Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue: Need a trace or debug feature in Lanchain DirectoryLoader #7725

Closed
botchagalupe opened this issue Jul 14, 2023 · 3 comments
Closed

Issue: Need a trace or debug feature in Lanchain DirectoryLoader #7725

botchagalupe opened this issue Jul 14, 2023 · 3 comments
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases

Comments

@botchagalupe
Copy link

Issue you'd like to raise.

I've set "langchain.debug=True"; however, it does not work for the DirectoryLoader. I have a notebook that tried to load a dozen or more PDFs, and typically, at least one of the files fails (see attached). I looked at the code, and as far as I can tell, there is no trace or debug feature in (https://github.com/hwchase17/langchain/tree/master/langchain/document_loaders).

My issue is that the loader code is a black box. I can't tell which file is failing; therefore, I have to process each one individually to find out which one is failing. It would be beneficial if a trace/debugger could help me identify which file it's failing on.

TIA
Screen Shot 2023-07-14 at 9 04 56 AM

Suggestion:

Please make a debug option for "https://github.com/hwchase17/langchain/tree/master/langchain/document_loaders" code.

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Jul 14, 2023
@dosubot
Copy link

dosubot bot commented Jul 14, 2023

Answer generated by a 🤖

Answer

I understand that you're having trouble with the DirectoryLoader in LangChain. You've set the debug mode to true, but it doesn't seem to be working as expected. You're unable to identify which file is causing the issue when loading multiple PDFs, and you're looking for a way to trace or debug the loader code to help you pinpoint the problematic file.

Here's a potential solution that might help. You can modify the load_file method in the DirectoryLoader to log the name of the file it's currently processing. This way, if an error occurs, you can check the logs to see which file was being processed at the time.

Here's how you can do it:

import logging

def load_file(self, item: Path, path: Path, docs: List[Document], pbar: Optional[Any]) -> None:
    if item.is_file():
        if _is_visible(item.relative_to(path)) or self.load_hidden:
            try:
                logging.debug(f"Processing file: {item}")  # Add this line
                sub_docs = self.loader_cls(str(item), **self.loader_kwargs).load()
                docs.extend(sub_docs)
            except Exception as e:
                if self.silent_errors:
                    logging.warning(e)
                else:
                    raise e
            finally:
                if pbar:
                    pbar.update(1)

This modification will log the name of the file being processed at the debug level. To see these logs, you'll need to set the logging level to debug in your application. You can do this by adding the following line at the start of your application:

logging.basicConfig(level=logging.DEBUG)

This will set the root logger level to DEBUG, so all debug (and higher level) logs will be shown. If you only want to see debug logs from LangChain, you can set the level on its logger specifically:

logging.getLogger('langchain').setLevel(logging.DEBUG)

Please note that this is just one potential solution. There might be other ways to achieve the same result depending on your specific use case and environment.

Also, as an AI, I'm unable to view images or follow links. Therefore, I can't see the screenshot you've attached or access the URL you've provided.

If you find this solution helpful and believe it could benefit others, I encourage you to make a pull request to the LangChain repository with these changes. This way, other users facing the same issue can benefit from your contribution.

Sources


This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

This was referenced Jul 14, 2023
baskaryan pushed a commit that referenced this issue Jul 15, 2023
Description: Added debugging output in DirectoryLoader to identify the
file being processed.
Issue: [Need a trace or debug feature in Lanchain DirectoryLoader
#7725](#7725)
Dependencies: No additional dependencies are required.
Tag maintainer: @rlancemartin, @eyurtsev
This PR enhances the DirectoryLoader with debugging output to help
diagnose issues when loading documents. This new feature does not add
any dependencies and has been tested on a local machine.
@dosubot
Copy link

dosubot bot commented Oct 14, 2023

Hi, @botchagalupe! I'm Dosu, and I'm here to help the LangChain team manage their backlog. I wanted to let you know that we are marking this issue as stale.

From what I understand, you were requesting a trace or debug feature for the DirectoryLoader in LangChain. You were experiencing issues with loading PDF files and wanted a way to identify which file is causing the problem. Dosu-beta suggested modifying the load_file method in the DirectoryLoader to log the name of the file being processed at the debug level, which can help pinpoint the problematic file. They provided code snippets and instructions on how to set the logging level to debug.

Before we close this issue, we wanted to check with you if it is still relevant to the latest version of the LangChain repository. If it is, please let us know by commenting on the issue. Otherwise, feel free to close the issue yourself or it will be automatically closed in 7 days.

Thank you for your understanding and contribution to the LangChain project!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 14, 2023
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 21, 2023
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Oct 21, 2023
@mattwa1sh
Copy link

This can not possibly be the answer to modify source. This is super basic?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases
Projects
None yet
Development

No branches or pull requests

2 participants