Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Directory Loader not working as expected #11854

Closed
1 of 14 tasks
VpkPrasanna opened this issue Oct 16, 2023 · 4 comments
Closed
1 of 14 tasks

S3 Directory Loader not working as expected #11854

VpkPrasanna opened this issue Oct 16, 2023 · 4 comments
Labels
🔌: aws Primarily related to Amazon Web Services (AWS) integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)

Comments

@VpkPrasanna
Copy link
Contributor

System Info

langchain : 0.0.306
python : 3.10.12
platform : Ubuntu

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

loader = S3DirectoryLoader( bucket="", prefix="data/", aws_access_key_id="", aws_secret_access_key="") data = loader.load()

It should have load all the files with in the bucket from a given prefix which may contain multiple sub-folders but it does not

Expected behavior

It should have loaded all the files as list of Document of langchain schema , but it does not

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature 🔌: aws Primarily related to Amazon Web Services (AWS) integrations labels Oct 16, 2023
@dosubot
Copy link

dosubot bot commented Oct 16, 2023

🤖

Hello,

Thank you for bringing this to our attention. The issue you're experiencing with the S3DirectoryLoader not loading all the files from a given prefix within the bucket, including those in multiple sub-folders, is due to the way the load method is implemented in LangChain version 0.0.306.

In the load method, it uses the bucket.objects.filter(Prefix=self.prefix) to get the objects in the S3 bucket. This line of code will only get the objects that directly match the prefix, not the objects in the sub-folders.

Here is the relevant code:

for obj in bucket.objects.filter(Prefix=self.prefix):
    loader = S3FileLoader(
        self.bucket,
        obj.key,
        region_name=self.region_name,
        api_version=self.api_version,
        use_ssl=self.use_ssl,
        verify=self.verify,
        endpoint_url=self.endpoint_url,
        aws_access_key_id=self.aws_access_key_id,
        aws_secret_access_key=self.aws_secret_access_key,
        aws_session_token=self.aws_session_token,
        boto_config=self.boto_config,
    )
    docs.extend(loader.load())

To load all the files from a given prefix within the bucket, including those in multiple sub-folders, you would need to modify the load method to recursively search through all the sub-folders. However, this is not currently implemented in the provided version of LangChain.

There have been similar issues reported in the past, such as S3 Directory Loader reads prefix directory as file_path and s3Directory Loader with prefix error . The solution suggested in the first issue was to bypass any directory/prefix paths and collect only files. Here is the suggested solution:

docs = []
for obj in bucket.objects.filter(Prefix=self.prefix):
       if obj.key.endswith("/"):   # bypass the prefix directory
              continue
       else:
              loader = S3FileLoader(self.bucket, obj.key)
              docs.extend(loader.load())
return docs

Please try this solution and let us know if it resolves your issue. If not, we'll continue to investigate and find a solution for you.

Best regards.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@VpkPrasanna
Copy link
Contributor Author

@dosu-beta is this issue resolved in the new release of langchain ?
else i am ready to raise a PR for this !

@dosubot
Copy link

dosubot bot commented Oct 16, 2023

🤖

Thank you for reaching out and providing a potential solution for the issue you are experiencing with the S3DirectoryLoader in LangChain. Your contribution is greatly appreciated!

I'm glad to hear that you have found a potential solution by modifying the load method to recursively search through all the sub-folders. This could be a valuable addition to the framework.

To answer your question, I don't have access to the latest release notes or the current state of the LangChain codebase. However, if you believe that the issue still persists in the latest release, I encourage you to go ahead and raise a pull request (PR) with your proposed solution. The maintainers of LangChain will review your contribution and determine if it should be merged into the codebase.

Thank you for your willingness to contribute to the LangChain project! Let me know if there's anything else I can assist you with.


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

Copy link

dosubot bot commented Feb 3, 2024

Hi, @VpkPrasanna,

I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the S3 Directory Loader in langchain is not retrieving all files within the specified prefix, including those in sub-folders, due to the current implementation of the load method. There's a potential solution suggested to modify the load method to recursively search through all sub-folders, and you are considering raising a pull request (PR) for this issue, which the maintainers have encouraged.

Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and cooperation.

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 3, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 10, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🔌: aws Primarily related to Amazon Web Services (AWS) integrations 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: doc loader Related to document loader module (not documentation)
Projects
None yet
Development

No branches or pull requests

1 participant