-
Notifications
You must be signed in to change notification settings - Fork 13.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 Directory Loader not working as expected #11854
Comments
🤖 Hello, Thank you for bringing this to our attention. The issue you're experiencing with the S3DirectoryLoader not loading all the files from a given prefix within the bucket, including those in multiple sub-folders, is due to the way the In the Here is the relevant code: for obj in bucket.objects.filter(Prefix=self.prefix):
loader = S3FileLoader(
self.bucket,
obj.key,
region_name=self.region_name,
api_version=self.api_version,
use_ssl=self.use_ssl,
verify=self.verify,
endpoint_url=self.endpoint_url,
aws_access_key_id=self.aws_access_key_id,
aws_secret_access_key=self.aws_secret_access_key,
aws_session_token=self.aws_session_token,
boto_config=self.boto_config,
)
docs.extend(loader.load()) To load all the files from a given prefix within the bucket, including those in multiple sub-folders, you would need to modify the There have been similar issues reported in the past, such as S3 Directory Loader reads prefix directory as file_path and s3Directory Loader with prefix error . The solution suggested in the first issue was to bypass any directory/prefix paths and collect only files. Here is the suggested solution: docs = []
for obj in bucket.objects.filter(Prefix=self.prefix):
if obj.key.endswith("/"): # bypass the prefix directory
continue
else:
loader = S3FileLoader(self.bucket, obj.key)
docs.extend(loader.load())
return docs Please try this solution and let us know if it resolves your issue. If not, we'll continue to investigate and find a solution for you. Best regards. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
@dosu-beta is this issue resolved in the new release of langchain ? |
🤖 Thank you for reaching out and providing a potential solution for the issue you are experiencing with the S3DirectoryLoader in LangChain. Your contribution is greatly appreciated! I'm glad to hear that you have found a potential solution by modifying the To answer your question, I don't have access to the latest release notes or the current state of the LangChain codebase. However, if you believe that the issue still persists in the latest release, I encourage you to go ahead and raise a pull request (PR) with your proposed solution. The maintainers of LangChain will review your contribution and determine if it should be merged into the codebase. Thank you for your willingness to contribute to the LangChain project! Let me know if there's anything else I can assist you with. This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Hi, @VpkPrasanna, I'm helping the LangChain team manage their backlog and am marking this issue as stale. From what I understand, the S3 Directory Loader in langchain is not retrieving all files within the specified prefix, including those in sub-folders, due to the current implementation of the Could you please confirm if this issue is still relevant to the latest version of the LangChain repository? If it is, please let the LangChain team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you for your understanding and cooperation. |
System Info
langchain : 0.0.306
python : 3.10.12
platform : Ubuntu
Who can help?
No response
Information
Related Components
Reproduction
loader = S3DirectoryLoader( bucket="", prefix="data/", aws_access_key_id="", aws_secret_access_key="") data = loader.load()
It should have load all the files with in the bucket from a given prefix which may contain multiple sub-folders but it does not
Expected behavior
It should have loaded all the files as list of Document of langchain schema , but it does not
The text was updated successfully, but these errors were encountered: