Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bucket and object key to metadata in S3 loader #9317

Merged
merged 3 commits into from Aug 30, 2023

Conversation

cbornet
Copy link
Collaborator

@cbornet cbornet commented Aug 16, 2023

  • Description: this PR adds s3_object_key and s3_bucket to the doc metadata when loading an S3 file. This is particularly useful when using S3DirectoryLoader to remove the files from the dir once they have been processed (getting the object keys from the metadata source field seems brittle)
  • Dependencies: N/A
  • Tag maintainer: ?
  • Twitter handle: _cbornet

@vercel
Copy link

vercel bot commented Aug 16, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Aug 30, 2023 2:38pm

@dosubot dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 16, 2023
@cbornet
Copy link
Collaborator Author

cbornet commented Aug 16, 2023

An alternative would be to be able to pass initial metadata values to UnstructuredFileLoader when creating it.

@@ -35,4 +35,8 @@ def load(self) -> List[Document]:
os.makedirs(os.path.dirname(file_path), exist_ok=True)
s3.download_file(self.bucket, self.key, file_path)
loader = UnstructuredFileLoader(file_path)
return loader.load()
docs = loader.load()
for doc in docs:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace this with

docs.metadata['source'] = 's3://{bucket_name}/{key}

source can be parsed downstream to yield bucket_name and key. We want to standardize on a single field to be able to describe provenance of data

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.
I also changed the way to load document by inheriting UnstructuredBaseLoader and not using UnstructuredFileLoader anymore. I think it's cleaner this way. WDYT ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loader lives under a generic namespace of s3, so I don't think it should be coupled to any particular parser. I'd prefer if we made it possible to pass an arbitrary parser as part of the initializer instead, we can keep the unstructured one as the default parser to make sure there are no breaking changes for existing users

Copy link
Collaborator Author

@cbornet cbornet Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean we pass the class of parser and in load() we instantiate it with the file name, key and bucket as constructor arg ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that would be a bit better, though maybe not critical since the current change isn't breaking for users.

At a higher level, I'm trying to decouple fetching from content and rely more on composition to make it as easy as possible to re-use existing parsers on files regardless of where the files are stored.

We have a BlobGenerator abstraction that could use an S3 implementation that is able to fetch file blobs from s3 that match certain criteria.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you want to make changes now or if you want me to merge as is

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think one of the problem for that is that currently parsers are mixed with loaders. The 2 concepts would probably need to be separated.
I’m ok to merge as-is and to help improve later.

@eyurtsev eyurtsev added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Aug 30, 2023
@eyurtsev
Copy link
Collaborator

Ready to merge as soon as tests pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants