New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bucket and object key to metadata in S3 loader #9317
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
An alternative would be to be able to pass initial metadata values to |
@@ -35,4 +35,8 @@ def load(self) -> List[Document]: | |||
os.makedirs(os.path.dirname(file_path), exist_ok=True) | |||
s3.download_file(self.bucket, self.key, file_path) | |||
loader = UnstructuredFileLoader(file_path) | |||
return loader.load() | |||
docs = loader.load() | |||
for doc in docs: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you replace this with
docs.metadata['source'] = 's3://{bucket_name}/{key}
source
can be parsed downstream to yield bucket_name
and key
. We want to standardize on a single field to be able to describe provenance of data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I also changed the way to load document by inheriting UnstructuredBaseLoader
and not using UnstructuredFileLoader
anymore. I think it's cleaner this way. WDYT ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The loader lives under a generic namespace of s3, so I don't think it should be coupled to any particular parser. I'd prefer if we made it possible to pass an arbitrary parser as part of the initializer instead, we can keep the unstructured one as the default parser to make sure there are no breaking changes for existing users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean we pass the class of parser and in load() we instantiate it with the file name, key and bucket as constructor arg ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would be a bit better, though maybe not critical since the current change isn't breaking for users.
At a higher level, I'm trying to decouple fetching from content and rely more on composition to make it as easy as possible to re-use existing parsers on files regardless of where the files are stored.
We have a BlobGenerator abstraction that could use an S3 implementation that is able to fetch file blobs from s3 that match certain criteria.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you want to make changes now or if you want me to merge as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the problem for that is that currently parsers are mixed with loaders. The 2 concepts would probably need to be separated.
I’m ok to merge as-is and to help improve later.
Ready to merge as soon as tests pass |
s3_object_key
ands3_bucket
to the doc metadata when loading an S3 file. This is particularly useful when usingS3DirectoryLoader
to remove the files from the dir once they have been processed (getting the object keys from the metadatasource
field seems brittle)