Add bucket and object key to metadata in S3 loader #9317

cbornet · 2023-08-16T13:57:34Z

Description: this PR adds s3_object_key and s3_bucket to the doc metadata when loading an S3 file. This is particularly useful when using S3DirectoryLoader to remove the files from the dir once they have been processed (getting the object keys from the metadata source field seems brittle)
Dependencies: N/A
Tag maintainer: ?
Twitter handle: _cbornet

vercel · 2023-08-16T13:57:39Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Aug 30, 2023 2:38pm

cbornet · 2023-08-16T14:01:03Z

An alternative would be to be able to pass initial metadata values to UnstructuredFileLoader when creating it.

eyurtsev · 2023-08-21T14:06:18Z

libs/langchain/langchain/document_loaders/s3_file.py

@@ -35,4 +35,8 @@ def load(self) -> List[Document]:
            os.makedirs(os.path.dirname(file_path), exist_ok=True)
            s3.download_file(self.bucket, self.key, file_path)
            loader = UnstructuredFileLoader(file_path)
-            return loader.load()
+            docs = loader.load()
+            for doc in docs:


Could you replace this with

docs.metadata['source'] = 's3://{bucket_name}/{key}

source can be parsed downstream to yield bucket_name and key. We want to standardize on a single field to be able to describe provenance of data

Done.
I also changed the way to load document by inheriting UnstructuredBaseLoader and not using UnstructuredFileLoader anymore. I think it's cleaner this way. WDYT ?

The loader lives under a generic namespace of s3, so I don't think it should be coupled to any particular parser. I'd prefer if we made it possible to pass an arbitrary parser as part of the initializer instead, we can keep the unstructured one as the default parser to make sure there are no breaking changes for existing users

You mean we pass the class of parser and in load() we instantiate it with the file name, key and bucket as constructor arg ?

I think that would be a bit better, though maybe not critical since the current change isn't breaking for users.

At a higher level, I'm trying to decouple fetching from content and rely more on composition to make it as easy as possible to re-use existing parsers on files regardless of where the files are stored.

We have a BlobGenerator abstraction that could use an S3 implementation that is able to fetch file blobs from s3 that match certain criteria.

Let me know if you want to make changes now or if you want me to merge as is

I think one of the problem for that is that currently parsers are mixed with loaders. The 2 concepts would probably need to be separated.
I’m ok to merge as-is and to help improve later.

eyurtsev · 2023-08-30T14:26:45Z

Ready to merge as soon as tests pass

Add bucket and object key to metadata in S3 loader

0825d0c

dosubot bot added Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 16, 2023

vercel bot deployed to Preview – langchain August 16, 2023 14:09 View deployment

baskaryan assigned eyurtsev Aug 16, 2023

eyurtsev requested changes Aug 21, 2023

View reviewed changes

cbornet force-pushed the s3-metadata branch from 031b828 to 47cb613 Compare August 23, 2023 09:25

vercel bot deployed to Preview – langchain August 23, 2023 09:42 View deployment

cbornet requested a review from eyurtsev August 24, 2023 09:16

Put source metadata as s3://{bucket}/{key} in S3 file loader

9c29a4f

cbornet force-pushed the s3-metadata branch from 47cb613 to 9c29a4f Compare August 24, 2023 23:05

vercel bot deployed to Preview – langchain August 24, 2023 23:15 View deployment

Merge branch 'master' into s3-metadata

174f711

eyurtsev approved these changes Aug 30, 2023

View reviewed changes

eyurtsev added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Aug 30, 2023

vercel bot deployed to Preview – langchain August 30, 2023 14:38 View deployment

eyurtsev merged commit 9870bfb into langchain-ai:master Aug 30, 2023
27 checks passed

cbornet deleted the s3-metadata branch August 30, 2023 15:17

cbornet mentioned this pull request Sep 4, 2023

Fix S3DirectoryLoader exception #10193

Merged

tristinbaileel mentioned this pull request Nov 1, 2023

[Snyk] Fix for 1 vulnerabilities tristinbaileel/langchain#13

Open

dr-juetz mentioned this pull request Nov 1, 2023

[Snyk] Fix for 1 vulnerabilities dr-juetz/langchain#14

Open

dewankpant mentioned this pull request Nov 2, 2023

[Snyk] Fix for 1 vulnerabilities dewankpant/langchain#14

Open

eyalfisher mentioned this pull request Nov 4, 2023

[Snyk] Fix for 1 vulnerabilities eyalfisher/langchainfrk#11

Open

dr-juetz mentioned this pull request Nov 15, 2023

[Snyk] Fix for 1 vulnerabilities dr-juetz/langchain#16

Open

tristinbaileel mentioned this pull request Nov 15, 2023

[Snyk] Fix for 1 vulnerabilities tristinbaileel/langchain#14

Open

dewankpant mentioned this pull request Nov 15, 2023

[Snyk] Fix for 1 vulnerabilities dewankpant/langchain#16

Open

eyalfisher mentioned this pull request Nov 15, 2023

[Snyk] Fix for 1 vulnerabilities eyalfisher/langchainfrk#12

Open

bozza-man mentioned this pull request Nov 16, 2023

[Snyk] Fix for 1 vulnerabilities Bruteforce-Group/langchain#6

Open

dewankpant mentioned this pull request Nov 25, 2023

[Snyk] Fix for 1 vulnerabilities dewankpant/langchain#17

Open

tristinbaileel mentioned this pull request Nov 25, 2023

[Snyk] Fix for 1 vulnerabilities tristinbaileel/langchain#15

Open

dr-juetz mentioned this pull request Nov 27, 2023

[Snyk] Fix for 1 vulnerabilities dr-juetz/langchain#17

Open

mjuetz mentioned this pull request Jan 5, 2024

[Snyk] Fix for 1 vulnerabilities mjuetz/langchain#6

Open

eyalfisher mentioned this pull request Jan 5, 2024

[Snyk] Fix for 1 vulnerabilities eyalfisher/langchainfrk#17

Open

abdulrahman305 mentioned this pull request Apr 15, 2024

[Snyk] Security upgrade @docusaurus/theme-mermaid from 2.4.3 to 3.0.0 abdulrahman305/langchain#44

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add bucket and object key to metadata in S3 loader #9317

Add bucket and object key to metadata in S3 loader #9317

cbornet commented Aug 16, 2023

vercel bot commented Aug 16, 2023 •

edited

cbornet commented Aug 16, 2023

eyurtsev Aug 21, 2023

cbornet Aug 23, 2023

cbornet Aug 23, 2023

eyurtsev Aug 24, 2023

cbornet Aug 24, 2023 •

edited

eyurtsev Aug 24, 2023

eyurtsev Aug 24, 2023

cbornet Aug 24, 2023

eyurtsev commented Aug 30, 2023

Add bucket and object key to metadata in S3 loader #9317

Add bucket and object key to metadata in S3 loader #9317

Conversation

cbornet commented Aug 16, 2023

vercel bot commented Aug 16, 2023 • edited

cbornet commented Aug 16, 2023

eyurtsev Aug 21, 2023

Choose a reason for hiding this comment

cbornet Aug 23, 2023

Choose a reason for hiding this comment

cbornet Aug 23, 2023

Choose a reason for hiding this comment

eyurtsev Aug 24, 2023

Choose a reason for hiding this comment

cbornet Aug 24, 2023 • edited

Choose a reason for hiding this comment

eyurtsev Aug 24, 2023

Choose a reason for hiding this comment

eyurtsev Aug 24, 2023

Choose a reason for hiding this comment

cbornet Aug 24, 2023

Choose a reason for hiding this comment

eyurtsev commented Aug 30, 2023

vercel bot commented Aug 16, 2023 •

edited

cbornet Aug 24, 2023 •

edited