Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow adding metadata for chunks of files #318

Open
drale2k opened this issue Sep 18, 2023 · 17 comments
Open

Allow adding metadata for chunks of files #318

drale2k opened this issue Sep 18, 2023 · 17 comments

Comments

@drale2k
Copy link

drale2k commented Sep 18, 2023

Currently only add_texts takes an 'metadata' argument but add_data does not. Since add_data takes an array of files it would be clunky to extend it to allow metadata directly. Adding metadata needs to happen on a chunk-level.

The use case i have for this is to add the page number a chunk was found on and reference that as the source of the information.

To work around it i am currently reading and chunking files manually and then calling add_texts to supply the metadata. It's not too difficult but it would be nice if this was easier.

@andreibondarev
Copy link
Collaborator

@drale2k Do you think this functionality should go into https://github.com/moekidev/baran which is the gem we're using to do chunking?

@drale2k
Copy link
Author

drale2k commented Sep 20, 2023

Good question given that baran is a Text Splitter specifically for LLMs but even if baran were to accept metadata, langchainrb still needs to take it as input. You still want people to interact with the lanchainrb APIs and not baran directly or?

@andreibondarev
Copy link
Collaborator

@drale2k Correct, I'm just saying that those changes would need to happen in the baran gem itself first and then Langchain.rb would make the corresponding changes to accept metadata. I think instead of returning the plain chunks array, baran should return a different data structure that would hold all that metadata as well. Do you want to suggest those changes to @moekidev?

@kawakamimoeki
Copy link
Contributor

@kawakamimoeki
Copy link
Contributor

We released! https://github.com/moekidev/baran/releases/tag/v0.1.9

@andreibondarev
Copy link
Collaborator

andreibondarev commented Sep 28, 2023

@drale2k Which vectorsearch DB are you using btw? And what kind of files are you looking to upload?

@drale2k
Copy link
Author

drale2k commented Sep 28, 2023

Currently mostly pinecone but have been looking into open source ones as well. PDFs, MS Office docx and ppt mostly. Starting to look into audio transcriptions as well using https://github.com/guillaumekln/faster-whisper

@jjimenez
Copy link

this would help support the ability to add metadata such as source document names / source urls for the text.

I can see this being useful in add_data by optionally being able to pass an array of objects vs just string paths. checking the class of the "path" object before passing it to the chunker so and object that looks like

{ path: 'string/path/to/file', metadata: { url: "https://some.location.com/some-page-name", path: 'string/path/to/file', style: "blues" }

could be sent to the chunker would be nice. Then when we ask the vectorsearch database for similarities we should also get the metadata back to use for source links

@jjimenez
Copy link

I'm still reading code... It looks like Langchain::Loader actually will take a url! That is nice. I'll have to give that a try. It would be nice if the url was passed into the vectorsearch database as metadata directly.

I'm wishing out loud and should definitely consider making a pull request.

Thanks for making this a lot easier!

@andreibondarev
Copy link
Collaborator

@jjimenez Take a look at this draft branch I'm working on: https://github.com/andreibondarev/langchainrb/pull/538/files. The rest of the vectorsearch DBs need to be fixed to accept the metadatas: param.

@pedroresende
Copy link

any news on this feature ?

@sean-dickinson
Copy link

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

@andreibondarev
Copy link
Collaborator

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

@sean-dickinson Yes! Any help here would be extremely appreciated! Do you have a DSL in mind we'd implemented?

@sean-dickinson
Copy link

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

@sean-dickinson Yes! Any help here would be extremely appreciated! Do you have a DSL in mind we'd implemented?

@andreibondarev I'm not sure it's so much a DSL as a standardized schema for storing the data. That being said I'm not very knowledgeable in terms of the different vector databases here, but I'm assuming that in an ideal schema we would be able to store an object the represents the original data source (with a unique identifier) and then a collection of objects with the actual text splits that reference the original source.

The metadata field could live on either the parent or the chunks, (or both I suppose if you wanted?) but probably on the parent makes the most sense? Then when you do a search you are searching the chunks and you can also grab the parent record that the chunks reference to get the metadata.

I'm thinking this schema allows for the easiest updates if you are using sources change (for instance if your source is a url and the content updates, the url is the same but you want to clear out the old chunks and add new ones).

Note I'm taking these ideas from the LangChain python pgvector implementation.

In terms of a DSL, maybe it makes sense to name the concept of data like a DataSource to help conform to a more structure schema?

@andreibondarev
Copy link
Collaborator

@sean-dickinson I think I would be more in favor of an iterative approach here: enhancing and standarding the current schema across all the different vectorsearch DBs as opposed to overhauling it. We can slowly iterate towards the ideal state eventually.

@sean-dickinson
Copy link

@sean-dickinson I think I would be more in favor of an iterative approach here: enhancing and standarding the current schema across all the different vectorsearch DBs as opposed to overhauling it. We can slowly iterate towards the ideal state eventually.

@andreibondarev totally fair. I think we can essentially achieve the same functionality with just adding a metadata field for each of the vector dbs like you said, then it could always be improved upon in the future should the need arise.

Regardless, what's the state of your branch referenced here where you added the metadata as part of the parsing process? Do you want to build on that and update all the vector db adapters to expect this new field on that branch? Or do you want a separate PR to update all the vector db schemas?

@andreibondarev
Copy link
Collaborator

andreibondarev commented May 20, 2024

@sean-dickinson I think maybe we, first, standardize the metadata: {} param for all of the different vectorsearch providers.

We could probably do Pgvector first, I think you'd need to add some sort of a metadata JSON column and change this method: https://github.com/patterns-ai-core/langchainrb/blob/main/lib/langchain/vectorsearch/pgvector.rb#L73-L83

What're your thoughts?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants