Allow adding metadata for chunks of files #318

drale2k · 2023-09-18T00:06:36Z

Currently only add_texts takes an 'metadata' argument but add_data does not. Since add_data takes an array of files it would be clunky to extend it to allow metadata directly. Adding metadata needs to happen on a chunk-level.

The use case i have for this is to add the page number a chunk was found on and reference that as the source of the information.

To work around it i am currently reading and chunking files manually and then calling add_texts to supply the metadata. It's not too difficult but it would be nice if this was easier.

The text was updated successfully, but these errors were encountered:

andreibondarev · 2023-09-18T15:38:21Z

@drale2k Do you think this functionality should go into https://github.com/moekidev/baran which is the gem we're using to do chunking?

drale2k · 2023-09-20T15:12:39Z

Good question given that baran is a Text Splitter specifically for LLMs but even if baran were to accept metadata, langchainrb still needs to take it as input. You still want people to interact with the lanchainrb APIs and not baran directly or?

andreibondarev · 2023-09-20T15:22:31Z

@drale2k Correct, I'm just saying that those changes would need to happen in the baran gem itself first and then Langchain.rb would make the corresponding changes to accept metadata. I think instead of returning the plain chunks array, baran should return a different data structure that would hold all that metadata as well. Do you want to suggest those changes to @moekidev?

kawakamimoeki · 2023-09-26T07:46:15Z

kawakamimoeki/baran#7 (comment)

kawakamimoeki · 2023-09-26T23:53:39Z

We released! https://github.com/moekidev/baran/releases/tag/v0.1.9

andreibondarev · 2023-09-28T18:42:44Z

@drale2k Which vectorsearch DB are you using btw? And what kind of files are you looking to upload?

drale2k · 2023-09-28T20:11:04Z

Currently mostly pinecone but have been looking into open source ones as well. PDFs, MS Office docx and ppt mostly. Starting to look into audio transcriptions as well using https://github.com/guillaumekln/faster-whisper

jjimenez · 2024-03-22T22:23:49Z

this would help support the ability to add metadata such as source document names / source urls for the text.

I can see this being useful in add_data by optionally being able to pass an array of objects vs just string paths. checking the class of the "path" object before passing it to the chunker so and object that looks like

{ path: 'string/path/to/file', metadata: { url: "https://some.location.com/some-page-name", path: 'string/path/to/file', style: "blues" }

could be sent to the chunker would be nice. Then when we ask the vectorsearch database for similarities we should also get the metadata back to use for source links

jjimenez · 2024-03-22T22:34:23Z

I'm still reading code... It looks like Langchain::Loader actually will take a url! That is nice. I'll have to give that a try. It would be nice if the url was passed into the vectorsearch database as metadata directly.

I'm wishing out loud and should definitely consider making a pull request.

Thanks for making this a lot easier!

andreibondarev · 2024-03-23T00:28:47Z

@jjimenez Take a look at this draft branch I'm working on: https://github.com/andreibondarev/langchainrb/pull/538/files. The rest of the vectorsearch DBs need to be fixed to accept the metadatas: param.

pedroresende · 2024-04-22T09:02:11Z

any news on this feature ?

sean-dickinson · 2024-05-17T19:44:12Z

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

andreibondarev · 2024-05-20T14:03:07Z

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

@sean-dickinson Yes! Any help here would be extremely appreciated! Do you have a DSL in mind we'd implemented?

sean-dickinson · 2024-05-20T14:47:57Z

I'm interested in this feature, particularly for pgvector. I'm noticing that the different vectorsearch dbs don't all currently have a schema to support storing this new metadata. Would this be something you would be interested in help with?

@sean-dickinson Yes! Any help here would be extremely appreciated! Do you have a DSL in mind we'd implemented?

@andreibondarev I'm not sure it's so much a DSL as a standardized schema for storing the data. That being said I'm not very knowledgeable in terms of the different vector databases here, but I'm assuming that in an ideal schema we would be able to store an object the represents the original data source (with a unique identifier) and then a collection of objects with the actual text splits that reference the original source.

The metadata field could live on either the parent or the chunks, (or both I suppose if you wanted?) but probably on the parent makes the most sense? Then when you do a search you are searching the chunks and you can also grab the parent record that the chunks reference to get the metadata.

I'm thinking this schema allows for the easiest updates if you are using sources change (for instance if your source is a url and the content updates, the url is the same but you want to clear out the old chunks and add new ones).

Note I'm taking these ideas from the LangChain python pgvector implementation.

In terms of a DSL, maybe it makes sense to name the concept of data like a DataSource to help conform to a more structure schema?

andreibondarev · 2024-05-20T15:58:50Z

@sean-dickinson I think I would be more in favor of an iterative approach here: enhancing and standarding the current schema across all the different vectorsearch DBs as opposed to overhauling it. We can slowly iterate towards the ideal state eventually.

sean-dickinson · 2024-05-20T16:39:28Z

@sean-dickinson I think I would be more in favor of an iterative approach here: enhancing and standarding the current schema across all the different vectorsearch DBs as opposed to overhauling it. We can slowly iterate towards the ideal state eventually.

@andreibondarev totally fair. I think we can essentially achieve the same functionality with just adding a metadata field for each of the vector dbs like you said, then it could always be improved upon in the future should the need arise.

Regardless, what's the state of your branch referenced here where you added the metadata as part of the parsing process? Do you want to build on that and update all the vector db adapters to expect this new field on that branch? Or do you want a separate PR to update all the vector db schemas?

andreibondarev · 2024-05-20T18:12:54Z

@sean-dickinson I think maybe we, first, standardize the metadata: {} param for all of the different vectorsearch providers.

We could probably do Pgvector first, I think you'd need to add some sort of a metadata JSON column and change this method: https://github.com/patterns-ai-core/langchainrb/blob/main/lib/langchain/vectorsearch/pgvector.rb#L73-L83

What're your thoughts?

drale2k mentioned this issue Sep 20, 2023

Add support for metadata in text chunks kawakamimoeki/baran#7

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow adding metadata for chunks of files #318

Allow adding metadata for chunks of files #318

drale2k commented Sep 18, 2023

andreibondarev commented Sep 18, 2023

drale2k commented Sep 20, 2023

andreibondarev commented Sep 20, 2023

kawakamimoeki commented Sep 26, 2023

kawakamimoeki commented Sep 26, 2023

andreibondarev commented Sep 28, 2023 •

edited

Loading

drale2k commented Sep 28, 2023

jjimenez commented Mar 22, 2024

jjimenez commented Mar 22, 2024

andreibondarev commented Mar 23, 2024

pedroresende commented Apr 22, 2024

sean-dickinson commented May 17, 2024

andreibondarev commented May 20, 2024

sean-dickinson commented May 20, 2024

andreibondarev commented May 20, 2024

sean-dickinson commented May 20, 2024

andreibondarev commented May 20, 2024 •

edited

Loading

Allow adding metadata for chunks of files #318

Allow adding metadata for chunks of files #318

Comments

drale2k commented Sep 18, 2023

andreibondarev commented Sep 18, 2023

drale2k commented Sep 20, 2023

andreibondarev commented Sep 20, 2023

kawakamimoeki commented Sep 26, 2023

kawakamimoeki commented Sep 26, 2023

andreibondarev commented Sep 28, 2023 • edited Loading

drale2k commented Sep 28, 2023

jjimenez commented Mar 22, 2024

jjimenez commented Mar 22, 2024

andreibondarev commented Mar 23, 2024

pedroresende commented Apr 22, 2024

sean-dickinson commented May 17, 2024

andreibondarev commented May 20, 2024

sean-dickinson commented May 20, 2024

andreibondarev commented May 20, 2024

sean-dickinson commented May 20, 2024

andreibondarev commented May 20, 2024 • edited Loading

andreibondarev commented Sep 28, 2023 •

edited

Loading

andreibondarev commented May 20, 2024 •

edited

Loading