Adding functionality for multi-column ingestion into vector databases and skills #8990

QuantumPlumber · 2024-03-26T00:04:14Z

Description

Current vector database and knowledge base implementation does not allow for multi-column insertion, even though the langchain_embeddings handler does support multi-column embedding. The langchain_embeddings handler concatentates the row entries as context for the embedding, but does not return this concatenation. Therefore, the actual context provided to the embedding model remains hidden. This PR explicitly returns the embedding context as the embedding_context column.

The knowledge base insert statement is modified to handle the protected embedding_context column.

embedding_context is registered as a protected column name for the vector database integration.

Fixes #issue_number

Type of change

(Please delete options that are not relevant)

🐛 Bug fix (non-breaking change which fixes an issue)
[x ] ⚡ New feature (non-breaking change which adds functionality)
📢 Breaking change (fix or feature that would cause existing functionality not to work as expected)
[ x] 📄 This change requires a documentation update

Verification Process

To ensure the changes are working as expected:

Tested locally on most up-to-date staging branch.

Test Location: Specify the URL or path for testing.
Verification Steps: Outline the steps or queries needed to validate the change. Include any data, configurations, or actions required to reproduce or see the new functionality.

Additional Media:

I have attached a brief loom video or screenshots showcasing the new functionality or change.

Checklist:

[ x] My code follows the style guidelines(PEP 8) of MindsDB.
[ x] I have appropriately commented on my code, especially in complex areas.
Necessary documentation updates are either made or tracked in issues.
Relevant unit and integration tests are updated or added.

tmichaeldb

Thanks for tackling this! Just need to update the rest of the embedding handlers, and (preferably) add some tests.

tmichaeldb · 2024-03-26T16:56:55Z

mindsdb/interfaces/knowledge_base/controller.py

+
+        # rename model's 'embedding_context' column to 'content'
+        df = df.rename(
+            columns={TableField.CONTEXT.value: TableField.CONTENT.value}


This currently only works with langchain_embedding_handler, because it is the only handler that adds this embedding_context column.

gotcha, adding it to the sentence transformer.

@QuantumPlumber perhaps worth us creating a base Embedding class like we have for vector stores

tmichaeldb · 2024-03-26T17:01:25Z

mindsdb/integrations/handlers/langchain_embedding_handler/langchain_embedding_handler.py

@@ -161,7 +161,7 @@ def predict(self, df: DataFrame, args) -> DataFrame:
        embeddings = model.embed_documents(df_texts.tolist())

        # create a new dataframe with the embeddings
-        df_embeddings = df.copy().assign(**{target: embeddings})
+        df_embeddings = df.copy().assign(**{'embedding_context': df_texts, target: embeddings})


We would need to update this for all embedding handlers (e.g. https://github.com/mindsdb/mindsdb/blob/staging/mindsdb/integrations/handlers/sentence_transformers_handler/sentence_transformers_handler.py#L66)

The sentence transformer is more transparent, the input to the model is just the document, so we can duplicate that entry in the dataframe.

tmichaeldb · 2024-03-26T17:03:12Z

mindsdb/interfaces/knowledge_base/controller.py

@@ -126,6 +128,15 @@ def insert(self, df: pd.DataFrame):
        df_emb = self._df_to_embeddings(df)
        df = pd.concat([df, df_emb], axis=1)

+        # drop original 'content' column if it exists
+        if TableField.CONTENT.value in df.columns:
+            df = df.drop(TableField.CONTENT.value, axis='columns')


The original content may be different from the embedding_context and is good to have still. Is there any major downside to keeping both columns?

The original content we can name something else.. I dropped it because we have to rename the embedding_context as the content to preserve the rest of the functionality.

ea-rus · 2024-04-24T15:13:36Z

This PR should be covered by #9005

removing redundant package lists and sentence-transformers library

c36ff6b

QuantumPlumber requested a review from ea-rus March 26, 2024 00:04

adding logging in controller.py

bd416df

QuantumPlumber requested a review from tmichaeldb March 26, 2024 00:45

tmichaeldb requested changes Mar 26, 2024

View reviewed changes

QuantumPlumber added 2 commits March 26, 2024 12:33

conform sentence transformer embedding to new RAG pattern.

6012122

Keep originally submitted context in database return.

64f259e

ea-rus mentioned this pull request Mar 28, 2024

Managing knowledge base columns #9005

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding functionality for multi-column ingestion into vector databases and skills #8990

Adding functionality for multi-column ingestion into vector databases and skills #8990

QuantumPlumber commented Mar 26, 2024

tmichaeldb left a comment

tmichaeldb Mar 26, 2024

QuantumPlumber Mar 26, 2024

dusvyat Mar 27, 2024

tmichaeldb Mar 26, 2024

QuantumPlumber Mar 26, 2024

tmichaeldb Mar 26, 2024

QuantumPlumber Mar 26, 2024

ea-rus commented Apr 24, 2024

Adding functionality for multi-column ingestion into vector databases and skills #8990

Are you sure you want to change the base?

Adding functionality for multi-column ingestion into vector databases and skills #8990

Conversation

QuantumPlumber commented Mar 26, 2024

Description

Type of change

Verification Process

Additional Media:

Checklist:

tmichaeldb left a comment

Choose a reason for hiding this comment

tmichaeldb Mar 26, 2024

Choose a reason for hiding this comment

QuantumPlumber Mar 26, 2024

Choose a reason for hiding this comment

dusvyat Mar 27, 2024

Choose a reason for hiding this comment

tmichaeldb Mar 26, 2024

Choose a reason for hiding this comment

QuantumPlumber Mar 26, 2024

Choose a reason for hiding this comment

tmichaeldb Mar 26, 2024

Choose a reason for hiding this comment

QuantumPlumber Mar 26, 2024

Choose a reason for hiding this comment

ea-rus commented Apr 24, 2024