Pprados/rag vectorstore #13910

pprados · 2023-11-27T11:38:43Z

Description:
The ParentDocumentRetriever class provides the add_documents() methods, which may seem unconventional for a retriever.

This deviation arises from the need to establish and maintain connections between document fragments and their various iterations. Consequently, the parent_splitter and child_splitter were introduced. As elaborated in this pull request, it's crucial not to confine oneself to text splitters exclusively. Instead, a preference is given to working with transformations, from which text splitters inherit.

However, my initial proposal proved insufficient to fully leverage this concept with all the advanced features of Langchain. An issue has been raised to address this limitation.

To enhance text selection, the proposed idea by ParentDocumentRetriever is as follows:

Convert each text fragment into multiple versions to enhance proximity between questions and potential answers.
Select corresponding fragments in the vector store.
Return the original fragment for injection into the prompt.

For this purpose, maintaining a link between the original document, each document chunk, and every transformation
applied to each chunk is essential.

My Proposition:
The RagVectorStore class serves as a vector store wrapper with rules for transforming a document
into a chunk and each chunk into different versions. It closely resembles ParentRetriever but employs
a VectorStore instead of a Retriever.

This approach is advantageous due to its simplicity and compatibility with all advanced features of Langchain.
Feel free to explore the benefits of this solution in this notebook.

Compatibility includes:

Indexing API
Advanced retrievers like MultiQueryRetriever, Self Querying, etc.

The implementation required several changes and additional pull requests, each of which can be validated independently before merging into this one:

Issues:

Dependencies:
Any dependencies required for this change.

Tag Maintainers:
@baskaryan @hwchase17

Twitter Handle:
Twitter account: pprados

@baskaryan

To the in-memory outputs. Separate it out from the outputs so it's present in the dataframe.describe() results

Changes: - remove langchain_core/schema since no clear distinction b/n schema and non-schema modules - make every module that doesn't end in -y plural - where easy have 1-2 classes per file - no more than one level of nesting in directories - only import from top level core modules in langchain

- **Description:** We need to update the Dockerfile for templates to also copy your README.md. This is because poetry requires that a readme exists if it is specified in the pyproject.toml

Upgrade langserve template version to 0.0.30 to include new improvements

@tjaffri

Adds a cookbook for semi-structured RAG via Docugami. This follows the same outline as the semi-structured RAG with Unstructured cookbook: https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb The main change is this cookbook uses Docugami instead of Unstructured to find text and tables, and shows how XML markup in the output helps with retrieval and generation. We are \@docugami on twitter, I am \@tjaffri --------- Co-authored-by: Taqi Jaffri <tjaffri@docugami.com>

Fix some circular deps: - move PromptValue into top level module bc both PromptTemplates and OutputParsers import - move tracer context vars to `tracers.context` and import them in functions in `callbacks.manager` - add core import tests

Provider check was incorrectly failing for anything other than "meta"

Introduced in langchain-ai#13403

- Adds pydantic/import linting to core - Adds a check for `langchain_experimental` imports to langchain

…ate(...) (langchain-ai#13645) **Description:** BaseStringMessagePromptTemplate.from_template was passing the value of partial_variables into cls(...) via **kwargs, rather than passing it to PromptTemplate.from_template. Which resulted in those *partial_variables being* lost and becoming required *input_variables*. Co-authored-by: Josep Pon Farreny <josep.pon-farreny@siemens.com> Co-authored-by: Bagatur <baskaryan@gmail.com>

…-ai#13626) **Description:** Currently, if we pass in a ToolMessage back to the chain, it crashes with error `Got unsupported message type: ` This fixes it. Tested locally --------- Co-authored-by: Bagatur <baskaryan@gmail.com>

- **Description:** add method embed_general_texts in VoyageEmebddings to support input_type - **Issue:** - **Dependencies:** - **Tag maintainer:** - **Twitter handle:** @Voyage_AI_

…3652) - **Description:** This commit fixed the problem that Redis vector store will change the value of a metadata from 0 to empty when saving the document, which should be an un-intended behavior. - **Issue:** N/A - **Dependencies:** N/A

Co-authored-by: Lance Martin <lance@langchain.dev>

vercel · 2023-11-27T11:38:47Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 27, 2023 5:08pm

pprados · 2023-12-01T07:54:36Z

@hwchase17
I think you're the author of ParentDocumentRetriever.
This PR proposes an improvement on this approach.

pprados and others added 30 commits November 23, 2023 11:46

Adds an in-memory implementation of RecordStore

abe534f

Fix spell

a9e342a

Update to last langchain version

b069a09

Add error rate (langchain-ai#13568)

ed94157

To the in-memory outputs. Separate it out from the outputs so it's present in the dataframe.describe() results

bump 0.0.339rc0 (langchain-ai#13664)

fc41231

fix templates dockerfile (langchain-ai#13672)

df326da

- **Description:** We need to update the Dockerfile for templates to also copy your README.md. This is because poetry requires that a readme exists if it is specified in the pyproject.toml

update langserve to v0.0.30 (langchain-ai#13673)

66f0b35

Upgrade langserve template version to 0.0.30 to include new improvements

CLI 0.0.19 (langchain-ai#13677)

942d525

Update name (langchain-ai#13676)

aacea1a

BUG: more core fixes (langchain-ai#13665)

714698e

Fix some circular deps: - move PromptValue into top level module bc both PromptTemplates and OutputParsers import - move tracer context vars to `tracers.context` and import them in functions in `callbacks.manager` - add core import tests

BUG: Add core utils imports (langchain-ai#13688)

408af0f

add callback import test (langchain-ai#13689)

e6a21fb

IMPROVEMENT: bump core dep 0.0.3 (langchain-ai#13690)

8cfe906

DOCS: fixed import error for BashOutputParser (langchain-ai#13680)

f5c4f60

DOCS: remove openai api key from cookbook (langchain-ai#13633)

0985486

BUGFIX: Update bedrock.py to fix provider bug (langchain-ai#13646)

9d63ebc

Provider check was incorrectly failing for anything other than "meta"

BUGFIX: anthropic models on bedrock (langchain-ai#13629)

9815a2f

Introduced in langchain-ai#13403

INFRA: Lint for imports (langchain-ai#13632)

ce88619

- Adds pydantic/import linting to core - Adds a check for `langchain_experimental` imports to langchain

IMPROVEMENT: VoyageEmbeddings embed_general_texts (langchain-ai#13620)

1a66810

- **Description:** add method embed_general_texts in VoyageEmebddings to support input_type - **Issue:** - **Dependencies:** - **Tag maintainer:** - **Twitter handle:** @Voyage_AI_

BUGFIX: llm backwards compat imports (langchain-ai#13698)

cc841dc

IMPROVEMENT: Conditionally import core type hints (langchain-ai#13700)

3c2ec49

TEMPLATES Metadata (langchain-ai#13691)

e93f0a7

Co-authored-by: Lance Martin <lance@langchain.dev>

BUGFIX: add prompt imports for backwards compat (langchain-ai#13702)

fd7c052

Fix locking (langchain-ai#13725)

ffc4867

pprados added 7 commits November 27, 2023 10:25

Update to last langchain version

47280a6

Add *LCEL* Transformers

fc7732c

Update to last langchain version

cb43ecf

Merge branch 'pprados/memory_recordmanager' into pprados/rag_vectorstore

9d7fdd0

Merge branch 'pprados/sql_docstore' into pprados/rag_vectorstore

bb29c16

Merge branch 'pprados/lcel_transformer' into pprados/rag_vectorstore

c5519b3

Add RagVectorStore

f33a206

vercel bot deployed to Preview November 27, 2023 11:50 View deployment

Fixe spell

56eff69

pprados force-pushed the pprados/rag_vectorstore branch from 82acf08 to 56eff69 Compare November 27, 2023 12:12

vercel bot deployed to Preview November 27, 2023 12:31 View deployment

vercel bot deployed to Preview November 27, 2023 15:09 View deployment

pprados force-pushed the pprados/rag_vectorstore branch from c172423 to 375b8b5 Compare November 27, 2023 16:13

vercel bot deployed to Preview November 27, 2023 16:26 View deployment

Fix demo

70880a9

pprados force-pushed the pprados/rag_vectorstore branch from 375b8b5 to 70880a9 Compare November 27, 2023 16:56

vercel bot deployed to Preview November 27, 2023 17:08 View deployment

pprados marked this pull request as ready for review November 27, 2023 17:12

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. Ɑ: vector store Related to vector store module 🤖:enhancement A large net-new component, integration, or chain. Use sparingly. The largest features labels Nov 27, 2023

hwchase17 closed this Jan 30, 2024

baskaryan reopened this Jan 30, 2024

pprados mentioned this pull request Feb 1, 2024

Add VectorStore wrapper to help the integration with SelfQuery #16454

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pprados/rag vectorstore #13910

Pprados/rag vectorstore #13910

pprados commented Nov 27, 2023 •

edited

vercel bot commented Nov 27, 2023 •

edited

pprados commented Dec 1, 2023

Pprados/rag vectorstore #13910

Are you sure you want to change the base?

Pprados/rag vectorstore #13910

Conversation

pprados commented Nov 27, 2023 • edited

vercel bot commented Nov 27, 2023 • edited

pprados commented Dec 1, 2023

pprados commented Nov 27, 2023 •

edited

vercel bot commented Nov 27, 2023 •

edited