Easy RAG #686

langchain4j · 2024-03-01T09:56:14Z

Implementing RAG applications is hard. Especially for those who are just getting started exploring LLMs and RAG.

This PR introduces an "Easy RAG" feature that should help developers to get started with RAG as easy as possible.

With it, there is no need to learn about chunking/splitting/segmentation, embeddings, embedding models, vector databases, retrieval techniques and other RAG-related concepts.

This is similar to how one can simply upload one or multiple files into OpenAI Assistants API and the LLM will automagically know about their contents when answering questions.

Easy RAG is using local embedding model running in your CPU (GPU support can be added later).
Your files are ingested into an in-memory embedding store.

Please note that "Easy RAG" will not replace manual RAG setups and especially advanced RAG techniques, but will provide an easier way to get started with RAG.
The quality of an "Easy RAG" should be sufficient for demos, proof of concepts and for getting started.

To use "Easy RAG", simply import langchain4j-easy-rag dependency that includes everything needed to do RAG:

Apache Tika document loader (to parse all document types automatically)
Quantized BAAI/bge-small-en-v1.5 in-process embedding model which has an impressive (for it's size) 51.68 score for retrieval

Here is the proposed API:

List<Document> documents = FileSystemDocumentLoader.loadDocuments(directoryPath); // one can also load documents recursively and filter with glob/regex

EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); // we will use an in-memory embedding store for simplicity

EmbeddingStoreIngestor.ingest(documents, embeddingStore);

Assistant assistant = AiServices.builder(Assistant.class)
                .chatLanguageModel(model)
                .contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
                .build();

String answer = assistant.chat("Who is Charlie?"); // Charlie is a carrot...

FileSystemDocumentLoader in the above code loads documents using DocumentParser available in classpath via SPI, in this case an ApacheTikaDocumentParser imported with the langchain4j-easy-rag dependency.

The EmbeddingStoreIngestor in the above code:

splits documents into smaller text segments using a DocumentSplitter loaded via SPI from the langchain4j-easy-rag dependency. Currently it uses DocumentSplitters.recursive(300, 30, new HuggingFaceTokenizer())
embeds text segments using an AllMiniLmL6V2QuantizedEmbeddingModel loaded via SPI from the langchain4j-easy-rag dependency
stores text segments and their embeddings into the specified embedding store

When using InMemoryEmbeddingStore, one can serialize/persist it into a JSON string on into a file.
This way one can skip loading documents and embedding them on each application run.

It is easy to customize the ingestion in the above code, just change

EmbeddingStoreIngestor.ingest(documents, embeddingStore);

into

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                //.documentTransformer(...) // you can optionally transform (clean, enrich, etc) documents before splitting
                //.documentSplitter(...) // you can optionally specify another splitter
                //.textSegmentTransformer(...) // you can optionally transform (clean, enrich, etc) segments before embedding
                //.embeddingModel(...) // you can optionally specify another embedding model to use for embedding
                .embeddingStore(embeddingStore)
                .build();

ingestor.ingest(documents)

Over time, we can add an auto-eval feature that will find the most suitable hyperparametes for a given documents (e.g. which embedding model to use, which splitting method, possibly advanced RAG techniques, etc.) so that "easy RAG" can be comparable to the "advanced RAG".

Related:
langchain4j/langchain4j-embeddings#16

…nce Engines

geoand · 2024-03-06T07:13:58Z

cc @jmartisk @cescoffier

cescoffier · 2024-03-06T07:22:38Z

I like the idea. How do you compute the chunk size and the number of relevant documents to include? Should it be configurable with sensible defaults?

jmartisk · 2024-03-06T07:58:04Z

This is really cool and easy. +1 to configurable document size and count

dliubars · 2024-03-06T08:16:40Z

Size of the segment can be configurable, but it will have a sensible static default (probably something like 300 tokens and 30 tokens overlap).

Over time, we can implement an eval step to find the most optimal size for the given document(s). Also, it does hot have to be fixed, it can be variable in some range (min, max) to minimize information loss during embedding and maximize recall during retrieval.

I was working on the side to implement a hybrid splitting algorithm that takes into account both structural (headers, newlines, paragraphs, lists, etc) and semantic (using embeddings) signals to find the best splitting point. One can define min/max segment size and it will split given text into segments with size within that range while minimzing information loss by looking at dispersion of individual sentences in the potential segment (their deviation from the overall segment embedding (centroid)). Maybe this can be used here.

langchain4j · 2024-03-06T08:56:29Z

BTW all credits for the idea goes to @dandreadis

maxandersen · 2024-03-06T10:25:44Z

+1; maybe i'm missing something but isn't this not already feasible today if we just enabled sensible defaults on the existing api (or maybe this is exactly that) ?

Thus to be clear - 100% for enabling this kind of easy setup; especially if behind it is use of the existing apis so we know users can "escape" and adjust when/where needed.

...src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java

langchain4j · 2024-03-07T09:03:11Z

@maxandersen Quarkus extension could probably enable sensible defaults (if not yet), but for the non-Quarkus folks there should be a way to do RAG hastle-free without a need to manually setup document loaders, splitters, embedding model, embedding store and orchestrating the ingestion, thus the proposed solution. Later it will be expanded to support more advanced RAG strategies and do auto-evals. So imho this deserves it's own class/name/abstraction. Also, to stress with the name that this is an easy/demo/PoC feature.

# Conflicts: # langchain4j-bom/pom.xml

coderabbitai

Review Status

Actionable comments generated: 5

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 3aafa79 and 4458ec8.

Files ignored due to path filters (11)

document-parsers/langchain4j-document-parser-apache-tika/pom.xml is excluded by: !**/*.xml
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.doc is excluded by: !**/*.doc
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.docx is excluded by: !**/*.docx
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pdf is excluded by: !**/*.pdf
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.ppt is excluded by: !**/*.ppt
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pptx is excluded by: !**/*.pptx
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xls is excluded by: !**/*.xls
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xlsx is excluded by: !**/*.xlsx
langchain4j-bom/pom.xml is excluded by: !**/*.xml
langchain4j-easy-rag/pom.xml is excluded by: !**/*.xml
pom.xml is excluded by: !**/*.xml

Files selected for processing (7)

document-parsers/langchain4j-document-parser-apache-tika/src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java (1 hunks)
document-parsers/langchain4j-document-parser-apache-tika/src/test/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserTest.java (1 hunks)
langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/EasyRAG.java (1 hunks)
langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/IngestionConfig.java (1 hunks)
langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/RetrievalConfig.java (1 hunks)
langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java (1 hunks)
langchain4j-easy-rag/src/test/resources/story-about-happy-carrot.txt (1 hunks)

Check Runs (5)

java_build (21) completed (2)

java_build (17) completed (2)

java_build (11) completed (1)

java_build (8) completed (2)

compliance completed (1)

Additional comments: 2

document-parsers/langchain4j-document-parser-apache-tika/src/test/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserTest.java (1)

13-49: The tests in ApacheTikaDocumentParserTest cover a good range of document formats, ensuring the parser's functionality across different file types. It's a good practice to also include negative test cases or edge cases, such as empty files or files with unsupported formats, to ensure the parser behaves as expected in these scenarios.

Consider adding negative test cases or edge cases to further validate the parser's robustness.

langchain4j-easy-rag/src/test/resources/story-about-happy-carrot.txt (1)

1-28: The narrative text story-about-happy-carrot.txt is well-written and serves as a good test resource. However, there are a few minor typographical issues, such as possible spelling mistakes with "VeggieVille" and a missing comma near the beginning. While these may be intentional for the story's context, it's important to ensure that test resources are as error-free as possible to avoid confusion.

Consider reviewing the spelling of "VeggieVille" and adding any missing commas for clarity.

langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/RetrievalConfig.java

langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/IngestionConfig.java

...src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java

langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java

langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/EasyRAG.java

coderabbitai

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits

Files that changed from the base of the PR and between 4458ec8 and 3b897b7.

Files selected for processing (2)

langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java (1 hunks)
langchain4j/src/test/java/dev/langchain4j/service/AiServicesWithRagIT.java (2 hunks)

Check Runs (5)

java_build (21) completed (2)

java_build (17) completed (2)

java_build (11) completed (2)

java_build (8) completed (1)

compliance completed (1)

Files skipped from review as they are similar to previous changes (1)

langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java

Additional comments: 8

langchain4j/src/test/java/dev/langchain4j/service/AiServicesWithRagIT.java (8)

32-37: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-1]

The package declaration and imports are correctly organized, ensuring that all necessary classes and methods are available for the tests. No issues found here.

32-37: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [14-14]

The @BeforeEach method correctly sets up the embedding store with test data before each test. This is a good practice for ensuring test isolation and repeatability.

32-37: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [16-16]

The use of interfaces to define test-specific behavior, such as Assistant and MultiUserAssistant, enhances modularity and readability. This approach allows for clear separation of test scenarios.

36-36: The parameterized tests effectively cover various scenarios, leveraging different components of the RAG functionality. This ensures comprehensive testing of the feature.

36-36: The use of @MethodSource("models") for providing test models is a good practice, allowing for easy extension of test cases with additional models in the future.

36-36: The tests make good use of mocking and spying to isolate the system under test and verify interactions with dependencies. This is crucial for unit and integration testing.

36-36: The ingest method demonstrates a clear and concise way to prepare test data, contributing to the maintainability of the test suite.

36-36: The tests are well-structured and easy to understand, following best practices for readability and maintainability.

geoand · 2024-03-19T07:50:08Z

I like it a lot!

Dirty POC: Easy RAG

660764a

langchain4j changed the title ~~Dirty POC: Easy RAG~~ POC: Easy RAG Mar 5, 2024

docu: added link to Guide to Choosing Quantization Methods and Infere…

512c93f

…nce Engines

sberyozkin reviewed Mar 6, 2024

View reviewed changes

...src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java Show resolved Hide resolved

langchain4j mentioned this pull request Mar 12, 2024

[FEATURE] Extend extensions list for the TXT DocumentType #160

Closed

langchain4j added 3 commits March 12, 2024 12:32

Merge branch 'main' into easy_rag_poc

1664686

# Conflicts: # langchain4j-bom/pom.xml

WIP: Easy RAG

8fc4d42

WIP: Easy RAG

4458ec8

coderabbitai bot reviewed Mar 13, 2024

View reviewed changes

dliubars added 2 commits March 13, 2024 09:52

Merge branch 'main' into easy_rag_poc

5c63084

docu: RAG: WIP

3b897b7

coderabbitai bot reviewed Mar 13, 2024

View reviewed changes

langchain4j added 5 commits March 14, 2024 17:30

WIP: Easy RAG

060f06e

Merge branch 'main' into easy_rag_poc

f8086da

WIP: Easy RAG

5599b98

WIP: Easy RAG

079906a

WIP: Easy RAG

ff1f87c

Repository owner deleted a comment from coderabbitai bot Mar 18, 2024

Merge branch 'main' into easy_rag_poc

ad53b2e

langchain4j added 3 commits March 20, 2024 18:32

WIP: Easy RAG

8efbbc6

Easy RAG: add logging to EmbeddingStoreIngestor

f1bb49a

Easy RAG: use langchain4j-embeddings-bge-small-en-v15-q

63ab9e5

langchain4j and others added 6 commits March 21, 2024 16:49

Merge branch 'main' into easy_rag_poc

6c561fb

Merge branch 'main' into easy_rag_poc

47a2f7b

added missing deps to BOM

61ac0f3

Merge remote-tracking branch 'origin/easy_rag_poc' into easy_rag_poc

cfb24b1

added missing deps to BOM

73f71ff

fixed IT

c8bdf8a

langchain4j merged commit 2f425da into main Mar 21, 2024
6 checks passed

langchain4j deleted the easy_rag_poc branch March 21, 2024 16:37

langchain4j changed the title ~~POC: Easy RAG~~ Easy RAG Mar 23, 2024

langchain4j mentioned this pull request Mar 23, 2024

Declarative AI Services langchain4j/langchain4j-spring#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Easy RAG #686

Easy RAG #686

langchain4j commented Mar 1, 2024 •

edited

geoand commented Mar 6, 2024 •

edited

cescoffier commented Mar 6, 2024

jmartisk commented Mar 6, 2024

dliubars commented Mar 6, 2024

langchain4j commented Mar 6, 2024

maxandersen commented Mar 6, 2024

langchain4j commented Mar 7, 2024

coderabbitai bot left a comment

coderabbitai bot left a comment

geoand commented Mar 19, 2024

Easy RAG #686

Easy RAG #686

Conversation

langchain4j commented Mar 1, 2024 • edited

geoand commented Mar 6, 2024 • edited

cescoffier commented Mar 6, 2024

jmartisk commented Mar 6, 2024

dliubars commented Mar 6, 2024

langchain4j commented Mar 6, 2024

maxandersen commented Mar 6, 2024

langchain4j commented Mar 7, 2024

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

geoand commented Mar 19, 2024

langchain4j commented Mar 1, 2024 •

edited

geoand commented Mar 6, 2024 •

edited