Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Easy RAG #686

Merged
merged 22 commits into from Mar 21, 2024
Merged

Easy RAG #686

merged 22 commits into from Mar 21, 2024

Conversation

langchain4j
Copy link
Owner

@langchain4j langchain4j commented Mar 1, 2024

Easy RAG Example

Implementing RAG applications is hard. Especially for those who are just getting started exploring LLMs and RAG.

This PR introduces an "Easy RAG" feature that should help developers to get started with RAG as easy as possible.

With it, there is no need to learn about chunking/splitting/segmentation, embeddings, embedding models, vector databases, retrieval techniques and other RAG-related concepts.

This is similar to how one can simply upload one or multiple files into OpenAI Assistants API and the LLM will automagically know about their contents when answering questions.

Easy RAG is using local embedding model running in your CPU (GPU support can be added later).
Your files are ingested into an in-memory embedding store.

Please note that "Easy RAG" will not replace manual RAG setups and especially advanced RAG techniques, but will provide an easier way to get started with RAG.
The quality of an "Easy RAG" should be sufficient for demos, proof of concepts and for getting started.

To use "Easy RAG", simply import langchain4j-easy-rag dependency that includes everything needed to do RAG:

  • Apache Tika document loader (to parse all document types automatically)
  • Quantized BAAI/bge-small-en-v1.5 in-process embedding model which has an impressive (for it's size) 51.68 score for retrieval

Here is the proposed API:

List<Document> documents = FileSystemDocumentLoader.loadDocuments(directoryPath); // one can also load documents recursively and filter with glob/regex

EmbeddingStore<TextSegment> embeddingStore = new InMemoryEmbeddingStore<>(); // we will use an in-memory embedding store for simplicity

EmbeddingStoreIngestor.ingest(documents, embeddingStore);

Assistant assistant = AiServices.builder(Assistant.class)
                .chatLanguageModel(model)
                .contentRetriever(EmbeddingStoreContentRetriever.from(embeddingStore))
                .build();

String answer = assistant.chat("Who is Charlie?"); // Charlie is a carrot...

FileSystemDocumentLoader in the above code loads documents using DocumentParser available in classpath via SPI, in this case an ApacheTikaDocumentParser imported with the langchain4j-easy-rag dependency.

The EmbeddingStoreIngestor in the above code:

  • splits documents into smaller text segments using a DocumentSplitter loaded via SPI from the langchain4j-easy-rag dependency. Currently it uses DocumentSplitters.recursive(300, 30, new HuggingFaceTokenizer())
  • embeds text segments using an AllMiniLmL6V2QuantizedEmbeddingModel loaded via SPI from the langchain4j-easy-rag dependency
  • stores text segments and their embeddings into the specified embedding store

When using InMemoryEmbeddingStore, one can serialize/persist it into a JSON string on into a file.
This way one can skip loading documents and embedding them on each application run.

It is easy to customize the ingestion in the above code, just change

EmbeddingStoreIngestor.ingest(documents, embeddingStore);

into

EmbeddingStoreIngestor ingestor = EmbeddingStoreIngestor.builder()
                //.documentTransformer(...) // you can optionally transform (clean, enrich, etc) documents before splitting
                //.documentSplitter(...) // you can optionally specify another splitter
                //.textSegmentTransformer(...) // you can optionally transform (clean, enrich, etc) segments before embedding
                //.embeddingModel(...) // you can optionally specify another embedding model to use for embedding
                .embeddingStore(embeddingStore)
                .build();

ingestor.ingest(documents)

Over time, we can add an auto-eval feature that will find the most suitable hyperparametes for a given documents (e.g. which embedding model to use, which splitting method, possibly advanced RAG techniques, etc.) so that "easy RAG" can be comparable to the "advanced RAG".

Related:
langchain4j/langchain4j-embeddings#16

@langchain4j langchain4j changed the title Dirty POC: Easy RAG POC: Easy RAG Mar 5, 2024
@geoand
Copy link
Contributor

geoand commented Mar 6, 2024

cc @jmartisk @cescoffier

@cescoffier
Copy link

I like the idea. How do you compute the chunk size and the number of relevant documents to include? Should it be configurable with sensible defaults?

@jmartisk
Copy link
Contributor

jmartisk commented Mar 6, 2024

This is really cool and easy. +1 to configurable document size and count

@dliubars
Copy link
Collaborator

dliubars commented Mar 6, 2024

Size of the segment can be configurable, but it will have a sensible static default (probably something like 300 tokens and 30 tokens overlap).

Over time, we can implement an eval step to find the most optimal size for the given document(s). Also, it does hot have to be fixed, it can be variable in some range (min, max) to minimize information loss during embedding and maximize recall during retrieval.

I was working on the side to implement a hybrid splitting algorithm that takes into account both structural (headers, newlines, paragraphs, lists, etc) and semantic (using embeddings) signals to find the best splitting point. One can define min/max segment size and it will split given text into segments with size within that range while minimzing information loss by looking at dispersion of individual sentences in the potential segment (their deviation from the overall segment embedding (centroid)). Maybe this can be used here.

@langchain4j
Copy link
Owner Author

BTW all credits for the idea goes to @dandreadis

@maxandersen
Copy link

+1; maybe i'm missing something but isn't this not already feasible today if we just enabled sensible defaults on the existing api (or maybe this is exactly that) ?

Thus to be clear - 100% for enabling this kind of easy setup; especially if behind it is use of the existing apis so we know users can "escape" and adjust when/where needed.

@langchain4j
Copy link
Owner Author

@maxandersen Quarkus extension could probably enable sensible defaults (if not yet), but for the non-Quarkus folks there should be a way to do RAG hastle-free without a need to manually setup document loaders, splitters, embedding model, embedding store and orchestrating the ingestion, thus the proposed solution. Later it will be expanded to support more advanced RAG strategies and do auto-evals. So imho this deserves it's own class/name/abstraction. Also, to stress with the name that this is an easy/demo/PoC feature.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 5

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 3aafa79 and 4458ec8.
Files ignored due to path filters (11)
  • document-parsers/langchain4j-document-parser-apache-tika/pom.xml is excluded by: !**/*.xml
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.doc is excluded by: !**/*.doc
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.docx is excluded by: !**/*.docx
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pdf is excluded by: !**/*.pdf
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.ppt is excluded by: !**/*.ppt
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pptx is excluded by: !**/*.pptx
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xls is excluded by: !**/*.xls
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xlsx is excluded by: !**/*.xlsx
  • langchain4j-bom/pom.xml is excluded by: !**/*.xml
  • langchain4j-easy-rag/pom.xml is excluded by: !**/*.xml
  • pom.xml is excluded by: !**/*.xml
Files selected for processing (7)
  • document-parsers/langchain4j-document-parser-apache-tika/src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java (1 hunks)
  • document-parsers/langchain4j-document-parser-apache-tika/src/test/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserTest.java (1 hunks)
  • langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/EasyRAG.java (1 hunks)
  • langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/IngestionConfig.java (1 hunks)
  • langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/RetrievalConfig.java (1 hunks)
  • langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java (1 hunks)
  • langchain4j-easy-rag/src/test/resources/story-about-happy-carrot.txt (1 hunks)
Check Runs (5)
java_build (21) completed (2)
java_build (17) completed (2)
java_build (11) completed (1)
java_build (8) completed (2)
compliance completed (1)
Additional comments: 2
document-parsers/langchain4j-document-parser-apache-tika/src/test/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserTest.java (1)
  • 13-49: The tests in ApacheTikaDocumentParserTest cover a good range of document formats, ensuring the parser's functionality across different file types. It's a good practice to also include negative test cases or edge cases, such as empty files or files with unsupported formats, to ensure the parser behaves as expected in these scenarios.

Consider adding negative test cases or edge cases to further validate the parser's robustness.

langchain4j-easy-rag/src/test/resources/story-about-happy-carrot.txt (1)
  • 1-28: The narrative text story-about-happy-carrot.txt is well-written and serves as a good test resource. However, there are a few minor typographical issues, such as possible spelling mistakes with "VeggieVille" and a missing comma near the beginning. While these may be intentional for the story's context, it's important to ensure that test resources are as error-free as possible to avoid confusion.

Consider reviewing the spelling of "VeggieVille" and adding any missing commas for clarity.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Status

Actionable comments generated: 0

Configuration used: CodeRabbit UI

Commits Files that changed from the base of the PR and between 4458ec8 and 3b897b7.
Files selected for processing (2)
  • langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java (1 hunks)
  • langchain4j/src/test/java/dev/langchain4j/service/AiServicesWithRagIT.java (2 hunks)
Check Runs (5)
java_build (21) completed (2)
java_build (17) completed (2)
java_build (11) completed (2)
java_build (8) completed (1)
compliance completed (1)
Files skipped from review as they are similar to previous changes (1)
  • langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java
Additional comments: 8
langchain4j/src/test/java/dev/langchain4j/service/AiServicesWithRagIT.java (8)
  • 32-37: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-1]

The package declaration and imports are correctly organized, ensuring that all necessary classes and methods are available for the tests. No issues found here.

  • 32-37: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [14-14]

The @BeforeEach method correctly sets up the embedding store with test data before each test. This is a good practice for ensuring test isolation and repeatability.

  • 32-37: > 📝 NOTE

This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [16-16]

The use of interfaces to define test-specific behavior, such as Assistant and MultiUserAssistant, enhances modularity and readability. This approach allows for clear separation of test scenarios.

  • 36-36: The parameterized tests effectively cover various scenarios, leveraging different components of the RAG functionality. This ensures comprehensive testing of the feature.
  • 36-36: The use of @MethodSource("models") for providing test models is a good practice, allowing for easy extension of test cases with additional models in the future.
  • 36-36: The tests make good use of mocking and spying to isolate the system under test and verify interactions with dependencies. This is crucial for unit and integration testing.
  • 36-36: The ingest method demonstrates a clear and concise way to prepare test data, contributing to the maintainability of the test suite.
  • 36-36: The tests are well-structured and easy to understand, following best practices for readability and maintainability.

Repository owner deleted a comment from coderabbitai bot Mar 18, 2024
@geoand
Copy link
Contributor

geoand commented Mar 19, 2024

I like it a lot!

@langchain4j langchain4j merged commit 2f425da into main Mar 21, 2024
6 checks passed
@langchain4j langchain4j deleted the easy_rag_poc branch March 21, 2024 16:37
@langchain4j langchain4j changed the title POC: Easy RAG Easy RAG Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants