New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Easy RAG #686
Easy RAG #686
Conversation
I like the idea. How do you compute the chunk size and the number of relevant documents to include? Should it be configurable with sensible defaults? |
This is really cool and easy. +1 to configurable document size and count |
Size of the segment can be configurable, but it will have a sensible static default (probably something like 300 tokens and 30 tokens overlap). Over time, we can implement an eval step to find the most optimal size for the given document(s). Also, it does hot have to be fixed, it can be variable in some range (min, max) to minimize information loss during embedding and maximize recall during retrieval. I was working on the side to implement a hybrid splitting algorithm that takes into account both structural (headers, newlines, paragraphs, lists, etc) and semantic (using embeddings) signals to find the best splitting point. One can define min/max segment size and it will split given text into segments with size within that range while minimzing information loss by looking at dispersion of individual sentences in the potential segment (their deviation from the overall segment embedding (centroid)). Maybe this can be used here. |
BTW all credits for the idea goes to @dandreadis |
+1; maybe i'm missing something but isn't this not already feasible today if we just enabled sensible defaults on the existing api (or maybe this is exactly that) ? Thus to be clear - 100% for enabling this kind of easy setup; especially if behind it is use of the existing apis so we know users can "escape" and adjust when/where needed. |
...src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java
Show resolved
Hide resolved
@maxandersen Quarkus extension could probably enable sensible defaults (if not yet), but for the non-Quarkus folks there should be a way to do RAG hastle-free without a need to manually setup document loaders, splitters, embedding model, embedding store and orchestrating the ingestion, thus the proposed solution. Later it will be expanded to support more advanced RAG strategies and do auto-evals. So imho this deserves it's own class/name/abstraction. Also, to stress with the name that this is an easy/demo/PoC feature. |
# Conflicts: # langchain4j-bom/pom.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 5
Configuration used: CodeRabbit UI
Files ignored due to path filters (11)
document-parsers/langchain4j-document-parser-apache-tika/pom.xml
is excluded by:!**/*.xml
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.doc
is excluded by:!**/*.doc
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.docx
is excluded by:!**/*.docx
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pdf
is excluded by:!**/*.pdf
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.ppt
is excluded by:!**/*.ppt
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.pptx
is excluded by:!**/*.pptx
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xls
is excluded by:!**/*.xls
document-parsers/langchain4j-document-parser-apache-tika/src/test/resources/test-file.xlsx
is excluded by:!**/*.xlsx
langchain4j-bom/pom.xml
is excluded by:!**/*.xml
langchain4j-easy-rag/pom.xml
is excluded by:!**/*.xml
pom.xml
is excluded by:!**/*.xml
Files selected for processing (7)
- document-parsers/langchain4j-document-parser-apache-tika/src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java (1 hunks)
- document-parsers/langchain4j-document-parser-apache-tika/src/test/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserTest.java (1 hunks)
- langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/EasyRAG.java (1 hunks)
- langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/IngestionConfig.java (1 hunks)
- langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/RetrievalConfig.java (1 hunks)
- langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java (1 hunks)
- langchain4j-easy-rag/src/test/resources/story-about-happy-carrot.txt (1 hunks)
Check Runs (5)
java_build (21) completed (2)
java_build (17) completed (2)
java_build (11) completed (1)
java_build (8) completed (2)
compliance completed (1)
Additional comments: 2
document-parsers/langchain4j-document-parser-apache-tika/src/test/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParserTest.java (1)
- 13-49: The tests in
ApacheTikaDocumentParserTest
cover a good range of document formats, ensuring the parser's functionality across different file types. It's a good practice to also include negative test cases or edge cases, such as empty files or files with unsupported formats, to ensure the parser behaves as expected in these scenarios.Consider adding negative test cases or edge cases to further validate the parser's robustness.
langchain4j-easy-rag/src/test/resources/story-about-happy-carrot.txt (1)
- 1-28: The narrative text
story-about-happy-carrot.txt
is well-written and serves as a good test resource. However, there are a few minor typographical issues, such as possible spelling mistakes with "VeggieVille" and a missing comma near the beginning. While these may be intentional for the story's context, it's important to ensure that test resources are as error-free as possible to avoid confusion.Consider reviewing the spelling of "VeggieVille" and adding any missing commas for clarity.
langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/RetrievalConfig.java
Outdated
Show resolved
Hide resolved
langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/IngestionConfig.java
Outdated
Show resolved
Hide resolved
...src/main/java/dev/langchain4j/data/document/parser/apache/tika/ApacheTikaDocumentParser.java
Outdated
Show resolved
Hide resolved
langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java
Outdated
Show resolved
Hide resolved
langchain4j-easy-rag/src/main/java/dev/langchain4j/rag/easy/EasyRAG.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Review Status
Actionable comments generated: 0
Configuration used: CodeRabbit UI
Files selected for processing (2)
- langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java (1 hunks)
- langchain4j/src/test/java/dev/langchain4j/service/AiServicesWithRagIT.java (2 hunks)
Check Runs (5)
java_build (21) completed (2)
java_build (17) completed (2)
java_build (11) completed (2)
java_build (8) completed (1)
compliance completed (1)
Files skipped from review as they are similar to previous changes (1)
- langchain4j-easy-rag/src/test/java/dev/langchain4j/rag/easy/EasyRAGTest.java
Additional comments: 8
langchain4j/src/test/java/dev/langchain4j/service/AiServicesWithRagIT.java (8)
- 32-37: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [1-1]
The package declaration and imports are correctly organized, ensuring that all necessary classes and methods are available for the tests. No issues found here.
- 32-37: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [14-14]
The
@BeforeEach
method correctly sets up the embedding store with test data before each test. This is a good practice for ensuring test isolation and repeatability.
- 32-37: > 📝 NOTE
This review was outside the diff hunks, and no overlapping diff hunk was found. Original lines [16-16]
The use of interfaces to define test-specific behavior, such as
Assistant
andMultiUserAssistant
, enhances modularity and readability. This approach allows for clear separation of test scenarios.
- 36-36: The parameterized tests effectively cover various scenarios, leveraging different components of the RAG functionality. This ensures comprehensive testing of the feature.
- 36-36: The use of
@MethodSource("models")
for providing test models is a good practice, allowing for easy extension of test cases with additional models in the future.- 36-36: The tests make good use of mocking and spying to isolate the system under test and verify interactions with dependencies. This is crucial for unit and integration testing.
- 36-36: The
ingest
method demonstrates a clear and concise way to prepare test data, contributing to the maintainability of the test suite.- 36-36: The tests are well-structured and easy to understand, following best practices for readability and maintainability.
I like it a lot! |
Easy RAG Example
Implementing RAG applications is hard. Especially for those who are just getting started exploring LLMs and RAG.
This PR introduces an "Easy RAG" feature that should help developers to get started with RAG as easy as possible.
With it, there is no need to learn about chunking/splitting/segmentation, embeddings, embedding models, vector databases, retrieval techniques and other RAG-related concepts.
This is similar to how one can simply upload one or multiple files into OpenAI Assistants API and the LLM will automagically know about their contents when answering questions.
Easy RAG is using local embedding model running in your CPU (GPU support can be added later).
Your files are ingested into an in-memory embedding store.
Please note that "Easy RAG" will not replace manual RAG setups and especially advanced RAG techniques, but will provide an easier way to get started with RAG.
The quality of an "Easy RAG" should be sufficient for demos, proof of concepts and for getting started.
To use "Easy RAG", simply import
langchain4j-easy-rag
dependency that includes everything needed to do RAG:Here is the proposed API:
FileSystemDocumentLoader
in the above code loads documents usingDocumentParser
available in classpath via SPI, in this case anApacheTikaDocumentParser
imported with thelangchain4j-easy-rag
dependency.The
EmbeddingStoreIngestor
in the above code:DocumentSplitter
loaded via SPI from thelangchain4j-easy-rag
dependency. Currently it usesDocumentSplitters.recursive(300, 30, new HuggingFaceTokenizer())
AllMiniLmL6V2QuantizedEmbeddingModel
loaded via SPI from thelangchain4j-easy-rag
dependencyWhen using
InMemoryEmbeddingStore
, one can serialize/persist it into a JSON string on into a file.This way one can skip loading documents and embedding them on each application run.
It is easy to customize the ingestion in the above code, just change
into
Over time, we can add an auto-eval feature that will find the most suitable hyperparametes for a given documents (e.g. which embedding model to use, which splitting method, possibly advanced RAG techniques, etc.) so that "easy RAG" can be comparable to the "advanced RAG".
Related:
langchain4j/langchain4j-embeddings#16