Skip to content

Multi-pdf Capabilities#13

Merged
kartikm7 merged 2 commits into
kartikm7:masterfrom
dandonarahul2002:master
Nov 22, 2024
Merged

Multi-pdf Capabilities#13
kartikm7 merged 2 commits into
kartikm7:masterfrom
dandonarahul2002:master

Conversation

@dandonarahul2002
Copy link
Copy Markdown
Contributor

Enhanced Multi-PDF RAG Capabilities and Optimized Reranking

Overview

This pull request significantly improves our RAG (Retrieval-Augmented Generation) system by extending single-PDF capabilities to support multiple PDFs and implementing an optimized reranking algorithm.

Key Changes

1. Multi-PDF RAG Support

  • Modified rag-utils.ts to handle multiple PDF documents simultaneously
  • Enhanced similarity search to work across multiple vector databases

2. Optimized Reranking Algorithm

Implemented a new bm25Rerank function with the following optimizations:

  • Preprocessed query terms to filter out single-character words
  • Precomputed IDF scores for improved efficiency
  • Utilized a single regex for term matching, reducing string operations
  • Implemented more efficient term frequency counting using a Map
  • Improved BM25 score calculation for better result ranking
  • Reset parameters of RecursiveCharacterTextSplitter to default values as it showed better results while manual testing

3. Type Safety Improvements

  • Added a new ScoredDocument interface extending Document to include a score property
  • Updated similaritySearch function to use the new bm25Rerank function, returning ScoredDocument[]

4. Text Splitting Adjustment

  • Reset parameters of RecursiveCharacterTextSplitter to default values based on improved results from manual testing

Performance Impact

These changes are expected to significantly improve the accuracy of our RAG system, particularly for queries involving multiple PDFs or large document sets.

Next Steps

  • Potential to improvise Reranking using Cross-Encoders (Couldn't find the funtionality yet to support Js(ONNX) models for sBert)
  • Explore potential for further optimizations in vector search and embedding processes

Please review these changes, paying particular attention to the reranking algorithm and multi-PDF handling logic.

@kartikm7 kartikm7 merged commit bc87ab8 into kartikm7:master Nov 22, 2024
@kartikm7
Copy link
Copy Markdown
Owner

Thank you so much!

@dandonarahul2002
Copy link
Copy Markdown
Contributor Author

dandonarahul2002 commented Nov 23, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants