Skip to content

notkshitijsingh/document-retrieval-system-for-llms

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document Retrieval System

Overview

This project implements a backend for a document retrieval system. It allows users to query documents based on text similarity using a REST API. Documents are stored in a database with their vector embeddings, and queries are processed to return the most relevant documents using cosine similarity.

Features

  • Document Storage and Retrieval: Stores documents with their encoded vector representations and retrieves the top relevant documents based on user queries.
  • API Rate Limiting: Limits each user to 5 API calls; further requests return a 429 status code.
  • Background News Scraper: A placeholder background task that periodically scrapes news articles.
  • Caching: Uses Redis to cache search results, improving the response time for repeated queries.

Project Structure

document_retrieval_system/
│
├── app/
│   ├── __init__.py           # Initializes the Flask app
│   ├── api.py                # Main API routes for the system
│   ├── models.py             # Database models and logic
│   ├── scraper.py            # Background news scraper logic
│   └── utils.py              # Utility functions (e.g., for caching, encoding)
│
├── data/
│   └── documents.db          # SQLite database (generated by initialize_db.py)
│
├── initialize_db.py          # Script to initialize the database and add sample data
├── Dockerfile                # Dockerfile for building the app
├── README.md                 # Documentation with details on how to run the project
├── requirements.txt          # Python dependencies
├── run.py                    # Entry point for running the Flask app

Setup Instructions

1. Install Dependencies

First, make sure you have Python 3.9+ and Redis installed. If you have Docker, you can use it to start Redis.

  • Install the required Python packages:

    pip install -r requirements.txt
  • Run Redis in Docker (if you don’t have Redis installed locally):

    docker run -d -p 6379:6379 redis

2. Initialize the Database

The project uses a SQLite database to store documents and track user requests. Before running the app, you need to initialize the database and insert sample documents.

  • Run the initialize_db.py script:
    python initialize_db.py

This script will:

  • Create a documents table in the data/documents.db SQLite database.
  • Create a users table to track API requests for rate-limiting.
  • Insert three sample documents into the documents table with their encoded embeddings.

Sample documents inserted:

  1. "Artificial intelligence is transforming the world by enabling machines to learn from data."
  2. "Machine learning is a subset of AI that focuses on building algorithms that can learn and make predictions."
  3. "Data science combines domain knowledge, programming skills, and statistics to extract insights from data."

3. Run the Application

After initializing the database, start the Flask application:

python run.py

The server will be running on http://localhost:5000.


API Endpoints

1. Health Check: /health

  • Method: GET
  • Description: Checks if the API is active.
  • Response:
    {
      "status": "API is active"
    }

2. Document Search: /search

  • Method: POST
  • Description: Retrieves the top-k most relevant documents based on a query text.
  • Request Parameters:
    • user_id (string, required): A unique identifier for the user making the request.
    • text (string, required): The query text to search for relevant documents.
    • top_k (integer, optional): The number of top results to return (default: 5).
    • threshold (float, optional): The minimum similarity score required for a document to be included in the results (default: 0.5).

Example Use

1. Initializing the Database

Before you run the application, initialize the SQLite database and add sample documents by running:

python initialize_db.py

This script creates the necessary database schema and inserts the following sample documents:

  1. "Artificial intelligence is transforming the world by enabling machines to learn from data."
  2. "Machine learning is a subset of AI that focuses on building algorithms that can learn and make predictions."
  3. "Data science combines domain knowledge, programming skills, and statistics to extract insights from data."

Once this is done, you can start the Flask application.

2. Running the Server

Start the Flask app by running:

python run.py

The server will be available at http://localhost:5000.

3. Example Request and Response

Example Search Request (via Curl)

curl -X POST "http://localhost:5000/search" \
-H "Content-Type: application/json" \
-d '{
  "user_id": "user_1",
  "text": "What is AI?",
  "top_k": 3,
  "threshold": 0.4
}'

Example Search Request (via Postman)

  1. Open Postman and create a new POST request.
  2. Set the URL to http://localhost:5000/search.
  3. In the Headers tab, add:
    • Key: Content-Type
    • Value: application/json
  4. In the Body tab, select raw and paste the following JSON:
    {
      "user_id": "user_1",
      "text": "What is AI?",
      "top_k": 3,
      "threshold": 0.4
    }
  5. Click Send.

Expected Response:

The system will respond with the top relevant documents and their similarity scores:

{
  "results": [
    [
      "Artificial intelligence is transforming the world by enabling machines to learn from data.",
      0.89
    ],
    [
      "Machine learning is a subset of AI that focuses on building algorithms that can learn and make predictions.",
      0.75
    ]
  ]
}

Explanation:

  • Query: The query "What is AI?" was encoded, and the system searched for the most relevant documents in the database based on cosine similarity of the text embeddings.
  • Results: The system returned two documents that matched the query with similarity scores of 0.89 and 0.75, respectively.

4. API Rate Limiting

If the same user makes more than 5 requests, the system will return an error:

{
  "error": "Rate limit exceeded"
}

This helps prevent abuse by limiting the number of requests a single user can make.


Docker Setup

To run the application inside a Docker container, follow these steps:

  1. Build the Docker image:

    docker build -t document_retrieval_system .
  2. Run the container:

    docker run -p 5000:5000 document_retrieval_system

The application will be available at http://localhost:5000.


Future Improvements

  • Re-ranking Algorithms: Implement re-ranking of search results to improve accuracy.
  • Real News Scraping: Implement actual news scraping in the background task.
  • Fine-Tuning: Add support for fine-tuning retrievers on specific datasets for better results.

About

This project implements a backend for a document retrieval system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published