This project implements a backend for a document retrieval system. It allows users to query documents based on text similarity using a REST API. Documents are stored in a database with their vector embeddings, and queries are processed to return the most relevant documents using cosine similarity.
- Document Storage and Retrieval: Stores documents with their encoded vector representations and retrieves the top relevant documents based on user queries.
- API Rate Limiting: Limits each user to 5 API calls; further requests return a 429 status code.
- Background News Scraper: A placeholder background task that periodically scrapes news articles.
- Caching: Uses Redis to cache search results, improving the response time for repeated queries.
document_retrieval_system/
│
├── app/
│ ├── __init__.py # Initializes the Flask app
│ ├── api.py # Main API routes for the system
│ ├── models.py # Database models and logic
│ ├── scraper.py # Background news scraper logic
│ └── utils.py # Utility functions (e.g., for caching, encoding)
│
├── data/
│ └── documents.db # SQLite database (generated by initialize_db.py)
│
├── initialize_db.py # Script to initialize the database and add sample data
├── Dockerfile # Dockerfile for building the app
├── README.md # Documentation with details on how to run the project
├── requirements.txt # Python dependencies
├── run.py # Entry point for running the Flask app
First, make sure you have Python 3.9+ and Redis installed. If you have Docker, you can use it to start Redis.
-
Install the required Python packages:
pip install -r requirements.txt
-
Run Redis in Docker (if you don’t have Redis installed locally):
docker run -d -p 6379:6379 redis
The project uses a SQLite database to store documents and track user requests. Before running the app, you need to initialize the database and insert sample documents.
- Run the
initialize_db.py
script:python initialize_db.py
This script will:
- Create a
documents
table in thedata/documents.db
SQLite database. - Create a
users
table to track API requests for rate-limiting. - Insert three sample documents into the
documents
table with their encoded embeddings.
Sample documents inserted:
- "Artificial intelligence is transforming the world by enabling machines to learn from data."
- "Machine learning is a subset of AI that focuses on building algorithms that can learn and make predictions."
- "Data science combines domain knowledge, programming skills, and statistics to extract insights from data."
After initializing the database, start the Flask application:
python run.py
The server will be running on http://localhost:5000
.
- Method:
GET
- Description: Checks if the API is active.
- Response:
{ "status": "API is active" }
- Method:
POST
- Description: Retrieves the top-k most relevant documents based on a query text.
- Request Parameters:
user_id
(string, required): A unique identifier for the user making the request.text
(string, required): The query text to search for relevant documents.top_k
(integer, optional): The number of top results to return (default: 5).threshold
(float, optional): The minimum similarity score required for a document to be included in the results (default: 0.5).
Before you run the application, initialize the SQLite database and add sample documents by running:
python initialize_db.py
This script creates the necessary database schema and inserts the following sample documents:
- "Artificial intelligence is transforming the world by enabling machines to learn from data."
- "Machine learning is a subset of AI that focuses on building algorithms that can learn and make predictions."
- "Data science combines domain knowledge, programming skills, and statistics to extract insights from data."
Once this is done, you can start the Flask application.
Start the Flask app by running:
python run.py
The server will be available at http://localhost:5000
.
curl -X POST "http://localhost:5000/search" \
-H "Content-Type: application/json" \
-d '{
"user_id": "user_1",
"text": "What is AI?",
"top_k": 3,
"threshold": 0.4
}'
- Open Postman and create a new
POST
request. - Set the URL to
http://localhost:5000/search
. - In the Headers tab, add:
Key
:Content-Type
Value
:application/json
- In the Body tab, select raw and paste the following JSON:
{ "user_id": "user_1", "text": "What is AI?", "top_k": 3, "threshold": 0.4 }
- Click Send.
The system will respond with the top relevant documents and their similarity scores:
{
"results": [
[
"Artificial intelligence is transforming the world by enabling machines to learn from data.",
0.89
],
[
"Machine learning is a subset of AI that focuses on building algorithms that can learn and make predictions.",
0.75
]
]
}
- Query: The query "What is AI?" was encoded, and the system searched for the most relevant documents in the database based on cosine similarity of the text embeddings.
- Results: The system returned two documents that matched the query with similarity scores of
0.89
and0.75
, respectively.
If the same user makes more than 5 requests, the system will return an error:
{
"error": "Rate limit exceeded"
}
This helps prevent abuse by limiting the number of requests a single user can make.
To run the application inside a Docker container, follow these steps:
-
Build the Docker image:
docker build -t document_retrieval_system .
-
Run the container:
docker run -p 5000:5000 document_retrieval_system
The application will be available at http://localhost:5000
.
- Re-ranking Algorithms: Implement re-ranking of search results to improve accuracy.
- Real News Scraping: Implement actual news scraping in the background task.
- Fine-Tuning: Add support for fine-tuning retrievers on specific datasets for better results.