PDF Embedding Indexer is a CLI tool designed to process text content from PDF files, generate meaningful embeddings from the text using Sentence Transformers, and store those embeddings in a PostgreSQL database. This allows for quick and efficient similarity searching, providing a useful tool for managing and navigating through a large number of PDF files.
- Currently supports PDF files only, and is only limited to a single PDF file per command.
- Extracts text content from PDF files and generates embeddings using Sentence Transformers.
- Stores the embeddings in a PostgreSQL database for quick and efficient similarity searching.
- Allows for sentence-level indexing, offering granular search results.
- Stores additional metadata for each document, including file hash, timestamp, and title.
- Prevents duplicate PDFs from being indexed.
- Python 3.8 or higher
- PostgreSQL
- sentence-transformers
- psycopg2-binary
- sqlalchemy
- pgvector
- python-magic
- pdfminer.six
- nltk
-
Ensure that you have Python 3.8 or higher installed.
-
Install PostgreSQL and setup a database for this project.
-
Clone this repository:
git clone https://github.com/itsyaasir/pdf-intellect.git
-
Change into the project directory:
cd pdf-intellect
-
You can create and activate a Conda or a virtual environment:
-
For Conda environment:
Run the provided setup script to create a Conda environment and install the necessary packages:
bash conda_setup.sh
-
For virtual environment:
Create a virtual environment:
python -m venv venv
Activate the virtual environment:
-
On Unix or MacOS, run:
source venv/bin/activate
-
On Windows, run:
venv\Scripts\activate
Install the required packages:
pip install -r requirements.txt
-
-
-
Run the provided setup script to setup the database:
You will need to prov
bash db_setup.sh
-
Modify the environment variables in
config.py
if necessary. -
Depending on your model, you might need to adjust the prompt template to match the model's input format. You can check the default template in
app/llama.py
.
To index a PDF:
python main.py index <pdf_file>
To search for similar content given a query:
python main.py search "<query>"
To use the PDF with LLM:
python main.py query <"query">
These commands will print their results to the console.
Please feel free to fork this repository and contribute. When submitting your changes, please ensure that your code is well-commented and that you have tested your changes.
This project is licensed under the terms of the MIT license. See LICENSE for more details.