PDF Intellect

PDF Embedding Indexer is a CLI tool designed to process text content from PDF files, generate meaningful embeddings from the text using Sentence Transformers, and store those embeddings in a PostgreSQL database. This allows for quick and efficient similarity searching, providing a useful tool for managing and navigating through a large number of PDF files.

Features

Currently supports PDF files only, and is only limited to a single PDF file per command.
Extracts text content from PDF files and generates embeddings using Sentence Transformers.
Stores the embeddings in a PostgreSQL database for quick and efficient similarity searching.
Allows for sentence-level indexing, offering granular search results.
Stores additional metadata for each document, including file hash, timestamp, and title.
Prevents duplicate PDFs from being indexed.

Requirements

Python 3.8 or higher
PostgreSQL
sentence-transformers
psycopg2-binary
sqlalchemy
pgvector
python-magic
pdfminer.six
nltk

Setup

Ensure that you have Python 3.8 or higher installed.
Install PostgreSQL and setup a database for this project.

Clone this repository:

git clone https://github.com/itsyaasir/pdf-intellect.git

Change into the project directory:
```
cd pdf-intellect
```
You can create and activate a Conda or a virtual environment:
- For Conda environment:
  
  Run the provided setup script to create a Conda environment and install the necessary packages:
```
bash conda_setup.sh
```
- For virtual environment:
  
  Create a virtual environment:
```
python -m venv venv
```
  Activate the virtual environment:
  - On Unix or MacOS, run:
```
source venv/bin/activate
```
  - On Windows, run:
```
venv\Scripts\activate
```
  Install the required packages:
```
pip install -r requirements.txt
```
Run the provided setup script to setup the database:

You will need to prov
```
bash db_setup.sh
```
Modify the environment variables in config.py if necessary.
Depending on your model, you might need to adjust the prompt template to match the model's input format. You can check the default template in app/llama.py.

Usage

To index a PDF:

python main.py index <pdf_file>

To search for similar content given a query:

python main.py search "<query>"

To use the PDF with LLM:

python main.py query <"query">

These commands will print their results to the console.

Contributing

Please feel free to fork this repository and contribute. When submitting your changes, please ensure that your code is well-commented and that you have tested your changes.

License

This project is licensed under the terms of the MIT license. See LICENSE for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
app		app
models		models
scripts		scripts
spec		spec
.env.sample		.env.sample
.gitignore		.gitignore
LICENSE		LICENSE
Readme.md		Readme.md
Water Monitoring System.pdf		Water Monitoring System.pdf
main.py		main.py
pylintrc		pylintrc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

app

app

models

models

scripts

scripts

spec

spec

.env.sample

.env.sample

.gitignore

.gitignore

LICENSE

LICENSE

Readme.md

Readme.md

Water Monitoring System.pdf

Water Monitoring System.pdf

main.py

main.py

pylintrc

pylintrc

Repository files navigation

PDF Intellect

Features

Requirements

Setup

Usage

Contributing

License

About

Releases

Packages

Languages

License

itsyaasir/pdf-intellect

Folders and files

Latest commit

History

Repository files navigation

PDF Intellect

Features

Requirements

Setup

Usage

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages