This project provides a solution to search across all PDF files on a system using word embeddings for summarization. By leveraging the power of Word2Vec, the project can find and return relevant documents based on user queries, even if the exact keywords are not present in the documents.
Ensure you have the following installed:
- Python 3.7 or higher
- Required Python packages (listed below)
-
Clone the repository:
git clone https://github.com/patelchaitany/Copilot-for-Linux cd Copilot-for-Linux
-
Install the required packages:
pip install -r requirements.txt
-
Install Chromadb:
Run Chromadb On localhost and port 8000
-
Run the script:
python main.py --size <max-file-size-in-MB>
Replace
<max-file-size-in-MB>
with the maximum size of PDF files you want to process. -
Input your query: When prompted, enter the word or phrase you want to search for in the PDF documents.
-
View the results: The script will display the list of PDFs containing the relevant information.
python main.py --size 10
Word need to search in document: machine learning
This command processes all PDF files under the user's home directory that are less than 10 MB in size and searches for the term "machine learning".
.
├── directory.py # Handles directory structure and file comparisons
├── embedding.py # Manages document embeddings and queries
├── main.py # Main script to execute the project
├── requirements.txt # List of dependencies
└── README.md # Project documentation
- Ensure all dependencies are installed.
- Make sure your PDF files are accessible and not corrupted.
Feel free to fork this repository, make your changes, and submit a pull request. Contributions are welcome!