This Python script extracts and indexes keywords from PDF documents. It scans through a specified PDF file to identify occurrences of predefined keywords and generates an organized report showing which pages contain each keyword. The script is particularly useful for quickly navigating large technical documents, research papers, or reference manuals.
- Keyword indexing: Scans PDF documents for specific keywords and records the page numbers where each appears
- Multi-word support: Properly handles both single-word and multi-word phrases with flexible spacing
- Page exclusion: Allows specific pages to be excluded from analysis (e.g., references, appendices)
- Progress reporting: Shows real-time progress indicators during processing
- Comprehensive output: Generates a text report containing found and not-found keywords
- Error handling: Robust error checking for file access and text extraction issues
- Python 3.6+
- PyPDF2 library
-
Clone the repository:
git clone https://github.com/mihir-28/PDFKeyWordMapper.git cd PDFKeyWordMapper -
Install the required dependencies:
pip install PyPDF2
-
Place your PDF file in the same directory as the script or provide the correct path in the script.
-
Edit the
pdf_pathvariable in the script to point to your PDF file:pdf_path = "path/to/your/file"
-
Customize the list of
extended_keywordsto include terms relevant to your document. -
If needed, update the
pages_to_skiplist to exclude specific pages from analysis:# For individual pages: [1, 5, 9] # For page ranges: list(range(37, 100)) + list(range(122, 126)) pages_to_skip = []
-
Run the script:
python keywords.py -
Check the generated keywords_index.txt file for results.
The output file keywords_index.txt contains:
Keyword Index Results:
Document: your_document.pdf
Total pages: 125
Pages skipped: 122, 123, 124, 125
FOUND KEYWORDS:
Authentication: 15, 42, 87
Blockchain: 3, 8, 10, 12, 45, 67
React: 56, 78, 92
...
NOT FOUND KEYWORDS:
ABI
Cloud Functions
...
You can easily customize the script by modifying:
pdf_path: Path to the PDF file to analyzepages_to_skip: List of page numbers to exclude from analysisextended_keywords: List of keywords to search for
- The script opens and validates the specified PDF file
- It iterates through each page, extracting text content
- For each page, it searches for all keywords in the predefined list
- Single-word keywords use word boundary regex matching for accuracy
- Multi-word phrases use flexible spacing pattern matching
- Found keywords and their page numbers are compiled into a dictionary
- Results are written to an output file with statistics
The script uses regular expressions with word boundaries to ensure accurate matching:
- For single words:
\bkeyword\bmatches the exact word - For multi-word phrases: It allows flexible spacing between words while maintaining word order
- Depends on PDF text extraction quality (some PDFs with scanned content or unusual fonts may not extract properly)
- Case-insensitive matching is used, so capitalization variants are treated as the same keyword
This project is licensed under the MIT License - see the LICENSE for details.
Contributions are welcome! Please feel free to submit a Pull Request.