PDF Keyword Index Generator

Overview

This Python script extracts and indexes keywords from PDF documents. It scans through a specified PDF file to identify occurrences of predefined keywords and generates an organized report showing which pages contain each keyword. The script is particularly useful for quickly navigating large technical documents, research papers, or reference manuals.

Features

Keyword indexing: Scans PDF documents for specific keywords and records the page numbers where each appears
Multi-word support: Properly handles both single-word and multi-word phrases with flexible spacing
Page exclusion: Allows specific pages to be excluded from analysis (e.g., references, appendices)
Progress reporting: Shows real-time progress indicators during processing
Comprehensive output: Generates a text report containing found and not-found keywords
Error handling: Robust error checking for file access and text extraction issues

Requirements

Python 3.6+
PyPDF2 library

Installation

Clone the repository:

git clone https://github.com/mihir-28/PDFKeyWordMapper.git
cd PDFKeyWordMapper

Install the required dependencies:
```
pip install PyPDF2
```

Usage

Place your PDF file in the same directory as the script or provide the correct path in the script.
Edit the pdf_path variable in the script to point to your PDF file:
```
pdf_path = "path/to/your/file"
```
Customize the list of extended_keywords to include terms relevant to your document.

If needed, update the pages_to_skip list to exclude specific pages from analysis:

# For individual pages: [1, 5, 9]
# For page ranges: list(range(37, 100)) + list(range(122, 126))
pages_to_skip = []

Run the script:
```
python keywords.py
```
Check the generated keywords_index.txt file for results.

Output Format

The output file keywords_index.txt contains:

Keyword Index Results:
Document: your_document.pdf
Total pages: 125
Pages skipped: 122, 123, 124, 125

FOUND KEYWORDS:
Authentication: 15, 42, 87
Blockchain: 3, 8, 10, 12, 45, 67
React: 56, 78, 92
...

NOT FOUND KEYWORDS:
ABI
Cloud Functions
...

Configuration Options

You can easily customize the script by modifying:

pdf_path: Path to the PDF file to analyze
pages_to_skip: List of page numbers to exclude from analysis
extended_keywords: List of keywords to search for

How It Works

The script opens and validates the specified PDF file
It iterates through each page, extracting text content
For each page, it searches for all keywords in the predefined list
Single-word keywords use word boundary regex matching for accuracy
Multi-word phrases use flexible spacing pattern matching
Found keywords and their page numbers are compiled into a dictionary
Results are written to an output file with statistics

Advanced Pattern Matching

The script uses regular expressions with word boundaries to ensure accurate matching:

For single words: \bkeyword\b matches the exact word
For multi-word phrases: It allows flexible spacing between words while maintaining word order

Limitations

Depends on PDF text extraction quality (some PDFs with scanned content or unusual fonts may not extract properly)
Case-insensitive matching is used, so capitalization variants are treated as the same keyword

License

This project is licensed under the MIT License - see the LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE.md		LICENSE.md
README.md		README.md
keyword_mapper.py		keyword_mapper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Keyword Index Generator

Overview

Features

Requirements

Installation

Usage

Output Format

Configuration Options

How It Works

Advanced Pattern Matching

Limitations

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF Keyword Index Generator

Overview

Features

Requirements

Installation

Usage

Output Format

Configuration Options

How It Works

Advanced Pattern Matching

Limitations

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages