PDF MCP

MCP server for PDF processing and analysis using PyPDFium2.

Features

extract_text: Extract text content from PDF files with page range support
extract_metadata: Extract PDF metadata including title, author, and page count
search_text: Search for specific text within PDF files with context
get_page_count: Get the total number of pages in a PDF file
extract_pages: Extract specific pages from a PDF and save as a new PDF
split_pdf: Split a PDF into multiple page-based PDFs with base64 encoding
merge_pdfs: Merge multiple PDF files into a single PDF
pdf_to_images: Convert PDF pages to PNG images with configurable DPI
get_form_fields: Extract all form fields from a PDF including names, types, and values
fill_form: Fill form fields in a PDF with provided values and save to output path

Installation

From Git Repository

# Clone the repository
git clone https://github.com/gzigurella/pdf-mcp.git
cd pdf-mcp

# Create virtual environment
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install the package
pip install -e .

With uv (recommended)

# Clone and enter directory
git clone https://github.com/gzigurella/pdf-mcp.git
cd pdf-mcp

# Install with uv
uv pip install -e .

Integration

OpenCode

Add to your ~/.config/opencode/opencode.json:

{
  "mcpServers": {
    "pdf-mcp": {
      "type": "local",
      "command": [
        "/path/to/pdf-mcp/venv/bin/python",
        "-m",
        "pdf_mcp"
      ],
      "enabled": true
    }
  }
}

Claude Desktop

Add to your Claude Desktop config:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json Windows: %APPDATA%\Claude\claude_desktop_config.json Linux: ~/.config/Claude/claude_desktop_config.json

{
  "mcpServers": {
    "pdf-mcp": {
      "command": "/path/to/pdf-mcp/venv/bin/python",
      "args": ["-m", "pdf_mcp"]
    }
  }
}

Generic MCP Client

For any MCP-compatible client:

# Start the server directly
/path/to/venv/bin/python -m pdf_mcp

The server communicates via stdio using the MCP protocol.

Tools

extract_text

Extract text content from a PDF file. Supports PDFs with searchable text and can extract text from specific pages or ranges.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the PDF file to extract text from
pages	string	No	"all"	Page range to extract (e.g., '1-5', '3,7,9', 'all')

{
  "file_path": "/path/to/document.pdf",
  "pages": "1-5"
}

extract_metadata

Extract metadata from a PDF file including title, author, subject, keywords, creator, producer, creation date, modification date, and page count.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the PDF file to extract metadata from

{
  "file_path": "/path/to/document.pdf"
}

search_text

Search for specific text within a PDF file. Returns page numbers and context around the found text. Useful for finding specific content in large documents.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the PDF file to search within
query	string	Yes	-	Text to search for in the PDF
case_sensitive	boolean	No	false	Whether to perform case-sensitive search
context_words	integer	No	10	Number of words to include before and after each match

{
  "file_path": "/path/to/document.pdf",
  "query": "important term",
  "case_sensitive": false,
  "context_words": 5
}

get_page_count

Get the total number of pages in a PDF file. Returns a simple integer count.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the PDF file to count pages for

{
  "file_path": "/path/to/document.pdf"
}

extract_pages

Extract specific pages from a PDF file and save as a new PDF. Supports page ranges and individual page selection.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the source PDF file
pages	string	Yes	-	Pages to extract (e.g., '1-5', '3,7,9', '1,3-5')
output_path	string	Yes	-	Path where the extracted pages will be saved as a new PDF

{
  "file_path": "/path/to/source.pdf",
  "pages": "1,3,5-7",
  "output_path": "/path/to/output.pdf"
}

split_pdf

Split a PDF file into multiple separate PDF files based on page ranges. Returns a JSON with base64-encoded PDFs for each selected page. Supports single pages, page ranges, and all pages.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the PDF file to split
page_range	string	Yes	-	Page range to split - 'all', single page (e.g., '1'), or range (e.g., '1-3', '2-5')

{
  "file_path": "/path/to/document.pdf",
  "page_range": "1-3"
}

merge_pdfs

Merge multiple PDF files into a single PDF. Files are merged in the order provided.

Parameter	Type	Required	Default	Description
file_paths	array	Yes	-	List of PDF file paths to merge
output_path	string	Yes	-	Path where the merged PDF will be saved

{
  "file_paths": ["/path/to/doc1.pdf", "/path/to/doc2.pdf", "/path/to/doc3.pdf"],
  "output_path": "/path/to/merged.pdf"
}

pdf_to_images

Convert PDF pages to PNG images. Returns a JSON with base64-encoded PNG images for each page. Supports custom DPI settings for resolution control.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the PDF file to convert to images
dpi	integer	No	150	Image resolution in dots per inch
format	string	No	"png"	Image format (PNG only)

{
  "file_path": "/path/to/document.pdf",
  "dpi": 300,
  "format": "png"
}

get_form_fields

Extract all form fields from a PDF document including field names, types, current values, and available choices for dropdown fields.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the PDF file to extract form fields from

{
  "file_path": "/path/to/form.pdf"
}

Returns a JSON with field information:

{
  "fields": [
    {
      "name": "first_name",
      "type": "text",
      "value": "",
      "page": 1,
      "rect": {"x0": 50, "y0": 72, "x1": 150, "y1": 92}
    },
    {
      "name": "country",
      "type": "combobox",
      "value": "",
      "page": 1,
      "rect": {...},
      "choices": ["USA", "Canada", "UK"]
    },
    {
      "name": "accept_terms",
      "type": "checkbox",
      "value": "",
      "page": 1,
      "rect": {...},
      "on_state": "Yes"
    }
  ],
  "total_fields": 3
}

fill_form

Fill form fields in a PDF document with provided values and save to output path. Supports text fields, checkboxes, radio buttons, and dropdowns.

Parameter	Type	Required	Default	Description
file_path	string	Yes	-	Path to the source PDF file
fields	object	Yes	-	Dictionary of field names and their values to fill
output_path	string	Yes	-	Path where the filled PDF will be saved

{
  "file_path": "/path/to/form.pdf",
  "fields": {
    "first_name": "John",
    "last_name": "Doe",
    "country": "USA",
    "accept_terms": true
  },
  "output_path": "/path/to/filled_form.pdf"
}

Checkbox values accept: true/false, "yes"/"no", "1"/"0". Radio buttons: use the value from on_state field (get with get_form_fields first).

Configuration

Environment Variables

Variable	Default	Description
PDF_MCP_DEBUG	false	Enable debug logging

# Example
export PDF_MCP_DEBUG=true
python -m pdf_mcp

Development

Running Tests

source venv/bin/activate
pytest

# With coverage
pytest --cov=src --cov-report=html

Project Structure

pdf-mcp/
├── src/pdf_mcp/
│   ├── __init__.py
│   ├── __main__.py
│   ├── server.py
│   ├── config.py
│   └── tools/
│       ├── __init__.py
│       ├── extract_text.py
│       ├── extract_metadata.py
│       ├── search_text.py
│       ├── get_page_count.py
│       ├── extract_pages.py
│       ├── split_pdf.py
│       ├── merge_pdfs.py
│       ├── pdf_to_images.py
│       ├── get_form_fields.py
│       └── fill_form.py
├── tests/
├── pyproject.toml
└── README.md

Troubleshooting

Installation Issues

If you encounter installation errors, ensure you have Python 3.10 or later:

python --version

File Not Found Errors

Make sure the PDF file paths are correct and the files exist:

ls -l /path/to/your/document.pdf

Encrypted PDFs

The tools will raise a RuntimeError if attempting to process encrypted PDFs. Ensure your PDFs are not password-protected.

Memory Issues with Large PDFs

For very large PDF files, consider processing them in smaller chunks using the extract_pages or split_pdf tools.

Permission Errors (Linux)

If you encounter permission errors, ensure the PDF files are readable:

chmod +r /path/to/your/document.pdf

Security Considerations

File Access: The server only processes files that exist and are readable by the running process
Path Validation: All file paths are validated before processing
No Network Access: The server does not make any network requests
Temporary Files: Temporary files are properly cleaned up after processing
Error Handling: Sensitive information is not exposed in error messages
Encrypted PDFs: Password-protected PDFs are rejected with appropriate error messages

Example Usage Scenarios

Scenario 1: Extract Text from Specific Pages

{
  "name": "extract_text",
  "arguments": {
    "file_path": "/documents/report.pdf",
    "pages": "1-3,7,9"
  }
}

Scenario 2: Search and Extract Context

{
  "name": "search_text",
  "arguments": {
    "file_path": "/documents/contract.pdf",
    "query": "liability clause",
    "case_sensitive": true,
    "context_words": 15
  }
}

Scenario 3: Merge Multiple Reports

{
  "name": "merge_pdfs",
  "arguments": {
    "file_paths": [
      "/reports/q1.pdf",
      "/reports/q2.pdf", 
      "/reports/q3.pdf",
      "/reports/q4.pdf"
    ],
    "output_path": "/reports/annual.pdf"
  }
}

Scenario 4: Convert PDF to Images

{
  "name": "pdf_to_images",
  "arguments": {
    "file_path": "/documents/presentation.pdf",
    "dpi": 300
  }
}

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src/pdf_mcp		src/pdf_mcp
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

PDF MCP

Features

Installation

From Git Repository

With uv (recommended)

Integration

OpenCode

Claude Desktop

Generic MCP Client

Tools

extract_text

extract_metadata

search_text

get_page_count

extract_pages

split_pdf

merge_pdfs

pdf_to_images

get_form_fields

fill_form

Configuration

Environment Variables

Development

Running Tests

Project Structure

Troubleshooting

Installation Issues

File Not Found Errors

Encrypted PDFs

Memory Issues with Large PDFs

Permission Errors (Linux)

Security Considerations

Example Usage Scenarios

Scenario 1: Extract Text from Specific Pages

Scenario 2: Search and Extract Context

Scenario 3: Merge Multiple Reports

Scenario 4: Convert PDF to Images

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages