pdf-json-parser

Objective

The goal of this project is to design and implement a Python-based program that can read a PDF file and convert its contents into a well-structured JSON format. Instead of extracting plain text, the program aims to preserve the logical hierarchy of the document such as pages, sections, sub-sections, paragraphs, tables, and charts/images. This structured JSON output can be further used in data analytics, search engines, or knowledge management systems. Requirements

Input & Output

• Input: A PDF file (e.g., financial reports, fact sheets, research papers). • Output: A JSON file that organizes extracted content into a clear structure with types like paragraph, table, chart.

Libraries Used

pdfplumber → for extracting text and simple tables.
PyMuPDF (fitz) → for extracting images and handling layout.
streamlit → for creating a user interface.
camelot (optional) → for advanced table extraction.

Methodology

The project follows a modular approach, where each step of PDF parsing is handled separately:

Text & Paragraph Extraction o Text is extracted using pdfplumber. o Heuristics are applied to split text into paragraphs by checking line breaks, sentence endings, and spacing.
Heading & Section Detection o Font size and capitalization are analyzed to identify sections and sub-sections. o Example: Larger font or uppercase lines are treated as section headers.
Table Extraction o Tables are identified using pdfplumber’s built-in table detection. o Optionally, camelot is used for more accurate extraction of complex tables.
Chart/Image Extraction o Images (likely charts or figures) are extracted using PyMuPDF. o They are saved as separate PNG files and referenced inside the JSON output.
JSON Structuring o Each page is represented as an object in JSON. o Inside each page, content is divided into paragraph, table, or chart. o Section headers are linked to paragraphs when possible.

JSON Structure (Example)

{ "pages": [ { "page_number": 1, "content": [ { "type": "paragraph", "section": "Introduction", "sub_section": "Background", "text": "This project demonstrates how PDF content can be structured..." }, { "type": "table", "section": "Financial Data", "description": null, "table_data": [ ["Year", "Revenue", "Profit"], ["2022", "$10M", "$2M"], ["2023", "$12M", "$3M"] ] }, { "type": "chart", "section": "Performance Overview", "description": "Chart showing revenue growth over years", "image_path": "extracted_images/p1_img_0.png", "width": 1200, "height": 600 } ] } ] }

Conclusion

This project successfully demonstrates how a PDF file can be transformed into a machine-readable JSON format while preserving the document hierarchy and content types. The solution can: • Extract clean text paragraphs, • Detect and store tables, • Capture charts/images for reference, • Organize everything in a structured JSON. Such a system is highly valuable for businesses, researchers, and developers who need to convert unstructured documents into structured data for analytics, automation, and search applications.

OUTPUT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
app.py		app.py
pdf_to_json.py		pdf_to_json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pdf-json-parser

About

Uh oh!

Releases

Packages

Languages

pavan-git21/pdf-json-parser

Folders and files

Latest commit

History

Repository files navigation

pdf-json-parser

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages