🔹 Project: PDF → Structured Data with Hugging Face

✅ Goal

Upload a PDF → Extract:

Title

Authors

Summary

Return structured JSON (no hallucination).

In [3]:
# Install Requirements
!pip install pypdf2 transformers sentence-transformers accelerate

Collecting pypdf2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m890.5 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pypdf2
Successfully installed pypdf2-3.0.1


In [4]:
# Step 2: Extract Text from PDF
from PyPDF2 import PdfReader

def read_pdf(file_path):
    reader = PdfReader(file_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    return text

pdf_text = read_pdf("/content/sample1.pdf")
print(pdf_text[:500])  # preview

                                 DEPARTMENT OF  
          COMPUTER SCIENCE AND ENGINEERING  
                              DIGITAL NOTES  
ON                                                
DEEP LEARNING  
(R20A6610)  
  
  
                                      Prepared by  
                                      K.Chandusha  
 
 
    MALLA REDDY COLLEGE OF                             
ENGINEERING&TECHNOLOGY  (AutonomousInstitution –UGC,Govt.of India) 
Recognizedunder2(f)and12(B)ofUGCACT 1956  


Step 3: Use Hugging Face Models

We’ll use:

Summarization → facebook/bart-large-cnn

NER (Named Entity Recognition) → dslim/bert-base-NER (to detect authors, names)

Title extraction → take first lines of PDF or summarization

In [5]:
from transformers import pipeline

# Summarizer for abstract/summary
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# NER for authors
ner = pipeline("ner", model="dslim/bert-base-NER", aggregation_strategy="simple")

# Extract summary
summary = summarizer(pdf_text[:2000], max_length=150, min_length=50, do_sample=False)[0]['summary_text']

# Extract names (potential authors)
entities = ner(pdf_text[:1000])  # first 1000 chars
authors = [ent['word'] for ent in entities if ent['entity_group'] == "PER"]

# Extract title (first 200 chars as heuristic)
title = pdf_text.split("\n")[0][:200]

print("Title:", title)
print("Authors:", authors)
print("Summary:", summary)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

Device set to use cpu


config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cpu


Title:                                  DEPARTMENT OF  
Authors: ['CO', '##MP', 'K', 'Chandus', 'MA']
Summary: The paper was written by K.Chandusha, an assistant professor at the Malla RedDy College of Engineering and Science, Secunderabad, India. The paper was published by the Department of Computer Science and Engineering of the University of Hyderabad. It was written in the form of an open-source code.


In [6]:
# Step 4: Build Structured JSON
import json

metadata = {
    "title": title,
    "authors": list(set(authors)),  # remove duplicates
    "summary": summary
}

with open("paper_metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print("✅ Saved structured data:", metadata)


✅ Saved structured data: {'title': '                                 DEPARTMENT OF  ', 'authors': ['MA', 'CO', '##MP', 'Chandus', 'K'], 'summary': 'The paper was written by K.Chandusha, an assistant professor at the Malla RedDy College of Engineering and Science, Secunderabad, India. The paper was published by the Department of Computer Science and Engineering of the University of Hyderabad. It was written in the form of an open-source code.'}
