Research Paper Summarizer

# Step 1: Install required libraries


This step installs the necessary **Python libraries** (*Transformers*, *PyPDF2* and *Torch*) to enable text summarization and PDF processing.


The `!pip install` command runs in Jupyter or Google Colab to quietly (`-q`) install `transformers`, `PyPDF2`, and `torch`. `transformers` offers pre-trained BART for summarization, `PyPDF2` extracts PDF text, and `torch` supports BART via PyTorch. This ensures all required libraries are installed before proceeding with the summarization process.

In [1]:
# Step 1: Install required libraries
!pip install transformers PyPDF2 torch -q

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

# Step 2: Import necessary modules

This step imports the required Python modules to handle **PDF extraction**, **text summarization** and **file uploads**.

The code imports `BartTokenizer` and `BartForConditionalGeneration` from `transformers` to tokenize text and generate summaries with the BART model. `PyPDF2` handles PDF text extraction, while `files` from `google.colab` facilitates file uploads in Colab. The `re` module, for regular expressions, cleans text. These imports prepare essential tools for later steps.

In [2]:
# Step 2: Import necessary modules
from transformers import BartTokenizer, BartForConditionalGeneration
import PyPDF2
from google.colab import files
import re

# Step 3: Load the pre-trained BART model and tokenizer

This step initializes the BART model and tokenizer using the pre-trained facebook/bart-large-cnn configuration for summarization.

The `model_name` variable designates the `facebook/bart-large-cnn` model, a BART variant optimized for summarization. `BartTokenizer` is initialized to tokenize text for model input, while `BartForConditionalGeneration` loads the summarization model. Using pre-trained models like this saves time and resources versus training anew, making it a critical setup step.

In [3]:
# Step 3: Load the pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
tokenizer = BartTokenizer.from_pretrained(model_name)
model = BartForConditionalGeneration.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

# Step 4: Function to extract text from PDF

This step defines a function to extract and clean text from a PDF file for summarization.

The `extract_text_from_pdf` function processes a PDF using `PyPDF2.PdfReader`, extracting text from each page with `extract_text()` and combining it into one string. The `re.sub(r'\s+', ' ', text).strip()` command cleans it by replacing multiple spaces or line breaks with a single space and trimming excess whitespace for summarization compatibility.

In [4]:
# Step 4: Function to extract text from PDF
def extract_text_from_pdf(pdf_file):
    pdf_reader = PyPDF2.PdfReader(pdf_file)
    text = ""
    for page in pdf_reader.pages:
        text += page.extract_text()
    # Clean the text: remove extra spaces and line breaks
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Step 5: Function to generate summary of main content

This step creates a function to generate a concise summary of approximately 500 words from the extracted text using the BART model.

The `generate_summary` function tokenizes text to 1024 tokens, converting it to a PyTorch tensor. BART generates a ~500-word summary (max_length=550, min_length=450) with length_penalty=1.0, num_beams=6 for coherence, and no_repeat_ngram_size=3 to avoid repetition. The decoded summary is split into sentences, formatted with newlines, and trimmed if over 500 words, focusing on main content.

In [5]:
# Step 5: Function to generate summary of main content
def generate_summary(text):
    # Tokenize the input text
    inputs = tokenizer(text, max_length=1024, truncation=True, return_tensors="pt")
    # Generate summary focusing on main content, aiming for ~500 words
    summary_ids = model.generate(
        inputs["input_ids"],
        max_length=550,  # Adjusted for ~500 words
        min_length=450,
        length_penalty=1.0,  # Lower penalty to emphasize key content
        num_beams=6,  # Higher beams for better coherence
        no_repeat_ngram_size=3,  # Avoid repetition to focus on main ideas
        early_stopping=True
    )
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
    # Split into sentences and join with newlines
    sentences = re.split(r'(?<=[.!?])\s+', summary)
    formatted_summary = "\n".join(sentences)
    # Rough word count check and trim if needed
    words = formatted_summary.split()
    if len(words) > 500:
        trimmed_sentences = formatted_summary.split("\n")[:int(500 / 15)]  # ~15 words per sentence
        formatted_summary = "\n".join(trimmed_sentences)
    return formatted_summary

# Step 6: Upload PDF and process

This step prompts the user to upload a PDF file and extracts its text for summarization.

The code employs `files.upload()` from `google.colab` to enable PDF uploads in Colab, storing the file in the `uploaded` dictionary. It extracts the filename from the dictionary’s keys, opens it in binary read mode (`"rb"`) for `extract_text_from_pdf`, and saves the text in `pdf_text`. This connects user input to the summarization process. (60 words)

In [6]:
# Step 6: Upload PDF and process
print("Please upload your research paper PDF:")
uploaded = files.upload()

# Get the uploaded file name
pdf_filename = list(uploaded.keys())[0]

# Extract text from the uploaded PDF
with open(pdf_filename, "rb") as pdf_file:
    pdf_text = extract_text_from_pdf(pdf_file)

Please upload your research paper PDF:


Saving Paper 19.pdf to Paper 19.pdf


# Step 7: Generate and display the summary

This step generates and prints a summary of the PDF text if extraction is successful, or an error message if it fails.

The code verifies if `pdf_text` has content; if so, it uses `generate_summary` to create a summary, printing it with a header noting its ~500-word length. If `pdf_text` is empty (e.g., from an unreadable PDF), it displays a failure message. This final step ensures the user receives the summary or an error in a clear, readable format, completing the process.

In [7]:
# Step 7: Generate and display the summary
if pdf_text:
    print("\nGenerating summary...\n")
    summary = generate_summary(pdf_text)
    print(summary)
else:
    print("No text extracted from the PDF.")


Generating summary...

Attention deficit hyperactivity disorder (ADHD) is an early child - hood neurodevelopmental condition that in general continues to adulthood.
The abnormal functioning of the brain and associative neural systems are greatly responsible for the occurrence of ADHD.
Environmental fac- tors such as alcoholism, drug use, and smoking in the pregnancy period, and in some cases the family hereditary are also the contributing ele- ments for the onset of ADHD (Tor et al., 2021 ).
ADHD children face difficulty in following instructions, paying and sustaining attention to any object or task, and show behavioral shifts and hyper-impulsive pe- culiarities (American Psychiatric Association (APA), 2013 ).
The Center for Disease Control and Prevention (CDC) has estimated that around 6.1 million children aged 2–17 years in the U.S.
had been diagnosed with ADHD according to the National Survey of Children ’s Health (NSCH) conducted in 2016.
Also, boys were more affected by ADHD tha