# Simple PDF Analyzer (any PDF)

The following sections will demonstrate how to:
1. load a standard PDF (i.e., an article, a scientific paper, etc.)
2. extract continuous text from it using PyPDF2
3. summarize the information with a small LLM

Here's how we do it.

## Loading a PDF file

First off, we need to somehow read the PDF into Python. To do that, we'll again use [PyPDF2](https://pypdf2.readthedocs.io/en/3.0.0/modules/PdfReader.html#PyPDF2.PdfReader).
In contrast to before, we don't have fields now, though. We thus need to treat the text differently.

Here's how that looks like:

In [1]:
from PyPDF2 import PdfReader

pdf_path = "data/Why Dashboarding is Amazing.pdf"
reader = PdfReader(pdf_path)
text = ""

# the PdfReader instance does not have fields now, but it has pages which we extract text from:
for page in reader.pages:
    page_text = page.extract_text()
    if page_text:
        text += page_text + "\n"

print(text)


ModuleNotFoundError: No module named 'PyPDF2'

Nice. Now we loaded the entire text from the pdf into python. Issue, though: it's all
one big string. 

If we had to analyze a lot of scientific papers, though, we'd rather like to know what
the text of the individual sections is (Introduction, Methods, etc.).
To do that, we'll use the `re` package offering all kinds of regex methods.

Here's how we can 'skim' the text according to predefined keywords.

In [3]:
# a general example: how to use regex!
import re

example_sentence = "Hello, my name is Hannah."
match = re.search("Hello", example_sentence)

# now we want to know the characters' indices in the sentence!
match_start, match_end = match.span()
match_word = example_sentence[match_start:match_end]
print(str(match_start) + ", " + str(match_end))
print(match_word)

# What about typos?
# What about upper- or lowercase words?
# What about if there is no match?

0, 5
Hello


In [4]:
# now a bit more specific: what if we look for more than one word?

example_sentence = "The text contains an Abstract, an abstract, and a Discussion."
matches = re.finditer("Abstract|Introduction|Discussion", example_sentence)

for match in matches:
    match_start, match_end = match.span()
    match_word = example_sentence[match_start:match_end]
    print(str(match_start) + ", " + str(match_end))
    print(match_word)

21, 29
Abstract
50, 60
Discussion


In [7]:
# What if our keyword is ambiguous?

example_sentence = "No. Nothing's wrong with him."
matches = re.finditer("No", example_sentence)

for m, match in enumerate(matches):
    match_start, match_end = match.span()
    match_word = example_sentence[match_start:match_end]
    print(f"-----------match {m}-----------")
    print(str(match_start) + ", " + str(match_end))
    print(match_word)

# How to fix such cases?

-----------match 0-----------
0, 2
No
-----------match 1-----------
4, 6
No


In [9]:
# -> we explicitly look for a "No" between two word borders \b
example_sentence = "No. Nothing's wrong with him."
matches = re.finditer(r"\bNo\b", example_sentence)

for m, match in enumerate(matches):
    match_start, match_end = match.span()
    match_word = example_sentence[match_start:match_end]
    print(f"-----------match {m}-----------")
    print(str(match_start) + ", " + str(match_end))
    print(match_word)

-----------match 0-----------
0, 2
No


In [10]:
# how does that work? some differences between normal and raw strings:
print("------normal string------")
print("Hello \n World")
print("------raw string------")
print(r"Hello \n World")

------normal string------
Hello 
 World
------raw string------
Hello \n World


In [11]:
import re
import pandas as pd

# find keyword matches
SECTION_PATTERN = "Abstract|Introduction|Methods|Discussion|Conclusion"
matches = re.finditer(SECTION_PATTERN, text)
matches = list(matches)

# find text sections
sections = {}
for m, match in enumerate(matches):

    section_name = match.group(0)
    start = match.end()    
    end = matches[m + 1].start() if m + 1 < len(matches) else len(text)

    # capture any text before the first section as 'Title'
    if m == 0:  
        title = text[:match.start()].strip()
        if title:
            sections["Title"] = title
    section_text = text[start:end].strip()
    sections[section_name] = section_text

In [12]:
print(sections)

{'Title': 'Why Dashboarding is Amazing: Unlocking the Power of \nPython for Real-Time Insight and Communication', 'Abstract': 'In today’s data-driven world, the ability to interact with data in real time is not a luxury — it’s a necessity. \nTraditional static reports, spreadsheets, and even conventional business intelligence tools often fall short in \ndelivering the agility, clarity, and actionable insights required for modern decision-making. Dashboards bridge \nthis gap by o Ưering interactive visualizations, real -time data exploration, and immediate feedback on key \nmetrics. They empower users to not only view data but actively engage with it, uncovering insights that might \notherwise remain hidden. Python, a dominant language in data science, o Ưers frameworks that make \ndashboarding accessible, e Ưicient, and highly customizable. These tools streamline the translation of complex \ndatasets into intuitive visual representations, enhancing comprehension and decision-making acr

In [13]:
# that's not all we can do with regex. we can also exchange words with other words:

example_sentence = "There's quite a lot you can do with apples."
print(example_sentence)

fixed_sentence = re.sub("apples", "regex", example_sentence)
print(fixed_sentence)

There's quite a lot you can do with apples.
There's quite a lot you can do with regex.


Nice! Now we successfully extracted text from the pdf section by section. 
An issue, though: the formatting is quite strange with \n (newline character in regex)
etc. in there. We want to remove those non-text characters.

Here's how we do that.

In [14]:
for key, value in sections.items():
    # remove newline characters
    value = re.sub(r"\n", " ", value)

    # additionally, strip multiple spaces
    value = re.sub(r"\s+", " ", value)
    
    sections[key] = value
print(sections)

{'Title': 'Why Dashboarding is Amazing: Unlocking the Power of Python for Real-Time Insight and Communication', 'Abstract': 'In today’s data-driven world, the ability to interact with data in real time is not a luxury — it’s a necessity. Traditional static reports, spreadsheets, and even conventional business intelligence tools often fall short in delivering the agility, clarity, and actionable insights required for modern decision-making. Dashboards bridge this gap by o Ưering interactive visualizations, real -time data exploration, and immediate feedback on key metrics. They empower users to not only view data but actively engage with it, uncovering insights that might otherwise remain hidden. Python, a dominant language in data science, o Ưers frameworks that make dashboarding accessible, e Ưicient, and highly customizable. These tools streamline the translation of complex datasets into intuitive visual representations, enhancing comprehension and decision-making across technical an

## Summarizing the text

We successfully extracted text from the PDF and maintained the section structure. Now,
we want to use an LLM to summarize the section text for us because we don't have much
time to read everything that's going on in there.

Here's how we do this.

In [18]:
# how to work with an LLM: using ollama's llama3.2.1b model locally
from ollama import chat, ChatResponse

input_sentence = "Hi, please tell us a joke."
response: ChatResponse = chat(
        model='llama3.2:1b', 
        messages=[{'role': 'user', 'content': input_sentence}])
print(response.message.content)

Here's one that's sure to bring a smile:

What do you call a fake noodle?

An impasta.


In [23]:
# now with a prompt specifying how the model should behave:
input_sentence = "Hi, please tell us a joke."
prompt = "Reply to the following sentence as if you were a servant at court: "
response: ChatResponse = chat(
        model='llama3.2:1b', 
        messages=[{'role': 'user', 'content': prompt + input_sentence}])
print(response.message.content)

Your Majesty, I'm happy to oblige your request. However, I must remind Your Highness that jests are often shared among the nobility and upper classes, but may not be suitable for your more...refined audience. Nevertheless, I shall endeavor to present a jest of moderate levity.

Why did the cardinal's vestments go to therapy? Because they were feeling a little "folded" under the pressure! (The cardinal, it is said, was quite taken aback by this pun, but attempted to maintain a stiff upper lip.)

May I suggest another one, Your Majesty?


In [None]:
from ollama import chat, ChatResponse

# define a prompt for the LLM
prompt = f"""Summarize the following text. Do not verify facts, and do not add
    commentary. Only output a concise summary in three sentences maximum."""

# initialize an empty list for the model's section summaries
summaries = []
dataframe = pd.DataFrame([sections])
for col in dataframe.columns:

    # fetch the content of the specific column's first row
    text = dataframe[col].iloc[0]   

    # have the LLM (model: llama3.2:1b) summarize the text
    response: ChatResponse = chat(
        model='llama3.2:1b', 
        messages=[{'role': 'user', 'content': prompt + text}])
    result = response["message"]["content"]

    # sometimes, the model adds a Chat-GPT-like 'Here is an output for ...' to the 
    # beginning of its response - if that's the case, we have it run again until we
    # get a response without that precursor (not the most efficient solution, but works).
    while result.startswith("Here is"):
        response: ChatResponse = chat(
            model='llama3.2:1b', 
            messages=[{'role': 'user', 'content': prompt + text}])
        result = response["message"]["content"]

    summaries.append(result)

# add the summaries to the sections dataframe as a second row
summary_df = pd.concat([dataframe, pd.DataFrame([summaries], columns=dataframe.columns)], axis=0)
summary_df.index = ["Original Text", "Summary"]


Takes quite long (~5 minutes...) but in the end the results are quite nice.
Sometimes, the model is unable to stick to the 'at most three sentences' prompt and
gives longer answers; that's because the model is quite small and context tends to get
overlooked for long text inputs. It's quite okay for our purposes, though!

In [11]:
summary_df.head()


Unnamed: 0,Title,Abstract,Introduction,Methods,Discussion,Conclusion
Original Text,Why Dashboarding is Amazing: Unlocking the Pow...,"In today’s data-driven world, the ability to i...",Modern organizations are inundated with vast a...,Among the tools available for dashboarding in ...,The real value of dashboarding lies in the way...,The future of data communication lies in inter...
Summary,Dashboarding allows users to easily view and i...,Dashboarding is a crucial tool for modern deci...,Modern organizations have vast amounts of data...,Streamlit is a tool for dashboarding in Python...,A dashboard reshapes how organizations interac...,Data communication's future lies in interactiv...
