### Purpose

This notebook will provide a walkthrough of how to 1) read a PDF, 2) extract relevant text, 3) identify the key topics, and 4) summarize them all with the chat GPT API.

#### The Problem:

Imagine you're a student that is prepping to apply to grad school and wants to target a few universities based on the coursework they offer. You know your goals, but you're unsure where to begin. There are *a lot* of syllabi out there, and reading through all of them will require *a lot* of time and effort. You want to make the decision quickly while remaining informed. This is a perfect use-case to process PDFs, summarize the key learnings, and select your courses.

#### The Data:

In this we are going to use the syllabus from [6500 Statistical Machine Learning](https://stat.osu.edu/courses/stat-6500) taught at The Ohio State University by Dr. Lee. I took this class while studying for my MS Statistics at OSU and really enjoyed the course, so I chose a recent syllabus as our opportunity to extract and summarize text.

#### The Example:

This code will allow you to use your own OpenAPI key to process the text in a pdf, as well as sort through the content in this syllabus.

#### Key Learnings:

Along the way I will call out:

- Key libraries needed to process PDFs

- The components of a call to OpenAPI

- Checking your costs upfront (we don't want to go overboard here)

In [1]:
## generic libraries
import os
import sys
import pandas as pd
import numpy as np

## beautiful soup to handle the HTML
from bs4 import BeautifulSoup

## openai libraries
import openai
from openai import OpenAI

## libraries to count tokens etc
import tiktoken
import textwrap

## to help process any outputs
import ast
import re

## load my .env which contains API key and directory
from dotenv import load_dotenv
load_dotenv()
API_KEY = os.getenv("OPENAPI_KEY")
datapath = os.getenv("DATAPATH")
filename = os.getenv("FILENAME")


First we need to read in the pdf, this is done using PyMuPDF. We need to pip install this library before we import fitz.

```pip install pymupdf```

In [2]:
## the library name is quite different from pymupdf, don't worry about it!
import fitz

def extract_pdf_text(file_path):
    """
    This function will grab all of the text from the provided path and
    return it as a string.
    """
    text = ""
    with fitz.open(file_path) as doc:
        for page in doc:
            text += page.get_text()
    return text

In [3]:
# Example usage:
pdf_path = datapath + filename
pdf_text = extract_pdf_text(pdf_path)
print(pdf_text[:1000])

STAT 6500 Statistical Machine Learning
Term: Spring 2024
Lectures: MWF 3:00–3:55 PM (3 credit hours) in Cockins Hall 312
Instructor: Yoonkyung Lee
Oﬃce: 440H Cockins Hall
Oﬃce Hours: M 4:00–5:00 PM and F 11:00 AM–12:00 PM or by appointment
Email: yklee@stat.osu.edu or lee.2272@osu.edu
Grader: Zhizhen Zhao
Oﬃce Hours: by appointment only
Email: zhao.3053@osu.edu
Course Website: https://carmen.osu.edu
Course Description:
Statistical Machine Learning explores the methodology and algorithms behind modern supervised and
unsupervised learning techniques to explore relationships between variables in large, complex datasets.
Topics include linear and logistic regression, classiﬁcation, clustering, resampling methods, model selection
and regularization, and non-linear regression. Students will also gain exposure to popular statistical ma-
chine learning algorithms implemented in R. A focus will be on understanding the formulation of statistical
models and their implementation, and the practical

When we print the text we see that there is some structure to it - the alarm bells should be going off here. This is typically a sign of HTML. Let's call the first 100 characters and look for HTML formatting.

In [4]:
pdf_text[:100]

'STAT 6500 Statistical Machine Learning\nTerm: Spring 2024\nLectures: MWF 3:00–3:55 PM (3 credit hours)'

It's pretty clear that there is html formatting. The "\n" are linebreaks, which provides the formatting we see above. This makes the text pretty, but it also is detrimental to the LLM.

#### Why clean HTML before sending to an LLM?

1) Costs
    - APIs charge for each token and the HTML text increases the token count. Therefore, HTML increases costs

2) Quality
    - LLMs' "attention" will likely ignore the HTML, but it may be meaningful enough to impede the quality of results

3) Speed
    - More input tokens = longer time to process. Why waste time with something we don't care about?

In [5]:
def strip_html(html):

    """
    This function uses BeautifulSoup to remove html from the text string. It may not
    be 100% perfect, but it will remove the majority of HTML, which is good enough for
    this demo.
    """

    # parse html content
    soup = BeautifulSoup(html, "html.parser")

    for data in soup(['style', 'script', 'code', 'a']):
        # Remove tags
        data.decompose()

    text = soup.get_text(separator=" ", strip=True)
    cleaned = re.sub(r'\s+', ' ', text)

    # return data by retrieving the tag content
    #return ' '.join(soup.stripped_strings)
    return cleaned

In [6]:
## clean the html
cleaned_text = strip_html(pdf_text)

In [7]:
## call the text to look for any html formatting
cleaned_text[:100]

'STAT 6500 Statistical Machine Learning Term: Spring 2024 Lectures: MWF 3:00–3:55 PM (3 credit hours)'

Compare the new text to the previous - it's pretty clear the html is gone.

In [8]:
print(cleaned_text[:100])
print("\n")
print(pdf_text[:100])

STAT 6500 Statistical Machine Learning Term: Spring 2024 Lectures: MWF 3:00–3:55 PM (3 credit hours)


STAT 6500 Statistical Machine Learning
Term: Spring 2024
Lectures: MWF 3:00–3:55 PM (3 credit hours)


### Moving on to the API Call

Getting ready to call the API takes some prep. We need to complete a couple of prep steps:

1) Pick a model

2) Count the number of tokens - if the text is too long, we will run into a variety of issues

3) Engineer a prompt (seriously easier than it sounds)
    - Set the context
    - Give the command with the prep

#### Model selection

I am going to use gpt-3.5-turbo-0125 because it does a great job extracting topics and sumamrizing text while being a cheaper option. As of 5/11/2025 the 4.0+ models are out, but they are more expensive. If this was a business use case or a more complex task, I would choose a different model.

In [9]:
model = "gpt-3.5-turbo-0125"

### Count the tokens

The script below needs the model and text, which are ready to go, and will tell us the tokens before sending to the API. This is generally wise - sometimes it's better to chunk the text and summarize the outputs in another step. Syllabi are long, so this is a very real possibility.

In [10]:
def count_tokens(text, model):
    '''
    Counts the number of tokens for the provided model. Can be used to kill an api call,
    estimate costs, chunk, etc. depending on the length of the input.

    Parameters:
                - text (str): this is the text before it is converted to tokens
                - model (str): this specifies the openai model chosen
    Returns:
                - len(tokens) (int): the number of tokens
    '''
    encoding = tiktoken.encoding_for_model(model)
    tokens = encoding.encode(text)
    return len(tokens)

In [11]:
count_tokens(cleaned_text, model)

2069

2069 tokens is nothing. This will costs fractions of a cent and be incredibly cheap. We're almost ready to go!

### Engineer the prompts

1) We want to include a system message, which provides context for the LLM. This helps guide the LLM to perform our task with more precision. I always use these and highly recommend them. Be precise and creative. Imagine an expert performing the task - how would you describe them? What would you want them to emphasize during the task?
    - Imagine this expert is really good at their job, but terrible at assuming context. Giving them specific instructions will help them do a better job.

2) The prompt should provide a little bit more context, but also specify the output. I want a bulleted list of topics that contain no more than 3-5 words per bullet.

In [12]:
system_message = {
    "role": "system",
    "content": (
        """ You are an expert education analyst. Your job is to read a course syllabus and extract the key learning topics
            a student can expect to study. Focus on concepts, skill, machine learning algorithms, or other statistics topics
            that the course aims to teach Masters-level statistics students.
            Ignore administrative details such as grading policies, attendance, or instructor bios.
            Present the output as a concise list of the main subject areas or skills covered in the course."""
    )
}

prompt = """Here is the syllabus for a university-level course. Please extract the main learning topics and skills a student will gain
            from this course, return them in a bulleted list where each bullet is no more than 3-5 wordsin length. The text is as follows: {}"""

## Finally we make the call

First we write a function to handle this, then we send it off to the API, clean the response, and look at the results.

Parameters:
 - Temperature: This controls the randomness of the results. Generally, the extremes are bad. 1.0 is no randomness and close to 0.0 is nearly completely random. We want something in the middle, but guided. I generally like 0.7, but this is also something I'd tune to get the performance I want.

 - Penalty: In this case it's a frequency penalty. We don't want to see the same bullets repeatedly, so we will penalize the model for repeating itself. More penalty = less repetition. I don't always use this parameter, but I think it fits this problem nicely. We'll do 0.2. The default is 0. 

In [13]:
def extract_bulleted_topics(systemPrep, prompt, model, temp, penalty):
    response = client.chat.completions.create(
        model=model,
        messages=[
            systemPrep,
            {"role": "user", "content": prompt}
        ],
        temperature=temp,
        frequency_penalty = penalty,
        max_tokens = 800
    )

    return response

In [14]:
### use the API Key
client = OpenAI(api_key=API_KEY)

In [15]:
bullets = extract_bulleted_topics(system_message, prompt.format(cleaned_text), model, 0.7, 0.2)

In [16]:
print(bullets.choices[0].message.content)

- Statistical Machine Learning Methodology
- Supervised and Unsupervised Learning Techniques
- Linear and Logistic Regression
- Classification and Clustering
- Resampling Methods
- Model Selection and Regularization
- Non-linear Regression
- Statistical Learning Framework
- Statistical Models Formulation and Evaluation
- Learning Algorithms Rationale and Implementation
- Quantitative Evaluation of Learning Methods
- Application of Learning Methods to Real-world Data


### Conclusion:

I think the model did a great job hitting the highlights of the course. I would say that I learned all of that while in grad school, and I think a student would benefit from this list. Additionally, I think this is generalizable to any statistics course. For other disciplines, the context and prompt may need to be tweaked, but this is 95% of the way to completion.