<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_015_summarize_pdf.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
# !pip install transformers datasets
# !pip install python-dotenv

### Environment Setup

In [4]:
# Necessary imports
import os
from dotenv import load_dotenv
from transformers import pipeline
from datasets import load_dataset

# Load environment variables
load_dotenv('/content/huggingface_api_key.env')
api_key = os.getenv("HUGGINGFACE_API_KEY")
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = api_key
os.environ["HF_TOKEN"] = api_key

# Initialize Hugging Face sentiment-analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

### Step 1: Extract Text from the PDF
To process the PDF text with an LLM, we first need to extract the text. Python libraries like **PyMuPDF (fitz)**, **PyPDF2**, or **pdfplumber** are great for this.

Certainly! Here’s a breakdown of each step to include in your documentation.

---

### Steps to Read and Extract Text from a PDF Hosted Online

When working with PDFs hosted online, we need to first download the PDF and then process it to extract the text. Here’s a step-by-step explanation of the process:

#### Step 1: Download the PDF File Using `requests`

The `requests` library in Python allows us to fetch content from a URL. In this case, we use `requests.get(pdf_url)` to download the PDF file from a specified URL.

- **Code**: `response = requests.get(pdf_url)`
- **Explanation**: This sends an HTTP GET request to the provided URL (`pdf_url`), which should point to a PDF file.
- **Status Check**: `response.raise_for_status()` will raise an error if the request was unsuccessful (e.g., if the file is not found or the server returns an error). This step ensures that the download was successful before proceeding.

#### Step 2: Open the PDF from Memory with `pdfplumber`

Once the PDF content is downloaded, we need a way to process it without saving it to disk. We can use `io.BytesIO` to handle the PDF content directly in memory.

- **Code**: `pdfplumber.open(io.BytesIO(response.content))`
- **Explanation**: `response.content` contains the raw bytes of the PDF. Wrapping it in `io.BytesIO` creates an in-memory file-like object, allowing `pdfplumber` to open and process it as if it were a local file.
- **Usage**: `with pdfplumber.open(io.BytesIO(response.content)) as pdf` opens the PDF, making it accessible page-by-page.

#### Step 3: Extract Text from Each Page

Once the PDF is open, we can iterate through its pages and extract text.

- **Code**: `for page in pdf.pages`
- **Explanation**: This loop iterates through each page of the PDF.
- **Text Extraction**: `page.extract_text()` extracts the text from each page. We concatenate the text from all pages to get a complete text representation of the document.
  
#### Step 4: Return or Print Extracted Text

The extracted text is stored in a variable (`text`) and can either be returned by the function or printed for a quick preview.





In [7]:
# !pip install pdfplumber

In [11]:
import requests
import pdfplumber
import io

# Function to extract text from a PDF file
def extract_text_from_pdf(pdf_url):
    # Download the PDF from the URL
    response = requests.get(pdf_url)
    response.raise_for_status()  # Check if the request was successful

    # Open the PDF from the downloaded bytes
    text = ""
    with pdfplumber.open(io.BytesIO(response.content)) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

# Example usage
pdf_url = "https://sdplantatlas.org/ge_files/pdf/Yellow-rumped%20Warbler.pdf"
pdf_text = extract_text_from_pdf(pdf_url)
print(pdf_text[:500])  # Display the first 500 characters of extracted text


Wood Warblers — Family Parulidae 477
Yellow-rumped Warbler Dendroica coronata
The Yellow-rumped Warbler is probably San Diego
County’s most abundant winter visitor. If the White-
crowned Sparrow exceeds it, it is not by much.
Eucalyptus groves and other exotic trees planted in
developed areas suit the Yellow-rumped Warbler at
least as much as natural sage scrub, chaparral, and
woodland. The birds are strongly concentrated in
the coastal lowland but are found almost through-
out the county, lacki


### PDF from Local Folder

In [12]:
# Extract text from each page of the PDF
def extract_text_from_pdf(pdf_path):
    text = ""
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            text += page.extract_text() + "\n"
    return text

# Example usage
pdf_path = "//content/Yellow-rumped Warbler.pdf"
pdf_text = extract_text_from_pdf(pdf_path)
print(pdf_text[:500])  # Display the first 500 characters of extracted text

Wood Warblers — Family Parulidae 477
Yellow-rumped Warbler Dendroica coronata
The Yellow-rumped Warbler is probably San Diego
County’s most abundant winter visitor. If the White-
crowned Sparrow exceeds it, it is not by much.
Eucalyptus groves and other exotic trees planted in
developed areas suit the Yellow-rumped Warbler at
least as much as natural sage scrub, chaparral, and
woodland. The birds are strongly concentrated in
the coastal lowland but are found almost through-
out the county, lacki


### Step 2: Split Long Text into Chunks


This code is designed to **split a long text into smaller chunks** to ensure that each chunk is within a specified token limit (`max_length=512`). This is especially useful when working with models like BERT-based models, which typically have a **token limit of 512** (i.e., they can process only 512 tokens at a time). Let’s break down the code:

### Code Explanation

1. **Split Text into Words**: `words = text.split()`
   - This splits the `text` string into a list of individual words based on spaces. Each word becomes an element in the `words` list.

2. **Loop Through Words in Chunks**: `for i in range(0, len(words), max_length)`
   - The `range(0, len(words), max_length)` statement allows us to loop over `words` in increments of `max_length`.
   - The loop variable `i` starts at `0` and increments by `max_length` each time, representing the starting index of each new chunk of words.

3. **Generate Chunks**: `yield ' '.join(words[i:i + max_length])`
   - `words[i:i + max_length]` selects a slice of the list containing `max_length` words.
   - `join()` combines the list of words in the slice into a single string, creating a text chunk.
   - `yield` returns this chunk of text, pausing the function until the next chunk is requested.


- **Purpose**: This line collects all generated chunks in a list by calling `split_text_into_chunks()` on `pdf_text`.
- Each element in `chunks` is a text segment of up to `512` words (or fewer if it’s the last chunk).

### Why This is Useful
- **Model Token Limit**: LLMs, particularly transformer models like BERT or BART, have a limit on how many tokens (words or subwords) they can process at once. This code breaks the text into manageable segments.
- **Ensuring Full Text Coverage**: By processing text in chunks, you can summarize, classify, or analyze the entire document without losing any information due to truncation.
- **Memory Efficiency**: Using `yield` allows chunks to be generated one at a time, which is memory-efficient when dealing with large documents.

This chunking approach is essential for large documents in NLP tasks, ensuring the model can handle each part of the text without exceeding input limits.

In [13]:
def split_text_into_chunks(text, max_length=512):
    words = text.split()
    for i in range(0, len(words), max_length):
        yield ' '.join(words[i:i + max_length])

# Example usage
chunks = list(split_text_into_chunks(pdf_text, max_length=512))

### Step 3: Summarize Each Chunk with Hugging Face Transformers
Hugging Face provides pre-trained summarization models like **BART** or **T5**, which work well for summarizing text.


In [14]:
from transformers import pipeline

# Load the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarize each chunk
summaries = [summarizer(chunk)[0]["summary_text"] for chunk in chunks]

# Combine the chunk summaries into a single document summary
full_summary = " ".join(summaries)
print(full_summary)

config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The Yellow-rumped Warbler is probably San Diego County’s most abundant winter visitor. The birds are strongly concentrated in the coastal lowland but are found almost through- out the county. The warblers are now of annual occurrence in the Cuyamaca Mountains, with up to eight, including five singing males, on Middle Peak. Audubon’s Warblers usually nest in the outer branches of P. K. Nelson but much less common in native desert the middle levels of conifers. It is rare to absent in the least-vegetated tracts of 2004 was in a burned canyon live oak retaining toasted desert and lacking entirely from pinyon–juniper zone leaves. Most records are from the coastal lowland, summits of San Diego County’s peaks. The Myrtle Warbler too is usu- southeastern Arizona. The dominant subspecies of the Yellow- high as 30 in Tijuana River valley 17 December 1977. Dates for the Myrtle range from 5 October (Townsend, 1837)


### Format Summary

To make your summary more readable, you can format it by breaking it into shorter paragraphs or adding line breaks. Here are a few techniques for improving readability:

### 1. **Text Wrapping**
   - Use the `textwrap` module to wrap long lines of text into a specified width (e.g., 80 characters), making it easier to read without scrolling horizontally.

### 2. **Adding Paragraph Breaks**
   - You can split the summary into paragraphs if it’s long. Adding breaks every few sentences improves readability.


### Explanation

- **`textwrap.fill(sentence, width=width)`**: Wraps each sentence to a specified width, breaking lines to fit within the limit.
- **Adding Line Breaks**: Every 3 sentences, a blank line is added to create paragraph breaks.

### Result
This will print the summary in a structured, legible format with line breaks and neatly wrapped lines, making it easier to read in a console or output window. Adjust `width` or the number of sentences per paragraph as needed to customize the format.

In [18]:
import textwrap

def format_summary(summary_text, width=80):
    # Split text into sentences (optional: based on periods)
    sentences = summary_text.split(". ")

    # Group sentences into paragraphs
    formatted_text = ""
    for i, sentence in enumerate(sentences, 1):
        # Wrap each sentence to the specified width
        wrapped_sentence = textwrap.fill(sentence, width=width)
        formatted_text += wrapped_sentence + ".\n"

        # Add a blank line every 3 sentences for readability
        if i % 3 == 0:
            formatted_text += "\n"

    return formatted_text

# Example usage
formatted_summary = format_summary(full_summary, width=80)
print(formatted_summary)

The Yellow-rumped Warbler is probably San Diego County’s most abundant winter
visitor.
The birds are strongly concentrated in the coastal lowland but are found almost
through- out the county.
The warblers are now of annual occurrence in the Cuyamaca Mountains, with up to
eight, including five singing males, on Middle Peak.

Audubon’s Warblers usually nest in the outer branches of P.
K.
Nelson but much less common in native desert the middle levels of conifers.

It is rare to absent in the least-vegetated tracts of 2004 was in a burned
canyon live oak retaining toasted desert and lacking entirely from
pinyon–juniper zone leaves.
Most records are from the coastal lowland, summits of San Diego County’s peaks.
The Myrtle Warbler too is usu- southeastern Arizona.

The dominant subspecies of the Yellow- high as 30 in Tijuana River valley 17
December 1977.
Dates for the Myrtle range from 5 October (Townsend, 1837).











### Step 4: Optional - Save the Summary to a Text File


### Explanation of Each Step
1. **Extract Text**: We use `pdfplumber` to extract text from each page in the PDF.
2. **Split Text**: The text is split into chunks to fit within the LLM’s input limit.
3. **Summarize**: Each chunk is summarized, and all summaries are combined for a final output.
4. **Save**: The final summary is saved to a file if needed.

### Key Considerations
- **Accuracy**: For best results, you may want to experiment with different summarization models like `t5-small`, `facebook/bart-large-cnn`, or `google/pegasus-xsum`.
- **Document Structure**: If the PDF has distinct sections, summarizing each section separately can improve the coherence of the summary.



In [15]:
with open("summary.txt", "w") as file:
    file.write(full_summary)

print("Summary saved to summary.txt")

Summary saved to summary.txt


### Modify Summary Length

You can adjust the summary length by setting parameters in the Hugging Face `pipeline` when using models like BART or T5 for summarization. Specifically, you can use the **`min_length`** and **`max_length`** parameters to control the minimum and maximum number of tokens in the summary.

### Parameters Explained
- **`min_length`**: Sets the minimum number of tokens for the summary. The model will try not to generate a summary shorter than this length.
- **`max_length`**: Sets the maximum number of tokens for the summary, allowing you to restrict the length of the output.

### Considerations
- **Balance**: Choose `min_length` and `max_length` based on your needs. A higher `max_length` produces a longer, more detailed summary, while a lower one provides a concise overview.
- **Truncation**: Setting `truncation=True` ensures that if the input text is too long, it will be truncated to fit within the model’s token limit (typically 512 tokens for BART).

This approach allows you to control the summary length dynamically, making it possible to adjust the length based on the document type or your summarization needs.

In [None]:
from transformers import pipeline

# Load the summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Set min_length and max_length for a custom summary length
def summarize_text(text, min_len=50, max_len=150):
    summary = summarizer(text, min_length=min_len, max_length=max_len, truncation=True)
    return summary[0]['summary_text']

# Example usage
text = "Your long PDF text or extracted text goes here..."
print(summarize_text(text, min_len=50, max_len=150))

## Full Document Context

When a large document is broken into chunks for an LLM, each chunk is processed independently, which means the model doesn’t have direct access to the full document context across chunks. Here’s how LLMs and summarization techniques address this limitation and some strategies to retain contextual connection:

### 1. **Chunk-by-Chunk Summarization**
   - **Independent Summaries**: Each chunk is treated as a standalone piece of text. The model generates a summary for each chunk separately, but this lacks a direct contextual link between chunks.
   - **Combining Summaries**: After summarizing each chunk, the individual summaries can be concatenated into a new “summary document” and then summarized again. This is often called a **hierarchical summarization approach** and helps retain a higher-level summary of the entire document.

### 2. **Hierarchical Summarization Approach**
   - **Stage 1**: Summarize each chunk independently to get a "mini-summary" for each section.
   - **Stage 2**: Combine all mini-summaries and run them through the model again to produce a cohesive summary that captures the document’s main points.
   - **Why This Helps**: Although each chunk’s summary doesn’t know the content of other chunks directly, the second-stage summary provides a synthesized view by “summarizing the summaries.”

### 3. **Embedding-Based Approaches for Semantic Connection**
   - Using an **embedding model** (e.g., Sentence-BERT) allows you to create a vector representation of each chunk. These embeddings capture semantic similarities between chunks without needing a sequential context.
   - **Clustering or Topic Modeling**: Once you have embeddings for each chunk, you can use clustering or topic modeling to find related themes across chunks. This provides a sense of document cohesion and can guide which key points to include in the final summary.

### 4. **Memory-Augmented Transformers**
   - Some advanced transformer models, like **Longformer** or **Reformer**, are designed to handle longer documents by keeping track of more extensive context. These models have specialized architectures allowing them to attend to a greater range of tokens, providing better context preservation over long documents.

### 5. **Using Overlapping Chunks**
   - By creating overlapping chunks (where, for instance, the last 50 tokens of one chunk are repeated at the beginning of the next), you allow some degree of context carry-over between chunks.
   - This method lets the model “see” some of the previous content in the next chunk, though it’s still a limited solution.

### Practical Implementation
If you’re using a standard transformer model (like BERT or BART), the **hierarchical summarization approach** is often the most practical:
- **Step 1**: Summarize each chunk independently.
- **Step 2**: Concatenate the summaries.
- **Step 3**: Run the concatenated summary through the model again to create a cohesive summary.

Here’s a code snippet showing how you might implement hierarchical summarization:


### Summary
- **Chunk-by-chunk summarization** provides summaries for each part of the document.
- **Hierarchical summarization** combines those summaries for a cohesive final summary.
- While standard models can’t retain document-wide context directly, these techniques give you a practical way to retain key points from across the document.

This process won’t capture nuanced context as well as reading the entire document at once, but it’s a robust workaround when handling documents that exceed model token limits.

In [19]:
from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

# Summarize each chunk
chunk_summaries = [summarizer(chunk, max_length=100, min_length=30)[0]['summary_text'] for chunk in chunks]

# Combine summaries
combined_summary_text = " ".join(chunk_summaries)

# Run a second summarization on the combined text
final_summary = summarizer(combined_summary_text, max_length=150, min_length=50)[0]['summary_text']

print("Final Summary of Document:", final_summary)



Final Summary of Document: The Yellow-rumped Warbler is probably San Diego County’s most abundant winter visitor. The birds are strongly concentrated in the coastal lowland but are found almost through- out the county. Sightings of adults carrying insects are as early as 27 May.


In [21]:
formatted_summary = format_summary(final_summary, width=80)
print(formatted_summary)

The Yellow-rumped Warbler is probably San Diego County’s most abundant winter
visitor.
The birds are strongly concentrated in the coastal lowland but are found almost
through- out the county.
Sightings of adults carrying insects are as early as 27 May..


