# Document Summarization Using GPT-3.5-turbo and PDF Processing

## Project Overview

This Jupyter notebook outlines a comprehensive approach to automatically summarizing large PDF documents using advanced NLP techniques and GPT-3.5-turbo, OpenAI's state-of-the-art language model. The process involves extracting text from a PDF file, segmenting the text into coherent sections, and then utilizing machine learning models to summarize and weave these sections into a cohesive narrative.

The goal of this project is to demonstrate how to handle and transform large volumes of text into concise summaries that maintain the core information and context. This is particularly useful in academic, legal, or corporate settings where quickly understanding large documents is crucial.

## Workflow Summary

1. **PDF Text Extraction**: Extract text from a PDF file, handling various formatting and structure to ensure high-quality text retrieval.
2. **Text Segmentation**: Identify logical sections within the extracted text, using titles or headings as delimiters to enhance the relevance of the segments.
3. **Text Summarization**: Apply NLP techniques to summarize each identified section. This step involves generating embeddings for text segments, clustering them to find similar themes, and summarizing each cluster.
4. **Generating Cohesive Narrative**: Enhance the flow between summaries using GPT-3.5-turbo to generate transitional texts that link the summaries in a coherent and fluid manner.
5. **Output as PDF**: Finally, compile the generated summaries into a single document, formatted as a PDF, ready for distribution or archiving.

## Author
* Nizar ZEROUALE

## Importing Required Libraries

Before we dive into the implementation details of our document summarization project, let's first import all the necessary libraries:

- **PyMuPDF (`fitz`)**: This library will be used for reading and extracting text from PDF files. It provides robust capabilities for handling various PDF content formats.
- **Sentence Transformers**: A library built on top of Hugging Face's transformers. It's designed for generating sentence embeddings, which are crucial for understanding and comparing text segments semantically.
- **Scikit-learn (`KMeans`)**: From this popular machine learning library, we use the KMeans clustering algorithm to group text segments into clusters based on their semantic similarity.
- **NumPy**: Essential for handling large arrays and matrices of numerical data, which is fundamental for any data processing in machine learning tasks.
- **re (Regular Expressions)**: This module is utilized for text processing tasks, specifically for identifying and extracting text patterns which is vital in segmenting text based on titles or subtitles.
- **FPDF**: A library for assembling and saving our summaries into a PDF format, allowing for easy distribution and archiving of the final document.

In [163]:
import fitz  # PyMuPDF
from sentence_transformers import SentenceTransformer
from sklearn.cluster import KMeans
import numpy as np
import time
import re
from fpdf import FPDF

## Setup for OpenAI API Integration

This section of the notebook establishes our connection to the OpenAI API, which is essential for accessing GPT-3.5 to perform advanced natural language processing tasks, including generating transitions and summarizing text:

- **os**: This module will be used for environment management, especially for safely handling API keys.
- **openai**: Directly imported from the OpenAI Python client library, this module enables us to interact with OpenAI's powerful GPT models.



In [178]:
import os
from openai import OpenAI

client = OpenAI(
    # Replace 'your_openai_api_key' with your actual OpenAI API key
    api_key="your_openai_api_key",
)

## 1. PDF Text Extraction

In this section, we focus on extracting textual content from a PDF document, which is the first critical step in our document summarization process. Efficient and accurate extraction of text from PDFs is essential as it forms the basis for all subsequent processing and analysis steps. 

We will use the `PyMuPDF` library, known in our code as `fitz`, to handle this task. This library provides robust support for interacting with PDF files, enabling us to read pages and extract text efficiently. Here's what the upcoming code will accomplish:

- **Open the PDF file**: We'll load the PDF document into our Python environment.
- **Read and Extract Text**: Iterate through each page of the PDF to gather all textual content.
- **Store and Print Text**: The extracted text will be stored for further processing and a snippet will be printed to verify the extraction.


In [88]:
def extract_text_from_pdf(file_path):
    # Open the provided PDF file
    doc = fitz.open(file_path)
    text = ''
    
    # Iterate through each page of the PDF
    for page in doc:
        # Extract text from the page
        text += page.get_text()
    
    # Close the document
    doc.close()
    return text
    

## 2. Text Segmentation

After extracting the text from the PDF, our next task is to segment it into logical sections. Proper segmentation is crucial as it allows us to process and summarize each part of the document more effectively, especially in large and complex documents where content varies significantly from one section to another.

### How We Segment the Text
We use a function called `split_into_sections`, which employs regular expressions to identify patterns that typically denote the start of a new section. These patterns include:

- Numbers indicating sections (e.g., `1`, `2.1`, etc.).
- Capital letters possibly indicating appendices or major section starts (e.g., `A`, `B.2`).
- These are often followed by titles in all caps or initial caps that help in recognizing the beginning of a section.

**Regex Approach**: 
The function defines a regex pattern that matches lines which are likely to be titles based on the above criteria. This helps in splitting the document text at these points:

- The regex pattern used is `r'(\n(\d+|\d+\.\d+|[A-Z]|\w\.\d+)\s*\n[A-Z])'`, which looks for new lines followed by a pattern of numbers or letters that are typical of titles or headers, followed by a new line starting with a capital letter.
- We then create a list of start indices for each match found by the regex, which marks where each new section begins.
- The text is split into sections using these indices, ensuring each segment captures a complete thought or topic as indicated by the document structure.

In [89]:
def split_into_sections(text):
    # Regex pattern to match the described subtitle formats
    pattern = r'(\n(\d+|\d+\.\d+|[A-Z]|\w\.\d+)\s*\n[A-Z])'
    sections = []
    start_indices = [0] + [match.start(1) for match in re.finditer(pattern, text)]
    
    # Split the text at each start index
    start_indices.append(len(text))  # Append the end of the text to handle the last section
    for i in range(len(start_indices) - 1):
        section = text[start_indices[i]:start_indices[i+1]].strip()
        if section:
            sections.append(section)

    return sections


## 3. Text Summarization

Once we have segmented the text into logical sections, the next step in our text summarization pipeline is to convert these text segments into a numerical format that machine learning models can process. This is achieved through the generation of text embeddings.

### What are Text Embeddings?
Text embeddings are vector representations of text where words or phrases with similar meanings have a similar representation. By converting text into embeddings, we can perform various types of semantic analyses, such as clustering similar texts together based on their content.

### Using Sentence Transformers for Embeddings
In this section, we utilize the `SentenceTransformer` library, which is specifically designed for generating high-quality sentence embeddings. The model we use is `all-MiniLM-L6-v2`:

- **Model Choice**: `all-MiniLM-L6-v2` is a lightweight, yet powerful model trained for generating embeddings that capture the semantic meaning of sentences efficiently.
- **Process**: We pass our text segments into the model's `encode` method, which returns a list of embeddings. Each embedding is a vector that numerically represents the corresponding text segment.

The embeddings generated here will be used in subsequent steps for clustering the text segments based on their semantic similarity, enabling us to summarize content that discusses similar topics.

### Implementation Details
The function `generate_embeddings` takes a list of text segments as input and returns their corresponding embeddings. This simplifies downstream tasks like clustering and summarization, as dealing with numerical data is often more straightforward and effective in machine learning workflows.

In [90]:
def generate_embeddings(text_segments):
    # Code to generate embeddings for text segments
    model = SentenceTransformer('all-MiniLM-L6-v2')  # Lightweight model for generating embeddings
    embeddings = model.encode(text_segments)
    return embeddings


### Clustering Text Sections for Summarization
In our text summarization workflow, one key step is to effectively group similar text segments into clusters. This step is crucial because it allows us to manage and summarize large documents more efficiently by focusing on representative segments from each cluster. Here is an overview of the clustering process implemented in the `cluster_segments` function:

### Purpose of Clustering
The primary goal of clustering in this context is to:

Reduce Redundancy: By clustering similar text segments together, we can summarize only one segment from each cluster, reducing repetitive information in the final summary.
Enhance Diversity: It ensures that the summarized content covers a broad range of topics or themes presented in the document.

In [91]:
def cluster_segments(embeddings):
    # Code to cluster embeddings
    n_clusters = min(250, len(embeddings))  # Define the number of clusters
    clustering_model = KMeans(n_clusters=n_clusters, random_state=42)
    clustering_model.fit(embeddings)
    cluster_labels = clustering_model.labels_
    return cluster_labels
    

### Concatenating Text Sections Within Clusters
For our text summarization workflow, after segmenting and clustering text sections based on their semantic similarities, it is essential to manage the text within these clusters effectively. The `concatenate_cluster_samples` function addresses this by concatenating all text sections within each identified cluster. This step prepares the text for more comprehensive processing, such as summarization. 


In [179]:
def concatenate_cluster_samples(cluster_labels, text_sections):

    from collections import defaultdict
    cluster_content = defaultdict(list)

    # Group all sections by their cluster labels
    for label, section in zip(cluster_labels, text_sections):
        cluster_content[label].append(section)

    # Concatenate all sections in each cluster into a single string
    concatenated_sections = []
    for sections in cluster_content.values():
        concatenated_sections.append(" ".join(sections))

    return concatenated_sections

## Text Summarization with GPT-3.5-Turbo
In our document processing workflow, one of the key functionalities is the summarization of text segments. This is carried out by the `summarize_text` function, which leverages the capabilities of OpenAI's GPT-3.5-turbo model. Here's a detailed look at how this function operates and its significance in the overall process:

### Purpose of the Summarize Text Function
The `summarize_text` function is designed to condense lengthy text segments into concise, informative summaries. This helps in distilling essential information from large volumes of text, making it more manageable and easier to understand.



In [185]:
def summarize_text(segment):
    # Code to summarize a text segment 
    prompt=(f"Convert the following into a concise, factual summary without introductory phrases. The convertion should be narrated. "
              f"Focus on key information and outcomes:\n\n{segment}")
    try:
        # Make an API call to OpenAI's GPT-3.5-turbo model
        response = client.chat.completions.create(
            messages=[
                {"role": "system", "content": "You are a helpful assistant."},
                {"role": "user", "content": prompt}
            ],
            model="gpt-3.5-turbo",  
        
            max_tokens=150,  # You can adjust this based on how concise you want the summary to be
            temperature=0.2  # Controls randomness: lower values make the output more deterministic
        )
        # Extract the text summary from the response
        summary = response.choices[0].message.content.strip()
        return summary
    except Exception as e:
        print(f"An error occurred: {e}")
        return "Error in summarization."

### Estimating Token Count for Text
Before processing text segments for summarization or any other NLP tasks, it's crucial to estimate how many tokens (words and punctuation marks) they contain. This estimation helps manage API usage, particularly when interacting with services that have limits on token counts per request, like OpenAI's GPT models. The `estimate_token_count` function provides a simple yet effective method for this estimation.


In [96]:
def estimate_token_count(text):
    # Rough estimation based on whitespace and common punctuation
    return len(text.split())

## 4. Generating Cohesive Narrative
In the process of compiling a comprehensive document from segmented summaries, it's crucial to ensure that the narrative flows logically and smoothly from one section to the next. The `create_fluid_summary` function is designed to achieve this by generating cohesive transitions between consecutive summaries, using OpenAI's GPT model. This approach enhances the readability and continuity of the final document, making it appear as a unified narrative rather than a collection of disjointed sections.

### Function Overview
The `create_fluid_summary` function integrates individual summaries into a seamless narrative by:

* **Appending each summary to a collective text**
* **Dynamically generating and inserting transition sentences between summaries using GPT-3.5-turbo**

In [186]:
def create_fluid_summary(summaries):

    full_text = []

    for i in range(len(summaries) - 1):
        # Append the current summary
        full_text.append(summaries[i])
        
        # Create the prompt for transition
        prompt = (
            f"Create a smooth transition sentence between the following two sections:\n\n"
            f"---\nSection {i+1}:\n{summaries[i]}\n\n"
            f"---\nSection {i+2}:\n{summaries[i+1]}\n\n"
            "Answer only with the transition :"
        )
        
        try:
            # Request the transition from GPT-3.5-turbo
            response = client.chat.completions.create(
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt}
                ],
            
                model="gpt-3.5-turbo",

                max_tokens=50,  # Adjust based on how long you expect transitions to be
                temperature=0.4,  # Adjust for more creative transitions
            )
            transition = response.choices[0].message.content.strip()
            full_text.append(transition)  # Append the generated transition
        except Exception as e:
            print(f"An error occurred while generating transition: {e}")

    # Append the last summary since it has no subsequent summary to transition to
    full_text.append(summaries[-1])

    # Join all parts into a single cohesive narrative
    return "\n\n".join(full_text)


## 5. Output as PDF
Once we have our cohesive narrative or document summary, the next step is to preserve and share it in a universally accessible format. The save_text_as_pdf function serves this purpose by converting the final text into a PDF file. This functionality not only facilitates easy distribution but also ensures the content is presented in a professional format suitable for reading and printing.

In [182]:
def save_text_as_pdf(text, filename="Summary.pdf"):
    
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Arial", size=8)
    pdf.add_font("Arial", "", "arial.ttf", uni=True)  # Ensure Unicode support for non-ASCII characters

    # Add text to PDF
    pdf.multi_cell(0, 10, text)  # Width = 0 means the cell is extended to the right margin

    # Save the PDF to a file
    pdf.output(filename)
    print(f"PDF successfully saved as {filename}")


## Main Function Overview

The `main` function orchestrates a comprehensive sequence of operations to process a PDF document into a coherent summary. Below is a step-by-step explanation of each component involved in this process:

- **Extract Text**: Retrieves all readable content from the specified PDF file.

- **Segment Text**: Divides the extracted text into sections based on predefined patterns, facilitating better organization of the content.

- **Generate Embeddings**: Converts each section into numerical representations (embeddings) that capture their semantic meaning.

- **Cluster Text Sections**: Groups similar text sections into clusters to consolidate related content, enhancing summarization efficiency.

- **Concatenate Cluster Samples**: Merges all text sections within each cluster to form a comprehensive view of each cluster’s content.

- **Generate Summaries**: Summarizes the concatenated text of each cluster, producing a concise overview of its content.

- **Create Cohesive Narrative**: Integrates all individual summaries into a single fluid narrative using transitions generated between them.

- **Output as PDF**: Saves the final cohesive narrative into a PDF document, ready for distribution or archival.

### Function Execution Details
- The function monitors and manages the number of text sections, embeddings, and cluster labels processed, providing insight into the data handling stages.
- It controls API usage by implementing pauses when the estimated token count approaches a preset limit, ensuring efficient use of resources without exceeding API rate limits.

In [183]:
def main():
    file_path = './2303.08774v6.pdf'
    full_text = extract_text_from_pdf(file_path)
    
    sections = split_into_sections(full_text)
    
    print("sections")
    #for section in sections:  
    #    print(section)
    #    print("-------------------------------------------------------")
    
    print(len(sections))

    embeddings = generate_embeddings(sections)
    print("embeddings")
    print(len(embeddings))
    #print(embeddings[:10])
    
    cluster_labels = cluster_segments(embeddings)
    print("cluster labels")
    print(len(cluster_labels))
    #print(cluster_labels)
    
    concatenated_samples = concatenate_cluster_samples(cluster_labels, sections)
    print("Concatenated cluster samples: ", len(concatenated_samples))
    print(len(concatenated_samples[-1]))

    summaries = []
    pause_duration = 100  # Pause duration in seconds
    tokens_used = 0
    max_tokens_per_minute = 30000

    for sample in concatenated_samples:
        print("Sample" + str(concatenated_samples.index(sample)))
        #print(sample)
        estimated_tokens = estimate_token_count(sample) + 150  # Add expected output tokens

        if tokens_used + estimated_tokens > max_tokens_per_minute:
            print("Approaching token limit, pausing...")
            time.sleep(pause_duration)
            tokens_used = 0  # Reset tokens after pause

        summary = summarize_text(sample)
        print(summary)
        summaries.append(summary)
        tokens_used += estimated_tokens  # Update tokens used

    
    final_summary = create_fluid_summary(summaries)
    print("Final Summary :", final_summary)
    save_text_as_pdf(final_summary, filename='Summary_GPT-4_Technical_Report.pdf')


In [None]:
if __name__ == "__main__":
    main()

## Conclusion

Throughout this notebook, we have demonstrated a robust process for extracting text from a PDF document, segmenting the text, generating semantic embeddings, and clustering the text for efficient summarization. We successfully transformed these clusters into concatenated samples, from which we generated concise summaries. These summaries were then integrated into a cohesive narrative through the use of intelligent transition sentences generated by OpenAI's GPT model.

The final narrative was outputted as a formatted PDF document, making it suitable for a wide range of applications including academic reviews, business reporting, or any scenario requiring quick digestion of large documents.

### Achievements
- **Efficient Text Handling**: Managed to efficiently process large volumes of text by leveraging clustering to reduce redundancy and focus on diversity.
- **NLP Techniques**: Utilized state-of-the-art NLP techniques and AI technologies to generate meaningful and context-aware summaries.
- **Automation and Scalability**: Established a workflow that can be automated and scaled for handling multiple documents or larger datasets.

### Future Directions
- **Improvement of Text Segmentation**: Further refine the text segmentation to capture more nuanced distinctions between sections and improve the accuracy of clustering.
- **Enhanced Summary Personalization**: Develop methods to tailor summaries more closely to specific user needs or preferences.
- **Broader Language Support**: Extend the model's capabilities to include multiple languages, increasing its applicability in global contexts.

This notebook stands as a testament to the power of combining traditional data processing techniques with advanced AI-driven tools to enhance information accessibility and usability. The methods and processes outlined here are adaptable to various content types and industries, promising broad utility in the data-driven world.
