<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_012_huggingface_PDF_summarizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Introduction to Hugging Face PDF Summarizer Notebook**

This notebook demonstrates how to build a PDF summarization tool using Hugging Face's Transformers library. It leverages a pre-trained summarization model to process and summarize large amounts of text extracted from PDFs, making it easier to digest long documents quickly. Key components include:

1. **Summarization Pipeline**: Loads a pre-trained model for text summarization, specifically tuned to handle shorter text chunks for improved accuracy.
2. **generate_summary Function**: Extracts text from each page of a PDF, splits the text into manageable chunks, and applies the summarization model to each chunk.
3. **Generating and Saving Summaries**: The notebook processes PDFs in a specified directory, summarizes the text content, and saves the resulting summaries as text files.

Through this notebook, users can learn how to effectively summarize lengthy documents by handling PDF text extraction, chunk processing, and model inference using state-of-the-art NLP techniques from Hugging Face.

## Install Libraries

### 1. **`langchain` (`!pip install langchain`)**
   - **Purpose**: LangChain is a framework for developing applications powered by large language models (LLMs). It helps streamline the process of integrating language models with external data sources, chains of reasoning, and other utilities.
   - **Common Uses**:
     - Creating LLM-based applications (e.g., chatbots, question-answering systems).
     - Building complex workflows where language models interact with APIs, databases, or other tools.
     - Integrating language models with external data sources for document retrieval and advanced reasoning.

### 2. **`unstructured` (`!pip install unstructured`)**
   - **Purpose**: Unstructured is a library for preprocessing and transforming unstructured data (such as text from documents, PDFs, emails, etc.) into structured formats that are easier to analyze and process.
   - **Common Uses**:
     - Parsing and extracting information from a variety of unstructured document formats (e.g., PDFs, emails, Word documents).
     - Preparing data for natural language processing (NLP) or machine learning tasks by cleaning and structuring raw text.

### 3. **`openai` (`!pip install openai`)**
   - **Purpose**: The `openai` library is the official Python client for OpenAI’s API, providing access to powerful models such as GPT-3, GPT-4, Codex, and DALL·E for tasks like natural language understanding, code generation, and image generation.
   - **Common Uses**:
     - Interacting with OpenAI’s language models for text completion, summarization, translation, and code generation.
     - Generating images using DALL·E.
     - Building AI-powered applications, such as chatbots, language translation tools, or content generation systems.

### 4. **`chromadb` (`!pip install chromadb`)**
   - **Purpose**: ChromaDB (Chroma Database) is an open-source embedding database used to manage and query vector embeddings efficiently. It is designed for applications like semantic search, similarity search, and working with large-scale vectorized data.
   - **Common Uses**:
     - Storing and querying vector embeddings for tasks such as document similarity, image search, and semantic search.
     - Managing embeddings from language models or other deep learning models for fast, efficient retrieval.

### 5. **`Cython` (`!pip install Cython`)**
   - **Purpose**: Cython is a programming language that allows you to write C extensions for Python. It helps speed up Python code by converting it into C, which can be compiled to run much faster than pure Python code.
   - **Common Uses**:
     - Optimizing performance-critical sections of Python code by converting them to C.
     - Creating Python bindings for C/C++ libraries.
     - Improving the performance of Python programs, particularly for computational tasks like numerical calculations and data processing.

### 6. **`tiktoken` (`!pip install tiktoken`)**
   - **Purpose**: `tiktoken` is a fast and efficient tokenization library designed to work with OpenAI's language models. It’s optimized for use with large language models like GPT-3 and GPT-4 and is used for splitting text into tokens that language models can understand.
   - **Common Uses**:
     - Tokenizing text efficiently for use with OpenAI’s language models.
     - Ensuring that text inputs and outputs are formatted correctly for language model processing (e.g., controlling the number of tokens in a request).
     - Preprocessing large text datasets for language model tasks.

### Summary for Documentation:

- **LangChain**: A framework for building applications powered by large language models, enabling interactions with external data sources and complex workflows.
- **Unstructured**: A library for transforming unstructured data (like PDFs or emails) into structured formats suitable for text analysis and processing.
- **OpenAI**: The official Python client for OpenAI's API, providing access to powerful models for tasks like text generation, code generation, and image creation.
- **ChromaDB**: An embedding database designed to manage and query vector embeddings efficiently for tasks like similarity search and semantic search.
- **Cython**: A tool for improving the performance of Python code by converting it into C for faster execution, especially in computationally intensive applications.
- **Tiktoken**: A tokenizer optimized for OpenAI's language models, used for splitting text into tokens efficiently for language model tasks.


In [None]:
!pip install langchain
!pip install unstructured
!pip install openai
!pip install chromadb
!pip install Cython
!pip install tiktoken
!pip install transformers
!pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [None]:
import os
import openai
from transformers import pipeline
import PyPDF2

file_path = '/content/Art Collector personality types_ finish.pdf'

/content


In [None]:


# Load the summarization pipeline
summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')

# Function to generate summary for a PDF
def generate_summary(pdf_path, chunk_size=1000, max_length=150, min_length=50):
    with open(pdf_path, 'rb') as file:
        pdf_text = ""
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            pdf_text += page.extract_text()

        # Split text into chunks
        chunks = [pdf_text[i:i + chunk_size] for i in range(0, len(pdf_text), chunk_size)]

        # Generate summaries for each chunk
        summaries = []
        for i, chunk in enumerate(chunks):
            summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
            summaries.append(summary[0]['summary_text'])

        return "\n".join(summaries)

if __name__ == '__main__':
    pdf_dir = '/content/'
    output_dir = '/content/'

    for filename in os.listdir(pdf_dir):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(pdf_dir, filename)
            pdf_summary = generate_summary(pdf_path)

            print(f"PDF: {filename}\nSummary:\n{pdf_summary}\n{'=' * 30}")

Your max_length is set to 150, but your input_length is only 25. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


PDF: Art Collector personality types_ finish.pdf
Summary:
 The Niche Enthusiast looks for very specific art forms, be it abstract, surrealism, or any niche genre . The Novice Collector is eager, curious, and often rely on others' opinions or visible  trends . For artists, aligning  with their niche preference is key to gain their attention .
 The Investment Collector buys artworks they believe will appreciate value . The Speculative Collector takes risks on unknown or lesser-known artists hoping that their value will explode in the future . Give them VIP access to artists with limited editions or exclusive previews to entice them .
 The Emotional Buyer buys art that resonates with them on a deep, emotional level . The Trend Follower buys artworks that are trending or popular at the moment . For artists, storytelling and personal connection can be the key to attracting these collectors .
 The Philanthropic/Patron Collector: They believe in supporting artists . The Social Collector: For 

### **Summarization Pipeline**:

   - This loads a pre-trained summarization model from Hugging Face's Transformers library. The model is used later to summarize chunks of text extracted from PDFs.

In [None]:
summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')

###**`generate_summary` Function**:

   - **Purpose**: This function extracts text from the PDF at the given `pdf_path`, splits it into chunks, and uses the summarization pipeline to generate summaries.
   - `chunk_size=1000`: This controls the size of text chunks (in characters) that are passed to the summarizer, because summarization models typically work better with limited input length.
   - `max_length` and `min_length`: These control the length of the summary generated for each chunk.


In [None]:
def generate_summary(pdf_path, chunk_size=1000, max_length=150, min_length=50):
    with open(pdf_path, 'rb') as file:
        pdf_text = ""
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            pdf_text += page.extract_text()

###**Splitting Text and Generating Summaries**:

   - The PDF text is split into chunks of size `chunk_size`. Then, for each chunk, the `summarizer` is called to generate a summary.
   - The generated summaries for each chunk are combined into one overall summary.

In [None]:
chunks = [pdf_text[i:i + chunk_size] for i in range(0, len(pdf_text), chunk_size)]

summaries = []
for i, chunk in enumerate(chunks):
    summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
    summaries.append(summary[0]['summary_text'])

###**Main Loop**:

   - **Purpose**: The script scans the `pdf_dir` directory for PDF files and generates summaries for each one.
   - It prints the summary to the console. You might want to save the summaries to files instead of just printing them.

In [None]:
if __name__ == '__main__':
    pdf_dir = '/content/'
    output_dir = '/content/'

    for filename in os.listdir(pdf_dir):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(pdf_dir, filename)
            pdf_summary = generate_summary(pdf_path)

            print(f"PDF: {filename}\nSummary:\n{pdf_summary}\n{'=' * 30}")


### Improvements and Enhancements:

**Handle Empty or Non-Extractable PDFs**:
   - Some PDFs may not have extractable text (due to being image-based or encrypted). You should add error handling to manage this.

   **Improved PDF Reading with Error Handling**:
   ```python
   def generate_summary(pdf_path, chunk_size=1000, max_length=150, min_length=50):
       try:
           with open(pdf_path, 'rb') as file:
               pdf_text = ""
               pdf_reader = PyPDF2.PdfReader(file)
               for page in pdf_reader.pages:
                   text = page.extract_text()
                   if text:  # Check if text extraction was successful
                       pdf_text += text

               # If no text is extracted, return an error message
               if not pdf_text:
                   return "No text could be extracted from this PDF."
               
               # Split text into chunks
               chunks = [pdf_text[i:i + chunk_size] for i in range(0, len(pdf_text), chunk_size)]

               # Generate summaries for each chunk
               summaries = []
               for i, chunk in enumerate(chunks):
                   summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
                   summaries.append(summary[0]['summary_text'])

               return "\n".join(summaries)
       except Exception as e:
           return f"Error processing PDF: {str(e)}"
   ```

3. **Save Summaries to Files**:
   - Instead of printing the summaries, you could save them to a text file for later review. For example:

   **Saving Summaries**:
   ```python
   if __name__ == '__main__':
       pdf_dir = '/content/'
       output_dir = '/content/summaries'

       if not os.path.exists(output_dir):
           os.makedirs(output_dir)

       for filename in os.listdir(pdf_dir):
           if filename.lower().endswith('.pdf'):
               pdf_path = os.path.join(pdf_dir, filename)
               pdf_summary = generate_summary(pdf_path)

               # Save summary to a file
               summary_filename = os.path.join(output_dir, f"{filename[:-4]}_summary.txt")
               with open(summary_filename, 'w') as summary_file:
                   summary_file.write(pdf_summary)

               print(f"Summary saved for: {filename}")
   ```

4. **Handle Long Documents**:
   - The code splits the PDF into chunks based on character count (`chunk_size=1000`). Some summarization models work with token limits rather than character counts, so ensure the chunks are small enough for the model you are using (Hugging Face models typically have a limit of around 1024 tokens). You may need to adjust the `chunk_size` or use a tokenizer to split based on token counts rather than characters.

5. **Enhance Efficiency for Multiple PDFs**:
   - If you’re processing multiple PDFs, it’s a good idea to run the summarization in parallel using multi-threading or parallel processing to speed things up.

### Improved Final Code with Suggestions Implemented:

### Summary of Changes:
- Added error handling for PDFs that may not contain extractable text.
- Added the ability to save summaries to text files instead of just printing them.
- Provided more efficient file handling and improved output management.


## Full Code

In [None]:
import os
import PyPDF2
from transformers import pipeline

# Load the summarization pipeline
summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')

# Function to generate summary for a PDF
def generate_summary(pdf_path, chunk_size=1000, max_length=150, min_length=50):
    try:
        with open(pdf_path, 'rb') as file:
            pdf_text = ""
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text = page.extract_text()
                if text:  # Check if text extraction was successful
                    pdf_text += text

            # If no text is extracted, return an error message
            if not pdf_text:
                return "No text could be extracted from this PDF."

            # Split text into chunks
            chunks = [pdf_text[i:i + chunk_size] for i in range(0, len(pdf_text), chunk_size)]

            # Generate summaries for each chunk
            summaries = []
            for i, chunk in enumerate(chunks):
                summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
                summaries.append(summary[0]['summary_text'])

            return "\n".join(summaries)
    except Exception as e:
        return f"Error processing PDF: {str(e)}"

if __name__ == '__main__':
    pdf_dir = '/content/'
    output_dir = '/content/summaries'

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(pdf_dir):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(pdf_dir, filename)
            pdf_summary = generate_summary(pdf_path)

            # Save summary to a file
            summary_filename = os.path.join(output_dir, f"{filename[:-4]}_summary.txt")
            with open(summary_filename, 'w') as summary_file:
                summary_file.write(pdf_summary)

            # Print the summary to the console
            print(f"Summary for {filename}:\n")
            print(pdf_summary)
            print("=" * 30)  # Separator for better readability


Your max_length is set to 150, but your input_length is only 25. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


Summary saved for: Art Collector personality types_ finish.pdf


In [None]:
# Path to the directory where summaries are saved
output_dir = '/content/summaries'

# List all files in the summaries directory
for summary_filename in os.listdir(output_dir):
    # Open and read the summary file
    with open(os.path.join(output_dir, summary_filename), 'r') as summary_file:
        summary_content = summary_file.read()
        print(f"Summary from {summary_filename}:\n")
        print(summary_content)
        print("=" * 30)  # Separator for readability


Summary from Art Collector personality types_ finish_summary.txt:

 The Niche Enthusiast looks for very specific art forms, be it abstract, surrealism, or any niche genre . The Novice Collector is eager, curious, and often rely on others' opinions or visible  trends . For artists, aligning  with their niche preference is key to gain their attention .
 The Investment Collector buys artworks they believe will appreciate value . The Speculative Collector takes risks on unknown or lesser-known artists hoping that their value will explode in the future . Give them VIP access to artists with limited editions or exclusive previews to entice them .
 The Emotional Buyer buys art that resonates with them on a deep, emotional level . The Trend Follower buys artworks that are trending or popular at the moment . For artists, storytelling and personal connection can be the key to attracting these collectors .
 The Philanthropic/Patron Collector: They believe in supporting artists . The Social Collec

### Adjust Summary Length
To ensure that the summarization actually **reduces the text** rather than simply repeating it, you can adjust several aspects of the summarization process. The key lies in how you control the length of the generated summary and the granularity of the text chunks being summarized.

Here are some strategies to improve the summarization process:

### 1. **Reduce `chunk_size` and Control Summary Length**:
   - The `chunk_size` parameter controls how much text from the PDF is passed to the summarizer at once.
   - The `max_length` and `min_length` parameters control the length of each summary chunk.
   - If the PDF contains short text sections, you may want to reduce `chunk_size` and reduce the `max_length` parameter to ensure a more concise summary.

### Explanation:
- **Smaller `chunk_size` (500)**: This ensures smaller portions of text are summarized at a time, allowing the summarizer to focus on condensing the information.
- **Shorter summaries with `max_length=50` and `min_length=25`**: The summaries will now be much shorter and more concise, focusing on extracting key information rather than simply repeating the text.




In [None]:
# Load the summarization pipeline
summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')

# Function to generate summary for a PDF
def generate_summary(pdf_path, chunk_size=200, max_length=25, min_length=10):
    try:
        with open(pdf_path, 'rb') as file:
            pdf_text = ""
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text = page.extract_text()
                if text:  # Check if text extraction was successful
                    pdf_text += text

            # If no text is extracted, return an error message
            if not pdf_text:
                return "No text could be extracted from this PDF."

            # Split text into chunks
            chunks = [pdf_text[i:i + chunk_size] for i in range(0, len(pdf_text), chunk_size)]

            # Generate summaries for each chunk
            summaries = []
            for i, chunk in enumerate(chunks):
                summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
                summaries.append(summary[0]['summary_text'])

            return "\n".join(summaries)
    except Exception as e:
        return f"Error processing PDF: {str(e)}"

if __name__ == '__main__':
    pdf_dir = '/content/'
    output_dir = '/content/summaries'

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(pdf_dir):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(pdf_dir, filename)
            pdf_summary = generate_summary(pdf_path)

            # Save summary to a file
            summary_filename = os.path.join(output_dir, f"{filename[:-4]}_summary.txt")
            with open(summary_filename, 'w') as summary_file:
                summary_file.write(pdf_summary)

            # Print the summary to the console
            print(f"Summary for {filename}:\n")
            print(pdf_summary)
            print("=" * 30)  # Separator for better readability


Summary for Art Collector personality types_ finish.pdf:

 The Novice Collector: These are individuals just starting in the art collection . They are eager, curious, and
 For artists, understanding their tastes and guiding them can lead to  a long-term patron relationship . Understanding their
 The Niche Enthusiast looks for very specific art forms, be it abstract, surrealism
 For artists, aligning with their niche preference is key to gaining their attention . Offer educational content about art
 The ricacies of the art world can be traced back to the world's art world .
 The Investment Collector: Primarily driven by the potential return on artworks they believe will appreciate artworks . The
 Artists who’ve received media attention or have a growing reputation will appeal to this group . Unlike the Investment
 The speculative collector takes risks on unknown or lesser-known artists hoping that their value will explode in the future .
 Provide data on past art sales, auction results, 

### 3. **Summarize the Entire PDF at Once**:
If the PDF is short, instead of breaking the text into chunks, you might want to summarize the **entire text** of the PDF in one go. This could result in a more coherent summary rather than summarizing smaller sections.

1. **Split the PDF Text into Chunks**:
   Since the model can handle only 1024 tokens at a time, you can split the text into smaller chunks and summarize each chunk separately.

2. **Change the Model**:
   Some models are better suited for handling longer texts, but most summarization models have similar token limitations (e.g., `bart-large-cnn` also has a 1024 token limit). Using OpenAI's GPT models via the OpenAI API would allow for more tokens, but that would require using the `openai` library.

### Option 1: Splitting the PDF Text into Chunks

Here’s how you can modify your code to split the text into chunks that fit within the model’s token limit:

#### Step-by-Step Update:

1. **Split Text into Chunks**: Create text chunks small enough for the model to process (under 1024 tokens).
2. **Summarize Each Chunk**: Apply the summarization model to each chunk separately.
3. **Combine the Summaries**: Concatenate the summaries from each chunk into one final summary.



### Key Changes:
1. **`chunk_text` function**: This function splits the input text into smaller chunks, each of which is within the model's token limit.
   - `chunk_size=1024` ensures that each chunk is small enough for the model to process.
   
2. **Chunk Summarization**: The text is split into chunks, and the summarizer is applied to each chunk separately. The summaries are then combined into one final summary.

### Benefits of This Approach:
- **Avoid Token Limit Errors**: By breaking the text into smaller chunks, you ensure that no chunk exceeds the model’s 1024 token limit.
- **Summarize Long Texts**: You can now summarize entire PDFs, regardless of their length, by handling each chunk individually.





In [None]:
import os
import PyPDF2
from transformers import pipeline

# Load the summarization pipeline (you can change the model if needed)
summarizer = pipeline("summarization", model='sshleifer/distilbart-cnn-12-6')

# Function to split text into chunks of a manageable size
def chunk_text(text, max_length=1024):
    # Splitting the text into chunks under the token limit
    return [text[i:i + max_length] for i in range(0, len(text), max_length)]

# Function to generate a summary for a PDF, now in chunks
def generate_summary(pdf_path, chunk_size=1024, max_length=150, min_length=50):
    try:
        with open(pdf_path, 'rb') as file:
            pdf_text = ""
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text = page.extract_text()
                if text:
                    pdf_text += text

            if not pdf_text:
                return "No text could be extracted from this PDF."

            # Split the text into smaller chunks to fit the model's token limit
            chunks = chunk_text(pdf_text, max_length=chunk_size)

            # Summarize each chunk separately
            summaries = []
            for chunk in chunks:
                summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
                summaries.append(summary[0]['summary_text'])

            # Combine the summaries of each chunk into one final summary
            return "\n".join(summaries)
    except Exception as e:
        return f"Error processing PDF: {str(e)}"


if __name__ == '__main__':
    pdf_dir = '/content/'
    output_dir = '/content/summaries'

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(pdf_dir):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(pdf_dir, filename)
            pdf_summary = generate_summary(pdf_path)

            # Save summary to a file
            summary_filename = os.path.join(output_dir, f"{filename[:-4]}_summary.txt")
            with open(summary_filename, 'w') as summary_file:
                summary_file.write(pdf_summary)

            # Print the summary to the console
            print(f"Summary for {filename}:\n")
            print(pdf_summary)
            print("=" * 30)  # Separator for better readability


Summary for Art Collector personality types_ finish.pdf:

 Novice Collector: These are individuals just starting in the art collection world . Niche Enthusiast: This collector looks for very specific art forms, be it  grotesqueabstract, surrealism, or any niche genre . For artists, aligning  with their niche preference is key to gain their attention .
 The Investment Collector buys artworks they believe will appreciate value . The Speculative Collector takes risks on unknown or lesser-known artists hoping that their value will explode in the future . Artists who’ve received media attention or have a growing ipient reputation will appeal to this group .
 The Emotional Buyer buys art that resonates with them on a deep, emotional level . The Trend Follower buys artworks that are trending or popular at the moment . For artists, storytelling and personal connection can be the key to attracting collectors .
 The Philanthropic/Patron Collector: They believe in supporting artists . The Social 

### Bart Large Summarizer

### 2. **Increase Abstraction by Using a More Powerful Model**:
The `sshleifer/distilbart-cnn-12-6` model is good for basic summarization but might not generate highly abstract summaries. You can try using a more advanced model that generates better summaries by focusing on key points.

Here are some alternative summarization models:
- **`facebook/bart-large-cnn`**: This is the full BART model fine-tuned on CNN/DailyMail and often provides more detailed and refined summaries.
- **`t5-small` or `t5-base`**: T5 (Text-to-Text Transfer Transformer) is another great model for summarization.


In [None]:
summarizer = pipeline("summarization", model='facebook/bart-large-cnn')

# Function to generate summary for a PDF
def generate_summary(pdf_path, chunk_size=500, max_length=50, min_length=25):
    try:
        with open(pdf_path, 'rb') as file:
            pdf_text = ""
            pdf_reader = PyPDF2.PdfReader(file)
            for page in pdf_reader.pages:
                text = page.extract_text()
                if text:  # Check if text extraction was successful
                    pdf_text += text

            # If no text is extracted, return an error message
            if not pdf_text:
                return "No text could be extracted from this PDF."

            # Split text into chunks
            chunks = [pdf_text[i:i + chunk_size] for i in range(0, len(pdf_text), chunk_size)]

            # Generate summaries for each chunk
            summaries = []
            for i, chunk in enumerate(chunks):
                summary = summarizer(chunk, max_length=max_length, min_length=min_length, do_sample=False)
                summaries.append(summary[0]['summary_text'])

            return "\n".join(summaries)
    except Exception as e:
        return f"Error processing PDF: {str(e)}"

if __name__ == '__main__':
    pdf_dir = '/content/'
    output_dir = '/content/summaries'

    if not os.path.exists(output_dir):
        os.makedirs(output_dir)

    for filename in os.listdir(pdf_dir):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(pdf_dir, filename)
            pdf_summary = generate_summary(pdf_path)

            # Save summary to a file
            summary_filename = os.path.join(output_dir, f"{filename[:-4]}_summary.txt")
            with open(summary_filename, 'w') as summary_file:
                summary_file.write(pdf_summary)

            # Print the summary to the console
            print(f"Summary for {filename}:\n")
            print(pdf_summary)
            print("=" * 30)  # Separator for better readability


config.json:   0%|          | 0.00/1.58k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Your max_length is set to 50, but your input_length is only 25. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=12)


Summary for Art Collector personality types_ finish.pdf:

The Novice Collector: These are individuals just starting in the art collection world. The Niche Enthusiast: This collector looks for very specific art forms.
For artists, aligning with their niche preference is key to gain their attention. Offer educational content about art, perhaps through workshops. Showcase expertise within their preferred niche. Organize exhibi
The Investment Collector buys artworks they believe will appreciate in value. Artists who have received media attention or have a growing reputation appeal to this group. The Speculative Collector takes artworks that focus exclusively on their chosen genre.
a. Provide data on past art sales, auction results, and any press coverage to highlight potential ROI.b. Offer limited editions or exclusive previews to entice risk-taking artists.
The Trend Follower: They’re all about what’s “in.” They buy artworks that are trending or popular at the moment. The Emotional Buyer:

### **Post-Processing the Summary**:
You can further reduce the summary length by post-processing the generated summary to remove unnecessary information (e.g., boilerplate or common phrases).

### Example Adjustments:
1. **Tune `min_length` and `max_length`**: These parameters help control the summary length. Lower the `max_length` to make the summary shorter.
2. **Try different summarization models**: Some models are better suited for certain types of text. BART, T5, or even OpenAI's GPT models may provide better summaries depending on the document's content.

### Summary of Adjustments:
- **Control summary length** by tuning `max_length` and `min_length`.
- **Reduce chunk size** to allow for better summarization of smaller sections.
- **Use a more powerful summarization model** (like `facebook/bart-large-cnn`) for better abstraction.
- **Summarize the entire document at once** if it's short enough for the model to handle in one go.

By combining these strategies, you should be able to get more concise and useful summaries that don't simply repeat the original text.