# PDF Summarizer Project

This project is designed to efficiently extract key information from research papers in PDF format, utilizing advanced AI models. It provides a structured pipeline for loading, chunking, and summarizing large PDF files, optimized for researchers and professionals.

This project includes a **Streamlit** app that allows users to utilize the PDF summarization functionality through an intuitive interface.


## Features

### PDF Loading
- Uses **PyPDFLoader** to extract content from PDF files.
- Supports splitting large documents into manageable pages.

### Chunk-Based Summarization
- Dynamically splits raw text into smaller chunks with adjustable size and overlap for processing.
- Recursive Splitting:

  The splitter attempts to divide text at meaningful boundaries (e.g., paragraphs, sentences, words) to create chunks that make sense on their own, rather than cutting arbitrarily at the character limit.
- Supports both **"Stuff"** and **"Map-Reduce"** chain methods for summarization.

### Customizable Prompt Templates
- **Stuff Chain Template**: Summarizes entire chunks directly, suitable for small documents.
- **Map-Reduce Chain Template**: Breaks the process into mapping and reducing stages for better scalability and synthesis of complex documents.

### Integration with GPT-3.5 Turbo
- Utilizes **gpt-3.5-turbo-16k** for accurate and concise summaries.
- Allows temperature and prompt-based customization for tailored outputs.

---

## Workflow

### 1. Define and Load PDF
- Input the PDF URL or file path.
- Load and split content into chunks.

### 2. Summarization Method
- Use **"stuff"** for direct processing.
- Use **"map-reduce"** for larger documents with more comprehensive synthesis.

### 3. Prompt Integration
- Leverages `map_prompt` and `combine_prompt` templates for tailored AI guidance.
- Outputs summaries in bullet-point or final cohesive formats.

### 4. Execution
- Easily switch between methods by adjusting the function call parameters (`chunk_size`, `chunk_overlap`, and chain type).




## <font color=blue> TASK 1: Install and setup LangChain
Welcome to this project notebook, which will serve as your guide to constructing your inaugural Generative AI application. Within this notebook, you'll encounter concise descriptions of each task to enhance your comprehension of the sequence. Our initial task commences with the importation of essential libraries required for this project.

***Note: Before importing the libraries please ensure that all the library modules such as Langchain, Streamlit, PyPdf and other required libraries are installed on this notebook by running the pip command as shown below.***

In [44]:
#For installing the Langchain module associated with OpenAI LLM model
!pip install langchain
!pip install langchain_openai
#For installing the Python library responsible for PDF upload
!pip install pypdf
#For installing the library responsible for web app development
!pip install streamlit
#For installing tokeniser library that asists with converting text strings into tokens recognizable by OpenAI models
!pip install tiktoken
#For installing the library used for invoking the environment file containing secret API key
!pip install python-dotenv
!pip install -U langchain-community



In [45]:
import os
import dotenv
from langchain_openai import OpenAI
from langchain_openai.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
import streamlit as st

## <font color=blue> Load OpenAI API Key to access LLM model

## <font color=black>
1. Go to https://platform.openai.com/api-keys
2. Click on the '+ Creat new secret key button'
3. Enter an identifier name(optional) and click on the "Create secret key" button
4. Copy the API key to be used in the API.env file that you need to upload to Google Colab environment

In [46]:
# Load the .env file and invoke the secret API key from the file
dotenv.load_dotenv('API.env')
OpenAI.api_key = os.getenv("OPENAI_KEY")

# <font color=blue> Load PDF file

In [47]:
pdf_url = "https://www.medrxiv.org/content/10.1101/2021.07.15.21260605v1.full.pdf"

loader = PyPDFLoader(pdf_url)
pages = loader.load_and_split()


In [48]:
#number of pages
len(pages)

18

In [50]:
#view page content
print(pages[0].page_content)

COVID-19 Chest X-Ray Image Classification Using Deep Learning
Gunther Correia Bacellar,1 Mallikarjuna Chandrappa,1 Rajlakshman Kulkarni,1 Soumava Dey1* 
Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA  
*Correspondence: soumava2@illinois.edu; soumavadey87@gmail.com 
               
ABSTRACT 
The rise of the coronavirus disease 2019 (COVID-19) pandemic has made it necessary to improve existing medical screening 
and clinical management of this disease. While COVID-19 patients are known to exhibit a variety of symptoms, the major 
symptoms include  fever, cough, and fatigue. Since these symptoms also appear in pneumonia patients, this creates 
complications in COVID-19 detection especially during the flu season. Early studies identified abnormalities in chest X -ray 
images of COVID -19 infected patient s that could be beneficial for disease diagnosis. Therefore, chest X -ray image -based 
disease classification has emerged as an alterna

## <font color=blue> TASK 2: Define the summarize pdf function
Define the main function that will take pdf file path as an input and generate a summary of the file.

### Recursive Splitting Logic

- The method first tries to split the text into larger logical sections (e.g., paragraphs).
- If the resulting sections are still too large (i.e., exceed `chunk_size`), it further breaks them into smaller parts (e.g., sentences).
- This process continues recursively down to the smallest meaningful unit (e.g., words) until the chunks meet the desired size constraint.


### Adding Overlap

- After splitting the text into chunks, the method adds overlap between consecutive chunks based on the `chunk_overlap` parameter.
- For example:
  - If `chunk_size` is 1000 and `chunk_overlap` is 200:
    - The first chunk might include characters 0-1000.
    - The second chunk would include characters 800-1800.
- This overlap ensures that critical context isn't lost between chunks.


In [51]:
def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap):
    # Instantiate LLM model (GPT-3.5 Turbo)
    llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    # Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    # Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    # Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    # Summarize the chunks
    chain = load_summarize_chain(llm, chain_type="stuff")
    summary = chain.invoke(docs_chunks, return_only_outputs=True)

    # Return the summary
    return summary['output_text']




In [52]:
#Chunk size and chunk overlap values set to random value
# Print summary by using chain type "stuff" or "map_reduce"
print(summarize_pdf(pdf_url, 1000, 20))

This study developed a deep learning model called DLH_COVID to classify COVID-19, pneumonia, and normal/healthy cases from chest X-ray images. The model was compared to pre-trained models and achieved the highest accuracy of 96% in detecting COVID-19. A web application was also developed based on the DLH_COVID model for users to upload chest X-ray images and detect the presence of COVID-19. The study highlights the potential of AI-based systems for efficient COVID-19 detection and diagnosis.


In [53]:
def summarize_map_reduce_pdf(pdf_file_path, chunk_size, chunk_overlap):
    # Instantiate LLM model (GPT-3.5 Turbo)
    llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    # Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    # Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    # Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    # Summarize the chunks
    chain = load_summarize_chain(llm, chain_type="map_reduce")
    summary = chain.invoke(docs_chunks, return_only_outputs=True)

    # Return the summary
    return summary['output_text']

In [54]:
print(summarize_map_reduce_pdf(pdf_url, 1000, 20))

This study developed a deep learning model called DLH-COVID to classify chest X-ray images for COVID-19 detection. The model achieved a 96% accuracy rate and a user-friendly web application was created for quick diagnosis. The study highlights the challenges in differentiating COVID-19 from pneumonia and the limitations of current testing methods. The DLH-COVID model shows promise as a rapid and efficient method for COVID-19 diagnosis, but further assessments are needed. The study used transfer learning and hyperparameter optimization techniques, and references various research papers on CNN models, medical imaging diagnosis, and AI for COVID-19 detection.


## <font color=blue> TASK 3: Add Prompt template to the summarizer function
Leveraging prompt templates to extract key information from the reserach paper in more guided manner.

### Stuff Method
- Only uses `map_prompt`.
- Suitable for smaller documents where all content can be processed in a single step.

#### Map-Reduce Method
A multi-stage processing approach divided into two key steps:

##### 1. Map Stage
- The input document is divided into smaller chunks.
- Each chunk is processed individually to generate a separate summary.
- This stage utilizes `map_prompt` to guide the summarization of each chunk.

##### 2. Reduce Stage
- Summaries from the map stage are combined into a comprehensive final summary.
- This stage leverages `combine_prompt` to aggregate and synthesize the smaller summaries into one cohesive result.


## <font color=blue> Define Prompt Templates

### <font color=black> Prompt Template for Stuffing chain type

In [55]:
map_prompt_template = """
                       Write a summary of the research paper for an
                       artficial intelligence researcher that includes
                       main points and any important details in bullet points.{text}
                      """

map_prompt = PromptTemplate(
    input_variables=["text"],
    template=map_prompt_template,
)


### <font color=black> Add Combo Template for Map_Reduce chain type

In [56]:
combine_prompt_template = """
You will be given main points and any important details of a research paper in bullet points.
Your goal is to give a final summary of the main research topic and findings
which will be useful to an artificial intelligence researcher
to grasp what was done during the research work.

```{text}```

FINAL SUMMARY:
"""

combine_prompt = PromptTemplate(
    input_variables=["text"],
    template=combine_prompt_template,
)

In [57]:
# Modify the custom function to add the prompt templates
def summarize_stuff_pdf(pdf_file_path, chunk_size, chunk_overlap, map_prompt):
    # Instantiate LLM model
    llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    # Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    # Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    # Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    # Summarize the chunks
    chain = load_summarize_chain(llm, chain_type="stuff", prompt=map_prompt)

    # Return the summary
    summary = chain.invoke(docs_chunks, return_only_outputs=True)
    return summary['output_text']


In [58]:
#print summary using the map_prompt and combine prompt
#Increasing the chunk size value might reduce the overall summarization time with map_reduce method
print(summarize_stuff_pdf(pdf_url, 2000, 100, map_prompt))

- The research paper focuses on the development of an artificial intelligence (AI) system for the classification of COVID-19 chest X-ray images.
- The major symptoms of COVID-19 include fever, cough, and fatigue, which also appear in pneumonia patients, making it difficult to differentiate between the two during the flu season.
- Manual detection of COVID-19 from chest X-ray images is cumbersome and prone to human error, so AI techniques powered by deep learning algorithms are used to enhance the diagnosis process.
- The researchers implemented various pre-trained deep learning models such as ResNet, VGG, Inception, and EfficientNet, and developed their own convolutional neural network (CNN) model called DLH-COVID.
- The DLH-COVID model achieved the highest accuracy of 96% in detecting COVID-19 from chest X-ray images compared to pneumonia-affected and healthy individuals.
- The researchers also developed a web application based on the DLH-COVID model, allowing users to upload chest X-

In [59]:
# Modify the custom function to add the prompt templates
def summarize_map_reduce_prompt_pdf(pdf_file_path, chunk_size, chunk_overlap, map_prompt, combine_prompt):
    # Instantiate LLM model
    llm = ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    # Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    # Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    # Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    # Summarize the chunks
    # Using map_reduce chain type with map_prompt and combine_prompt
    chain = load_summarize_chain(llm, chain_type="map_reduce", map_prompt=map_prompt, combine_prompt=combine_prompt)

    # Return the summary
    summary = chain.invoke(docs_chunks, return_only_outputs=True)
    return summary['output_text']


In [60]:
print(summarize_map_reduce_prompt_pdf(pdf_url, 2000, 100, map_prompt, combine_prompt))

The research paper focuses on the development and evaluation of an artificial intelligence model, DLH_COVID, for the detection of COVID-19 from chest X-ray images. The researchers collected a dataset of chest X-ray images from COVID-19 patients, pneumonia-affected individuals, and healthy individuals. They implemented various pre-trained deep learning models and developed their own convolutional neural network model, DLH_COVID. The DLH_COVID model achieved the highest accuracy of 96% in detecting COVID-19 from chest X-ray images and outperformed the pre-trained models. A web application based on the DLH_COVID model was also developed, allowing users to upload chest X-ray images and quickly detect the presence of COVID-19. The paper concludes that the DLH_COVID model and web application provide an efficient and user-friendly method for COVID-19 detection from chest X-ray images, with the potential to become a rapid diagnosis method in the future. The paper also discusses the use of tran

## <font color=blue> TASK 4: Build and test a GenAI app for PDF summarization

In [61]:
%%writefile app.py
import os
import dotenv
from langchain_openai import OpenAI
from langchain_openai.chat_models import ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
import streamlit as st

Overwriting app.py


In [62]:
#summarize_pdf function
def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, prompt):
    #Instantiate LLM model gpt-3.5-turbo-16k
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    #Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    #Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    #Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    #Summarize the chunks
    chain = load_summarize_chain(llm, chain_type="stuff", prompt = prompt)
    #Return the summary
    summary = chain.invoke(docs_chunks, return_only_outputs=True)
    return summary['output_text']

In [63]:
#streamlit app main() function
def main():
    # Set page config and title
    st.set_page_config(page_title="PDF Summarizer", page_icon=":book:", layout="wide")
    st.title("PDF Summarizer")

    # Input pdf file path
    pdf_file_path = st.text_input("Enter the path to the PDF file:")
    if pdf_file_path != "":
        st.write("PDF file was loaded successfully")

    # Prompt input
    user_prompt = st.text_input("Enter your prompt:")
    user_prompt = user_prompt + """{text}"""
    prompt = PromptTemplate(
        input_variables=["text"],
        template=user_prompt,
    )

    # Summarize button
    if st.button("Summarize"):
        summary = summarize_pdf(pdf_file_path, 1000, 20, prompt)
        st.write(summary)




In [64]:
if __name__ == "__main__":
    main()



## <font color=blue> Launch Streamlit app from Google Colab

The following lines of code would enable users to launch Streamlit app from Google Colab using [ngrok service](https://ngrok.com/)

In [None]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip

In [None]:
!unzip ngrok-stable-linux-amd64.zip

In [None]:
get_ipython().system_raw('./ngrok http 8501 &')

In [None]:
!wget -q -O - ipv4.icanhazip.com

In [None]:
!streamlit run app.py & npx localtunnel --port 8501

## <font color=blue> FINAL TASK: Cumulative Activity

Click the link to explore useful Streamlit library functions:
https://docs.streamlit.io/library/cheatsheet

In [65]:
%%writefile app.py
import os
import dotenv
from langchain_openai import OpenAI
from langchain_openai.chat_models import ChatOpenAI
from langchain_community.document_loaders import PyPDFLoader
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains.summarize import load_summarize_chain
import streamlit as st

Overwriting app.py


In [66]:
# Load the .env file and invoke the secret API key from the file
dotenv.load_dotenv('API.env')
OpenAI.api_key = os.getenv("OPENAI_KEY")

In [67]:
#summarize_pdf function
def summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, chain_type, prompt):
    #Instantiate LLM model gpt-3.5-turbo-16k
    llm=ChatOpenAI(model="gpt-3.5-turbo-16k", temperature=0, openai_api_key=OpenAI.api_key)

    #Load PDF file
    loader = PyPDFLoader(pdf_file_path)
    docs_raw = loader.load()

    #Create multiple documents
    docs_raw_text = [doc.page_content for doc in docs_raw]

    #Split text into chunks
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
    docs_chunks = text_splitter.create_documents(docs_raw_text)

    # Create multiple prompts
    prompt = prompt + """{text}"""
    combine_prompt = PromptTemplate(input_variables=["text"], template=prompt)
    map_prompt = PromptTemplate(template="Summarize in bullet points:\n\n{text}", input_variables=["text"])

    # Summarize the chunks
    if chain_type == "map_reduce":
        chain = load_summarize_chain(llm, chain_type=chain_type,
                                    map_prompt=map_prompt, combine_prompt=combine_prompt)
    else:
        chain = load_summarize_chain(llm, chain_type=chain_type, prompt=map_prompt)

    # Return the summary
    return chain.run(docs_chunks)




In [68]:
#streamlit app main() function
def main():
    # Set page config and title
    st.set_page_config(page_title="PDF Summarizer", page_icon=":book:", layout="wide")
    st.title("Sam's GenAI App")

    # Add custom sliders and selectbox for more user interaction
    chain_type = st.sidebar.selectbox("Chain Type", ["stuff", "map_reduce"])
    chunk_size = st.sidebar.slider("Chunk Size", min_value=100, max_value=10000, step=100, value=1000)
    chunk_overlap = st.sidebar.slider("Chunk Overlap", min_value=10, max_value=1000, step=100, value=20)

    # Display warning message
    if 'map_reduce' in chain_type:
        st.sidebar.warning("Map_reduce chain type takes more than 5 mins to generate summary due to prompt latency!")

    # Input pdf file path
    pdf_file_path = st.text_input("Enter PDF file path:")

    # Prompt input
    user_prompt = st.text_input("Enter prompt:")


    #Summarize button
    if st.button("Summarize"):
        #Summarize pdf
        summary = summarize_pdf(pdf_file_path, chunk_size, chunk_overlap, chain_type, user_prompt)
        st.write(summary)

if __name__ == "__main__":
    main()



In [None]:
!wget -q -O - ipv4.icanhazip.com

34.133.247.52


In [None]:
!streamlit run app.py & npx localtunnel --port 8501


Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.
[0m
[1G[0K⠙[1G[0K⠹[0m
[34m[1m  You can now view your Streamlit app in your browser.[0m
[0m
[34m  Local URL: [0m[1mhttp://localhost:8501[0m
[34m  Network URL: [0m[1mhttp://172.28.0.12:8501[0m
[34m  External URL: [0m[1mhttp://34.133.247.52:8501[0m
[0m
[1G[0K⠸[1G[0K⠼[1G[0K[1G[0JNeed to install the following packages:
localtunnel@2.0.2
Ok to proceed? (y) [20Gy

[1G[0K⠙[1G[0K⠹[1G[0K⠸[1G[0K⠼[1G[0K⠴[1G[0K⠦[1G[0K⠧[1G[0K⠇[1G[0K⠏[1G[0K⠋[1G[0K⠙[1G[0K⠹[1G[0Kyour url is: https://empty-emus-cough.loca.lt
y
[34m  Stopping...[0m
^C
