# Text Summarization of Large Documents using LangChain 🦜🔗



In [2]:
# !pip install --upgrade pytesseract pypdf PyPDF2 textract langchain transformers --quiet

This notebook is inspired by this Google Notebook: https://github.com/GoogleCloudPlatform/generative-ai/blob/main/language/use-cases/document-summarization/summarization_large_documents_langchain.ipynb 

**Key difference:**

- Utilize openai models instead. 

- Centralize all sub summarizations and summarize them all together. 

- Using document intelligence to extract text from document instead of open-source solutions

**Objective**:  
- This notebook demonstrates a method to summarize large documents using the Refine approach from LangChain.
- We will focus on implementing and optimizing this specific summarization technique.
- Our goal is to efficiently process and summarize the content of a substantial PDF document.


In [3]:

# Standard library imports
import os
import urllib
import warnings
from pathlib import Path as p

# Third-party imports
import pandas as pd
from dotenv import load_dotenv

# LangChain imports
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
from langchain_openai import AzureChatOpenAI

# Load environment variables
load_dotenv()

# Suppress warnings
warnings.filterwarnings("ignore")

In [4]:
# Initialize the Azure OpenAI model
llm = AzureChatOpenAI(
    azure_deployment="gpt-4o-mini",  # The deployment name of the model in Azure
    temperature=0.3,                 # Controls randomness in the output (0.0 to 1.0)
    api_version=os.environ['AZURE_OPENAI_API_VERSION'],  # API version from environment variable
    model_name='gpt-4o-mini'         # The name of the model being used
)

# Large Document Summarization

## Preparing data files

Here we will load a Home Depot Annual Report, which is a substantial document. 
Documents of this size can be challenging for LLMs to summarize comprehensively.
We'll use this to demonstrate advanced summarization techniques.

In [5]:
# # data_folder = p.cwd() / "data"
# # p(data_folder).mkdir(parents=True, exist_ok=True)
# pdf_url = "https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf"
# pdf_file = str(p(data_folder, pdf_url.split("/")[-1]))

# urllib.request.urlretrieve(pdf_url, pdf_file)

Here we use an `PdfReader` to extract then text from our scanned documents

In [6]:
# extract text from the pdf

pdf_loader = PyPDFLoader('hd_annual_report.pdf')

pages = pdf_loader.load_and_split()

print(pages[1].page_content)


Fiscal 2023: A Year of Moderation
Fiscal 2023 was a year of moderation after three years of unprecedented growth in the home improvement market.  
It was also a year of opportunity. We focused on several operational improvements to strengthen the business, while 
also staying true to the growth opportunities detailed at our Investor and Analyst conference in June of 2023.
During /f_iscal 2023, total sales declined 3.0 percent to $152.7 billion, compared to /f_iscal 2022. Fiscal 2023  
comparable sales declined 3.2 percent for the total Company and 3.5 percent in the U.S. Our /f_iscal 2023 net  
earnings were $15.1 billion, and earnings per diluted share decreased 9.5 percent to $15.11.
Focused on Strategic Objectives
Over the last several years, we have successfully managed through a dynamic macroeconomic environment,  
including in/f_lation and disin/f_lation, higher interest rates, and shifts in consumer spending. Throughout this time, our 
strategic priorities have remained the same

## Method: Refine

The Refine method is an alternative method to deal with large document summarization. It works by first running an initial prompt on a small chunk of data, generating some output. Then for each subsequent document, the output from the previous document is passed in a long with the new document, and the LLM is asked to refine the output based on the new document. 

In LangChain, you can use `RefineDocumentsChain` as part of the load_summarize_chain method. What you need to do is setting `refine` as `chain_type` of your chain

### Prompt design with `Refine` Chain

With LangChain, the `refine` chain requires 2 prompts

The question prompt to generate hte output for subsequent taks. The refine prompt to refine the output based on the generated content. 


In [7]:

question_prompt_template = """
                  Please provide a summary of the following text.
                  TEXT: {text}
                  SUMMARY:
                  """

question_prompt = PromptTemplate(
    template=question_prompt_template, input_variables=["text"]
)

refine_prompt_template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
              """

refine_prompt = PromptTemplate(
    template=refine_prompt_template, input_variables=["text"]
)

## Generate Summaries

After you define prompts, you initiate a summarization chain using `refine` chain type.

In [8]:
refine_chain = load_summarize_chain(
    llm,
    chain_type="refine",
    question_prompt=question_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
)

Then, you use the summarization chain to summarize document using Refine method

In [9]:
refine_outputs = refine_chain({'input_documents': pages})

In [10]:
final_refine_data = []
for doc, out in zip(
    refine_outputs["input_documents"], refine_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["concise_summary"] = out
    final_refine_data.append(output)

In [11]:
pdf_refine_summary = pd.DataFrame.from_dict(final_refine_data)
pdf_refine_summary = pdf_refine_summary.sort_values(
    by=["file_name", "page_number"]
)  # sorting the dataframe by filename and page_number
pdf_refine_summary.reset_index(inplace=True, drop=True)
pdf_refine_summary.head()
     

Unnamed: 0,file_name,file_type,page_number,chunks,concise_summary
0,hd_annual_report,.pdf,0,‘23Annual\nReport,It seems you provided a title or heading for a...
1,hd_annual_report,.pdf,1,Fiscal 2023: A Year of Moderation\nFiscal 2023...,- **Fiscal 2023 Overview**: Marked a year of m...
2,hd_annual_report,.pdf,2,We also see an opportunity to drive sales thro...,- Opportunity identified for sales growth thro...
3,hd_annual_report,.pdf,3,Table of Contents\nUNITED STATES\nSECURITIES A...,- The document is the FORM 10-K annual report ...
4,hd_annual_report,.pdf,3,The number of shares outstanding of the regist...,"- As of February 28, 2024, the registrant has ..."


Examine summarization of a specific page

In [13]:
# page 98
index = 138
print("[Context]")
print(pdf_refine_summary["chunks"].iloc[index])
print("\n\n [Simple Summary]")
print(pdf_refine_summary["concise_summary"].iloc[index])
print("\n\n [Page number]")
print(pdf_refine_summary["page_number"].iloc[index])
print("\n\n [Source: file_name]")
print(pdf_refine_summary["file_name"].iloc[index])

[Context]
LIVING OUR
VALUES
The Home Depot, Inc.
2455 Paces Ferry Road, Atlanta, GA 30339-4024
(770)433-8211
http://ir.homedepot.comNYSE: HDINVESTED AN 
ADDITIONAL     
~ $1 BILLION  
in annualized compensation 
for our frontline, hourly 
associates~ 90% of our U.S. store 
leaders started as 
HOURLY ASSOCIATESTargeting 85% of our 
U.S. & Canadian Sales in 
push mowers and handheld 
outdoor lawn equipment will be 
BATTERY POWERED   
by the end of 2028
Since 2018, our Foundation’s  
PATH TO PRO  
program helped train over 
41,000 participants and 
introduced over 200,000 people  
to the skilled tradesThe Home Depot Foundation
SURPASSED 
$500 MILLION  
in veterans giving since 2011
ESTABLISHED 
SCIENCE-BASED 
TARGETS  
to reduce our 
emissions across Scopes 
1, 2 & 3 by the end of 2030STRENGTHEN OUR FOCUS ON
OUR PEOPLE COMMUNITIESOPERATE
SUSTAINABLY


 [Simple Summary]
- Home Depot invested an additional ~$1 billion in annualized compensation for frontline, hourly associates.
- 90% of U.S

## Final Summarization 

In [16]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain 
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser

Combine all the summarizations to generate a final summary

In [17]:
final_summary = "\n\n".join(pdf_refine_summary['concise_summary'])

In [21]:
final_sum_prompt_template = """You are an intelligent writing assistant. Your task is to generate a coherent, well-structured summary of a document based on individual page summaries provided below.

Page Summaries:
{page_summaries}

When synthesizing these, consider the following:
- Identify and highlight major themes, recurring ideas, and the most important points.
- Maintain a logical flow from introduction to conclusion, ensuring that the summary reflects the document's overall structure.
- Avoid redundancy but ensure no critical information is left out.
- Your summary should read smoothly, making sure to connect ideas where appropriate.

Your final output should be clear, comprehensive, and reflective of the entire document's essence.

Overall Summary:
"""


final_sum_prompt = final_sum_prompt_template.format(page_summaries = final_summary)

overall_summary = llm.invoke(
    [SystemMessage(content=final_sum_prompt), 
     HumanMessage(content='Generate the overall summary.')]
)

# Comparision with ChatGPT summarization

### Refined Method Summary

In [22]:
# Display the overall summary
print("Overall Summary of the Document:")
print(overall_summary.content)


Overall Summary of the Document:
### Overall Summary of The Home Depot, Inc. Fiscal 2023 Form 10-K

The Home Depot, Inc. reported a year of moderation in fiscal 2023, following three years of significant growth in the home improvement market. Total sales declined by 3.0% to $152.7 billion, with comparable sales down 3.2%, reflecting changing consumer trends and rising interest rates. Net earnings fell to $15.1 billion, resulting in a decrease in earnings per diluted share by 9.5% to $15.11.

**Strategic Focus and Operational Improvements**: The company emphasized enhancing customer experience, developing differentiated capabilities, and maintaining a low-cost provider position. Key operational improvements included enhanced digital navigation in stores, optimized logistics for appliance delivery, and investments in fulfillment modes tailored for professional customers (Pros). The acquisition of Construction Resources aimed to bolster product offerings for Pro customers.

**Store Expans

### ChatGPT generated summary

The document is **The Home Depot's Annual Report for Fiscal 2023**, providing a detailed review of the company's financial performance, strategic objectives, and future plans. Here's a summary of the key points:

- **Financial Performance**: In fiscal 2023, total sales dropped by 3% to $152.7 billion, while net earnings were $15.1 billion. The company's diluted earnings per share also decreased by 9.5% to $15.11.
  
- **Market Conditions**: The year was marked by moderation following three years of strong growth. The company faced macroeconomic challenges such as inflation, rising interest rates, and shifts in consumer spending patterns.

- **Strategic Focus**: Home Depot remained committed to its long-term goals, including enhancing customer experience, developing unique capabilities, maintaining its low-cost position, and growing market share. The company focused on reducing friction in the shopping experience, expanding its store footprint, and enhancing its Pro customer offerings.

- **Operational Improvements**: Efforts were made to improve store navigation, reduce checkout times, and increase customer satisfaction through better delivery systems. Investments were made in digital technologies, such as the "Sidekick" application powered by machine learning, and the acquisition of Construction Resources.

- **Growth Opportunities**: The company aims to expand by opening around 80 new stores over the next five years, especially in areas with significant population growth.

- **Employee Investments**: Home Depot invested $1 billion in compensation for frontline workers, improving staff retention and customer service quality.

- **Technological Advancements**: Machine learning and computer vision technologies were introduced to improve product availability and enhance productivity within stores.

- **Outlook for 2024**: The company plans to continue leveraging its strengths to capitalize on long-term growth opportunities while focusing on delivering shareholder value and improving the shopping experience.


### Evaluation 

Here’s a comparison of the two versions:

### **Version 1**:
- **Style**: More detailed and comprehensive. It breaks down the document into clearly labeled sections such as **Strategic Focus**, **Financial Performance**, **Store Expansion**, etc.
- **Length**: Longer, covering a wide range of areas including financials, operational improvements, sustainability, risks, governance, and future strategies.
- **Tone**: Professional and formal. It reads like an executive summary that you might find in a corporate report, with in-depth insights.
- **Focus**: While it mentions fiscal performance, it gives significant weight to **strategic focus**, **workforce investments**, **sustainability**, **risks**, and **governance**. It provides a well-rounded picture of the company beyond just the financials.

### **Version 2**:
- **Style**: Shorter and more concise. It hits the key points but doesn’t delve into as much detail.
- **Length**: Much shorter and more to the point. It offers a high-level summary of important points without elaborating too much on specific areas.
- **Tone**: Slightly more informal and straightforward. It feels like a brief recap rather than a deep analysis.
- **Focus**: Concentrates on key elements like **financial performance**, **strategic focus**, **employee investments**, and **growth opportunities**, while omitting more detailed sections like governance, sustainability, and risk factors.

### **Which to Use**:
- If you're looking for a **comprehensive overview** that covers all aspects of the document, **Version 1** is better suited.
- If you need a **quick, high-level summary** that hits the main points without too much detail, **Version 2** works best.
