# Text Summarization of Large Documents using LangChain and LLamaParse



In [2]:
# !pip install --upgrade pytesseract pypdf PyPDF2 textract langchain transformers --quiet

**Objective**:  

This notebook demonstrates and compares two methods for summarizing large documents using LangChain and different PDF parsing techniques:

1. Basic extraction using PyPDFLoader
2. Advanced extraction using LlamaParse

The main goals are:

1. To implement and optimize the Refine summarization technique from LangChain for processing large documents.
2. To compare the effectiveness of PyPDFLoader and LlamaParse in extracting content from PDF files.
3. To generate comprehensive summaries of multiple Home Depot financial documents (annual report, proxy statement, and quarterly report).
4. To evaluate the quality and depth of summaries produced by each method.

The notebook aims to provide insights into which approach might be more suitable for different summarization needs, considering factors such as clarity, comprehensiveness, and level of detail in financial reporting contexts.


In [195]:
# Standard library imports
import os
import urllib
import warnings
from pathlib import Path as p
from typing import List
from dataclasses import dataclass
from IPython.display import Image, Markdown, display

# Third-party imports
import pandas as pd
import nest_asyncio
from dotenv import load_dotenv

# LangChain imports
from langchain import PromptTemplate
from langchain.chains.summarize import load_summarize_chain
from langchain.document_loaders import PyPDFLoader
from langchain_openai import AzureChatOpenAI
from langchain_core.documents import Document

# Llama Index imports
from llama_parse import LlamaParse
from llama_index.core.extractors import (
    SummaryExtractor,
    QuestionsAnsweredExtractor,
    TitleExtractor,
    KeywordExtractor,
    BaseExtractor,
)
from llama_index.extractors.entity import EntityExtractor
from llama_index.core.node_parser import TokenTextSplitter
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.schema import MetadataMode
from llama_index.core import SimpleDirectoryReader
from llama_index.core.ingestion import IngestionPipeline

# Load environment variables and apply nest_asyncio
load_dotenv('C:\Lang Graph\LangGraph_SF_Repo\.env')
nest_asyncio.apply()

# Suppress warnings
warnings.filterwarnings("ignore")

In [68]:
# Initialize the Azure OpenAI model
llm_llamaindex = AzureOpenAI(
    model="gpt-4o-mini",
    deployment_name="gpt-4o-mini",
    api_key = os.environ['AZURE_OPENAI_API_KEY'],
    azure_endpoint = os.environ['AZURE_OPENAI_ENDPOINT'],
    api_version = os.environ['AZURE_OPENAI_API_VERSION'] )

In [126]:
# Initialize the Azure OpenAI model
llm = AzureChatOpenAI(
    azure_deployment="gpt-4o-mini",  
    temperature=0.3,
    api_version=os.environ['AZURE_OPENAI_API_VERSION'],  
    model_name='gpt-4o-mini'         
)

In [127]:
# text_splitter = TokenTextSplitter(
#     separator=" ", chunk_size=1024, chunk_overlap=200
# )

# class CustomExtractor(BaseExtractor):
#     def extract(self, nodes):
#         metadata_list = [
#             {
#                 "custom": (
#                     node.metadata["document_title"]
#                     + "\n"
#                     + node.metadata["excerpt_keywords"]
#                 )
#             }
#             for node in nodes
#         ]
#         return metadata_list

# extractors = [
#     TitleExtractor(nodes=5, llm=llm_llamaindex),
#     QuestionsAnsweredExtractor(questions=3, llm=llm_llamaindex),
#     # EntityExtractor(prediction_threshold=0.5),
#     # SummaryExtractor(summaries=["prev", "self"], llm=llm),
#     # KeywordExtractor(keywords=10, llm=llm),
#     # CustomExtractor()
# ]

# transformations = [text_splitter] + extractors

In [128]:
# # Note the uninformative document file name, which may be a common scenario in a production setting
# hd_earning_statement = SimpleDirectoryReader(input_files=["C:\Lang Graph\LangGraph_SF_Repo\Large_Document_Summarization\data\hd_q2_2024_earning_release.pdf"]).load_data()

# hd_earning_statement[:3]

# Large Document Summarization

## Preparing data files

Here we will load a Home Depot Annual Report, which is a substantial document. 
Documents of this size can be challenging for LLMs to summarize comprehensively.
We'll use this to demonstrate advanced summarization techniques.

In [5]:
# # data_folder = p.cwd() / "data"
# # p(data_folder).mkdir(parents=True, exist_ok=True)
# pdf_url = "https://services.google.com/fh/files/misc/practitioners_guide_to_mlops_whitepaper.pdf"
# pdf_file = str(p(data_folder, pdf_url.split("/")[-1]))

# urllib.request.urlretrieve(pdf_url, pdf_file)

## Basic Data Extraction PyPDFLoader

Here we use an `PdfReader` to extract then text from our scanned documents

In [129]:
# list all data in a folder 
files = os.listdir('./data')
files

['hd_annual_report.pdf',
 'hd_proxy_statement_2024.pdf',
 'hd_q2_2024_earning_release.pdf']

In [155]:
# load all these documents 

def load_multiple_pdfs(directory):
    documents = []
    for file in os.listdir(directory):
        if file.endswith('.pdf'):
            file_path = os.path.join(directory, file)
            loader = PyPDFLoader(file_path)
            documents.extend(loader.load())
    return documents


In [159]:
# Process all documents 
pdf_directory = './data'
pdf_reader_processed = load_multiple_pdfs(pdf_directory)


In [160]:
len(pdf_reader_processed)

211

This is equal to the total number of pages of all documents in our directory, which is expected

## Advanced Data Extraction LlamaParse

In this approach, we will use Llama Parse to extract text to see if the final quality of summarization would be better than the basic PDF reader

In [161]:
# # PARSER for documents 
parser = LlamaParse(
    result_type="markdown",  # "markdown" and "text" are available
    num_workers=5,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
)

In [172]:
# all home depot data in the data folders

# async batch
llama_parse_processed = await parser.aload_data([f"./data/{file}" for file in files])


Parsing files: 100%|██████████| 3/3 [00:23<00:00,  7.81s/it]


In [174]:
# simplify Document format to contain necessary information only
def simplify_document(document):
    return Document(
        metadata={
            'source': document.metadata.get('file_path', ''),
            'page': document.metadata.get('page_label', '')
        },
        page_content=document.text
    )

llama_parse_processed = [simplify_document(doc) for doc in llama_parse_processed]



# Method: Refine

The Refine method is an alternative method to deal with large document summarization. It works by first running an initial prompt on a small chunk of data, generating some output. Then for each subsequent document, the output from the previous document is passed in a long with the new document, and the LLM is asked to refine the output based on the new document. 

In LangChain, you can use `RefineDocumentsChain` as part of the load_summarize_chain method. What you need to do is setting `refine` as `chain_type` of your chain

### Prompt design with `Refine` Chain

With LangChain, the `refine` chain requires 2 prompts

The question prompt to generate hte output for subsequent taks. The refine prompt to refine the output based on the generated content. 


In [107]:

question_prompt_template = """
                  Please provide a summary of the following text.
                  TEXT: {text}
                  SUMMARY:
                  """

question_prompt = PromptTemplate(
    template=question_prompt_template, input_variables=["text"]
)

refine_prompt_template = """
              Write a concise summary of the following text delimited by triple backquotes.
              Return your response in bullet points which covers the key points of the text.
              ```{text}```
              BULLET POINT SUMMARY:
              """

refine_prompt = PromptTemplate(
    template=refine_prompt_template, input_variables=["text"]
)

## Generate Summaries

After you define prompts, you initiate a summarization chain using `refine` chain type.

In [33]:
refine_chain = load_summarize_chain(
    llm,
    chain_type="refine",
    question_prompt=question_prompt,
    refine_prompt=refine_prompt,
    return_intermediate_steps=True,
)

### PDF Reader Chunk Summaries

Then, you use the summarization chain to summarize each document chunk using Refine method

In [175]:
# create summary for each chunk extracted by pdf reader 
pdf_reader_refine_outputs = refine_chain({'input_documents': pdf_reader_processed})

In [177]:
pdfreader_final_refine_data = []
for doc, out in zip(
    pdf_reader_refine_outputs["input_documents"], pdf_reader_refine_outputs["intermediate_steps"]
):
    output = {}
    output["file_name"] = p(doc.metadata["source"]).stem
    output["file_type"] = p(doc.metadata["source"]).suffix
    output["page_number"] = doc.metadata["page"]
    output["chunks"] = doc.page_content
    output["concise_summary"] = out
    pdfreader_final_refine_data.append(output)

In [179]:
pdfreader_refine_summary = pd.DataFrame.from_dict(pdfreader_final_refine_data)
pdfreader_refine_summary.reset_index(inplace=True, drop=True)
pdfreader_refine_summary.head()
     

Unnamed: 0,file_name,file_type,page_number,chunks,concise_summary
0,hd_annual_report,.pdf,0,‘23Annual\nReport,It seems like the text you provided is incompl...
1,hd_annual_report,.pdf,1,Fiscal 2023: A Year of Moderation\nFiscal 2023...,- Fiscal 2023 experienced moderation following...
2,hd_annual_report,.pdf,2,We also see an opportunity to drive sales thro...,- Opportunity identified for sales growth thro...
3,hd_annual_report,.pdf,3,Table of Contents\nUNITED STATES\nSECURITIES A...,- The document is the Form 10-K annual report ...
4,hd_annual_report,.pdf,4,TABLE OF CONTENTS\nCommonly Used or Defined Te...,- **Table of Contents Overview**: Organized in...


### Llama Parse Chunk Summaries

In [182]:
# create summary for each chunk extracted by pdf reader 
llamaparse_refine_outputs = refine_chain({'input_documents': llama_parse_processed})

In [184]:
llamaparse_final_refine_data = []
for doc, out in zip(
    llamaparse_refine_outputs["input_documents"], llamaparse_refine_outputs["intermediate_steps"]
):
    output = {}
    output["chunks"] = doc.page_content
    output["concise_summary"] = out
    llamaparse_final_refine_data.append(output)

In [186]:
llamaparse_refine_summary = pd.DataFrame.from_dict(llamaparse_final_refine_data)
llamaparse_refine_summary.reset_index(inplace=True, drop=True)
llamaparse_refine_summary.head()
     

Unnamed: 0,chunks,concise_summary
0,# Annual Report '23\n\n# Espanol Yo Hablo\n\n#...,The text appears to be a title or header for a...
1,# Fiscal 2023: A Year of Moderation\n\nFiscal ...,- Fiscal 2023 experienced a moderation in the ...
2,# Cleaning\n\n# Lumber\n\n# Light Bulbs\n\n# P...,- **Store Expansion**: Plans to open approxima...
3,# Table of Contents\n\n# UNITED STATES SECURIT...,- The document is the Form 10-K annual report ...
4,# TABLE OF CONTENTS\n\n|Commonly Used or Defin...,- **Table of Contents Overview**: Organized in...


# Final Summarization 

In [187]:
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain 
from langchain_core.messages import AIMessage, HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser

Combine all the summarizations to generate a final summary

## LlamaParse Final Summarization

In [None]:
llamparse_combined_summaries = "\n\n".join(llamaparse_refine_summary['concise_summary'])

In [189]:
final_sum_prompt_template = """
**Overview**:
- General Overview and Trends:
    * Provide an industry outlook, indicating if the industry is trending up, down, or stable, and explain why.
    * Analyze this retailer’s performance within the industry (sales, margins, market share).
    * Identify key macroeconomic factors influencing the retailer’s performance (e.g., inflation, interest rates).
    * Make a prediction for the next quarter, including sales growth, earnings per share, etc.
    * Note any strategic initiatives that may affect future trends and elaborate if possible.

**Annual Report:**
    1. Business Overview:
    - Summarize key business activities, including new store openings, digital transformation, pro-customer growth, etc.

    2. Risk Factors:
    - Detail significant risk factors and their potential impact on the business.
    
    3. Management’s Discussion and Analysis:
    - Prioritize quantitative metrics from this section.

**Proxy Statement:**
    1. Management’s Discussion of Operations and Business Strategy:
    - Provide strategic focus on customer-centric growth, supply chain, market leadership, etc.

    2. Human Capital Management:
    - Include details on talent retention, diversity, equity, inclusion, and employee engagement.

    3. Corporate Governance and Risk Management:
    - Discuss enterprise risk management, board oversight, cybersecurity, etc.

    4. Market Performance and Shareholder Returns:
    - Provide data on return on capital, share repurchase programs, dividends, and ESG.

**Quarterly Report:**
    1. Sales Performance:
    - Include data on comparable sales, product categories, e-commerce, and regional sales performance.

    2. Market Environment:
    - Analyze macroeconomic trends, housing market, consumer demand, and competition.

    3. Operational Performance:
    - Focus on cost management, margins, inventory, and staffing metrics.

    4. Guidance:
    - Include forward-looking metrics like sales and profit projections, investment plans, etc.

    5. Strategic Investments:
    - Highlight capital expenditures, technology investments, sustainability initiatives, etc.

Be sure to highlight any trends, strategic shifts, or significant risks, while grounding your analysis in the quantitative details provided by the reports.

Here are provided page summaries all these three documents:
{page_summaries}

<---end of provided page summaries--->

Overall Summary:
"""

final_sum_prompt = final_sum_prompt_template.format(page_summaries = llamparse_combined_summaries)

In [190]:
# use the best model
# Initialize the Azure OpenAI model
llm_4o = AzureChatOpenAI(
    azure_deployment="answer_generation",  # The deployment name of the model in Azure
    temperature=0.3,                 # Controls randomness in the output (0.0 to 1.0)
    api_version=os.environ['AZURE_OPENAI_API_VERSION'],  # API version from environment variable
    model_name='gpt-4o'         # The name of the model being used
)

In [None]:
llamaparse_summary = llm_4o.invoke(
    [SystemMessage(content=final_sum_prompt), 
     HumanMessage(content='Generate the overall summary.')]
)

In [196]:
# Display markdown
Markdown(llamaparse_summary.content)

**Industry Overview and Trends:**

- **Outlook:** The home improvement industry experienced a moderation in growth during fiscal 2023, influenced by economic uncertainty and higher interest rates. The industry is currently stable but faces pressures from shifting consumer spending patterns.

- **Retailer Performance:** Home Depot's sales declined by 3.0% to $152.7 billion, with comparable sales down 3.2%. Net earnings dropped by 11.5% to $15.1 billion, reflecting the challenging market environment.

- **Macroeconomic Factors:** Key influences include inflation, interest rates, and changing consumer preferences post-COVID-19. These factors have affected demand for home improvement products.

- **Prediction for Next Quarter:** Sales growth is expected to remain modest, with a projected decline in comparable sales by 3% to 4%. Earnings per share may also see a slight decrease.

- **Strategic Initiatives:** Home Depot is focusing on digital transformation, enhancing the Pro customer ecosystem, and expanding its store footprint. Investments in technology and supply chain improvements are ongoing.

**Annual Report:**

1. **Business Overview:**
   - Home Depot opened 13 new stores in fiscal 2023 and invested in digital navigation and store efficiency.
   - The company transitioned to a new market delivery network for appliances and acquired Construction Resources to enhance Pro offerings.

2. **Risk Factors:**
   - Risks include intense competition, economic volatility, supply chain disruptions, and cybersecurity threats.

3. **Management’s Discussion and Analysis:**
   - Total sales were $152.7 billion, with a gross profit of $50.96 billion. Operating income was $21.69 billion.

**Proxy Statement:**

1. **Management’s Discussion of Operations and Business Strategy:**
   - Focus on customer-centric growth, supply chain enhancements, and maintaining market leadership.

2. **Human Capital Management:**
   - Investment in talent retention, diversity, equity, inclusion, and employee engagement is emphasized.

3. **Corporate Governance and Risk Management:**
   - Strong governance with a focus on cybersecurity, risk management, and board oversight.

4. **Market Performance and Shareholder Returns:**
   - Return on Invested Capital (ROIC) was 36.7%. The company returned over $16 billion to shareholders through dividends and share repurchases.

**Quarterly Report:**

1. **Sales Performance:**
   - Q2 fiscal 2024 sales were $43.2 billion, a 0.6% increase from the previous year. Comparable sales decreased by 3.3%.

2. **Market Environment:**
   - The market is affected by higher interest rates and economic uncertainty, impacting consumer demand.

3. **Operational Performance:**
   - Operating income was $6.5 billion, with an operating margin of 15.1%. Cost management and inventory efficiency remain priorities.

4. **Guidance:**
   - Total sales for fiscal 2024 are projected to increase by 2.5% to 3.5%, with a decline in comparable sales by 3% to 4%.

5. **Strategic Investments:**
   - Continued focus on technology investments, sustainability initiatives, and expanding store locations.

**Overall Summary:**

Home Depot is navigating a challenging market environment with strategic investments in digital and physical retail enhancements. Despite a decline in sales and earnings, the company remains committed to long-term growth through customer-centric initiatives and operational efficiencies. The focus on Pro customers and supply chain improvements positions Home Depot for future resilience.

## PDFLoader Final Summarization

In [198]:
pdfreader_combined_summaries = "\n\n".join(pdfreader_refine_summary['concise_summary'])

In [209]:
final_sum_prompt = final_sum_prompt_template.format(page_summaries = pdfreader_combined_summaries)

In [210]:
pdfreader_summary = llm_4o.invoke(
    [SystemMessage(content=final_sum_prompt), 
     HumanMessage(content='Generate the overall summary.')]
)

In [211]:
# Display markdown
Markdown(pdfreader_summary.content)

**Overall Summary:**

**Industry Overview and Trends:**
- The home improvement industry experienced moderation in fiscal 2023 after several years of growth, influenced by high interest rates and economic uncertainty.
- Home Depot's total sales decreased by 3.0% to $152.7 billion, with comparable sales down 3.2%.
- Despite challenges, Home Depot maintained a strong position with net earnings of $15.1 billion.
- Key macroeconomic factors affecting performance include inflation and shifting consumer spending patterns.
- Strategic initiatives focus on enhancing customer experience, digital transformation, and expanding the Pro ecosystem.

**Annual Report Highlights:**
- Home Depot opened 13 new stores in fiscal 2023 and plans to open 80 more over five years.
- Investments included $1 billion in frontline associate compensation and $3.2 billion in capital expenditures.
- Risk factors include competition, cybersecurity threats, and supply chain disruptions.
- Management emphasizes operational improvements and maintaining a low-cost provider position.

**Proxy Statement Insights:**
- Strategic focus on customer-centric growth and supply chain enhancements.
- Commitment to diversity, equity, and inclusion with detailed workforce demographics.
- Strong corporate governance with emphasis on cybersecurity and risk management.
- Shareholder returns include $8.4 billion in dividends and $8.0 billion in share repurchases.

**Quarterly Report Highlights:**
- Q2 fiscal 2024 sales increased slightly to $43.2 billion, but comparable sales fell by 3.3%.
- Operating income and net earnings showed slight declines due to economic pressures.
- Updated guidance projects a 2.5% to 3.5% sales increase for fiscal 2024, with strategic investments in new stores and technology.
- Continued focus on enhancing the interconnected shopping experience and Pro customer services.

**Strategic Initiatives:**
- Home Depot is investing in digital tools and supply chain capabilities to improve customer experience.
- The company aims to leverage its competitive advantages for long-term growth and shareholder value.
- Sustainability efforts include reducing emissions and promoting eco-friendly products.

Overall, Home Depot remains a leader in the home improvement industry, focusing on strategic growth and operational efficiency despite economic challenges.

# Conclusion

Both summaries offer strong insights, but each approach has its strengths and limitations:

### PyPDFLoader Version:
**Strengths:**
- **Clarity**: This version presents the information in a very clear and well-organized manner, making it easy to digest.
- **Comprehensive Overview**: The "Overall Summary" section gives a broader view of Home Depot's performance and key trends.
- **Detailed Proxy and Quarterly Report**: The breakdown of the proxy and quarterly report is informative, with a good balance of strategic and financial highlights.

**Weaknesses:**
- **Lack of Depth in Quantitative Data**: Although it mentions sales and net earnings, this version lacks deeper quantitative analysis, especially when discussing projections or comparisons to past figures.
- **Mildly Generic Language**: Some phrases (e.g., "continued focus on enhancing the interconnected shopping experience") could be more specific. While it provides strategic initiatives, it doesn’t elaborate much on their potential impact.

### Llamaparse Version:
**Strengths:**
- **Strong Quantitative Focus**: This version includes more precise figures (e.g., ROIC, operating income, and margins), offering deeper financial insights and predictions for the future.
- **Prediction Clarity**: It gives a clearer and more detailed forecast for the next quarter, with figures on projected sales decline and earnings per share.
- **Risk and Strategy Integration**: The llamaparse version connects risks (economic volatility, cybersecurity) with the company’s strategic responses, providing a more holistic picture of how Home Depot is navigating these challenges.

**Weaknesses:**
- **More Dense**: The wording can feel a bit more technical, which might make it harder for some audiences to follow.
- **Less Emphasis on Narrative**: Compared to the PyPDFLoader version, this one is more data-driven, but it doesn’t weave the story of Home Depot’s performance and strategy as smoothly.

### Overall Recommendation:
- If your goal is to provide **straightforward, easily readable summaries** that are quick to digest, the **PyPDFLoader version** does a great job.
- If you're aiming for a **data-rich, highly detailed financial summary** with stronger quantitative focus and clear predictions, the **Llamaparse version** excels.

It depends on your audience and what level of detail they require!