# Large File Summarization using LangChain/LCEL with Bedrock API 
## GenAI Code Accelerator 
Author: Sundaresan Manoharan - Enterprise Architecture AI/ML Team
> *This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

Text summarization in Natural Language Processing (NLP) is the process of breaking down large texts into smaller parts. It uses deep learning and machine learning models to extract important information while preserving the meaning of the text from a text document and presenting it in a concise and coherent format. It allows digesting and distilling the essence from large volumes of content efficiently. It is a key capability of LLMs with many potential applications across industries to improve understanding and save time. This notebook demostrates text summarization using Amazon Bedrock API. 

Challenge: A key challenge is managing large documents that exceed the token limit. Another is obtaining high quality summaries. When we work with large documents, we can face some challenges as the input text might not fit into the model context length, or the model hallucinates with large documents, or, out of memory errors, etc.

To solve those problems, we are going to show an architecture that is based on the concept of chunking and chaining prompts. This architecture is leveraging LangChain which is a popular framework for developing applications powered by language models.

Use Cases:
- Books, Articles, Blogs, Research Papers

Foundation Model(s):
- Amazon Titan Lite

This notebook introduces Text Summarization using Amazon Bedrock API.  
- Uses various Foundation Models (LLM agnostic)
- Uses a PDF document (Large document which doesn't fit LLM context window)
- Uses simple and easy to adapt bite size'd code accelerator

Insert Architecture Diagram

In this architecture:

1. A large document (or a giant file appending small ones) is loaded
1. Langchain utility is used to split it into multiple smaller chunks (chunking)
1. First chunk is sent to the model; Model returns the corresponding summary
1. Langchain gets next chunk and appends it to the returned summary and sends the combined text as a new request to the model; the process repeats until all chunks are processed
1. In the end, you have final summary based on entire content


### Install Libraries

In [2]:
!pip install --upgrade pip

[0m

In [3]:
%pip install --no-build-isolation --force-reinstall \
    "boto3>=1.28.57" \
    "awscli>=1.29.57" \
    "botocore>=1.31.57"

Collecting boto3>=1.28.57
  Downloading boto3-1.34.42-py3-none-any.whl.metadata (6.6 kB)
Collecting awscli>=1.29.57
  Downloading awscli-1.32.42-py3-none-any.whl.metadata (11 kB)
Collecting botocore>=1.31.57
  Downloading botocore-1.34.42-py3-none-any.whl.metadata (5.7 kB)
Collecting jmespath<2.0.0,>=0.7.1 (from boto3>=1.28.57)
  Using cached jmespath-1.0.1-py3-none-any.whl (20 kB)
Collecting s3transfer<0.11.0,>=0.10.0 (from boto3>=1.28.57)
  Using cached s3transfer-0.10.0-py3-none-any.whl.metadata (1.7 kB)
Collecting docutils<0.17,>=0.10 (from awscli>=1.29.57)
  Using cached docutils-0.16-py2.py3-none-any.whl (548 kB)
Collecting PyYAML<6.1,>=3.10 (from awscli>=1.29.57)
  Using cached PyYAML-6.0.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.1 kB)
Collecting colorama<0.4.5,>=0.2.5 (from awscli>=1.29.57)
  Using cached colorama-0.4.4-py2.py3-none-any.whl (16 kB)
Collecting rsa<4.8,>=3.1.2 (from awscli>=1.29.57)
  Using cached rsa-4.7.2-py3-none-any.whl (34 kB)


In [4]:
%pip install langchain
!pip install transformers
# install PDF Reader library
!pip install PyPDF2


[0mNote: you may need to restart the kernel to use updated packages.
[0m

### Import Libraries

In [5]:
import json
import os
import sys
import pandas as pd
import re

import boto3
import botocore
from IPython.display import display_markdown, Markdown, clear_output
from PyPDF2 import PdfReader

from langchain.llms.bedrock import Bedrock


### Initialize boto session

In [6]:
# module_path = ".."
# sys.path.append(os.path.abspath(module_path))

boto_session = boto3.Session()
aws_region = boto_session.region_name
print(aws_region)
br_client = boto_session.client("bedrock", region_name=aws_region)
br_runtime = boto_session.client("bedrock-runtime", region_name=aws_region)


us-east-1


### Test Connection & List Foundation Models

In [7]:
fms = br_client.list_foundation_models()['modelSummaries']
dfFM = pd.DataFrame(fms)
print(dfFM.shape)
dfFM.head()

(45, 10)


Unnamed: 0,modelArn,modelId,modelName,providerName,inputModalities,outputModalities,responseStreamingSupported,customizationsSupported,inferenceTypesSupported,modelLifecycle
0,arn:aws:bedrock:us-east-1::foundation-model/am...,amazon.titan-tg1-large,Titan Text Large,Amazon,[TEXT],[TEXT],True,[],[ON_DEMAND],{'status': 'ACTIVE'}
1,arn:aws:bedrock:us-east-1::foundation-model/am...,amazon.titan-image-generator-v1:0,Titan Image Generator G1,Amazon,"[TEXT, IMAGE]",[IMAGE],,[FINE_TUNING],"[ON_DEMAND, PROVISIONED]",{'status': 'ACTIVE'}
2,arn:aws:bedrock:us-east-1::foundation-model/am...,amazon.titan-image-generator-v1,Titan Image Generator G1,Amazon,"[TEXT, IMAGE]",[IMAGE],,[],[ON_DEMAND],{'status': 'ACTIVE'}
3,arn:aws:bedrock:us-east-1::foundation-model/am...,amazon.titan-embed-g1-text-02,Titan Text Embeddings v2,Amazon,[TEXT],[EMBEDDING],,[],[ON_DEMAND],{'status': 'ACTIVE'}
4,arn:aws:bedrock:us-east-1::foundation-model/am...,amazon.titan-text-lite-v1:0:4k,Titan Text G1 - Lite,Amazon,[TEXT],[TEXT],True,"[FINE_TUNING, CONTINUED_PRE_TRAINING]",[PROVISIONED],{'status': 'ACTIVE'}


In [8]:
dfFM.columns

Index(['modelArn', 'modelId', 'modelName', 'providerName', 'inputModalities',
       'outputModalities', 'responseStreamingSupported',
       'customizationsSupported', 'inferenceTypesSupported', 'modelLifecycle'],
      dtype='object')

In [9]:
dfFM.modelName.unique()

array(['Titan Text Large', 'Titan Image Generator G1',
       'Titan Text Embeddings v2', 'Titan Text G1 - Lite',
       'Titan Text G1 - Express', 'Titan Embeddings G1 - Text',
       'Titan Multimodal Embeddings G1', 'SDXL 0.8', 'SDXL 1.0',
       'J2 Grande Instruct', 'J2 Jumbo Instruct', 'Jurassic-2 Mid',
       'Jurassic-2 Ultra', 'Claude Instant', 'Claude', 'Command',
       'Command Light', 'Embed English', 'Embed Multilingual',
       'Llama 2 Chat 13B', 'Llama 2 Chat 70B', 'Llama 2 13B',
       'Llama 2 70B'], dtype=object)

## Large Document Summarization

### Download a public dataset

In [10]:
%%sh
# 49786
wget -O fnma-esg-2022.pdf https://www.fanniemae.com/media/48156/display

--2024-02-15 00:05:02--  https://www.fanniemae.com/media/48156/display
Resolving www.fanniemae.com (www.fanniemae.com)... 104.18.27.25, 104.18.26.25, 2606:4700::6812:1b19, ...
Connecting to www.fanniemae.com (www.fanniemae.com)|104.18.27.25|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8202655 (7.8M) [application/pdf]
Saving to: ‘fnma-esg-2022.pdf’

     0K .......... .......... .......... .......... ..........  0% 29.8M 0s
    50K .......... .......... .......... .......... ..........  1%  176M 0s
   100K .......... .......... .......... .......... ..........  1%  172M 0s
   150K .......... .......... .......... .......... ..........  2% 29.2M 0s
   200K .......... .......... .......... .......... ..........  3% 75.6M 0s
   250K .......... .......... .......... .......... ..........  3% 21.0M 0s
   300K .......... .......... .......... .......... ..........  4% 10.7M 0s
   350K .......... .......... .......... .......... ..........  4% 87.2M 0s
   400K ....

### Read and Extract Text from PDF File

In [11]:
filename = 'fnma-esg-2022.pdf'
reader = PdfReader(filename)
print("Total Pages:", len(reader.pages))


Total Pages: 78


In [12]:
pages = []

for idx, page in enumerate(reader.pages):
    # print("Page ", idx + 1, "\n")
    text = page.extract_text(0) # 0 for orientation 90 degree upright 
    pages.append(text)
    # print(text, "\n\n")
    
print(f"Extracted {idx+1} pages successfully.")

Extracted 78 pages successfully.


In [13]:
import re
# combime extracted text from all pages
all_text = "\n".join(pages[5:20])
# count the number of tokens
print('Total Word Count:', len(re.findall(r"[\w']+", all_text)))

Total Word Count: 7039


### Send Large Text Directly to Small Context Window

The following cell sends the large text directly to the LLM for inference. You will see warning indicating the number of tokens in the text file exceeeds the maximum number of tokens for this model.

In [14]:
prompt = f"""
Please provide a summary of the following text. 

<text>
{all_text}
</text>

"""

In [15]:
%%time

body = json.dumps({"inputText": prompt, 
                   "textGenerationConfig":{
                       "maxTokenCount":256,
                       "stopSequences":[],
                       "temperature":0,
                       "topP":1
                   },
                  }) 

modelId = 'amazon.titan-text-lite-v1' # change this to use a different version from the model provider
accept = 'application/json'
contentType = 'application/json'

try:
    
    response = br_runtime.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
    response_body = json.loads(response.get('body').read())
    output_text = response_body.get('results')[0].get('outputText')
    print(len(re.findall(r"[\w']+", output_text)))
    print(output_text)

except botocore.exceptions.ClientError as error:    
    raise error

ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: Malformed input request: expected maxLength: 42000, actual: 46393, please reformat your input and try again.

In [29]:
len(all_text)

46325

## Summarize large text 

### Configuring LangChain with Boto3

LangChain allows you to access Bedrock once you pass boto3 session information to LangChain. If you pass None as the boto3 session information to LangChain, LangChain tries to get session information from your environment. Esure the right client is used bedrock-runtime.

You need to specify LLM for LangChain Bedrock class, and can pass arguments for inference. Here you specify Amazon Titan Text Lite in `model_id` and pass Titan's inference parameter in `textGenerationConfig`.

The following cell loads the text and counts the number of tokens in the file.

You will see warning indicating the number of tokens in the text file exceeeds the maximum number of tokens for this model.


In [17]:
modelId = "amazon.titan-text-lite-v1"
llm = Bedrock(
    model_id=modelId,
    model_kwargs={
        "maxTokenCount": 4096,
        "stopSequences": [],
        "temperature": 0,
        "topP": 1,
    },
    client=br_runtime,
)

# how the LLM calculates the Number of Tokens (4-5 Chars considers a Token)
llm.get_num_tokens(all_text)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Token indices sequence length is longer than the specified maximum sequence length for this model (11074 > 1024). Running this sequence through the model will result in indexing errors


11074

### Splitting the long text into chunks

The text is too long to fit in the prompt, so we will split it into smaller chunks. RecursiveCharacterTextSplitter in LangChain supports splitting long text into chunks recursively until size of each chunk becomes smaller than chunk_size. A text is separated with separators=["\n\n", "\n"] into chunks, which avoids splitting each paragraph into multiple chunks.

Using 4,000 characters per chunk, we can get summaries for each portion separately. The number of tokens, or word pieces, in a chunk depends on the text.


In [18]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n"], chunk_size=5000, chunk_overlap=100
)

docs = text_splitter.create_documents([all_text])

In [19]:
num_docs = len(docs)

num_tokens_first_doc = llm.get_num_tokens(docs[0].page_content)

print(
    f"Now we have {num_docs} documents and the first one has {num_tokens_first_doc} tokens"
)

Now we have 10 documents and the first one has 1122 tokens


### Summarizing chunks and combining them

Assuming that the number of tokens is consistent in the other docs we should be good to go. Let's use LangChain's [load_summarize_chain](https://python.langchain.com/en/latest/use_cases/summarization.html) to summarize the text. `load_summarize_chain` provides three ways of summarization: `stuff`, `map_reduce`, and `refine`. 
- `stuff` puts all the chunks into one prompt. Thus, this would hit the maximum limit of tokens.
- `map_reduce` summarizes each chunk, combines the summary, and summarizes the combined summary. If the combined summary is too large, it would raise error.
- `refine` summarizes the first chunk, and then summarizes the second chunk with the first summary. The same process repeats until all chunks are summarized.

`map_reduce` and `refine` invoke LLM multiple times and takes time for obtaining final summary. 
Let's try `map_reduce` here. 


### Option 1. Use Map reduce pattern on Langchain

In [20]:
# Set verbose=True if you want to see the prompts being used
from langchain.chains.summarize import load_summarize_chain

summary_chain = load_summarize_chain(llm=llm, chain_type="map_reduce", verbose=False)

In [21]:
%%time
output = ""

try:
    output = summary_chain.run(docs)
except ValueError as error:
    raise error


  warn_deprecated(


CPU times: user 92.6 ms, sys: 6.4 ms, total: 99 ms
Wall time: 1min 44s


In [22]:
print(output)


Fannie Mae is a company that helps people buy or rent homes by providing mortgage financing. It has two reportable business segments: Single-Family and Multifamily. The 2022 ESG Report provides information on the company's business and operations with a focus on social impact, sustainability, and responsible governance. Fannie Mae's ESG strategy is designed around two core objectives: improving access to equitable and sustainable housing and enhancing its financial and risk positions. The company's ESG team works to deepen its understanding of ESG priorities and solutions, benefiting from enterprise-wide connectivity and visibility, engagement with external stakeholders, and Board-level oversight. Fannie Mae's priority ESG topics include business ethics, climate resilience, climate risk, community engagement, data privacy and security, diversity and inclusion, ESG integration, green homes, housing affordability, housing stability, human capital management, racial equity in housing fin

## Option 2. Manually process insights, then summarize¶

### LangChain Expression Language (LCEL)

In [23]:
from langchain.prompts import PromptTemplate
from langchain.output_parsers import XMLOutputParser, PydanticOutputParser
from langchain.output_parsers.json import SimpleJsonOutputParser
from langchain.schema.output_parser import StrOutputParser

xml_parser = XMLOutputParser()
str_parser = StrOutputParser()

prompt = PromptTemplate(
    template="""
    
    Human:
    {instructions} : \"{document}\"
    Assistant:""",
    input_variables=["instructions","document"],
    # Format help: {format_instructions}.
    # partial_variables={"format_instructions": xml_parser.get_format_instructions()},
)

insight_chain = prompt | llm | StrOutputParser()

In [24]:
%%time
insights=[]
for i in range(len(docs)):
    insights.append(
        insight_chain.invoke({
        "instructions":"Provide Key insights from the following text",
        "document": {docs[i].page_content}
    }))
    

CPU times: user 44 ms, sys: 4.42 ms, total: 48.4 ms
Wall time: 50.6 s


In [25]:
print(docs[0].page_content, '\n\n')
print(insights[0], '\n\n')


About Fannie Mae
Who we are
The Federal National Mortgage Association, better known as 
Fannie Mae, is a purpose-driven company by charter and by 
choice. Our business supports mortgage lenders by providing 
mortgage financing to help people buy or rent a home. We help 
make the popular 30-year fixed-rate mortgage possible, enabling 
predictable mortgage payments over the life of the loan and 
giving homeowners stability and peace of mind. 
Our charter, an act of Congress, establishes our purposes: 
to provide liquidity and stability to the residential mortgage 
market and to promote access to mortgage credit. This mandate 
includes facilitating mortgages on housing for low- and 
moderate-income families involving a reasonable economic 
return that may be less than the return earned on other 
activities. Congress declared that our operations should be 
financed by private capital to the maximum extent feasible. With 
these Congressional intentions in mind, we have, principally 
using p

In [26]:
prompt = PromptTemplate(
    template="""
    
    Human:
    {instructions} : \"{document}\"
    Assistant:""",
    input_variables=["instructions","document"]
)

summary_chain = prompt | llm | StrOutputParser()

In [27]:
%%time
print(summary_chain.invoke({
        "instructions":"You will be provided with multiple paragraphs of insights. Compile and summarize these insights and provide key takeaways in one concise paragraph.",
        "document": {'\n'.join(insights)}
    }))

 Fannie Mae is a company that provides mortgage financing to help people buy or rent a home. They help make the popular 30-year fixed-rate mortgage possible, enabling predictable mortgage payments over the life of the loan and giving homeowners stability and peace of mind. They do not originate mortgage loans or lend money directly to borrowers, but work primarily with lenders who originate loans to borrowers. They acquire and securitize those loans into mortgage-backed securities that they guarantee. Their revenues are primarily driven by guaranty fees they receive for assuming the credit risk on loans underlying their MBS. As of December 31, 2022, they owned or guaranteed mortgage assets representing an estimated 27% of single-family mortgage debt outstanding and 21% of multifamily mortgage debt outstanding in the U.S. Fannie Mae is committed to realizing scalable, positive impact while mitigating risk through thoughtful integration of ESG priorities throughout its business. The comp

### Conclusion

You have now experimented with using boto3 SDK which provides a vanilla exposure to Amazon Bedrock API. Using this API you have seen the use case of generating a summary of a large document which doesn't fit into LLM context window size, using langChain and LCEL to chunk into much smaller documents and summarize each and then summarize the summaries to get the final summary output.

#### Take aways
- Adapt this notebook to experiment with different models available through Amazon Bedrock such as Amazon Titan and AI21 Labs Jurassic models.
- Change the prompts to your specific usecase and evaluate the output of different models.
- Play with the token length to understand the latency and responsiveness of the service.
- Apply different prompt engineering principles to get better outputs.