### Objective: Create summary of a specific page of the 2021 and 2022 Amazon Shareholder letter document (in PDF).

#### Thie highl level steps are as follows:

1. Deploy AI21 Summary LLM model 
2. Convert and split the files into text files, where each page represents a page from the original doc 
3. Create streamlimt based UI which will allow user to 
    a. Select Shareholder letter document
    b. Select a page of the relevant document
    c. Generate Summary 
4. Clean up 

In [3]:
### install libraries  
%pip install -q -U  pymupdf "ai21[AWS]" langchain streamlit ipykernel

[0mNote: you may need to restart the kernel to use updated packages.


In [4]:
import fitz
import io
from langchain.document_loaders import PyPDFLoader
import sagemaker
import boto3
import ai21
import os
from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uri
import json as json
from sagemaker import ModelPackage, get_execution_role 
import urllib



### Step 1 : Read PDF file and save each page of PDF as a text file locally

In [3]:
shareholder_letter_url_2022 = 'https://s2.q4cdn.com/299287126/files/doc_financials/2023/ar/2022-Shareholder-Letter.pdf'
shareholder_letter_url_2021 = 'https://s2.q4cdn.com/299287126/files/doc_financials/2022/ar/2021-Shareholder-Letter.pdf'

In [10]:
os.chdir("summarizer-workshop")

In [26]:
def split_pdf_to_text(doc_url):
    ## donwload 
    pdf = urllib.request.urlopen(doc_url)
    saved_file = 'docs/' + doc_url.split('/')[-1]
    with open(saved_file,'wb') as output:
        output.write(pdf.read())
    doc = fitz.open(saved_file)
    for i,page in enumerate(doc):
        saved_txt_file = saved_file.replace(".pdf",".txt")
        saved_txt_file = saved_txt_file.replace('.txt',f'_{i+1}.txt')
        with open(saved_txt_file,'w') as output:
            output.write(page.get_text())
        print(f'Saved page {i+1} in {saved_txt_file}')

    

In [27]:
split_pdf_to_text(shareholder_letter_url_2022)

Saved page 1 in docs/2022-Shareholder-Letter_1.txt
Saved page 2 in docs/2022-Shareholder-Letter_2.txt
Saved page 3 in docs/2022-Shareholder-Letter_3.txt
Saved page 4 in docs/2022-Shareholder-Letter_4.txt
Saved page 5 in docs/2022-Shareholder-Letter_5.txt
Saved page 6 in docs/2022-Shareholder-Letter_6.txt
Saved page 7 in docs/2022-Shareholder-Letter_7.txt
Saved page 8 in docs/2022-Shareholder-Letter_8.txt
Saved page 9 in docs/2022-Shareholder-Letter_9.txt
Saved page 10 in docs/2022-Shareholder-Letter_10.txt


In [28]:
split_pdf_to_text(shareholder_letter_url_2021)

Saved page 1 in docs/2021-Shareholder-Letter_1.txt
Saved page 2 in docs/2021-Shareholder-Letter_2.txt
Saved page 3 in docs/2021-Shareholder-Letter_3.txt
Saved page 4 in docs/2021-Shareholder-Letter_4.txt
Saved page 5 in docs/2021-Shareholder-Letter_5.txt
Saved page 6 in docs/2021-Shareholder-Letter_6.txt
Saved page 7 in docs/2021-Shareholder-Letter_7.txt
Saved page 8 in docs/2021-Shareholder-Letter_8.txt
Saved page 9 in docs/2021-Shareholder-Letter_9.txt


In [18]:
doc[0]

page 0 of docs/2022-Shareholder-Letter.pdf

In [22]:
def process_pdf(saved_file):


In [23]:
process_pdf('docs//2022-Shareholder-Letter.pdf')

Saved page 1 in docs//2022-Shareholder-Letter_1.txt
Saved page 2 in docs//2022-Shareholder-Letter_2.txt
Saved page 3 in docs//2022-Shareholder-Letter_3.txt
Saved page 4 in docs//2022-Shareholder-Letter_4.txt
Saved page 5 in docs//2022-Shareholder-Letter_5.txt
Saved page 6 in docs//2022-Shareholder-Letter_6.txt
Saved page 7 in docs//2022-Shareholder-Letter_7.txt
Saved page 8 in docs//2022-Shareholder-Letter_8.txt
Saved page 9 in docs//2022-Shareholder-Letter_9.txt
Saved page 10 in docs//2022-Shareholder-Letter_10.txt


### Step 2 : Deploy AI21 Summarizer model from Amazon Sagemaker Market place 
This model generates a summary based on any body of text. Your source text can contain up to 50,000 characters, translating to roughly 10,000 words, or an impressive 40 pages!

**No prompting needed** – simply input the text that needs to be summarized. The model is specifically trained to generate summaries that capture the essence and key ideas of the original text.

Learn more at https://aws.amazon.com/marketplace/pp/prodview-dkwy6chb63hk2?sr=0-3&ref_=beagle&applicationId=AWSMPContessa


In [11]:
boto3_session = boto3.session.Session()
sess = sagemaker.session.Session(boto_session=boto3_session)
sagemaker_session_bucket = sess.default_bucket()
aws_role = sess.get_caller_identity_arn()
aws_region = boto3_session.region_name
print(f'Session S3 Bucket: {sagemaker_session_bucket}')
print(f'Role: {aws_role}')
print(f'aws_region: {aws_region}')

Session S3 Bucket: sagemaker-us-west-2-102048127330
Role: arn:aws:iam::102048127330:role/service-role/AmazonSageMaker-ExecutionRole-20230403T093418
aws_region: us-west-2


In [13]:
model_package_map = {
    "us-east-1": "arn:aws:sagemaker:us-east-1:865070037744:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "us-east-2": "arn:aws:sagemaker:us-east-2:057799348421:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "us-west-1": "arn:aws:sagemaker:us-west-1:382657785993:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "us-west-2": "arn:aws:sagemaker:us-west-2:594846645681:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "ca-central-1": "arn:aws:sagemaker:ca-central-1:470592106596:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "eu-central-1": "arn:aws:sagemaker:eu-central-1:446921602837:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "eu-west-1": "arn:aws:sagemaker:eu-west-1:985815980388:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "eu-west-2": "arn:aws:sagemaker:eu-west-2:856760150666:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "eu-west-3": "arn:aws:sagemaker:eu-west-3:843114510376:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "eu-north-1": "arn:aws:sagemaker:eu-north-1:136758871317:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "ap-southeast-1": "arn:aws:sagemaker:ap-southeast-1:192199979996:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "ap-southeast-2": "arn:aws:sagemaker:ap-southeast-2:666831318237:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "ap-northeast-2": "arn:aws:sagemaker:ap-northeast-2:745090734665:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "ap-northeast-1": "arn:aws:sagemaker:ap-northeast-1:977537786026:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "ap-south-1": "arn:aws:sagemaker:ap-south-1:077584701553:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba",
    "sa-east-1": "arn:aws:sagemaker:sa-east-1:270155090741:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba"
}

model_package_arn = model_package_map[aws_region]
model_package_arn

'arn:aws:sagemaker:us-west-2:594846645681:model-package/summarize-1-1-003-c51dc6a4ff7e34a1b55ac1e4f337baba'

In [7]:
## function to deploy and undeploy AI21 model 
def ai21_summary_model_handler(DEPLOYED=False):
    endpoint_name = "ai21-summarizer-endpoint"
    content_type = "application/json"
    # create a deployable model from the model package.
    model = ModelPackage(role=aws_role, 
                    model_package_arn=model_package_arn, 
                    sagemaker_session=sess
    )
    if not DEPLOYED:
        # Deploy the model
        predictor = model.deploy(1, "ml.g4dn.4xlarge", 
                                endpoint_name=endpoint_name, 
                                model_data_download_timeout=3600,
                                container_startup_health_check_timeout=600,
                                )
    else:
        # Undeploy and cleanup 
        model.sagemaker_session.delete_endpoint(endpoint_name)
        model.sagemaker_session.delete_endpoint_config(endpoint_name)
        print('clean up done!')
        

In [37]:
## deploy the model 
ai21_summary_model_handler(DEPLOYED=False)


---------!

### Step3 : Create streamlit based Web UI to Summarize the documents 

In [5]:
%%writefile summarizer-workshop/summ.py

import ai21
import json as json
import os
import streamlit as st 

USER_ICON = "images/user-icon.png"
AI_ICON = "images/ai-icon.png"
SUMMARIZER_ENDPOINT_NAME = "ai21-summarizer-endpoint"

st.markdown("""
        <style>
               .block-container {
                    padding-top: 32px;
                    padding-bottom: 32px;
                    padding-left: 0;
                    padding-right: 0;
                }
                .element-container img {
                    background-color: #000000;
                }

                .main-header {
                    font-size: 32px;
                }
                .main-subheader {
                    font-size: 24px;
                }
        </style>
        """, unsafe_allow_html=True)

def write_logo():
    col1, col2, col3 = st.columns([1, 1, 5])
    with col1:
        st.image(AI_ICON, use_column_width='always')
    with col3:
        header = f"Generative AI Powered Business Document Summarizer!"
        st.write(f"<h3 class='main-header'>{header}</h3>", 
                 unsafe_allow_html=True)

def write_top_bar():
    col1, col2, col3 = st.columns([12,1,4])
    with col1:
        selected_doc = st.selectbox( 
        'Please choose a Document',
         ('Amazon Shareholder Letter 2022', 
          'Amazon Shareholder Letter 2021'))
    with col2:
        pass
    with col3:
        selected_page = st.selectbox( 
            'Page Number',
            ('1', 
            '2',
            '3',
            '4',
            '5',
            '6',
            '7',
            '8'))
    return selected_doc, selected_page

def get_text_source_file(selected_doc, selected_page):
    if selected_doc == 'Amazon Shareholder Letter 2022':
        filepath = f'docs/2022-Shareholder-Letter'
    elif selected_doc == 'Amazon Shareholder Letter 2021':
        filepath = f'docs/2021-Shareholder-Letter'
    filepath = f'{filepath}_{selected_page}.txt'
    return filepath

def generate_summary(selected_doc, selected_page):
    source_file = get_text_source_file(selected_doc, selected_page)
    print(source_file)
    with open(source_file, 'r') as f:
        source_text = f.read()
    response = ai21.Summarize.execute(
                          source=source_text,
                          sourceType="TEXT",
                          destination=ai21.SageMakerDestination(SUMMARIZER_ENDPOINT_NAME)
    )
    summary_results = st.text_area(label="summary",
        value=f"{response.summary}",
        key="summary_results",
        label_visibility="hidden",
        height=640)
    
if __name__ == "__main__":
    write_logo()
    selected_doc, selected_page  = write_top_bar()
    st.markdown('---')
    header=f"Summary of page {selected_page} of {selected_doc}"
    col1, col2, col3 = st.columns([2,12,1])
    with col2:
        prompt = st.button(f"Click here to generate {header}",
                type="primary")
    if not prompt:
        st.text_area(label="summary",
                    value="Summary will be shown here",
                    label_visibility="hidden",
                    height=240)
    else:
        summary = generate_summary(selected_doc, selected_page)
    


Overwriting summarizer-workshop/summ.py


### Last step : Run steamlit app 

    1. Open System terminal through launcher
    2. Ensure current working directory is summarizer-workshop
    3. Run `pip install streamlit "ai21[AWS]" `
    3. Run `sh setup.sh`
    4. Run `sh run.sh` 
        Follow the URL for the streamlit app 

In [14]:
## undeploy the AI21 model
## may need to reimport librarie and set boto/sagemaker sessions again
ai21_summary_model_handler(DEPLOYED=True)

clean up done!
