# Hands-on Session Text Summarization
This is the hands-on session accompanying the workshop on LangChain fundamentals. This is inspired by the more extensive LangChain Cookbook Part 1.

Copyright (c) 2023 Michael Neumayr

## Setup

### 0. Set up the Colab in your drive

- Load this Colab from Github
- Run the first cell to install all required packages (this takes a moment)
- During installation jump to section "Set OpenAI API Key" and put the key we provide you instead of "PUT_YOUR_KEY_HERE"

### 1. Required python packages

In [None]:
# install required packages; this may take some minutes; ignore dependency warnings it should work anyway
%pip install openai
%pip install langchain
%pip install pypdf
%pip install tiktoken

### 2. Load the workshop github

In [None]:
!git clone https://github.com/michaelnoi/venture_labs_build.git

In [None]:
!cd venture_labs_build && git checkout only_static_files

### 3. OpenAI API key

In [None]:
import os

openai_api_key = os.getenv('OPENAI_API_KEY', 'PUT_YOUR_KEY_HERE')

### 4. Optional: Connect to your Google Drive storage to upload your own documents later

In [None]:
# connect to your google drive storage to use your own documents
from google.colab import drive

drive.mount('/content/drive')

## Project: Text summarization

In [None]:
from langchain import OpenAI

In [None]:
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key)

### 1. Summarize short pdf document at once ("stuff" method)

<div class="alert alert-info">
  <b>Context Size</b>
  <p>Most basic models have a context size (max. number of input tokens) around 2k (GPT3) - 32k (GPT4), but the price for larger context sizes goes up per 1000 tokens. For our GPT 3.5 Turbo model, the context size is 4k tokens, so our short prompt + the document without the reference pages should work inputting it at once. Let's try!</p>
</div>

In [None]:
from langchain.document_loaders import PyPDFLoader

# load short business model canvas pdf again
pdf_path = "static/business_Model_Canvas.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

We load the business model canvas wikipedia again as pdf document. Let's first check out how many tokens the document has.

In [None]:
# get number of tokens in each page
overall_tokens = 0
for page in documents:
    n_tokens = llm.get_num_tokens(page.page_content)
    print(f"Number of tokens in page {page.metadata['page']}: {n_tokens}")
    overall_tokens += n_tokens
print(f"Overall number of tokens: {overall_tokens}")

Drop last two pages (references) and add the whole article to a prompt for summarization.

<div class="alert alert-warning">
  <p>Try to get a summary for 5 year olds and the summary you would like personally.</p>
</div>

In [None]:
prompt = "Provide a short summary of the following document. Your summary should be 3-5 sentences long."

# add all pages except the last two to the prompt
for page in documents[:-2]:
    prompt += page.page_content + "\n\n"

print(llm.get_num_tokens(prompt))

In [None]:
# get a summary of the document with our summarization prompt
answer = llm(prompt)
print(answer)

### 2. Summarize long pdf document in chunks ("map_reduce" method)

<div class="alert alert-info">
  <b>Handling large documents</b>
  <p>With the wikipedia article on natural language processsing the token limit of our available models is exceeded (even when dropping the references), so we have to find another way to deal with this document. A very efficient way is to split the document into chunks (smaller parts) and generate a lot of summaries of these chunks in parallel (map step). And then afterwards summarize all the generated summaries again (reduce step) to get a final summary. This method is called map reduce.</p>
  <p>The map and reduce concept is not exclusive to summarization and can also be applied to other tasks like translation, question answering, etc.</p>
</div>

<img src="static/map_reduce.png" width="700"/>

In [None]:
# import predefined chain for summarization and text splitter
from langchain.chains.summarize import load_summarize_chain
from langchain.text_splitter import RecursiveCharacterTextSplitter

In [None]:
# load the longer document on NLP and get the number of tokens
from langchain.document_loaders import PyPDFLoader

pdf_path = "static/natural_language_processing.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

overall_tokens = 0
for page in documents[:-4]:
    n_tokens = llm.get_num_tokens(page.page_content)
    overall_tokens += n_tokens
    
print(f"Overall number of tokens: {overall_tokens}")

We load a predefined summarize chain and additionally we need a text splitter to split up our document into chunks. Let's first split the text into chunks of 2000 token (a token is roughle 4 characters so we need to split at 8000 characters) with some overlap.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=8000, chunk_overlap=500)

# put relevant pages into one string
article = ""
for page in documents[:-4]:
    article += page.page_content + "\n\n"

# split into chunks with the defined text splitter
chunks = text_splitter.create_documents([article])

print(f"Number of chunks: {len(chunks)}")
print("Number of tokens in each chunk:")
for chunk in chunks:
    print(llm.get_num_tokens(chunk.page_content))

In [None]:
chain = load_summarize_chain(llm=llm, chain_type="map_reduce") #, verbose=True)

Now with our chunks and the map_reduce summarization chain we can generate the full summary by inputting the list of chunks into the chain. Setting verbose to True will give the prompts.

In [None]:
output = chain.invoke(chunks)

In [None]:
print(output["output_text"])

## More ressources

- Documentation: https://python.langchain.com/docs/get_started/introduction
- Really comprehensive tutorials: https://github.com/gkamradt/langchain-tutorials