# Document Summarization

This notebook demonstrates an application of long document summarization techniques to a work of literature.

## Install Dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [1]:
# ! pip install git+https://github.com/ibm-granite-community/utils \
#     langchain \
#     langchain-community \
#     transformers \
#     langchain-huggingface \
#     replicate torch \
#     tiktoken \
#     langchain-experimental


Collecting git+https://github.com/ibm-granite-community/utils
  Cloning https://github.com/ibm-granite-community/utils to /private/var/folders/bb/yhn_xpt54b10d138nmjsk8x40000gn/T/pip-req-build-jc12wo32
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite-community/utils /private/var/folders/bb/yhn_xpt54b10d138nmjsk8x40000gn/T/pip-req-build-jc12wo32
  Resolved https://github.com/ibm-granite-community/utils to commit a5965f40db3950dd2a41f3ca62a2c34adcdc20d7
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


## Select your model

Select a Granite model from the [`ibm-granite`](https://replicate.com/ibm-granite) org on Replicate. Here we use the Replicate Langchain client to connect to the model.

To get set up with Replicate, see [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb).

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [2]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var


model_path = "ibm-granite/granite-3.1-8b-instruct"
model = Replicate(
    model=model_path,
    replicate_api_token=get_env_var('REPLICATE_API_TOKEN'),
)

## Download a book

Here we fetch H.D. Thoreau's "Walden" from [Project Gutenberg](https://www.gutenberg.org/) for summarization.

We have to trim it down so that it will fit in the 128k-token context window of the model.

In [3]:
import requests
from time import sleep

# The following URL contains a text version of H.D. Thoreau's "Walden"
url = "https://www.gutenberg.org/cache/epub/205/pg205.txt"

# Get the contents
response = requests.get(url)
response.raise_for_status()
full_contents = response.text

# Extract the text of the book, leaving out the gutenberg boilerplate.
start_str = "*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
start_index = full_contents.index(start_str) + len(start_str)
end_str = "*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
end_index = full_contents.index(end_str)
book_contents = full_contents[start_index:end_index]
print("Length of book text: {} chars".format(len(book_contents)))

Length of book text: 644843 chars


## Count the tokens

Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.

Key points:
- We're using the [`granite-3.1-8b-instruct`](https://huggingface.co/ibm-granite/granite-3.1-8b-instruct) model, which has a context window of 128K tokens.
- Tokenization can vary between models, so we use the specific tokenizer for our chosen model.

Understanding token count helps us optimize our prompts and ensure we're using the model efficiently.

In [4]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Your model uses the tokenizer " + type(tokenizer).__name__)

print(f"Your document has {len(tokenizer.tokenize(book_contents))} tokens. ")

Your model uses the tokenizer GPT2TokenizerFast
Your document has 184361 tokens. 


### Pick Embedding model 

In [5]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings_model = HuggingFaceEmbeddings(model_name="ibm-granite/granite-embedding-125m-english")

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Chunk the text meaningfully using SemanticChunker

In [6]:
from langchain_experimental.text_splitter import SemanticChunker

text_splitter = SemanticChunker(embeddings=embeddings_model)
docs = text_splitter.create_documents(texts=[book_contents])

## Summarize the text

### Create the Final Summary

Generate a single summary from the chunked documents using langchain mapreduce summarization chain.

In [7]:
from langchain.chains.summarize import load_summarize_chain
from langchain_core.runnables.config import RunnableConfig

chain = load_summarize_chain(llm=model, chain_type="map_reduce")
summmary = chain.invoke(input={"input_documents": docs}, config=RunnableConfig(max_concurrency=5))['output_text']
summmary

Token indices sequence length is longer than the specified maximum sequence length for this model (25263 > 1024). Running this sequence through the model will result in indexing errors


'Henry David Thoreau\'s "Walden" is a philosophical exploration of simple living and self-reliance, chronicling his two-year stay in a cabin near Walden Pond. Thoreau advocates for a life less burdened by material possessions and more attuned to one\'s purpose, critiquing societal norms, materialism, and the dehumanizing effects of economic striving. He emphasizes the value of individualism, nature, and firsthand experience, and critiques traditional college education for prioritizing convenience over practical skills. Thoreau reflects on the futility of rapid technological advancements, valuing slower, more meaningful pursuits. He discusses his frugal lifestyle, advocating for minimal possessions and self-sufficiency in food production. The text also explores themes of simplicity, natural living, and self-improvement, criticizing corrupted religious manners and advocating for a more positive perspective on God. Thoreau\'s "Civil Disobedience" critiques passive citizens who oppose inju