# Document Summarization

This notebook demonstrates an application of long document summarization techniques to a work of literature.

## Install Dependencies

Granite Kitchen comes with a bundle of dependencies that are required for notebooks. See the list of packages in its [`setup.py`](https://github.com/ibm-granite-community/granite-kitchen/blob/main/setup.py). 

In [None]:
! pip install git+https://github.com/ibm-granite-community/granite-kitchen \
    transformers \
    torch \
    tiktoken

## Select your model

Select a Granite Code model from the [`ibm-granite`](https://replicate.com/ibm-granite) org on Replicate. Here we use the Replicate Langchain client to connect to the model.

To get set up with Replicate, see [Getting Started with Replicate](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Getting_Started/Getting_Started_with_Replicate.ipynb).

To connect to a model on a provider other than Replicate, substitute this code cell with one from the [LLM component recipe](https://github.com/ibm-granite-community/granite-kitchen/blob/main/recipes/Components/Langchain_LLMs.ipynb).

In [None]:
from langchain_community.llms import Replicate
from ibm_granite_community.notebook_utils import get_env_var

model = Replicate(
    model="ibm-granite/granite-8b-code-instruct-128k",
    replicate_api_token=get_env_var('REPLICATE_API_TOKEN'),
)

## Download a book

Here we fetch H.D. Thoreau's "Walden" from [Project Gutenberg](https://www.gutenberg.org/) for summarization.

We have to trim it down so that it will fit in the 128k-token context window of the model.

In [None]:
import requests
from time import sleep

# The following URL contains a text version of H.D. Thoreau's "Walden"
url = "https://www.gutenberg.org/cache/epub/205/pg205.txt"

# Get the contents
response = requests.get(url)
response.raise_for_status()
full_contents = response.text

# Extract the text of the book, leaving out the gutenberg boilerplate.
start_str = "*** START OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
start_index = full_contents.index(start_str) + len(start_str)
end_str = "*** END OF THE PROJECT GUTENBERG EBOOK WALDEN, AND ON THE DUTY OF CIVIL DISOBEDIENCE ***"
end_index = full_contents.index(end_str)
book_contents = full_contents[start_index:end_index]
print("Length of book text: {} chars".format(len(book_contents)))

# We limit the text to 200k characters, which is about 57k tokens. (400k chars is ~114k tokens; 300k chars is ~86k tokens; 350k chars is ~100k tokens).
char_limit = 10000
contents = book_contents[:char_limit]
print("Length of text for summarization: {} chars".format(len(contents)))

## Count the tokens

Before sending our code to the AI model, it's crucial to understand how much of the model's capacity we're using. Language models typically have a limit on the number of tokens they can process in a single request.

Key points:
- We're using the [`granite-8B-Code-instruct-128k`](https://huggingface.co/ibm-granite/granite-8b-code-instruct-128k) model, which has a context window of 128,000 tokens.
- Tokenization can vary between models, so we use the specific tokenizer for our chosen model.

Understanding token count helps us optimize our prompts and ensure we're using the model efficiently.

In [None]:
from transformers import AutoTokenizer

model_path = "ibm-granite/granite-8B-Code-instruct-128k"
tokenizer = AutoTokenizer.from_pretrained(model_path)
print("Your model uses the tokenizer " + type(tokenizer).__name__)

print(f"Your document has {len(tokenizer(contents, return_tensors='pt')['input_ids'][0])} tokens. ")

## Summarize the text

We construct our final prompt and send it to the AI model on Replicate for processing.

In [None]:
prompt = f"""
Summarize the following text from "Walden" by Henry David Thoreau:
{contents}
"""

output = model.invoke(
    prompt,
    model_kwargs={
        "max_tokens": 10000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 200, # Set the minimum number of tokens to generate as output.
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    }
    )

print(output)

## Summary of Summaries

Here we use an iterative summarization technique to adapt to the context length of the model.

### Chunk the text

Divide the full text into smaller passages for separate processing.

In [None]:
from langchain.text_splitter import TokenTextSplitter
from langchain.docstore.document import Document

excerpt_length = 20000
doc =  Document(page_content=book_contents[:excerpt_length], metadata={"source": "local"})
print(f"The text is {len(doc.page_content)} chars")

# Split the documents into chunks
chunk_char_limit = 1000
text_splitter = TokenTextSplitter.from_huggingface_tokenizer(tokenizer=tokenizer, chunk_size=chunk_char_limit, chunk_overlap=50)
chunks = text_splitter.split_documents([doc])
print("Chunk count: " + str(len(chunks)))

### Summarize the chunks

Here we create a separate summary of each passage. This can take a few minutes.

In [None]:
summaries = []

for i, chunk in enumerate(chunks):
    prompt = f"""
        Summarize the following text from "Walden" by Henry David Thoreau:
        {chunk}
        """
    output = model.invoke(
        prompt,
        model_kwargs={
            "max_tokens": 10000, # Set the maximum number of tokens to generate as output.
            "min_tokens": 200, # Set the minimum number of tokens to generate as output.
            "temperature": 0.75,
            "system_prompt": "You are a helpful assistant.",
            "presence_penalty": 0,
            "frequency_penalty": 0
        }
    )
    summary = f"Summary {i+1}:\n{output}\n\n"
    summaries.append(summary)
    print(summary)

print("Summary count: " + str(len(summaries)))


### Summarize the Summaries

We signal to the model that it is receiving separate summaries of passages from an original text, and to create a unified summary of that text.

In [None]:
summary_contents = "\n\n".join(summaries)
print(len(summary_contents))

prompt = f"""
The text of "Walden", by Henry David Thoreau, was summarized in separate passages; those passage summaries are provided below. 

{summary_contents}

From these summaries, compose a single lengthy, unified summary of the original text.
"""

output = model.invoke(
    prompt,
    model_kwargs={
        "max_tokens": 100000, # Set the maximum number of tokens to generate as output.
        "min_tokens": 5000, # Set the minimum number of tokens to generate as output.
        "temperature": 0.75,
        "system_prompt": "You are a helpful assistant.",
        "presence_penalty": 0,
        "frequency_penalty": 0
    }
    )

print(output)