# How to summarize long text inputs

In this notebook, you will be guided through the process of creating a summary for long text inputs using the powerful ChatGPT API and the Tiktoken package. Summarizing lengthy texts can be a challenging task, but with the right tools and techniques, it can be made efficient and effective.

## What is tiktoken?

[tiktoken](https://github.com/openai/tiktoken) is a fast BPE tokeniser for use with OpenAI's models.

Given a text string (e.g., "tiktoken is great!") and an encoding (e.g., "cl100k_base"), a tokenizer can split the text string into a list of tokens (e.g., ["t", "ik", "token", " is", " great", "!"]).

Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token).

### Encodings

Encodings specify how text is converted into tokens. Different models use different encodings.
<br>
`tiktoken` supports three encodings used by OpenAI models:

| Encoding name           | OpenAI models                                       |
|-------------------------|-----------------------------------------------------|
| `cl100k_base`           | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002`  |
| `p50k_base`             | Codex models, `text-davinci-002`, `text-davinci-003`|
| `r50k_base` (or `gpt2`) | GPT-3 models like `davinci`                         |

## Notebook Sections

- **Setup:** Install and Import packages
- **Encode texts:** Encode the input texts, preparing them to be processed by the ChatGPT API.
- **Create a chunks:** Break down the long input text into smaller, manageable segments using the Tiktoken package to ensure effective summarization
- **Define a prommpt:** Crafting a clear and relevant prompt to guide the summarization process and get more accurate results.
- **Summarize a list of chunks using the GPT Chat model from OpenAI.:** The powerful GPT Chat model from OpenAI will be leveraged to summarize the list of text chunks generated earlier, resulting in concise and coherent summaries for the long text input.

# Setup

If needed, install below packages with pip and then Import libraries.


In [None]:
%pip install --upgrade tiktoken
%pip install --upgrade openai

# Input data

The content of an e-book has been extracted and saved as a TXT file under the data folder
`gutenberg.org_cache_epub_13897_pg13897.txt`. <br>
If you are interested in downloading free books, this link can be referred to this [link](https://www.gutenberg.org/ebooks).


# Loading The Data


In [47]:
import os
import re


def load_document(file_name):
    """
    Load the content of a document from a specified file name within the "data" folder located in the current working directory.
    
    Parameters:
        file_name (str): The name of the file to be loaded, including the file extension.
        
    Returns:
        list: A list of strings, each representing a line from the loaded document.
        
    Raises:
        FileNotFoundError: If the specified file does not exist in the "data" folder.
        UnicodeDecodeError: If there is an issue with decoding the file using UTF-8 encoding.
        
    Example:
        Suppose we have a file named "example.txt" in the "data" folder. To load its content into a list, we would call the function like this:
        
        document_content = load_document("example.txt")
    """
    root_path = os.getcwd()
    with open(f"{root_path}/data/{file_name}", encoding="utf-8") as f:
        document = f.readlines()
        return document



def preprocess_text(text):
    """
    Preprocesses the input text by applying the following steps:
    1. Lowercasing the entire text.
    2. Removing special characters and punctuation, retaining only alphanumeric characters, spaces, and newline characters.
    3. Removing excessive whitespace, except for a single newline character.

    Parameters:
        text (str): The input text to be preprocessed.

    Returns:
        str: The preprocessed text after applying the specified cleaning steps.

    Example:
        Suppose we have the following text:
        text = "Hello! How are you?\nI am doing well!  "

        Calling preprocess_text(text) will return:
        "hello how are you\ni am doing well"
    """
    # Lowercasing
    text = text.lower()

    # Removing special characters and punctuation
    text = re.sub(r"[^a-zA-Z0-9\s\n]", "", text)

    # Removing excessive whitespace, except for a single newline character
    text = re.sub(r"(?!\n)\s+", " ", text)

    return text


In [48]:
# load document
document = load_document("gutenberg.org_cache_epub_13897_pg13897.txt")


# pre-process document
processed_data = []
for doc in document:
    processed_data.append(preprocess_text(doc))

In [62]:
processed_data = [string.strip() for string in processed_data if string.strip()]

In [63]:
processed_data

['the project gutenberg ebook the adventure club afloat by ralph henry',
 'barbour illustrated by e c caswell',
 'this ebook is for the use of anyone anywhere at no cost and with',
 'almost no restrictions whatsoever you may copy it give it away or',
 'reuse it under the terms of the project gutenberg license included',
 'with this ebook or online at wwwgutenbergorg',
 'title the adventure club afloat',
 'author ralph henry barbour',
 'release date october 30 2004 ebook 13897',
 'language english',
 'character set encoding iso646us usascii',
 'start of the project gutenberg ebook the adventure club afloat',
 'etext prepared by juliet sutherland kathryn lybarger and the project',
 'gutenberg online distributed proofreading team',
 'the adventure club afloat',
 'by',
 'ralph henry barbour',
 'author of left end edwards left tackle thayer etc',
 'with illustrations by e c caswell',
 '1917',
 'illustration the two cruisers were chugchugging out of the harbour',
 'to',
 'hp holt',
 'whose t

# Encode Texts and Create Chunks

In [50]:
import tiktoken

def tiktoken_encoding(encoding_model):
    """
    Get the token encoding function for a specific encoding_model using tiktoken.

    :param encoding_model: The name or identifier of the encoding model to be used.
    :type encoding_model: str

    :return: The token encoding function associated with the specified encoding_model.
    :rtype: function
    """

    return tiktoken.get_encoding(encoding_model)


In [51]:
# Define Encoding model
# checkout https://github.com/openai/tiktoken/blob/main/tiktoken/model.py to see other encoding models

tt_encoding = tiktoken_encoding("cl100k_base")

In [52]:
def create_gpt_chunks(data: list = [], chunk_size: int = 512, chunk_overlap: int = 0):
    """
    Encode a list of texts using the tokenizer (tt_encoding) and divide them into chunks based on the defined chunk_size and chunk_overlap.

    :param texts: List of texts to be encoded and divided into chunks.
    :type texts: list[str]
    :param chunk_size: Maximum size of each chunk in tokens.
    :type chunk_size: int, optional
    :param chunk_overlap: Number of tokens to overlap between adjacent chunks (0 for no overlap).
    :type chunk_overlap: int, optional

    :return: List of text chunks, where each chunk is a string.
    :rtype: list[str]
    """
    texts = '\n'.join(data)
    tokens = tt_encoding.encode(texts)
    total_tokens = len(tokens)

    chunks = []
    for i in range(0, total_tokens, chunk_size - chunk_overlap):
        chunk = tokens[i:i + chunk_size]
        chunks.append(tt_encoding.decode(chunk))
    return chunks


In [53]:
chunks = create_gpt_chunks(data=processed_data, chunk_size=512, chunk_overlap=0)

# Summarization

### Define prompt

In [54]:
# like llama index summary prompt
SUMMARY_PROMPT = (
    "Write a summary of the following. Try to use only the "
    "information provided. "
    "Try to include as many key details as possible.\n"
    "\n"
    "\n"
    "{context_str}\n"
    "\n"
    "\n"
    'SUMMARY:"""\n'
)

### Set OpenAI KEY

In [55]:
import openai

openai.api_key = ""

### Set configs

In [56]:
config = {
    "model_name": "gpt-3.5-turbo-16k",
    "max_summary_token": 256,
    "summary_temperature": 0.7,
    "top_p": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0
}

In [57]:
def summarization(text_chunks: list):
    """
    Summarize a list of text chunks using the GPT-3.5 Chat model from OpenAI.

    :param text_chunks: List of text chunks to be summarized.
    :type text_chunks: list[str]
    
    :return: The summarized text as a single string.
    :rtype: str
    """
    final_responses = []
    for chunk in text_chunks:
        prompt = SUMMARY_PROMPT.format(
            context_str=chunk)
        messages = [
            {"role": "system", "content": prompt},
        ]
        user_message = {"role": "user", "content": ''}
        messages.append(user_message)
        response = openai.ChatCompletion.create(
            model=config["model_name"],
            messages=messages,
            max_tokens=config["max_summary_token"],
            temperature=config["summary_temperature"],
            top_p=config["top_p"],
            presence_penalty=config["presence_penalty"],
            frequency_penalty=config["frequency_penalty"],
        )
        text = response["choices"][0]["message"]["content"].strip()
        final_responses.append(text)
    summarized_text = ' '.join(final_responses)
    return summarized_text

In [58]:
# get summary result
summary = summarization(chunks)

In [60]:
print(summary)

Later, the adventurers turn back into home waters and encounter some clouds. Joe suggests running for shelter, but the others dismiss the suggestion because there is no wind and the barometer says fair. The barometer is considered a joke on the boat as nobody can read it properly. Steve has given up on it, but Joe still believes in its weather forecasting abilities. The speaker is on a boat and they are discussing the weather. They mention that it might snow before night and they don't understand why they need to go into the harbor if there is no sign of fog. They mention that if it's just rain, they have been wet before and they should continue sailing. The speaker asks someone to find the next chart. They notice the seal islands off to the port and the boat they are following is behind them. Suddenly, a gust of wind comes from the north and sets the canvas and flags rattling. The sky darkens and spray starts coming over the rail. They lose sight of the boat they were following. The s