# Azure OpenAI - Progressive Document Summarization

OpenAI's GPT-3 model affords state-of-the-art language processing capabilities to generate concise and informative summaries of longer documents. For enterprises, using the Azure OpenAI service allows users to harness this cutting edge ML technology while maintaining a strong security posture. Generating summaries of long-form text can provide meaningful insights across multiple domains, however users are currently limited in the size of text that can be summarized in a single call to OpenAI's models.

The notebook below demonstrates a "progressive document summarization" approach where multiple documents are recursively summarized to generate a final summary report. This is showcased by generating a summary of Jules Verne's 20,000 Leagues Under the Sea - available in the public domain and [accessible online via the Gutenberg Project](https://www.gutenberg.org/files/164/164-h/164-h.htm). Each chapter is retrieved as a standalone document and subsequently summarized prior to generating a final "book summary." Here, recursive calls are made to ensure token limits are maintained with each call to the AOAI API.

### Import required packages (see conda.yml enviornment definition)

In [1]:
import requests
from bs4 import BeautifulSoup
import os
from dotenv import load_dotenv
import tiktoken

# Load environment variables from .env file
dotenv_path = os.path.join(os.getcwd(), 'env')
load_dotenv(dotenv_path)

True

### Document summarization helper functions

In [2]:
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

def prompt_openai(doc):
    import openai
    openai.api_type = "azure"
    openai.api_base = os.getenv('OPENAI_BASE')
    openai.api_version = "2022-12-01"
    openai.api_key = os.getenv("OPENAI_KEY")

    prompt = f"""
    Create a summary of the text below.

    '{doc}'
    """

    response = openai.Completion.create(
        engine=os.getenv('OPENAI_DEPLOYMENT'),
        prompt=prompt,
        temperature=0.5,
        max_tokens=1096,
        top_p=0.5,
        frequency_penalty=0,
        presence_penalty=0,
        stop=None
    )

    return response['choices'][0]['text'].replace('\n', '').replace(' .', '.').strip()

def estimate_token_count(text):
    encoding = tiktoken.get_encoding('gpt2')
    num_tokens = len(encoding.encode(text))
    return num_tokens

def split_document(text, out_documents):
    tc = estimate_token_count(text)
    if tc > 2500:
        sentences = list(sent_tokenize(text))
        midpoint = int(len(sentences)/2)
        chunk_1 = ' '.join(sentences[0:midpoint])
        chunk_2 = ' '.join(sentences[midpoint:])
        split_document(chunk_1, out_documents)
        split_document(chunk_2, out_documents)
    else:
        out_documents.append(text)

def summarize_all_documents(documents):
    summarized_docs = []
    for index, doc in enumerate(documents):
        print(str(index) + '/' + str(len(documents)))
        split_documents = []
        split_document(doc, split_documents)
        for sdoc in split_documents:
            summarized_docs.append(str(prompt_openai(sdoc)))
    return summarized_docs

def create_single_summary(documents):
    summaries = summarize_all_documents(documents)
    if len(summaries)==1:
        return summaries[0]
    else:
        return create_single_summary([' '.join(summaries)])


[nltk_data] Downloading package punkt to /home/azureuser/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


### Document retrieval

Each chapter in Jules Verne's 20,000 Leagues Under the Sea is included as a separate document initially

In [3]:
documents = []

resp = requests.get('https://www.gutenberg.org/files/164/164-h/164-h.htm')
soup = BeautifulSoup(resp.text, 'html.parser')
chapters = soup.find_all('div', class_='chapter')
for c in chapters:
    if len(c.text)>100:
        documents.append(c.text)

### Document summarization

All documents are recursively summarized and merged before generating a final consolidated summary. This approach can be modified for business tasks like user-feedback consolidation, competitive analysis, maintenance summarization, etc. 

In [4]:
book_summary = create_single_summary(documents)

print(book_summary)

0/46
1/46
2/46
3/46
4/46
5/46
6/46
7/46
8/46
9/46
10/46
11/46
12/46
13/46
14/46
15/46
16/46
17/46
18/46
19/46
20/46
21/46
22/46
23/46
24/46
25/46
26/46
27/46
28/46
29/46
30/46
31/46
32/46
33/46
34/46
35/46
36/46
37/46
38/46
39/46
40/46
41/46
42/46
43/46
44/46
45/46
0/1
0/1
Pierre Aronnax, a professor from the Museum of Natural History in Paris, joins an expedition to hunt a giant narwhal in the North Pacific Ocean. After three months of searching, the crew spots the narwhal and prepares for battle. Pierre and Conseil are lost at sea after the frigate they were on collides with the narwhal. They are rescued by a submarine boat and taken to the Nautilus, a powerful submarine vessel owned by Captain Nemo. The vessel is powered by electricity and is equipped with a kitchen, bathroom, and berthroom. The passengers explore the ocean floor and a submarine forest, encountering a variety of flora and fauna. They navigate the dangerous waters of the Torres Straits and the Indian Ocean, encounter