# Practice Using an LLM to Chunk and Summarize Data

In [26]:
from langchain_openai import AzureOpenAI
from langchain_openai import AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import SimpleJsonOutputParser
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import PyPDFLoader

from azure.identity import (
    DefaultAzureCredential,
)

from azure.identity import AzureAuthorityHosts
from azure.keyvault.secrets import SecretClient

from dotenv import load_dotenv
import json
import tiktoken
import os

### Create your Azure OpenAI Resource

_If you have your resource from the last exercise, you don't need to complete the following steps._ 

Navigate to the [Azure Portal](https://portal.azure.com/#home) or [US Gov Azure Portal](https://portal.azure.us/#home) and login using your account. Next you're going to create an Azure OpenAI resource, create a new resource group and use any unique name for the resource's name.  
Once the resource is created, you need to open [Azure AI Foundry](https://ai.azure.com/) or [Azure OpenAI Studio](https://ai.azure.us/) to deploy the model. Navigate to deployments, press deploy model and select gpt-35-turbo-instruct.  Make sure to increase your rate limit, or tokens per minute (around 700k should be sufficient).

In [27]:
load_dotenv()

credential = DefaultAzureCredential(authority=AzureAuthorityHosts.AZURE_GOVERNMENT)

secret_client = SecretClient(vault_url=os.getenv('KEY_VAULT_URL'), credential=credential)
deployment = os.getenv('DEPLOYMENT')
endpoint_url = os.getenv('AZURE_OPENAI_ENDPOINT')
api_version = os.getenv('API_VERSION')
api_key = secret_client.get_secret(os.getenv('SECRET_NAME')).value


azure_client = AzureChatOpenAI(
                api_key=api_key
                ,api_version=api_version
                ,azure_endpoint=endpoint_url
                ,deployment_name=deployment
                ,temperature=0
                ,max_tokens=4000
                ,model_kwargs={"response_format": {"type": "json_object"}}
)

### Our Data

The lotr.pdf is the complete _Lord of the Rings_ novel written by J.R.R. Tolkien. In this exercise, we are going to ingest the prologue and chunk out the data by using an LLM to get the most important pieces of information in the text.  
### What is MapReduce?

MapReduce is an algorithmic approach that solves a problem by recursively solving the same problem at a smaller scale (the map step) and then combining the solutions to solve the original problem (the reduce step). There are 2 main steps, mapping and reducing, as made clear from its name.  

**Map Step:**
* Break down the long text into smaller chunks
* Generate individual summaries for each chunk using an LLM
* This step is typically parallelized over the input chunks 2  

**Reduce Step:**
* Combine the individual summaries generated in the Map step
* Create a single cohesive summary that captures key points from all chunks

In [28]:
loader = PyPDFLoader("data/lotr.pdf")

pages = loader.load_and_split()

### Tiktoken

Tiktoken is a fast BPE (Byte Pair Encoding) tokenizer developed by OpenAI to use with their models. Some of its key features is that it has reversible and lossless conversion of text to tokens and works on arbitrary text.  
  
Here, we are creating a method, `num_tokens_from_string` to return the number of tokens from a string, using the cl100k_base encoding from the OpenAI library, along with tiktoken's functionality.  

We ended up splitting the text into 28 chunks, to then run through the LLM. 

In [29]:
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

In [30]:
prologue_pages = pages[24:39]
# Python for loop in one line
prologue_text = ''
for page in prologue_pages : prologue_text += page.page_content

In [31]:
text_splitter = RecursiveCharacterTextSplitter(
    # Set a really small chunk size, for demonstration
    chunk_size = 1500,
    chunk_overlap  = 150
)

docs = text_splitter.create_documents([prologue_text])
print(len(docs))

28


### The Map Step

Above, we did the first part of the Map step which is to break the text down into smaller chunks. Now, we are going to use the LLM to pull out themes (or summaries) from each chunk. 

In [32]:
output_parser = StrOutputParser()
#output_parser = JsonOutputParser(pydantic_object=Theme)
prompt = PromptTemplate(
    template="Identify the most important themes covered in the following text.  Take all information from the text only.  If a piece of information is not in the text, do not include it.  Results format: <theme>: <detailed summary of text that supports the theme in narrative form.>}}] :\n {text}",
    input_variables=["text"],

)

responses = []
for doc in docs:
    chain = (
    {"text": RunnablePassthrough()} 
        | prompt
        | azure_client
        | output_parser
    )

    responses.append(chain.invoke(doc.page_content).replace('"','\''))



In [33]:
print(docs[2].page_content)
print(responses[2])


mingled with the other kinds that had preceded them, but being 
somewhat bolder and more adventurous, they were often found as 
leaders or chieftains among clans of Harfoots or Stoors. Even in 
Bilbo’s time the strong Fallohidish strain could still be noted among4 TH E L ORD OF THE RI NGS 
the greater families, such as the Tooks and the Masters of Buckland. 
In the westlands of Eriador, between the Misty Mountains and 
the Mountains of Lune, the Hobbits found both Men and Elves. 
Indeed, a remnant still dwelt there of the Du´nedain, the kings of 
Men that came over the Sea out of Westernesse; but they were dwin-
dling fast and the lands of their North Kingdom were falling far and 
wide into waste. There was room and to spare for incomers, and ere 
long the Hobbits began to settle in ordered communities. Most of 
their earlier settlements had long disappeared and been forgotten in 
Bilbo’s time; but one of the ﬁrst to become important still endured, 
though reduced in size; this was at 

### The Reduce Step

Here, we are taking the themes the LLM generated and prompting the LLM to make summaries from these themes.

In [42]:
reduce_prompt_text = """
### Instructions:
Below is a collection of themes and supporting text summarize the all of supporting text in narrative form as if telling a story. 
Your summary will be added to other summaries so do not give an introduction to the summary, assume the reader has context for what you are telling them. 
Do not include the theme title in the summary.

### Output Key and Value JSON
{{"text": "<summary of supporting text>"}}
### Themes and Supporting Text:
{text}"
"""

In [43]:

final_summary = []
output_parser = StrOutputParser()
prompt = ChatPromptTemplate.from_template(reduce_prompt_text)

for response in responses:
    chain = (
    {"text": RunnablePassthrough()} 
        | prompt
        | azure_client
        | output_parser
    )

    final_summary.append(chain.invoke(response))

print(' '.join(final_summary))

{"text": "In the early days of their existence, Hobbits began to document their history after settling in the Shire, though their legends hint at a time long before, when they inhabited the upper vales of Anduin. This rich tapestry of history suggests a westward migration and a deep-rooted connection to other races, including Elves, Dwarves, and Men. As they settled into their new home, Hobbits naturally divided into three distinct breeds: the Harfoots, Stoors, and Fallohides. Each breed exhibited unique physical traits and preferred different terrains, showcasing the diversity within their culture and their ability to adapt to various environments. However, as time passed, external pressures began to influence their way of life. The increasing presence of Men and the encroaching darkness of the forest, which would later be known as Mirkwood, prompted the Hobbits to migrate across the mountains into Eriador, marking a significant chapter in their ongoing story."} {"text": "In a land wh

In [44]:
print(len(final_summary))

28


In [45]:
final_text = ''

for fs in final_summary:
    if len(fs) > 0:
        #raw_js = '{' + fs + '}'
        js = json.loads(fs)
        final_text += '\n\n' + js["text"]



In [46]:
with open("./prologue_text.txt", "w") as prologue:
    # Writing data to a file
    prologue.write(prologue_text)

with open("./prologue_summary_text.txt", "w") as summary:
    # Writing data to a file
    summary.write(final_text)