## Summarization

In [None]:
#!pip install openai langchain chromadb

In [1]:
import pandas as pd

In [2]:
main_df = pd.read_excel("topical_chat.xlsx")
main_df = main_df.dropna()
main_df = main_df[main_df['message'].notna() & (main_df['message'] != '')]
main_df = main_df.reset_index(drop=True)

first_100_conversations = main_df[main_df["conversation_id"] <= 100]


In [3]:
#group messages with same conversation id
unique_conversations_ids = first_100_conversations['conversation_id'].unique()
print("Unique Conversation ids:", unique_conversations_ids)


texts = []

for conversation_id in unique_conversations_ids:
    x = first_100_conversations[first_100_conversations["conversation_id"] == conversation_id]
    texts.append({"id" : conversation_id , "text" : "\n ".join(x["message"])})

Unique Conversation ids: [  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
  91  92  93  94  95  96  97  98  99 100]


In [10]:

from langchain.chat_models import AzureChatOpenAI, ChatOpenAI
from langchain.prompts import PromptTemplate

from langchain.output_parsers import PydanticOutputParser
from langchain.output_parsers import RetryWithErrorOutputParser
from pydantic import BaseModel, Field, validator


llm = ChatOpenAI() # add openai credentials

class AnswerModel(BaseModel):
    summary: str = Field(description="summary of the given conversation")


parser = PydanticOutputParser(pydantic_object=AnswerModel)

template = """Summarize the given chat context in short and generate 2-3 lines of summary. context : {context},\n format instructions : {format_instructions}"""


prompt = PromptTemplate(
    template=template,
    input_variables=["context"],
    partial_variables={"format_instructions": parser.get_format_instructions()},
)

In [11]:

def summarize(conversation_id):

    context = texts[conversation_id]["text"]

    input = prompt.format_prompt(context=context)
    o = llm.predict(input.to_string())
    retry_parser = RetryWithErrorOutputParser.from_llm(parser=parser, llm=llm)
    model = retry_parser.parse_with_prompt(o, input)
    return model.summary

In [12]:
print(texts[1]['text'])

do you like dance?
 Yes  I do. Did you know Bruce Lee was a cha cha dancer?
 Yes he even won a hardcore cha cha championship in 1958
 Yeah. Did you know Tupac was a ballet dancer?
 Yes and he even was in the production of the nutcracker
 Yeah. Ballet dancer go through 4 pairs of shoes a week
 Yes that is a lot of shoes and also a lot of money
 Yeah true. Did you know babies are really good at dancing?
 Yes and they smile more when they hit the beat
 Yeah they are much smarter than we give them credit for
 True Did you know Jackson had a patent on a dancing device?
 Yes it helped him smooth out his dance moves
 Nice. Do you like Shakespeare?
 Yes I do. Do you know that he popularized many phrases
  Yes like good riddance, in my heart of hearts and such
  Yes and then he also invented names like Jessica, Olivia and Miranda
 Yes. And for his works you have to use old english for it to make sense
 Yes otherwise the rhymes and puns do not seem to work out
 Yes. He lived at the same time as 

In [13]:
print(summarize(1))

The conversation revolves around various interesting facts about dance and Shakespeare. The participants discuss Bruce Lee's cha cha dancing, Tupac's ballet dancing, Michael Jackson's patented dancing device, and the natural dancing ability of babies. They also discuss Shakespeare's contribution to English language, his invention of names, and the necessity of using Old English to understand his works.


Writing summary of 1st conversation to task_2_summaries folder

In [17]:
import os
from tqdm import tqdm

def write_to_files(prefix="summary_conversation"):

    for id in tqdm(range(len(unique_conversations_ids))):
        summary = summarize(id)
        filename = os.path.join("task_2_summaries",f"{prefix}_{id+1}.txt")
        with open(filename, "w") as file:
            file.write(summary)

write_to_files()


100%|██████████| 100/100 [08:28<00:00,  5.08s/it]
