## Langchain 
These experiment aim to determine the performance of transcription models for creating meeting notes. The format is based off the teams .vtt data format

In [1]:
!pip install openai tiktoken langchain

Collecting openai
  Downloading openai-1.3.5-py3-none-any.whl.metadata (16 kB)
Collecting tiktoken
  Downloading tiktoken-0.5.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (6.6 kB)
Collecting langchain
  Downloading langchain-0.0.340-py3-none-any.whl.metadata (16 kB)
Collecting anyio<4,>=3.5.0 (from openai)
  Downloading anyio-3.7.1-py3-none-any.whl.metadata (4.7 kB)
Collecting distro<2,>=1.7.0 (from openai)
  Downloading distro-1.8.0-py3-none-any.whl (20 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.2-py3-none-any.whl.metadata (6.9 kB)
Collecting pydantic<3,>=1.9.0 (from openai)
  Downloading pydantic-2.5.2-py3-none-any.whl.metadata (65 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m65.2/65.2 kB[0m [31m1.3 MB/s[0m eta [36m0:00:00[0m[31m1.7 MB/s[0m eta [36m0:00:01[0m
[?25hCollecting tqdm>4 (from openai)
  Downloading tqdm-4.66.1-py3-none-any.whl.metadata (57 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

### Approach high-level

The core approaches are depicted below and what we will be experimenting with in this notebook using different prompts and packaging of the transcripts

![img](summarizeApproaches.png)


**Target**

We aim at creating something like the next cell

**Subject: [Meeting Date] - [Meeting Title/Topic]**



Dear all,

Please find the notes and action items from our recent meeting below

**Meeting Summary:**

[Concise summary of the meeting discussions.]

**Key Points:**

[Key Point 1]

[Key Point 2]

[Key Point 3]

**Action Items:**

[Action Item 1: Description, Deadline, Owner]

[Action Item 2: Description, Deadline, Owner]

[Action Item 3: Description, Deadline, Owner]

Please review the action items and ensure that deadlines are noted. Feel free to reach out if you have any questions or concerns related to the meeting minutes and action points.


Best regards,

[Standard signature]

## Preprocessing of .vtt


In [2]:
import json
import re

In [3]:
FILE_PATH = "sample.vtt"
with open(FILE_PATH, "r") as f:
    lines = f.read()

In [4]:
print(lines)

WEBVTT

00:00:00.000 --> 00:00:10.260
<v Donald Trump>These are bad, sick people. That was your coup, you know, against you. Well, it started right at the beginning. Like when Millie's talking about, oh, you were going to try to do a coup.</v>

00:00:10.340 --> 00:00:19.140
<v BBC News>No, they were tryin x x … g to do that before you even were sworn in. szThat's right. zssssszTrying to overthrow your election. Well, with Millie, let me see that. I'll show you an example.</v>

00:00:19.660 --> 00:00:31.840
<v Donald Trump>He said that I wanted to attack Iran. Isn't it amazing? bffbbbbbbb I have a big pile of papers. This thing just came out. Look. This was him.</v>


In [5]:
## VTT timestamp to seconds
def vtt_to_seconds(vtt_timestamp):
    h, m, s = vtt_timestamp.split(':')
    s, ms = s.split('.')
    return int(h) * 3600 + int(m) * 60 + int(s) + int(ms) / 1000

extraction_pattern = re.compile(r'<v([^>]*)>(.*?)<\/v>')

## Create function that converts vtt with speakers to json
def vtt_to_json(vtt_content):
    sections = []
    current_section = None

    lines = vtt_content.split('\n')

    for line in lines:
        if re.match(r'^\d{2}:\d{2}:\d{2}.\d{3} --> \d{2}:\d{2}:\d{2}.\d{3}$', line):
            # This line represents the timestamp, create a new section
            if current_section:
                sections.append(current_section)
            current_section = {'timestamp_start': vtt_to_seconds(line.split(' --> ')[0]),
                               'timestamp_end': vtt_to_seconds(line.split(' --> ')[1]),
                               'speaker': '',
                               'text': ''}
        elif re.match(r'<v (.+?)>', line):
            # This line represents a speaker, extract the speaker's name
            current_section['speaker'] = extraction_pattern.search(line).group(1).strip()
            current_section['text'] = extraction_pattern.search(line).group(2).strip()
        elif line.strip() == '':
            # Empty line indicates the end of a section
            if current_section:
                sections.append(current_section)
                current_section = None

    # Add the last section if there is any
    if current_section:
        sections.append(current_section)

    # Convert the sections to the desired JSON format
    return sections

In [6]:
from pprint import pprint
pprint(vtt_to_json(lines))

[{'speaker': 'Donald Trump',
  'text': 'These are bad, sick people. That was your coup, you know, against '
          "you. Well, it started right at the beginning. Like when Millie's "
          'talking about, oh, you were going to try to do a coup.',
  'timestamp_end': 10.26,
  'timestamp_start': 0.0},
 {'speaker': 'BBC News',
  'text': 'No, they were tryin x x … g to do that before you even were sworn '
          "in. szThat's right. zssssszTrying to overthrow your election. Well, "
          "with Millie, let me see that. I'll show you an example.",
  'timestamp_end': 19.14,
  'timestamp_start': 10.34},
 {'speaker': 'Donald Trump',
  'text': "He said that I wanted to attack Iran. Isn't it amazing? bffbbbbbbb "
          'I have a big pile of papers. This thing just came out. Look. This '
          'was him.',
  'timestamp_end': 31.84,
  'timestamp_start': 19.66}]


## Approach 1 - Map reduce

Create the following elements separately and then merged into the final format
* Action items
* Summary 
* Key points

In [43]:
from langchain.chains import MapReduceDocumentsChain, ReduceDocumentsChain
from langchain.text_splitter import CharacterTextSplitter
from langchain.llms import AzureOpenAI
from langchain.chains.combine_documents.stuff import StuffDocumentsChain, format_document
from langchain.chains.llm import LLMChain
from langchain.prompts import PromptTemplate

llm = AzureOpenAI(
    deployment_name="td2",
    model_name="text-davinci-002",
)

# Map
map_template = """
You are a highly skilled AI trained in language comprehension and summarization. 
I would like you to read the following transcript from a meeting and summarize it into a concise abstract paragraph. The transcript you will be summarizing in automatically generated from a video call and may contain errors. Additionally, the transcript is only a for a part of the transcript. 
Aim to retain the most important points, providing a coherent and readable summary that could help a person understand the main points of the discussion without needing to read the entire text. Please avoid unnecessary details or tangential points.
Transcript:
{docs}
"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

ModuleNotFoundError: No module named 'langchain'

### Map

Let's unpack the map reduce approach. For this, we'll first map each document to an individual summary using an `LLMChain`. Then we'll use a `ReduceDocumentsChain` to combine those summaries into a single global summary.
 
First, we specify the LLMChain to use for mapping each document to an individual summary:

In [None]:
map_template = """
You are a highly skilled AI trained in language comprehension and summarization. The following is a section of meeting summaries from different sections of automated meeting transcription. I would like you to read and summarize it into a concise abstract paragraph. Aim to retain the most important points, providing a coherent and readable summary that could help a person understand the main points of the discussion without needing to read the entire text. Please avoid unnecessary details or tangential points.
Transcript:
{docs}
"""
map_prompt = PromptTemplate.from_template(map_template)
map_chain = LLMChain(llm=llm, prompt=map_prompt)

### Reduce

The `ReduceDocumentsChain` handles taking the document mapping results and reducing them into a single output. It wraps a generic `CombineDocumentsChain` (like `StuffDocumentsChain`) but adds the ability to collapse documents before passing it to the `CombineDocumentsChain` if their cumulative size exceeds `token_max`. In this example, we can actually re-use our chain for combining our docs to also collapse our docs.

So if the cumulative number of tokens in our mapped documents exceeds 4000 tokens, then we'll recursively pass in the documents in batches of < 4000 tokens to our `StuffDocumentsChain` to create batched summaries. And once those batched summaries are cumulatively less than 4000 tokens, we'll pass them all one last time to the `StuffDocumentsChain` to create the final summary.

In [None]:
# Reduce
reduce_template = """
You are a highly skilled AI trained in language comprehension and summarization. The following is a set of generated meeting summaries from different sections of automated meeting transcription. I would like you to read and summarize it into a concise abstract paragraph. Aim to retain the most important points, providing a coherent and readable summary that could help a person understand the main points of the discussion without needing to read the entire text. Please avoid unnecessary details or tangential points.
Summaries:
{doc_summaries}
Take these and distill it into a final, consolidated summary of the main themes. 
Helpful Answer:"""
reduce_prompt = PromptTemplate.from_template(reduce_template)

In [None]:
# Run chain
reduce_chain = LLMChain(llm=llm, prompt=reduce_prompt)

# Takes a list of documents, combines them into a single string, and passes this to an LLMChain
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain, document_variable_name="docs"
)

# Combines and iteravely reduces the mapped documents
reduce_documents_chain = ReduceDocumentsChain(
    # This is final chain that is called.
    combine_documents_chain=combine_documents_chain,
    # If documents exceed context for `StuffDocumentsChain`
    collapse_documents_chain=combine_documents_chain,
    # The maximum number of tokens to group documents into.
    token_max=4000,
)

it is then combined into one

In [None]:
# Combining documents by mapping a chain over them, then combining results
map_reduce_chain = MapReduceDocumentsChain(
    # Map chain
    llm_chain=map_chain,
    # Reduce chain
    reduce_documents_chain=reduce_documents_chain,
    # The variable name in the llm_chain to put the documents in
    document_variable_name="docs",
    # Return the results of the map steps in the output
    return_intermediate_steps=False,
)

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000, chunk_overlap=0
)
split_docs = text_splitter.split_documents(docs)