# Meeting Transcription Analysis

This notebook uses generative AI (GenAI) in the form of Large Language Models (LLMs) to analyze zoom meeting transcriptions.

Analysis includes:
- Main topics of discussion
- Jargon used
- Action items

This notebook uses the [OpenAI REST API](https://platform.openai.com/docs/api-reference/introduction) to interact with LLMs hosted in a [FastChat](https://github.com/lm-sys/FastChat) deployment.

*Disclaimer*: While developing this notebook, I used LLMs as a pair programmer to get template code for specific functions.

## Install necessary packages

In [1]:
pip install openai==0.28.1 nltk

Collecting openai==0.28.1
  Obtaining dependency information for openai==0.28.1 from https://files.pythonhosted.org/packages/1e/9f/385c25502f437686e4aa715969e5eaf5c2cb5e5ffa7c5cdd52f3c6ae967a/openai-0.28.1-py3-none-any.whl.metadata
  Using cached openai-0.28.1-py3-none-any.whl.metadata (11 kB)
Collecting nltk
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting regex>=2021.8.3 (from nltk)
  Obtaining dependency information for regex>=2021.8.3 from https://files.pythonhosted.org/packages/8d/6b/2f6478814954c07c04ba60b78d688d3d7bab10d786e0b6c1db607e4f6673/regex-2023.12.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata
  Downloading regex-2023.12.25-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m742.9 kB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
Using cached openai-0.28.1-py3-none-any.whl (76 kB)
Downloading regex-2023.12.25-cp311-cp311-manylinu

## Import Packages

In [2]:
import yaml
import openai
from nltk.corpus import stopwords
import nltk
import re

## Set Environment Variables

In [3]:
with open('env.yaml', 'r') as f:
    env = yaml.safe_load(f)

print(env["fastchat"]["base_url"])

https://sdsu-rci-fastchat.nrp-nautilus.io/v1


## Importance of Token/Context Limit
It is important to make effective use of an LLM's token limit or context window to make sure they are processing meaningful data. The Zoom video text track (.vtt) file has a lot of extra characters that we should remove, like timestamps, that would not be useful for our purposes.

With that said, I conducted some preliminary, small-scale testing with the timestamps included and the LLM's analysis didn't appear to be adversely affected.

Regardless, when dealing with nearly one hour of spoken words, it is wise to optimize your input length. Prior to any pre-processing, a simple run of the linux command `wc -w` revealed my transcript file had 10,815 "words." 

For context, LLMs tokenization of words resulsts in a ratio of roughly 1 token = 3/4 of a word. So, 100 tokens is roughly 75 words. Using that ratio, we could expect the 10,815 words to be roughly equal to 13,519 tokens. And that doesn't include any user or system prompts!

## Clean the Zoom .vtt File

In [4]:
transcript_filename = env["transcript_filename"]

transcript_raw = ""

with open(transcript_filename, 'r') as f:
    transcript_raw = f.read()

# Calculate and print info about raw file
rawCharCount = len(transcript_raw)
rawWordCount = len(transcript_raw.split())
rawLineCount = len(transcript_raw.split("\n"))

print(f"Raw transcript character count: {rawCharCount}")
print(f"Raw transcript word count: {rawWordCount}")
print(f"Raw transcript line count: {rawLineCount}")

# Process transcript as a list to make it iterable
transcript_transform = transcript_raw.split("\n")

# Remove first two lines because they have no value
transcript_transform = transcript_transform[2:]

# Matches both numbered lines and timestamp lines
digit_pattern = r"^[0-9]+"

# Matches lines that start with a name
name_pattern = r"^[A-Z][a-z]+\s[A-Z][a-z]+:\s"

transcript_processed = []

for line in transcript_transform:
    # Ignore empty lines
    if line == "":
        continue
    # Ignore timestamp and numbered lines
    elif re.search(digit_pattern, line):
        continue
    # Strip off the names from lines that start with one
    elif re.search(name_pattern, line):
        substring_start_index = re.search(name_pattern, line).span()[1]
        line = line[substring_start_index:]
    transcript_processed.append(line)

Raw transcript character count: 64680
Raw transcript word count: 10815
Raw transcript line count: 1783


## Calculate and print info about processed file

In [5]:
word_count = 0
for line in transcript_processed:
    word_count += len(line.split())
    
print(word_count)

# Took the first 50 lines, but could optomize this based on the token length of the model
# In this case I use Vicuna 33B v1.3 with 2048 max context length.
transcript = ''.join(transcript_processed[:50])

# # Maybe list comprehensions for the first two?
# processedCharCount = 
# processedWordCount = 
# processedLineCount = len(transcript_processed)

# print(f"Processed transcript character count: {processedCharCount}")
# print(f"Processed transcript word count: {processedWordCount}")
# print(f"Processed transcript line count: {processedLineCount}")

8150


## Configure OpenAI API

In [6]:
openai.api_key = env["fastchat"]["api_key"]
openai.api_base = env["fastchat"]["base_url"]

# Test config by printing available models
models = openai.Model.list()
print(models)

{
  "object": "list",
  "data": [
    {
      "id": "vicuna-13b-v1.5-16k",
      "object": "model",
      "created": 1706900577,
      "owned_by": "fastchat",
      "root": "vicuna-13b-v1.5-16k",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-97JyKKoTFjGKFvJSnzwJjT",
          "object": "model_permission",
          "created": 1706900577,
          "allow_create_engine": false,
          "allow_sampling": true,
          "allow_logprobs": true,
          "allow_search_indices": true,
          "allow_view": true,
          "allow_fine_tuning": false,
          "organization": "*",
          "group": null,
          "is_blocking": false
        }
      ]
    },
    {
      "id": "vicuna-33b-v1.3",
      "object": "model",
      "created": 1706900577,
      "owned_by": "fastchat",
      "root": "vicuna-33b-v1.3",
      "parent": null,
      "permission": [
        {
          "id": "modelperm-3RbB5LgGNxUBxiXSeKfq6X",
          "object": "model_permission

## Ask the LLM to Perform the Analysis

In [7]:
# Model can be replaced with the model id from the previous call
model = "vicuna-33b-v1.3"
prompt = transcript

# create a chat completion
completion = openai.ChatCompletion.create(
  model=model,
  messages=[
      {"role": "system", "content": "You will be given a meeting transcript. From this transcript: Provide the top 3 items discussed. Provide a short 2 or 3 sentence summary. Provide 3 action items."},
      {"role": "user", "content": prompt}
  ]
)

# print the completion
print(completion.choices[0].message.content)

The top 3 items discussed are:

1. Vern is an instructional cluster that provides CPU and GPU resources to students in machine learning, data science, big data, and analytics courses.
2. Vern is part of a nationally distributed Kubernetes cluster, which is managed by the National Research Platform team and provides technical support and priority scheduling.
3. Vern is open for use by the National Research community, and any spare capacity is available to researchers.

A short 2 or 3 sentence summary is that Vern is an instructional cluster that provides resources for students in data science, machine learning, and big data courses. It is part of a nationally distributed Kubernetes cluster and is managed by the National Research Platform team.

Three action items are:

1. Researchers should familiarize themselves with Vern and Jupiter Hub.
2. Researchers should work with the Kubernetes software factory to containerize their software and run it on Vern.
3. Finance department members shou