## Meeting Assistant UI

The task here is to:
1. Take audio input from a meeting. Upload it to gradio interface.
2. generate minutes
3. generate actions from it.

I will use a frontier model to convert the audio into text <br>
I will use an open-source model to generate minutes<br>
Stream back the result as actionable items as a form of markdown.


In [None]:
## installation
!pip install -q --upgrade torch==2.5.1+cu124 torchvision==0.20.1+cu124 torchaudio==2.5.1+cu124 --index-url https://download.pytorch.org/whl/cu124
!pip install -q requests bitsandbytes==0.46.0 transformers==4.48.3 accelerate==1.3.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m908.2/908.2 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.3/7.3 MB[0m [31m133.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.4/3.4 MB[0m [31m107.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m85.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m61.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m115.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
## imports
import os
import requests
from IPython.display import display, Markdown, update_display
from openai import OpenAI
from huggingface_hub import login
from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch
import gradio as gr
import gc

In [None]:
## configuration
AUDIO_MODEL = "whisper-1"
LLAMA = "meta-llama/Llama-3.1-8B-Instruct"

In [None]:
## log in to hugging face
hf_token = userdata.get("HF_TOKEN")
login(hf_token,add_to_git_credential=True)

In [None]:
## openai configuration
OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
openai_client = OpenAI(api_key=OPENAI_API_KEY)

In [None]:
## transcription function
def transcribe_audio(file_obj):
  with open(file_obj, "rb") as audiofile:
    transcription = openai_client.audio.transcriptions.create(
        model=AUDIO_MODEL,
        file=audiofile,
        response_format="text"
    )
    return transcription

In [None]:
## quantization configuration
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

In [None]:
tokenizer = AutoTokenizer.from_pretrained(LLAMA, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    LLAMA,
    quantization_config=quantization_config,
    device_map=None,
    trust_remote_code=True
).to("cuda")
model.eval()


`low_cpu_mem_usage` was None, now default to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (up_proj): Linear4bit(in_features=4096, out_features=14336, bias=False)
          (down_proj): Linear4bit(in_features=14336, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((409

In [None]:
## function to generate meeting minutes
def generate_minutes(transcript):
  SYSTEM_PROMPT="""
  You are a very helpful assistant that can produce meeting minutes from transcripts.
  You provide a summary; key discussion point; takeaways and action items with owners.
  You provide the output in Markdown format.
  """
  USER_PROMPT=f"""
  Please write meeting minutes in Markdown, including summaries, key discussions, takeaways, and action items.
  If you find any name, please replace the name with an adjective and another noun like experiment names.
  Below is the transcript of the meeting:
  {transcript}
  """

  messages = [
      {"role": "system", "content": SYSTEM_PROMPT},
      {"role": "user", "content": USER_PROMPT}
  ]
  print(messages)
  inputs = tokenizer.apply_chat_template(
      messages,
      add_generation_prompt=True,
      return_tensors="pt"
  ).to("cuda")

  # generate output
  outputs = model.generate(
      inputs,
      max_new_tokens=2000,
      do_sample=True,
      temperature=0.7
  )

  new_tokens=outputs[0, inputs.shape[-1]:] # removing the echo from solution
  response = tokenizer.decode(outputs[0], skip_special_tokens=True).strip()
  print("last")
  return response

In [None]:
## pipeline wrapper
def meeting_assistant(file_obj):
  transcript = transcribe_audio(file_obj)
  minutes = generate_minutes(transcript)
  return minutes

In [None]:
with gr.Blocks() as demo:
  gr.Markdown("## Meeting Minutes Assistant")

  with gr.Row():
    audio_input = gr.Audio(sources=["upload"], type="filepath", label="upload meeting audio here")

  output = gr.Markdown()

  btn = gr.Button("Generate Summary")
  btn.click(meeting_assistant, inputs=audio_input, outputs=output)

demo.launch(debug=True)

It looks like you are running Gradio on a hosted Jupyter notebook, which requires `share=True`. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://a51f2720545b84b408.gradio.live

This share link expires in 1 week. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


[{'role': 'system', 'content': '\n  You are a very helpful assistant that can produce meeting minutes from transcripts.\n  You provide a summary; key discussion point; takeaways and action items with owners.\n  You provide the output in Markdown format.\n  '}, {'role': 'user', 'content': "\n  Please write meeting minutes in Markdown, including summaries, key discussions, takeaways, and action items.\n  If you find any name, please replace the name with an adjective and another noun like experiment names.\n  Below is the transcript of the meeting:\n  I'm guessing in the session here today, is there anyone who's not had a phone screening call yet? Everyone's had a phone screening call, everyone's got, Srishti hasn't had a phone screening call yet. Anyone else who hasn't yet had a phone screening interview? No. All right. Just as revision for everyone else, right? So, what are the things you need to know about in a phone screening interview? Phone screening interviews usually are the firs