# More models and tools

AI is more than just large language model chatbots!

## Whisper

OpenAI has released other AI tools besides ChatGPT – one of the most popular is [Whisper](https://openai.com/research/whisper), a model that can **transcribe audio**. The fact, technical name for this is "speech to text."

Unlike GPT, **you can actually download and use Whisper**. Python programmers can bop on over to [the GitHub repo](https://github.com/openai/whisper) and coding with it minutes.

Because Whisper is an open model (definition *to be discussed*), you'll see all sorts of Whisper-powered tools out there. [MacWhisper](https://goodsnooze.gumroad.com/l/macwhisper) allows you to transcribe audio from the safety of your mac - powered by Whisper! [This random website](https://whisperui.com/) allows to drag-and-drop audio files and transcribe them on the web – powered by Whisper!

And now we'll do the exact same thing right here, in Python – powered by Whisper! We'll start by **installing it.**

In [None]:
%pip install --quiet openai-whisper torch torchaudio whisperx pandas yt-dlp pydantic transformers torch sentence-transformers sacremoses timm

Just like spaCy or the Hugging Face models, Whisper isn't just one piece of software - it's a collection of models with different sizes and names that you have to download separately. When we use `whisper.load_model` below it will run out on the internet and grab the model we're asking for.

You can see [the models here](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). We're going to start with `tiny.en`, an English-only model that is the smallest and fastest.

In [None]:
import whisper

model = whisper.load_model("tiny.en")

Here's the audio we're going to transcribe. Yes, it's *very* short and not terribly complicated.

<audio controls src="sample-4.mp3"></audio>

The actual transcribing is just one line! We'll use `%%time` at the top of the cell to see how long it takes, so later we can compare the `tiny.en` model with some other, larger ones.

In [None]:
%%time
result = model.transcribe("sample-4.mp3")
result["text"]

2 seconds of audio transcribed in about 1 second! Pretty good, *except* for the fact that it says the incorrect "We've thrown" instead of the correct "We frown."

Let's try again with a slightly larger model, the medium English-only one.

In [None]:
model = whisper.load_model("medium.en")

In [None]:
%%time
result = model.transcribe("sample-4.mp3")
result["text"]

Changing to a slightly larger model really impacted our time! It took 4 seconds for a 2-second audio clip. But on the upside it was at least *correct*.

You can try [other models](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages), too. The non-`.en` ones are multilingual (to varying degrees), give them a shot as well.

In [None]:
model = whisper.load_model("turbo")

In [None]:
%%time
result = model.transcribe("sample-4.mp3")
result["text"]

You can see various metrics about how good the transcription abilities are for each language, including CER, WER, BLEU and other scores. One thing to note is that in transcription a 80% score is far worse than a 80% score on, say, a math test. Having one out of every ten words be wrong is... not great in practice.

Never listen to scores when dealing with transcription tools, **always test them in the field.**

## But... Whisper is actually bad!

[According to everyone](https://apnews.com/article/ai-artificial-intelligence-health-business-90020cdf5fa16c79ca2e5b6c4c9bbb14), and the excellently-named paper [Careless Whisper: Speech-to-Text Hallucination Harms](https://dl.acm.org/doi/10.1145/3630106.3658996), Whiper makes *a lot of bad mistakes.*

> In an example they uncovered, a speaker said, “He, the boy, was going to, I’m not sure exactly, take the umbrella.”
>
> But the transcription software added: “He took a big piece of a cross, a teeny, small piece ... I’m sure he didn’t have a terror knife so he killed a number of people.”

One of the biggest problems is **silence**. Like human beings, Whisper isn't very good at silence! It's trained to transcribe transcribe transcribe, so when there's silence it tends to start writing regardless of what's going on.

One way to fix this is **voice activity detection**, which cuts out silences before it transcribes.

Remember how I said Whisper was open source, and other people could build tools on top of it? As a result, we have great libraries like [WhisperX](https://github.com/m-bain/whisperX) which had add-ons like VAD, speaker diarization (splitting speakers!) and more. It's a little more unwieldy to use, but it's worth it.

In [None]:
import whisperx
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16 if device == "cuda" else 4
compute_type = "float16" if device == "cuda" else "int8" 

# 1. Transcribe with original whisper
model = whisperx.load_model("turbo", device, compute_type=compute_type)

In [None]:
audio_file = "How to Fix Holes in Drywall - 4 Easy Methods [uvQK7WTkKpI].webm"

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print("Transcribed")

model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)
print("Aligned")

In [None]:
import pandas as pd

df = pd.json_normalize(result['segments'])
df.head()

In [None]:
# Just grab the text
text = "\n".join(df['text'])
print(text[:500])

In [None]:
# Save it to a file
with open("transcript.txt", "w") as fp:
    fp.write(text)

In [None]:
len(text)

## Wait, where did that video come from?

[yt-dlp](https://github.com/yt-dlp/yt-dlp) is the best tool for downloading video (and audio) from YouTube, TikTok, or anywhere else. I always use ChatGPT to write the code for me because I can't remember all the little bits and pieces.

In [None]:
import yt_dlp

url = "https://www.youtube.com/watch?v=uvQK7WTkKpI"

ydl_opts = {
    'format': 'bestvideo[height<=720]+bestaudio/best[height<=720]',  # Max 720p
    'outtmpl': '%(title)s.%(ext)s',  # Save file as video title
    'merge_output_format': 'mp4',  # Ensure MP4 format
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

If you want **just the audio:**

In [None]:
import yt_dlp

url = "https://www.youtube.com/watch?v=uvQK7WTkKpI"

ydl_opts = {
    'format': 'bestaudio/best',
    'outtmpl': '%(title)s.%(ext)s',
    'postprocessors': [
        {'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3'},  # Converts to MP3
        {'key': 'FFmpegMetadata'},  # Embeds metadata
    ],
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

## Connecting with last session's LLM knowledge

Back to our transcript: **that text is so long!** We don't want to read it! We'd rather have a nice, simple summary.

So: **let's ask for one.**

In [None]:
# A new dataset

with open("transcript.txt", "r") as fp:
    transcript = fp.read()

transcript[:100]

In [None]:
# A new model

from pydantic import BaseModel, Field
from typing import Literal, List

class TranscriptSummary(BaseModel):
    """Data model for a transcript."""
    topic: str = Field(description="One-sentence blurb about transcript content")
    summary: str = Field(description="Summary of content covered, covering all major points")
    highlights: str = Field(description="Most interesting fact(s) from transcript")

In [None]:
# A new prompt

import os
from openai import OpenAI

os.environ['OPENAI_API_KEY'] = 'XXXXXXXX'

client = OpenAI()

prompt = transcript

completion = client.beta.chat.completions.parse(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract the relevant information from the following transcript."},
        {"role": "user", "content": prompt},
    ],
    response_format=TranscriptSummary,
)

response = completion.choices[0].message.parsed

In [None]:
response.model_dump()

## But what if it's just too long?

Each model has a **context window**, which is the amount of text it can handle at one time. To summarize you often use tools that break it up into sections, summarize each part, then give you a final, overall summary.

In [None]:
from llama_index.core.response_synthesizers import TreeSummarize

summarizer = TreeSummarize(verbose=True, output_cls=TranscriptSummary)

In [None]:
response = summarizer.get_response("Summarize the following YouTube video transcript", [transcript])

In [None]:
response.model_dump()

## What about not using GPT?

While GPT is the most *popular* tool, it isn't necessarily the best! Sometimes instead of media partnerships (OpenAI), you want more personality or coding abilities (Claude), or largest context windows and the ability to directly take in audio/images/video (Gemini). 

To use Gemini you''ll need a [Gemini API key](https://aistudio.google.com/app/apikey).

In [None]:
%pip install --quiet --upgrade google-genai

In [None]:
from openai import OpenAI

client = OpenAI(
    api_key="XXXXXXXX",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

response = client.chat.completions.create(
    model="gemini-2.0-flash-lite",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": "What color is the sky?"
        }
    ]
)

print(response.choices[0].message)

In [None]:
# A new prompt

import os
from openai import OpenAI

client = OpenAI(
    api_key="XXXXXXXX",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

prompt = transcript

completion = client.beta.chat.completions.parse(
    model="gemini-2.0-flash-lite",
    messages=[
        {"role": "system", "content": "Extract the relevant information from the following transcript."},
        {"role": "user", "content": prompt},
    ],
    response_format=TranscriptSummary,
)

response = completion.choices[0].message.parsed

In [None]:
response

### But Gemini can also take audio directly!

And video, for that matter! First we'll download the audio with yt-dlp (naming it directly after the ID this time), then we'll feed it into Gemini.

In [None]:
import yt_dlp

url = "https://www.youtube.com/watch?v=uvQK7WTkKpI"

ydl_opts = {
    'format': 'bestaudio/best',
    'outtmpl': '%(id)s.%(ext)s',
    'postprocessors': [
        {'key': 'FFmpegExtractAudio', 'preferredcodec': 'mp3'},  # Converts to MP3
        {'key': 'FFmpegMetadata'},  # Embeds metadata
    ],
}

with yt_dlp.YoutubeDL(ydl_opts) as ydl:
    ydl.download([url])

In [None]:
import base64
from openai import OpenAI

client = OpenAI(
    api_key="XXXXXXXX",
    base_url="https://generativelanguage.googleapis.com/v1beta/openai/"
)

with open("uvQK7WTkKpI.mp3", "rb") as audio_file:
  base64_audio = base64.b64encode(audio_file.read()).decode('utf-8')

completion = client.beta.chat.completions.parse(
    model="gemini-2.0-flash-lite",
    messages=[{
        "role": "user",
        "content": [
            { "type": "text", "text": "Extract the relevant information from the following YouTube audio", },
            { "type": "input_audio", "input_audio": { "data": base64_audio, "format": "mp3" } }
        ],
    }],
    response_format=TranscriptSummary
)

response = completion.choices[0].message.parsed

In [None]:
response

## Local models

A "local model" is an LLM or similar tool that runs on your own computer, not out in the cloud.

Local models are important to discuss about because **journalists are very cheap, and also obsessed with privacy**. If we can avoid using the cloud it might make sense to do so (although... maybe not?). To start, I'm going to link to [Deepseek right here](https://huggingface.co/unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF), `unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF` because the name is so absurd. Look at all those pieces:

- **unsloth/:** the makers of unsloth produced some aspect of this
- **Deepseek:** This is a Deepseek
- **R1:** reasoning model which has been
- **Distill:** distilled into
- **Qwen:** a Qwen model
- **1.5B:** that has 1.5 billion parameters
- **GGUF:** transformed into a GGUF file format

AFAIK the last one is the step from unsloth. **We'll talk about what each one of these means as we go on.** Or maybe, hopefully, potentially!

Even though you *can* write Python code directly to use the LLMs in their current state, using a tool on top of it is going to make things a lot easier.

### LM Studio and Quantized models

[LM Studio](https://github.com/lmstudio-ai) is a piece of software you load up on your computer that allows you to browse, download, and use LLM models. You can jump through all sorts of weird Python situations, but LM Studio is just *so easy* – you click around, it tells you what to do, what models will work, etc, and life is perfect.

The goal of using LM Studio is to be able to track down models that fit on your computer, **even versions of big models that are sized to fit on your computer.**

When I say "big models," "big" refers to the **number of parameters**, or pieces of information that the model can take in at once. For example, a 7B model has 7 billion parameters, while a 300B model has 300 billion parameters. Larger models are typically slower and require fancier computers, but are almost always better.

> This explanation of quantized models is *almost certainly very inaccurate*, but it's a good conceptual framework.

To make these giant models fit on your computer, they are **quantized**, or made into smaller versions. To a large degree a quantized model is just a smaller versions of big models that's aaaaaalmost as capable. Because an LLM is a big magic box of numerical calculations, you can make it smaller by just **rounding off the numbers inside of it**. The original numbers are big andl ong and specific, like this:

- 3.4967845395720573
- 6.1857232150673637
- 1.3792183003746249

Quantized models take those numbers and just chops off the end. They typically come in q8, q6, a smattering of q4's, and q2 versions. Here's a completely fake example of what the quantized numbers might look like:

|q8|q6|q4|q2|
|---|---|---|---|
|3.49678453|3.496784|3.4967|3.49|
|6.18572321|6.185723|6.1857|6.18|
|1.37921830|1.379218|1.3792|1.37|

Let's say you have a 36B model which requires 40GB of RAM (actual memory, not disk space) to load into use on your computer. If you only have 16GB of RAM, that isn't going to work! You *could* go with the 7B model, but that's almost 80% fewer parameters, which means the model is a *lot* dumber. Instead, you'll go for a q4 version of the 36B model – the numbers are rounded off which makes the model a lot smaller, but more parameters is better than a higher quantization.

Usually q4 is still fine, anything lower starts to know too little.

In [None]:
with open("transcript.txt") as fp:
    transcript = fp.read()

print(transcript[:400])

Now you'll need to **load up LM Studio** and load your model. You can see some pictures of how to do that [here](https://jonathansoma.com/words/olmocr-on-macos-with-lm-studio.html) (although you won't be looking for olmocr). Just look for anything popular that fits on your computer!

We talk to LM Studio using the same OpenAI library as when we talk to GPT – every tool seems to have standardized on the conversation format they use. We just set the `base_url` and `api_key` according to what LM Studio tells us to use.

In [None]:
from openai import OpenAI

client = OpenAI(
    base_url="http://127.0.0.1:10000/v1",
    api_key="lm-studio"
)


completion = client.beta.chat.completions.parse(
    model="deepseek-r1:1.5b",
    messages=[
        {"role": "user", "content": "Who are you?"},
    ]
)

In [None]:
print(completion.choices[0].message.content)

In [None]:
from openai import OpenAI

# LM Studio is :1234, Msty is :10000
client = OpenAI(
    base_url="http://127.0.0.1:10000/v1",
    api_key="lm-studio"
)


completion = client.beta.chat.completions.parse(
    model="deepseek-r1:1.5b",
    messages=[
        {"role": "system", "content": "Summarize this transcript in four descriptive bullet points."},
        {"role": "user", "content": transcript},
    ]
)

In [None]:
print(completion.choices[0].message.content)

It's slow, but looks great!

## What can we do with this information?

- [GitHub Actions](https://github.com/features/actions)
- [Mailgun](https://www.twilio.com/en-us/sendgrid/email-api)

There are *two GitHub Actions sessions?* But they were in the past, so instead [watch this video](https://www.youtube.com/watch?v=QNKxzkNpsko&list=PLewNEVDy7gq17ju86mZqPzr2mGwVLwNNM&index=1) and [learn to use secrets](https://docs.github.com/en/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions) and you'll be good to go. Maybe a little [Claude] or [ChatGPT](https://chatgpt.com/) magic.

> I have a script that transcribes an audio file and summarizes/gets bullet points out of it. I want to make it download the most recent file from a youtube playlist and email me the transcript/summary/highlights. I want this to happen every Monday morning. You're going to help me do this!
>
> The workflow should use GitHub Actions and Mailgun. I want to drag-and-drop my GitHub files because I don't really understand git. Step me through each part of the process. My code currently is in a Jupyter notebook, so you might have to ask me to copy and paste it in here so you can see what I'm doing. Make sure I copy all of the relevant portions.
>
> Let's do this **step by step**. Set out a plan and walk me through it, making sure I understand what's happening in each step. Don't proceed to the next step until you're comfortable that I've done what is necessary. Before each step in the process, ask if I have code that does it already or if we need to do it from scratch. Don't assume I know much of anything, as I'm working from a template/tutorial (e.g. I maybe have heard of yt-dlp, but don't really understand what it is).

## Hugging Face

Hugging Face is like **YouTube for AI.** All the models go there!

There are a [zillion models on Hugging Face](https://huggingface.co/models), but the top ones appear to be ALL text generation models. Since we know that [ones that cost money are the best](https://lmarena.ai/?leaderboard) we ignore them for now.

By the second or third page you see a few text classification or sentence similarity models, mostly due to the popularity of "retrieval augmented generation," the idea that we can ask a question to an LLM, it finds relevant sentences, then answers a question with them. We don't want to do that, either!

As we scroll and scroll and scroll, we eventually come across [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli), which is a text classification model from Facebook. Most normal classification models only know some specific categories to put things in - positive tweets vs negative tweets, for example – but `facebook/bart-large-mnli` can categorize... anything?

### Putting things in categories

Now that we've settled on [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli), let's use it!

To use the model, we're going to click the "use in Transformers" link on the top left. That will give us the base code for loading the model with the `pipeline` tool. Then we'll scroll down on the page itself to see if there's an example. And there is!

In [None]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

In [None]:
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

In [None]:
sequence_to_classify = "I'm tired from so much ballet, but it's time to make lunch"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

### Other tasks

While we're used to just asking questions back and forth with ChatGPT all day, most questions involving language or images are actually pre-defined tasks that have been studies for decades. For example, "put this text into a category" is called **classification**.

You can see a ton of examples of different machine learning tasks on Hugging Face's [tasks page](https://huggingface.co/tasks).

### Translation

For example, [translation is one option](https://huggingface.co/tasks/translation). It comes with a [small example](https://huggingface.co/tasks/translation#inference) that seems easy enough:

In [None]:
from transformers import pipeline

en_fr_translator = pipeline("translation_en_to_fr")
en_fr_translator("How old are you?")

I showed this to a French person and they *laughed!* It's a word by word translation, not actually how French is spoken. Such is the state of machine learning, *c'est la vie.*

> It's probably important to think about how even though you can *an answer* from a tool like this, it doesn't mean it's a *good answer*. It's easy to be distracted by AI seeming fancy and confident, when really it's just a computer pushing numbers around!

There's [another example on that page](https://huggingface.co/tasks/translation#inference), but they screwed it up! It uses another model that, if prompted, gives the correct translation of "How old are you?". On the page they I guess wanted to mix things up and changed it to translate "How are you?". We'll go with what was intended below:

In [None]:
from transformers import pipeline

translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
translator("How old are you?")

When translating, you request the pipeline `translation_xx_to_yy`, where `xx` is the source language and `yy` is the target language. Not all models support all languages, so you might have to [poke around for what you want](https://huggingface.co/models?pipeline_tag=translation) (the languages tab isn't even always the best route: sometimes the model you want is only filed under "multilingual").

There are two English-Chinese models that are ranked high, one is [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh) and one is [Helsinki-NLP/opus-mt-zh-en](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en). If we don't read the documentation we won't notice that for going from English to Chinese we need the first one, `opus-mt-en-zh`.

In [None]:
from transformers import pipeline

translator = pipeline("translation_en_to_zh", model="Helsinki-NLP/opus-mt-en-zh")
translator("How old are you?")

We can check the translation in the opposite direction by switching both the model and the pipeline name.

In [None]:
from transformers import pipeline

translator = pipeline("translation_zh_to_en", model="Helsinki-NLP/opus-mt-zh-en")
translator("你几岁了?")

If we try [this multilingual model](https://huggingface.co/facebook/nllb-200-distilled-600M) suddenly everything gets very crazy very quickly.

In [None]:
from transformers import pipeline

translator = pipeline(
    "translation_ja_to_en",
    model="facebook/nllb-200-distilled-600M")
translator("私は鉛筆です")

Why is it so bad??? Because despite not telling us how to use the model, `translation_xx_to_yy` apparently is *not* how you use this model, and we apparently need to use [some other weird language codes](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) that we pass in as `src_lang` and `tgt_lang`.

In [None]:
from transformers import pipeline

translator = pipeline(
    "translation",
    model="facebook/nllb-200-distilled-600M",
    src_lang='jpn_Jpan',
    tgt_lang='eng_Latn')
translator("私は鉛筆です")

I honestly don't know how we were supposed to learn how to do this. I figured it out by reading [the code](https://huggingface.co/spaces/Geonmo/nllb-translation-demo/blob/main/app.py) of one of [the demo spaces](https://huggingface.co/spaces/Geonmo/nllb-translation-demo).

## Wait, what's a "demo space?"

When you hear about new models or approaches that are released, you immediately want to try them. Like Deepseek! 

In the next notebook we'll see **how to effectively experiment with these tools**.

## Reflection

How can you trust anything?? Even if we're impressed by output from AI at first blush, it might not have the consitent quality necessary to perform *real* tasks. Or maybe it does?