In [None]:
!wget --quiet -O items.zip https://github.com/jsoma/nicar24-beyond-chatgpt/raw/main/items.zip
!unzip -o -q items.zip

# More models and tools

AI is more than just large language model chatbots!

## Whisper

OpenAI has released other AI tools besides ChatGPT – one of the most popular is [Whisper](https://openai.com/research/whisper), a model that can **transcribe audio**. The fact, technical name for this is "speech to text."

Unlike GPT, **you can actually download and use Whisper**. Python programmers can bop on over to [the GitHub repo](https://github.com/openai/whisper) and coding with it minutes.

Because Whisper is an open model (definition *to be discussed*), you'll see all sorts of Whisper-powered tools out there. [MacWhisper](https://goodsnooze.gumroad.com/l/macwhisper) allows you to transcribe audio from the safety of your mac - powered by Whisper! [This random website](https://whisperui.com/) allows to drag-and-drop audio files and transcribe them on the web – powered by Whisper!

And now we'll do the exact same thing right here, in Python – powered by Whisper! We'll start by **installing it.**

In [None]:
pip install --quiet openai-whisper

Just like spaCy or the Hugging Face models, Whisper isn't just one piece of software - it's a collection of models with different sizes and names that you have to download separately. When we use `whisper.load_model` below it will run out on the internet and grab the model we're asking for.

You can see [the models here](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages). We're going to use `tiny.en`, an English-only model that is the smallest and fastest.

In [None]:
import whisper

model = whisper.load_model("tiny.en")

Here's the audio we're going to transcribe. Yes, it's *very* short and not terribly complicated.

<audio controls src="sample-4.mp3"></audio>

The actual transcribing is just one line! We'll use `%%time` at the top of the cell to see how long it takes, so later we can compare the `tiny.en` model with some other, larger ones.

In [None]:
%%time
result = model.transcribe("sample-4.mp3")
result["text"]

2 seconds of audio transcribed in about 1 second! Pretty good, *except* for the fact that it says the incorrect "We've thrown" instead of the correct "We frown."

Let's try again with a slightly larger model, the medium English-only one.

In [None]:
model = whisper.load_model("medium.en")

In [None]:
%%time
result = model.transcribe("sample-4.mp3")
result["text"]

Changing to a slightly larger model really impacted our time! It took 4 seconds for a 2-second audio clip. But on the upside it was at least *correct*.

You can try [other models](https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages), too. The non-`.en` ones are multilingual (to varying degrees), give them a shot as well.

In [None]:
model = whisper.load_model("small")

You can see various metrics about how good the transcription abilities are for each language, including CER, WER, BLEU and other scores. One thing to note is that in transcription a 80% score is far worse than a 80% score on, say, a math test. Having one out of every ten words be wrong is... not great in practice.

Never listen to scores when dealing with transcription tools, **always test them in the field.**

## Hugging Face

There are a [zillion models on Hugging Face](https://huggingface.co/models), but the top ones appear to be ALL text generation models. Since we know that [GPT-4 is the best](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard) there isn't much use in using any of the other generative text models (...for the moment).

By the second or third page you see a few text classification or sentence similarity models, mostly due to the popularity of "retrieval augmented generation," the idea that we can ask a question to an LLM, it finds relevant sentences, then answers a question with them. We don't want to do that, either!

As we scroll and scroll and scroll, we eventually come across [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli), which is a text classification model from Facebook. Most normal classification models only know some specific categories to put things in - positive tweets vs negative tweets, for example – but `facebook/bart-large-mnli` can categorize... anything?

### Putting things in categories

Now that we've settled on [facebook/bart-large-mnli](https://huggingface.co/facebook/bart-large-mnli), let's install what we need in order to use it.

In [None]:
!pip install --quiet transformers torch sentence-transformers sacremoses

To use the model, we're going to click the "use in Transformers" link on the top left. That will give us the base code for loading the model with the `pipeline` tool. Then we'll scroll down on the page itself to see if there's an example. And there is!

In [None]:
from transformers import pipeline
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

In [None]:
sequence_to_classify = "one day I will see the world"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

In [None]:
sequence_to_classify = "I'm tired from so much ballet, but it's time to make lunch"
candidate_labels = ['travel', 'cooking', 'dancing']
classifier(sequence_to_classify, candidate_labels)

### Other tasks

While we're used to just asking questions back and forth with ChatGPT all day, most questions involving language or images are actually pre-defined tasks that have been studies for decades. For example, "put this text into a category" is called **classification**.

You can see a ton of examples of different machine learning tasks on Hugging Face's [tasks page](https://huggingface.co/tasks).

### Translation

For example, [translation is one option](https://huggingface.co/tasks/translation). It comes with a [small example](https://huggingface.co/tasks/translation#inference) that seems easy enough:

In [None]:
from transformers import pipeline

en_fr_translator = pipeline("translation_en_to_fr")
en_fr_translator("How old are you?")

I showed this to a French person and they *laughed!* It's a word by word translation, not actually how French is spoken. Such is the state of machine learning, *c'est la vie.*

> It's probably important to think about how even though you can *an answer* from a tool like this, it doesn't mean it's a *good answer*. It's easy to be distracted by AI seeming fancy and confident, when really it's just a computer pushing numbers around!

There's [another example on that page](https://huggingface.co/tasks/translation#inference), but they screwed it up! It uses another model that, if prompted, gives the correct translation of "How old are you?". On the page they I guess wanted to mix things up and changed it to translate "How are you?". We'll go with what was intended below:

In [None]:
from transformers import pipeline

translator = pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
translator("How old are you?")

When translating, you request the pipeline `translation_xx_to_yy`, where `xx` is the source language and `yy` is the target language. Not all models support all languages, so you might have to [poke around for what you want](https://huggingface.co/models?pipeline_tag=translation) (the languages tab isn't even always the best route: sometimes the model you want is only filed under "multilingual").

There are two English-Chinese models that are ranked high, one is [Helsinki-NLP/opus-mt-en-zh](https://huggingface.co/Helsinki-NLP/opus-mt-en-zh) and one is [Helsinki-NLP/opus-mt-zh-en](https://huggingface.co/Helsinki-NLP/opus-mt-zh-en). If we don't read the documentation we won't notice that for going from English to Chinese we need the first one, `opus-mt-en-zh`.

In [None]:
from transformers import pipeline

translator = pipeline("translation_en_to_zh", model="Helsinki-NLP/opus-mt-en-zh")
translator("How old are you?")

We can check the translation in the opposite direction by switching both the model and the pipeline name.

In [None]:
from transformers import pipeline

translator = pipeline("translation_zh_to_en", model="Helsinki-NLP/opus-mt-zh-en")
translator("你几岁了?")

If we try [this multilingual model](https://huggingface.co/facebook/nllb-200-distilled-600M) suddenly everything gets very crazy very quickly.

In [None]:
from transformers import pipeline

translator = pipeline(
    "translation_ja_to_en",
    model="facebook/nllb-200-distilled-600M")
translator("私は鉛筆です")

Why is it so bad??? Because despite not telling us how to use the model, `translation_xx_to_yy` apparently is *not* how you use this model, and we apparently need to use [some other weird language codes](https://github.com/facebookresearch/flores/blob/main/flores200/README.md) that we pass in as `src_lang` and `tgt_lang`.

In [None]:
from transformers import pipeline

translator = pipeline(
    "translation",
    model="facebook/nllb-200-distilled-600M",
    src_lang='jpn_Jpan',
    tgt_lang='eng_Latn')
translator("私は鉛筆です")

I honestly don't know how we were supposed to learn how to do this. I figured it out by reading [the code](https://huggingface.co/spaces/Geonmo/nllb-translation-demo/blob/main/app.py) of one of [the demo spaces](https://huggingface.co/spaces/Geonmo/nllb-translation-demo).

## Reflection

How can you trust anything?? Even if we're impressed by output from AI at first blush, it might not have the consitent quality necessary to perform *real* tasks. Or maybe it does?