# Open source project: whisper.cpp (based on OpenAI Whisper)

[OpenAI Whisper](https://github.com/openai/whisper) is one of the well-known and high-quality models for ASR (Automatic Speech Recognition). However, the original model requires more than 10 GB of VRAM for the large model (supports more than 50 languages).

A good alternative is [whisper.cpp](https://github.com/ggerganov/whisper.cpp), a high-performance implementation in C and C++ that requires about 4 GB of VRAM for the large model.


## Create a new environment:

```bash
conda create -n env_mc_asr python=3.12
````

Install the packeges that you need (jupyterlab, pandas, matplotlib, etc.)

## Cloning the Repository

From the development/projects folder, run Terminal and clone the repository:

```bash
git clone https://github.com/ggerganov/whisper.cpp.git
```

## Downloading the Whisper Model

Download the base model (base.en) using the sh-script from the repository:

```bash
cd whisper.cpp
```

```bash
./models/download-ggml-model.sh base.en
```

## Build the Application

 Use the make command to compile the source code into an executable (Makefile is provided in the repository):

```bash
make
```

## Run the Application and Transcribe a Sample Audio

Run the application and transcribe a sample file:

```bash
./main -m models/ggml-tiny.en.bin -f samples/jfk.wav
```

See other parameters:

```bash
./main -h
```

Download the large model (version 3):

```bash
./models/download-ggml-model.sh large-v3      
```

Transribe a sample file using the large model:

```bash
./main -m models/ggml-tiny.en.bin -f samples/jfk.wav
```

## Possible issues

Official OpenAI Whisper and whisper.cpp have issues when the audio contains silence. Even pauses of less than 1 second can cause problems. The model attempts to transcribe the silent parts and starts to hallucinate, using random text or, more often, repeating the same text multiple times.

There are many sources, that prove it:

* https://github.com/ggerganov/whisper.cpp/issues/1507
* https://deepgram.com/learn/whisper-v3-results
* https://github.com/openai/whisper/discussions/1606
* https://huggingface.co/spaces/openai/whisper/discussions/74
* https://github.com/ggerganov/whisper.cpp/pull/1588
* https://github.com/ggerganov/whisper.cpp/issues/1724
* https://github.com/ggerganov/whisper.cpp/pull/1768#issuecomment-1924743917


There are [ways](https://github.com/ggerganov/whisper.cpp/issues/1507#issuecomment-1816263320) to reduce probability of hallucination (unfortunatelly, it can have side effect):
> Here are some strategies that I've observed to reduce repetition and hallucinations:
> * Use 5 beams
> * Increase entropy threshold from the default 2.4 to 2.8 for example. Higher threshold will reject repetitive text and fallback to sampling with higher temperature
> * Reduce the maximum context size (--max-context). By default it is 224. Setting it to 64 or 32 can reduce the repetitions significantly. Setting it to 0 will most  likely eliminate all repetitions, but the transcription quality can be affected because it will be losing the context from the previous transcript

Also, the whisper.cpp devoloperc are going to release a new version in the near future that will fix some problems.


## Transcribe the real audio

Transcribe the audio extracted from the video earlier as detailed in the [Data Source](./01_main.ipynb) using the large model with additional flags:

```bash
./main -m models/ggml-large-v3.bin -f ~/Sync/AI/My_projects/av2txtsum/data/audio.wav --output-file ~/Sync/AI/My_projects/av2txtsum/data/whisper_output/audio --output-txt --output-srt --output-csv --output-json
```

where
* ./main - the executable file located in the current directory;
* -m models/ggml-large-v3.bin - the model that we use (large model in this case);
* -f ~/Sync/AI/My_projects/av2txtsum/data/audio.wav - the path to the file to transcript;
* --output-file ~/Sync/AI/My_projects/av2txtsum/data/whisper_output/audio - the path to the folder where to save the output and the output file name without extension;
* --output-txt, --output-srt, --output-csv, --output-json - the output formats. 

The transcription process for the 35-minute video took about 17 min (Macbook Air Apple M1).

The transcribed txt-file:

In [1]:
!cat ./data/whisper_output/audio.txt | head -n 15

 ♪
 Welcome to IBM Think 2023.
 ♪
 AI-generated art.
 AI-generated songs.
 AI, what is that?
 It sure is a lot of fun.
 But when foundation models are applied to big business,
 well, you need to think bigger.
 Because AI in business needs to be held to a higher standard.
 Built to be trusted, secured, and adaptable.
 This isn't simple automation that is only trained to do one thing.
 This is AI that is built and focused to work across your organization.
 This isn't committing to a single system.
 This is hybrid-ready AI that can scale across your systems.


## Clean the whisper transcript

Import the [clean_whisper_transript](./utils/clean_text.py) function and clean the transcript generated by Whisper (remove musicial notes, newline characters, etc):

In [2]:
from utils.clean_text import clean_whisper_transcript
cleaned_whisper_transcript = clean_whisper_transcript("data/whisper_output/audio.txt", "data/cleaned_whisper_transcript.txt")
print(cleaned_whisper_transcript[:1000])

Welcome to IBM Think 2023. AI-generated art. AI-generated songs. AI, what is that? It sure is a lot of fun. But when foundation models are applied to big business, well, you need to think bigger. Because AI in business needs to be held to a higher standard. Built to be trusted, secured, and adaptable. This isn't simple automation that is only trained to do one thing. This is AI that is built and focused to work across your organization. This isn't committing to a single system. This is hybrid-ready AI that can scale across your systems. This isn't wondering where an answer came from. This is AI that can show its work. When you build AI into the core of your business, you can go so much further. This is more than AI. This is AI for business. Let's create. Please welcome Senior Vice President and Director of Research, IBM, Dr. Dario Gil. Hello. Welcome. Welcome. The last session of "Think." And I understand some of you even had a drink. How special. So I hope you've enjoyed the last two 

## Clean the original transcript

Import the [clean_original_transript](./utils/clean_text.py) function and clean the original transcript (remove timestamps, diacritics, newline characters, etc.):

In [3]:
from utils.clean_text import clean_original_transcript
cleaned_original_transcript = clean_original_transcript("data/transcript.txt", "data/cleaned_original_transcript.txt")
print(cleaned_original_transcript[:1000])

Welcome to IBM THINK 2023! AI generated art, AI generated songs. AI, what is that? It sure is a lot of fun. But when foundation models are applied to big business, well, you need to think bigger. Because AI and business needs to be held to a higher standard. Built to be trusted, secured, and adaptable. This isn't simple automation that is only trained to do one thing. This is AI that is built and focused to work across your organization. This isn't committing to a single system. This is hybrid ready AI that can scale across your systems. This isn't wondering where an answer came from. This is AI that can show its work. When you build AI into the core of your business, you can go so much further. This is more than AI. This is AI for business. Let's create. Please welcome Senior Vice President and Director of Research, IBM, Dr. Dario Gil. Hello. Welcome, welcome to the last session of THINK. And I understand some of you even had a drink. How special. So, I hope you've enjoyed the last tw

## Calculate Error Rate Metrics

Import the [calculate_error_rates](./utils/clean_text.py) function and calculate the most common error rate metrics:

In [4]:
from utils.clean_text import calculate_error_rates
metrics = calculate_error_rates(cleaned_original_transcript, cleaned_whisper_transcript)

In [5]:
print(metrics)

{'WER': 0.20786979810465595, 'CER': 0.11381461350454164, 'MER': 0.19411312043093498, 'WIL': 0.29295809231354897, 'WIP': 0.707041907686451}


If we look at both transcripts, we will see that the Whisper transcript has a few repetitions. Reduce --max-context from 224 to 64.

```bash
./main -m models/ggml-large-v3.bin -f ~/Sync/AI/My_projects/av2txtsum/data/audio.wav --output-file ~/Sync/AI/My_projects/av2txtsum/data/whisper_output/audio_max_context_64 --output-txt --output-srt --output-csv --output-json --max-context 64
```

Clean the whisper transript:

In [6]:
cleaned_whisper_transcript_max_context_64 = clean_whisper_transcript("data/whisper_output/audio_max_context_64.txt", "data/cleaned_whisper_transcript_max_context_64.txt")
print(cleaned_whisper_transcript_max_context_64[:1000])

Welcome to IBM Think 2023. AI-generated art. AI-generated songs. AI, what is that? It sure is a lot of fun. But when foundation models are applied to big business, well, you need to think bigger. Because AI in business needs to be held to a higher standard. Built to be trusted, secured, and adaptable. This isn't simple automation that is only trained to do one thing. This is AI that is built and focused to work across your organization. This isn't committing to a single system. This is hybrid-ready AI that can scale across your systems. This isn't wondering where an answer came from. This is AI that can show its work. When you build AI into the core of your business, you can go so much further. This is more than AI. This is AI for business. Let's create. Please welcome Senior Vice President and Director of Research, IBM, Dr. Dario Gil. Hello. Welcome. Welcome. The last session of Think. And I understand some of you even had a drink. How special. So I hope you've enjoyed the last two da

Calculate error rate metrics:

In [7]:
metrics = calculate_error_rates(cleaned_original_transcript, cleaned_whisper_transcript_max_context_64)
metrics

{'WER': 0.12237330037082818,
 'CER': 0.03538467150621968,
 'MER': 0.12038913660316174,
 'WIL': 0.2010647493170774,
 'WIP': 0.7989352506829226}

As we can see, the metrics improve. Remove punctuation, lowercase the text and calculate the metrics.

In [8]:
import re
patterns = r'[^\w\s]'
cleaned_whisper_transcript_max_context_64 = cleaned_whisper_transcript_max_context_64.lower()
cleaned_original_transcript = cleaned_original_transcript.lower()
cleaned_whisper_transcript_max_context_64 = re.sub(patterns, "", cleaned_whisper_transcript_max_context_64)
cleaned_original_transcript = re.sub(patterns, "", cleaned_original_transcript)

In [9]:
print(cleaned_whisper_transcript_max_context_64[:1000])

welcome to ibm think 2023 aigenerated art aigenerated songs ai what is that it sure is a lot of fun but when foundation models are applied to big business well you need to think bigger because ai in business needs to be held to a higher standard built to be trusted secured and adaptable this isnt simple automation that is only trained to do one thing this is ai that is built and focused to work across your organization this isnt committing to a single system this is hybridready ai that can scale across your systems this isnt wondering where an answer came from this is ai that can show its work when you build ai into the core of your business you can go so much further this is more than ai this is ai for business lets create please welcome senior vice president and director of research ibm dr dario gil hello welcome welcome the last session of think and i understand some of you even had a drink how special so i hope youve enjoyed the last two days with us and what an incredible year it 

In [10]:
print(cleaned_original_transcript[:1000])

welcome to ibm think 2023 ai generated art ai generated songs ai what is that it sure is a lot of fun but when foundation models are applied to big business well you need to think bigger because ai and business needs to be held to a higher standard built to be trusted secured and adaptable this isnt simple automation that is only trained to do one thing this is ai that is built and focused to work across your organization this isnt committing to a single system this is hybrid ready ai that can scale across your systems this isnt wondering where an answer came from this is ai that can show its work when you build ai into the core of your business you can go so much further this is more than ai this is ai for business lets create please welcome senior vice president and director of research ibm dr dario gil hello welcome welcome to the last session of think and i understand some of you even had a drink how special so i hope youve enjoyed the last two days with us and what an incredible y

In [11]:
metrics = calculate_error_rates(cleaned_original_transcript, cleaned_whisper_transcript_max_context_64)
metrics

{'WER': 0.06100577081615829,
 'CER': 0.022469802685872908,
 'MER': 0.06012593946780419,
 'WIL': 0.08965672751913412,
 'WIP': 0.9103432724808659}

Compare original and whisper transcripts:

In [12]:
import difflib
d = difflib.Differ()
diff = list(d.compare(cleaned_original_transcript.split(), cleaned_whisper_transcript_max_context_64.split()))

In [13]:
diff

['  welcome',
 '  to',
 '  ibm',
 '  think',
 '  2023',
 '- ai',
 '- generated',
 '+ aigenerated',
 '? ++\n',
 '  art',
 '- ai',
 '- generated',
 '+ aigenerated',
 '? ++\n',
 '  songs',
 '  ai',
 '  what',
 '  is',
 '  that',
 '  it',
 '  sure',
 '  is',
 '  a',
 '  lot',
 '  of',
 '  fun',
 '  but',
 '  when',
 '  foundation',
 '  models',
 '  are',
 '  applied',
 '  to',
 '  big',
 '  business',
 '  well',
 '  you',
 '  need',
 '  to',
 '  think',
 '  bigger',
 '  because',
 '  ai',
 '- and',
 '+ in',
 '  business',
 '  needs',
 '  to',
 '  be',
 '  held',
 '  to',
 '  a',
 '  higher',
 '  standard',
 '  built',
 '  to',
 '  be',
 '  trusted',
 '  secured',
 '  and',
 '  adaptable',
 '  this',
 '  isnt',
 '  simple',
 '  automation',
 '  that',
 '  is',
 '  only',
 '  trained',
 '  to',
 '  do',
 '  one',
 '  thing',
 '  this',
 '  is',
 '  ai',
 '  that',
 '  is',
 '  built',
 '  and',
 '  focused',
 '  to',
 '  work',
 '  across',
 '  your',
 '  organization',
 '  this',
 '  isnt',

Or in a more readable [html-format](./data/diff.html):

In [14]:
html_diff = difflib.HtmlDiff().make_file(cleaned_original_transcript.split(), cleaned_whisper_transcript_max_context_64.split())

In [15]:
with open('./data/diff.html', 'w') as f:
    f.write(html_diff)

If we examine the differences, we will see cases such as:
- ai generated - aigenerated;
- you have - youve;
- we have - weve;
- super computer - supercomputer;
- watson x - watsonx, 
- etc.

It is possible to further clean the transcript and perform more delicate work with punctuation. However, in general, a WER of about 6% is quite a good result.

## Use a more advanced cleaning method offered by Whisper developers

The authors and developers of the original Whisper in the article

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., Sutskever, I. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv preprint arXiv:2212.04356, 2022.

write:

> Speech recognition research typically evaluates and compares systems based on the word error rate (WER) metric.
> However, WER, which is based on string edit distance, penalizes all differences between the model’s output and the
> reference transcript including innocuous differences in transcript style. As a result, systems that output transcripts that
> would be judged as correct by humans can still have a large
> WER due to minor formatting differences. While this poses
> a problem for all transcribers, it is particularly acute for
> zero-shot models like Whisper, which do not observe any
> examples of specific datasets transcript formats.
> 
> This is not a novel observation; the development of evaluation metrics that better correlate with human judgement is an
> active area of research, and while there are some promising
> methods, none have seen widespread adoption for speech
> recognition yet. We opt to address this problem with extensive standardization of text before the WER calculation
> to minimize penalization of non-semantic differences. Our
> text normalizer was developed through iterative manual inspection to identify common patterns where naive WER
> penalized Whisper models for an innocuous difference.

The original Whisper includes [normalizers](https://github.com/openai/whisper/blob/ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab/whisper/normalizers):

> [BasicTextNormalizer](https://wiki.mutable.ai/openai/whisper#text-normalization-1) handles basic cleaning tasks like lowercasing, removing symbols and diacritics.
> 
> [EnglishTextNormalizer](https://wiki.mutable.ai/openai/whisper#text-normalization-1) performs additional normalization steps for English text such as expanding contractions and standardizing numbers and spellings.

Try to use these normalizers for the same transripts that we worked on above.

Install Whisper:

In [16]:
!pip install git+https://github.com/openai/whisper.git

Collecting git+https://github.com/openai/whisper.git
  Cloning https://github.com/openai/whisper.git to /private/var/folders/gh/mx3xbl0d69jftpptc8yn6mn40000gn/T/pip-req-build-wie5ffvr
  Running command git clone --filter=blob:none --quiet https://github.com/openai/whisper.git /private/var/folders/gh/mx3xbl0d69jftpptc8yn6mn40000gn/T/pip-req-build-wie5ffvr
  Resolved https://github.com/openai/whisper.git to commit ba3f3cd54b0e5b8ce1ab3de13e32122d0d5f98ab
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


Import normalizer:

In [16]:
from whisper.normalizers import BasicTextNormalizer, EnglishTextNormalizer
# normalizer_b = BasicTextNormalizer(remove_diacritics=True, split_letters=False)
normalizer_b = BasicTextNormalizer()
normalizer_en = EnglishTextNormalizer()

Open the original transcript and remove timestamps like "0:09" and diacritics in the form of ">> DARIO GIL:":

In [17]:
with open("./data/transcript.txt", "r") as file:
    original_transcript = file.read()
original_transcript_in_lines = re.sub(r"\d+:\d{2}\n|>>\s*[A-Z\s]+:", "", original_transcript).strip().split("\n")
print(original_transcript_in_lines[0:10])

['>> Welcome to IBM THINK 2023!', '>> AI generated art, AI generated songs.', 'AI, what is that? It sure is a lot of fun. But when foundation models are applied to big business, well,', 'you need to think bigger. Because AI and business needs to be held to a higher standard.', "Built to be trusted, secured, and adaptable. This isn't simple automation that is only", 'trained to do one thing. This is AI that is built and focused to work across your organization.', "This isn't committing to a single system. This is hybrid ready AI that can scale across your systems.", "This isn't wondering where an answer came from. This is AI that can show its work.", 'When you build AI into the core of your business, you can go so much further. This is more than AI.', "This is AI for business. Let's create."]


Clean the original transcript using normalizer_en:

In [18]:
cleaned_original_transcript = []
for line in original_transcript_in_lines:
    cleaned_original_transcript.append(normalizer_en(line))

In [19]:
cleaned_original_transcript[0:10]

['welcome to ibm think 2023',
 'ai generated art ai generated songs',
 'ai what is that it sure is a lot of fun but when foundation models are applied to big business well',
 'you need to think bigger because ai and business needs to be held to a higher standard',
 'built to be trusted secured and adaptable this is not simple automation that is only',
 'trained to do one thing this is ai that is built and focused to work across your organization',
 'this is not committing to a single system this is hybrid ready ai that can scale across your systems',
 'this is not wondering where an answer came from this is ai that can show its work',
 'when you build ai into the core of your business you can go so much further this is more than ai',
 'this is ai for business let us create']

 Clean the whisper transript with the max_context 64:

In [20]:
with open("./data/whisper_output/audio_max_context_64.txt", "r") as file:
    whisper_transcript = file.read()
whisper_transcript_in_lines = whisper_transcript.strip().split("\n")
whisper_transcript_in_lines[0:10]

['♪',
 ' Welcome to IBM Think 2023.',
 ' ♪',
 ' AI-generated art.',
 ' AI-generated songs.',
 ' AI, what is that?',
 ' It sure is a lot of fun.',
 ' But when foundation models are applied to big business,',
 ' well, you need to think bigger.',
 ' Because AI in business needs to be held to a higher standard.']

Clean the whisper transcript using normalizer_en:

In [21]:
cleaned_whisper_transcript = []
for line in whisper_transcript_in_lines:
    cleaned_whisper_transcript.append(normalizer_en(line))

In [22]:
cleaned_whisper_transcript[0:10]

['',
 'welcome to ibm think 2023',
 '',
 'ai generated art',
 'ai generated songs',
 'ai what is that',
 'it sure is a lot of fun',
 'but when foundation models are applied to big business',
 'well you need to think bigger',
 'because ai in business needs to be held to a higher standard']

Remove the empty lines:

In [23]:
cleaned_whisper_transcript = [line for line in cleaned_whisper_transcript if line != ""]
cleaned_whisper_transcript[0:10]

['welcome to ibm think 2023',
 'ai generated art',
 'ai generated songs',
 'ai what is that',
 'it sure is a lot of fun',
 'but when foundation models are applied to big business',
 'well you need to think bigger',
 'because ai in business needs to be held to a higher standard',
 'built to be trusted secured and adaptable',
 'this is not simple automation that is only trained to do one thing']

Join the cleaned original transcript and the cleand whisper transcript:

In [24]:
reference = " ".join(cleaned_original_transcript)
reference[0:1000]

'welcome to ibm think 2023 ai generated art ai generated songs ai what is that it sure is a lot of fun but when foundation models are applied to big business well you need to think bigger because ai and business needs to be held to a higher standard built to be trusted secured and adaptable this is not simple automation that is only trained to do one thing this is ai that is built and focused to work across your organization this is not committing to a single system this is hybrid ready ai that can scale across your systems this is not wondering where an answer came from this is ai that can show its work when you build ai into the core of your business you can go so much further this is more than ai this is ai for business let us create please welcome senior vice president and director of research ibm doctor dario gil hello welcome welcome to the last session of think and i understand some of you even had a drink how special so i hope you have enjoyed the last 2 days with us and what a

In [25]:
hypothesis = " ".join(cleaned_whisper_transcript)
hypothesis[0:1000]

'welcome to ibm think 2023 ai generated art ai generated songs ai what is that it sure is a lot of fun but when foundation models are applied to big business well you need to think bigger because ai in business needs to be held to a higher standard built to be trusted secured and adaptable this is not simple automation that is only trained to do one thing this is ai that is built and focused to work across your organization this is not committing to a single system this is hybrid ready ai that can scale across your systems this is not wondering where an answer came from this is ai that can show its work when you build ai into the core of your business you can go so much further this is more than ai this is ai for business let us create please welcome senior vice president and director of research ibm doctor dario gil hello welcome welcome the last session of think and i understand some of you even had a drink how special so i hope you have enjoyed the last 2 days with us and what an in

Calculate the metrics:

In [26]:
metrics2 = calculate_error_rates(reference, hypothesis)
metrics2

{'WER': 0.03683464885650678,
 'CER': 0.017973245646812644,
 'MER': 0.03634911124425804,
 'WIL': 0.05387081619333034,
 'WIP': 0.9461291838066697}

As we can see, a combination of custom cleaning and the Whisper normalizer yields even better results, with the WER dropping to 3.7%.

The result in a more readable [html-format](./data/diff_normalizer.html):

In [27]:
html_diff = difflib.HtmlDiff().make_file(reference.split(), hypothesis.split())
with open('./data/diff_normalizer.html', 'w') as f:
    f.write(html_diff)

In [28]:
from IPython.display import HTML
HTML(html_diff)

0,1,2,3,4,5
f,1.0,welcome,f,1.0,welcome
,2.0,to,,2.0,to
,3.0,ibm,,3.0,ibm
,4.0,think,,4.0,think
,5.0,2023,,5.0,2023
,6.0,ai,,6.0,ai
,7.0,generated,,7.0,generated
,8.0,art,,8.0,art
,9.0,ai,,9.0,ai
,10.0,generated,,10.0,generated

Legends,Legends.1
Colors Added Changed Deleted,Links (f)irst change (n)ext change (t)op

Colors
Added
Changed
Deleted

Links,Links.1
(f)irst change,
(n)ext change,
(t)op,


## Conclusion

Whisper.cpp can produce hallucinations, especially in parts with no sound. However, overall, it provides good results with a low WER.

## Useful links:
- https://cdn.openai.com/papers/whisper.pdf
- https://deepgram.com/learn/how-openai-s-text-normalization-hides-whisper-s-true-word-error-rate-for-south-asian-and-southeast-asian-languages
- https://wiki.mutable.ai/openai/whisper#text-normalization-1
- https://deepgram.com/learn/benchmarking-openai-whisper-for-non-english-asr
- https://medium.com/aimonks/seamlessm4t-vs-whisper-a-speech-to-text-benchmark-6dc873154825
- https://huggingface.co/docs/transformers/en/model_doc/whisper
- https://colab.research.google.com/github/openai/whisper/blob/master/notebooks/LibriSpeech.ipynb#scrollTo=dl-KBDflMhrg

