# Open source project: SeamlessM4T

[Seamless](https://github.com/facebookresearch/seamless_communication) is a well-known open project that includes a lof of AI models related to automatic speech recognition (ASR), spech-to-spech translation, text-to-text translation, etc. 

## Cloning the Repository

From the development/projects folder, run Terminal and clone the repository:

```bash
git clone https://github.com/facebookresearch/seamless_communication
```

## Create a new environment and installl Seamless

You should be careful with the Python version; otherwise, you can have problems with dependencies:

```bash
conda create -n env_mc_sm4t_311 python==3.11 jupyterlab
conda activate env_mc_sm4t_311
python -m ipykernel install --user --name env_mc_sm4t_311 --display-name "Python 3.11 (env_mc_sm4t_311)"
cd /Users/lex/Sync/AI/My_projects/seamless_communication
pip install .
````

## Split the audio file

SeamlessM4T cannot process long files; otherwise, you will encounter the following error:
"ValueError: The input sequence length must be less than or equal to the maximum sequence length (4096), but is ...... instead."

The most common way to handle long audio files is by splitting them into smaller files, each approximately 1 minute in length. However, generally, this is a significant disadvantage. For example, Whisper uses a sliding 30-second window that keeps information/prompt from the previous 30 seconds for context. This significantly improves the result.

Divide the full audio files into segments using [split_audio.py](./utils/split_audio.py).

In [1]:
from utils.split_audio import split_audio
split_audio("data/audio.wav", "data/audio_segments", 60000, 500, -30)

data/audio_segments/segment_000001_70.0.wav has been created.
data/audio_segments/segment_000002_50.3.wav has been created.
data/audio_segments/segment_000003_61.9.wav has been created.
data/audio_segments/segment_000004_58.2.wav has been created.
data/audio_segments/segment_000005_60.2.wav has been created.
data/audio_segments/segment_000006_60.3.wav has been created.
data/audio_segments/segment_000007_61.1.wav has been created.
data/audio_segments/segment_000008_59.3.wav has been created.
data/audio_segments/segment_000009_59.8.wav has been created.
data/audio_segments/segment_000010_60.2.wav has been created.
data/audio_segments/segment_000011_60.1.wav has been created.
data/audio_segments/segment_000012_61.8.wav has been created.
data/audio_segments/segment_000013_60.5.wav has been created.
data/audio_segments/segment_000014_58.3.wav has been created.
data/audio_segments/segment_000015_60.8.wav has been created.
data/audio_segments/segment_000016_59.2.wav has been created.
data/aud

## Transcribe the real audio

Transcribe the audio extracted from the video earlier as detailed in the [Data Source](./01_main.ipynb) using the large model with additional flags:

```bash
m4t_predict ~/Sync/AI/My_projects/av2txtsum/data/audio_segments/segment_1_70.0.wav --task asr --tgt_lang eng --model_name seamlessM4T_v2_large

m4t_predict ~/Sync/AI/My_projects/av2txtsum/data/audio_segments/segment_1_70.0.wav --task s2tt --tgt_lang eng --model_name seamlessM4T_v2_large
```

where

- m4t_predict - runs the model;
- ~/Sync/AI/My_projects/av2txtsum/data/audio_segments/segment_1_70.0.wav - the path to the file to transcript;
- --task asr - auto speech recognition (s2tt - speech to text translation);
- --tgt_lang - the target language;
- --model_name - the model name (large, medium, etc)


In some cases, you need to install libsndfile additionaly:

```bash
conda install -c conda-forge libsndfile==1.0.31
```

## The result

### The audio length is about 60 sec

**segment_1_70.0.wav**

The original transript:

```
0:09
>> Welcome to IBM THINK 2023!
0:17
>> AI generated art, AI generated songs.
0:23
AI, what is that? It sure is a lot of fun. But when foundation models are applied to big business, well,
0:31
you need to think bigger. Because AI and business needs to be held to a higher standard.
0:36
Built to be trusted, secured, and adaptable. This isn't simple automation that is only
0:42
trained to do one thing. This is AI that is built and focused to work across your organization.
0:48
This isn't committing to a single system. This is hybrid ready AI that can scale across your systems.
0:54
This isn't wondering where an answer came from. This is AI that can show its work.
1:00
When you build AI into the core of your business, you can go so much further. This is more than AI.
1:07
This is AI for business.
```

The local model (seamlessM4T_v2_large) output:

```
When you build AI into the core of your business, you can go much further than that.
```

The HuggingFace model output:

```
Error because the file is longer than 60 sec
```

**segment_2_50.3.wav**

The original transript:

```
Let's create.
1:13
(MUSIC) >> Please welcome Senior Vice President and Director of Research, IBM, Dr. Dario Gil.
1:21
(Applause) >> DARIO GIL: Hello.
1:27
Welcome, welcome to the last session of THINK. And I understand some of you even had a drink.
1:33
How special. So, I hope you've enjoyed the last two days with us.
1:39
And what an incredible year it has been for AI. You can really feel the change that is happening all around us.
1:48
And there's just no denying that the pace of this technology continues to be exhilarating and that its implications are
1:57
now so clear for all to see around the
```

The local model (seamlessM4T_v2_large) output:

```
I hope you enjoyed the last two days with us and what an incredible year it has been for AI.
```

The HuggingFace model output:

```
Let's welcome the senior vice president and director of research, Dr. Gary O'Gill, and I hope you've enjoyed the last two days with us.
```

**segment_16_59.2.wav**

The original transcript:

```
topic that you want the assistant to handle, and it
15:06
generates the corresponding conversational flow. We have an inference stack to scale the serving of
15:14
the model in applications. It consists of state-of- the-art technology that has been field tested for scalable model serving.
15:22
This is how Watsonx allows us to go from data to a model that is trusted, governed, deployed and ready to serve, and how
15:31
we can scale that model to different applications. Once models are deployed, we continuously monitor them
15:39
and update them in both .data and in .ai. We call this constant process our data and model factory.
15:48
At Watsonx.governance monitors the models, if there's any change that may impact how the model can be used or performs,
15:58
be driven because we have new data that can be leveraged
```

The local model (seamlessM4T_v2_large) output:

```
We have a state-of-the-art technology that has been field-tested for scalable modeling.
```

The HuggingFace model output:

```
This is how we can scale that model to different applications, and this is how we can scale that model to different applications, and this is how we can scale that model to different applications, and this is how we can scale that model to different applications.
```

**segment_29_57.2.wav**

The original transcript:

```
your application.
28:05
It's that simple. We took the complexity of the process away so you only
28:10
need to worry about creating value for your business. And here are some of our current AI value creators.
28:17
SAP will use IBM Watson capabilities to power its digital assistant in the recipe solutions.
28:23
You have been hearing about Red Hat, how it's embedding IBM Watson Code Assistant into the Ansible Automation Platform,
28:30
BBVA is bringing their enterprise data to use with their own foundation model for natural language.
28:37
Moderna is applying IBM's foundation models to help predict potential MRNA medicines.
28:44
NASA is using our language models together with US spatial models we have created together to improve our
28:50
scientific understanding and response to earth and climate related issues. And WiX is using foundation models to gain novel insights
28:58
for customer care as they meet the needs of their customers.
```

The local model (seamlessM4T_v2_large) output:

```
It's that simple. We've taken the complexity of the process away so you just have to worry about creating value for your business.
```

The HuggingFace model output:

```
It's that simple. WIX is used in the process to create value for your business. And here are some of our current AI models.
```

As we can see, even the files with a 60-second length perform poorly. There are a lot of missed sentences and words, as well as hallucinations.

### The audio length is about 10 sec

We can try splitting it into smaller chunks (about 10 seconds each). However, in this case, we may end up with many segments divided in the middle of a sentence, which is bad for context.

Divide the full audio files into segments using [split_audio.py](./utils/split_audio.py).

In [1]:
from utils.split_audio import split_audio
split_audio("data/audio.wav", "data/audio_segments", 10000, 100, -30)

data/audio_segments/segment_000001_10.1.wav has been created.
data/audio_segments/segment_000002_10.2.wav has been created.
data/audio_segments/segment_000003_11.0.wav has been created.
data/audio_segments/segment_000004_15.5.wav has been created.
data/audio_segments/segment_000005_9.4.wav has been created.
data/audio_segments/segment_000006_8.4.wav has been created.
data/audio_segments/segment_000007_5.5.wav has been created.
data/audio_segments/segment_000008_12.2.wav has been created.
data/audio_segments/segment_000009_7.8.wav has been created.
data/audio_segments/segment_000010_10.0.wav has been created.
data/audio_segments/segment_000011_10.3.wav has been created.
data/audio_segments/segment_000012_9.9.wav has been created.
data/audio_segments/segment_000013_10.2.wav has been created.
data/audio_segments/segment_000014_9.8.wav has been created.
data/audio_segments/segment_000015_9.7.wav has been created.
data/audio_segments/segment_000016_10.0.wav has been created.
data/audio_segm

Transcribe the 10 sec audio files:

In [2]:
# Improt libraries
import os

import torch
from seamless_communication.inference import Translator


# Set device and Translator (mps does not work well with seamlessm4t)
# device = torch.device("mps")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
translator = Translator(
    "seamlessM4T_v2_large",
    "vocoder_v2",
    device=device,
    dtype=torch.float32
)

directory = "data/audio_segments"
# Get all file names, sort them alphabetically
files = [f for f in os.listdir(directory) if f.endswith(".wav")]
files.sort()
# print(files)

# Transribe the audio
text = []
for file in files:
    text_output, _ = translator.predict(
        input=f"data/audio_segments/{file}",
        task_str="ASR",
        tgt_lang="eng"
    )
    text.append(str(text_output[0]))


Using the cached checkpoint of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached checkpoint of vocoder_v2. Set `force` to `True` to download again.


In [3]:
text

["It's the first time I've ever seen a movie like this.",
 'Welcome to IBM Think 2023. AI generated art.',
 "AI generated songs? AI, what is that? It's sure is a lot of fun. But when foundation models are applied to big business, well",
 'You need to think bigger, because AI in business needs to be held to a higher standard, built to be trusted, secure and adaptable.',
 "across your organization. This isn't committing to a single system. This is hybrid-ready AI that can scale across your systems. This isn't wondering",
 'This is AI that can show its work. When you build AI into the core of your business, you can go so much further.',
 'This is more than AI, this is AI for business.',
 "Let's create please welcome senior vice president and director of research IBM Dr. Dario Gill",
 'Hello, welcome to the last session of the show.',
 "So I hope you've enjoyed the last two days with us.",
 "What an incredible year it has been for AI. You can really feel the change that is happening all ar

In [4]:
with open("./data/seamlessm4t_output/sm4t_transcript.txt", "w") as file:
    original_transcript = file.write("\n".join(text))

## Prepare the transcripts for WER

### The original transript

Use the whisper normalizer as we did before:

In [5]:
from whisper.normalizers import BasicTextNormalizer, EnglishTextNormalizer
normalizer_b = BasicTextNormalizer()
normalizer_en = EnglishTextNormalizer()

Open the original transcript and remove timestamps like "0:09" and diacritics in the form of ">> DARIO GIL:":

In [6]:
import re
with open("./data/transcript.txt", "r") as file:
    original_transcript = file.read()
original_transcript_in_lines = re.sub(r"\d+:\d{2}\n|>>\s*[A-Z\s]+:", "", original_transcript).strip().split("\n")
print(original_transcript_in_lines[0:10])

['>> Welcome to IBM THINK 2023!', '>> AI generated art, AI generated songs.', 'AI, what is that? It sure is a lot of fun. But when foundation models are applied to big business, well,', 'you need to think bigger. Because AI and business needs to be held to a higher standard.', "Built to be trusted, secured, and adaptable. This isn't simple automation that is only", 'trained to do one thing. This is AI that is built and focused to work across your organization.', "This isn't committing to a single system. This is hybrid ready AI that can scale across your systems.", "This isn't wondering where an answer came from. This is AI that can show its work.", 'When you build AI into the core of your business, you can go so much further. This is more than AI.', "This is AI for business. Let's create."]


Clean the original transcript using normalizer_en:

In [7]:
cleaned_original_transcript = []
for line in original_transcript_in_lines:
    cleaned_original_transcript.append(normalizer_en(line))

In [8]:
cleaned_original_transcript[0:10]

['welcome to ibm think 2023',
 'ai generated art ai generated songs',
 'ai what is that it sure is a lot of fun but when foundation models are applied to big business well',
 'you need to think bigger because ai and business needs to be held to a higher standard',
 'built to be trusted secured and adaptable this is not simple automation that is only',
 'trained to do one thing this is ai that is built and focused to work across your organization',
 'this is not committing to a single system this is hybrid ready ai that can scale across your systems',
 'this is not wondering where an answer came from this is ai that can show its work',
 'when you build ai into the core of your business you can go so much further this is more than ai',
 'this is ai for business let us create']

Join the cleaned original transcript and create the reference for WER:

In [9]:
reference = " ".join(cleaned_original_transcript)
reference[0:1000]

'welcome to ibm think 2023 ai generated art ai generated songs ai what is that it sure is a lot of fun but when foundation models are applied to big business well you need to think bigger because ai and business needs to be held to a higher standard built to be trusted secured and adaptable this is not simple automation that is only trained to do one thing this is ai that is built and focused to work across your organization this is not committing to a single system this is hybrid ready ai that can scale across your systems this is not wondering where an answer came from this is ai that can show its work when you build ai into the core of your business you can go so much further this is more than ai this is ai for business let us create please welcome senior vice president and director of research ibm doctor dario gil hello welcome welcome to the last session of think and i understand some of you even had a drink how special so i hope you have enjoyed the last 2 days with us and what a

### The seamlessm4t transript

 Clean the seamlessm4t transript:

In [10]:
with open("./data/seamlessm4t_output/sm4t_transcript.txt", "r") as file:
    sm4t_transcript = file.read()
sm4t_transcript_in_lines = sm4t_transcript.strip().split("\n")
sm4t_transcript_in_lines[0:10]

["It's the first time I've ever seen a movie like this.",
 'Welcome to IBM Think 2023. AI generated art.',
 "AI generated songs? AI, what is that? It's sure is a lot of fun. But when foundation models are applied to big business, well",
 'You need to think bigger, because AI in business needs to be held to a higher standard, built to be trusted, secure and adaptable.',
 "across your organization. This isn't committing to a single system. This is hybrid-ready AI that can scale across your systems. This isn't wondering",
 'This is AI that can show its work. When you build AI into the core of your business, you can go so much further.',
 'This is more than AI, this is AI for business.',
 "Let's create please welcome senior vice president and director of research IBM Dr. Dario Gill",
 'Hello, welcome to the last session of the show.',
 "So I hope you've enjoyed the last two days with us."]

Clean the seamlessm4t transcript using normalizer_en:

In [11]:
cleaned_sm4t_transcript = []
for line in sm4t_transcript_in_lines:
    cleaned_sm4t_transcript.append(normalizer_en(line))

In [12]:
cleaned_sm4t_transcript[0:10]

['it is the 1st time i have ever seen a movie like this',
 'welcome to ibm think 2023 ai generated art',
 'ai generated songs ai what is that it is sure is a lot of fun but when foundation models are applied to big business well',
 'you need to think bigger because ai in business needs to be held to a higher standard built to be trusted secure and adaptable',
 'across your organization this is not committing to a single system this is hybrid ready ai that can scale across your systems this is not wondering',
 'this is ai that can show its work when you build ai into the core of your business you can go so much further',
 'this is more than ai this is ai for business',
 'let us create please welcome senior vice president and director of research ibm doctor dario gill',
 'hello welcome to the last session of the show',
 'so i hope you have enjoyed the last 2 days with us']

Create the hypothesis:

In [13]:
hypothesis = " ".join(cleaned_sm4t_transcript)
hypothesis[0:1000]

'it is the 1st time i have ever seen a movie like this welcome to ibm think 2023 ai generated art ai generated songs ai what is that it is sure is a lot of fun but when foundation models are applied to big business well you need to think bigger because ai in business needs to be held to a higher standard built to be trusted secure and adaptable across your organization this is not committing to a single system this is hybrid ready ai that can scale across your systems this is not wondering this is ai that can show its work when you build ai into the core of your business you can go so much further this is more than ai this is ai for business let us create please welcome senior vice president and director of research ibm doctor dario gill hello welcome to the last session of the show so i hope you have enjoyed the last 2 days with us what an incredible year it has been for ai you can really feel the change that is happening all around us and there is just no denying that the pace of thi

### Calculate the error rate metrics

Import the [calculate_error_rates](./utils/clean_text.py) function and calculate the most common error rate metrics:

In [14]:
from utils.clean_text import calculate_error_rates
metrics = calculate_error_rates(reference, hypothesis)
metrics

{'WER': 0.23942521756729407,
 'CER': 0.19583738136163217,
 'MER': 0.23250786163522014,
 'WIL': 0.28410521349793494,
 'WIP': 0.7158947865020651}

The result in a more readable [html-format](./data/diff_normalizer.html):

In [15]:
import difflib
html_diff = difflib.HtmlDiff().make_file(reference.split(), hypothesis.split())
with open('./data/diff_normalizer_sm4t.html', 'w') as f:
    f.write(html_diff)

In [16]:
from IPython.display import HTML
HTML(html_diff)

0,1,2,3,4,5
n,,,n,1.0,it
,,,,2.0,is
,,,,3.0,the
,,,,4.0,1st
,,,,5.0,time
,,,,6.0,i
,,,,7.0,have
,,,,8.0,ever
,,,,9.0,seen
,,,,10.0,a

Legends,Legends.1
Colors Added Changed Deleted,Links (f)irst change (n)ext change (t)op

Colors
Added
Changed
Deleted

Links,Links.1
(f)irst change,
(n)ext change,
(t)op,


The WER (Word Error Rate) is about 24%, which is a poor result.

When we examine the comparison between transcripts, we can identify the main issues:

Some phrases and sentences within even small chunks are not transcribed. We observed this issue with 60-second files earlier. Splitting into approximately 10-second files helps but does not resolve the problems in all cases.
Due to automatic multiple splitting, we sometimes need to divide files in the middle of sentences. This creates additional issues with hallucinations, and we suspect it could be one of the reasons for the missed sentences or phrases.


## Conclusion

After multiple attempts, we can conclude that SeamlessM4T is not the best choice for processing long audio files, even in English, after these files have been split.

We tested files that were:

- Slightly shorter than 1 minute,
- Slightly longer than 1 minute,
- Exactly 1 minute,
- 10 seconds long,
- With additional noises (music, applause, silence pauses, artificial voices),
- Without noise (only one speaker talking).

File parameters:

- Format: WAV
- Bit depth: 16-bit
- Frame rate: 16 kHz
- Channels: 1 (mono)
Similarly, poor results were reproduced for 60-second files not only on a local system but also on [Huggingface](https://huggingface.co/spaces/facebook/seamless_m4t).

There are complaints about this issue:

- https://news.ycombinator.com/item?id=37222822
- https://github.com/facebookresearch/seamless_communication/issues/82
- https://github.com/facebookresearch/seamless_communication/issues/303

SeamlessM4T has its strengths. It supports multiple languages and can convert from speech-to-speech directly. However, Whisper is a more convenient option for ASR:
- It consumes fewer computational resources, especially Whisper.cpp, and requires less time to transcribe audio files;
- It can handle relatively long files directly without splitting them;
- It has a significantly better Word Error Rate (WER) — 4% vs. 24% for our specific file.

We look forward to future releases and improvements.