# **Use our fine-tuned XLSR-Wav2Vec2 on Air Traffic Control data with 🤗 Transformers**

Wav2Vec2 is a pretrained model for Automatic Speech Recognition (ASR) and was released in [September 2020](https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) by Alexei Baevski, Michael Auli, and Alex Conneau.  Soon after the superior performance of Wav2Vec2 was demonstrated on the English ASR dataset LibriSpeech, *Facebook AI* presented XLSR-Wav2Vec2 (click [here](https://arxiv.org/abs/2006.13979)). XLSR stands for *cross-lingual  speech representations* and refers to XLSR-Wav2Vec2's ability to learn speech representations that are useful across multiple languages.

Similar to Wav2Vec2, XLSR-Wav2Vec2 learns powerful speech representations from hundreds of thousands of hours of speech in more than 50 languages of unlabeled speech. Similar, to [BERT's masked language modeling](http://jalammar.github.io/illustrated-bert/), the model learns contextualized speech representations by randomly masking feature vectors before passing them to a transformer network.

![wav2vec2_structure](https://raw.githubusercontent.com/patrickvonplaten/scientific_images/master/xlsr_wav2vec2.png)

The authors show for the first time that massively pretraining an ASR model on cross-lingual unlabeled speech data, followed by language-specific fine-tuning on very little labeled data achieves state-of-the-art results. See Table 1-5 of the official [paper](https://arxiv.org/pdf/2006.13979.pdf).

(**Introduction from** [Google Colab of Patrick von Platen](https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_Tune_XLSR_Wav2Vec2_on_Turkish_ASR_with_%F0%9F%A4%97_Transformers.ipynb?authuser=1#scrollTo=V7YOT2mnUiea) implementation.])


Our fine-tuned model is open-sourced at: https://huggingface.co/Jzuluaga/wav2vec2-xls-r-300m-en-atc-atcosim and https://huggingface.co/Jzuluaga/wav2vec2-large-960h-lv60-self-en-atc-atcosim

In this notebook, we will give you an initial explanation of how to use the XLSR-Wav2Vec2's fine-tuned checkpoint on Air Traffic Control Data, with and without using language model. If you use a LM you can achieve better results. In case of interest, please follow this notebook: [Boosting Wav2Vec2 with n-grams in 🤗 Transformers](https://huggingface.co/blog/wav2vec2-with-ngram).

- We also have an explained setup on how to train a LM with KenLM and integrate it on this model in our GitHub repository: [How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications](https://github.com/idiap/w2v2-air-traffic)

For demonstration purposes, we fine-tune the [wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on the low resource ATC dataset:

- ATCOSIM ASR dataset: more information in their [ATCOSIM website](https://www.spsc.tugraz.at/databases-and-tools/atcosim-air-traffic-control-simulation-speech-corpus.html)
- However, **do not worry**, we have prepared the database in the [Datasets](https://huggingface.co/docs/datasets/index) format, here: [ATCOSIM CORPUS on HuggingFace](https://huggingface.co/datasets/Jzuluaga/atcosim_corpus). You can scroll and check the train/test partitions, and even listen to some audios.

# Data exploration of the model - Load ATCOSIM dataset

We need to load the dataset in HuggingFace format. The dataset is here: [ATCOSIM CORPUS on HuggingFace](https://huggingface.co/datasets/Jzuluaga/atcosim_corpus)

In [None]:
from datasets import load_dataset, load_metric, Audio
dataset_id = "Jzuluaga/atcosim_corpus"

# we only load the 'test' partition, however, if you want to load the 'train' partition, you can change 
atcosim_corpus_test = load_dataset(dataset_id, "test", split="test")



This is a short function that displays some random samples from the dataset. Only for visualization purposes:

In [None]:
from datasets import ClassLabel
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=10):
    assert num_examples <= len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)
    
    df = pd.DataFrame(dataset[picks])
    display(HTML(df.to_html()))

In [None]:
show_random_elements(atcosim_corpus_test.remove_columns(["audio", "segment_start_time", "segment_end_time"]), num_examples=10)

Unnamed: 0,id,text,duration
0,atcosim_zf1_05_147_000000_000263,contact milan one three four five two good bye,2.63
1,atcosim_zf1_05_148_000000_000273,u s air one four descend to flight level two nine zero,2.73
2,atcosim_zf1_05_152_000000_000303,air france three five six reims one three four four good bye,3.04
3,atcosim_zf1_05_146_000000_000350,lufthansa four three nine three descend to flight level two seven zero,3.5
4,atcosim_zf1_05_153_000000_000308,viva nine zero eight one descend to flight level two nine zero,3.08
5,atcosim_zf1_05_150_000000_000301,swiss air four eight eight climb to flight level three two zero,3.01
6,atcosim_zf1_05_154_000000_000344,swiss air four eight eight contact rhein one three two decimal four adieu,3.45
7,atcosim_zf1_05_155_000000_000359,air france one five five four milan one three four five two good bye,3.59
8,atcosim_zf1_05_151_000000_000308,air france three five six set course to morok,3.08
9,atcosim_zf1_05_149_000000_000270,aero lloyd five one seven set course direct to karlsruhe,2.7


## Load Tokenizer, Feature Extractor and Model
ASR models transcribe speech to text, which means that we both need a feature extractor that processes the speech signal to the model's input format, *e.g.* a feature vector, and a tokenizer that processes the model's output format to text. 

In 🤗 Transformers, the XLSR-Wav2Vec2 model is thus accompanied by both a tokenizer, called [Wav2Vec2CTCTokenizer](https://huggingface.co/transformers/master/model_doc/wav2vec2.html#wav2vec2ctctokenizer), and a feature extractor, called [Wav2Vec2FeatureExtractor](https://huggingface.co/transformers/master/model_doc/wav2vec2.html#wav2vec2featureextractor). Here, we can also load the Model with `AutoModelForCTC` and the processors with `Wav2Vec2Processor` and `Wav2Vec2ProcessorWithLM` functions.

Let's start by creating the tokenizer responsible for decoding the model's predictions.
And we also download the model which is public here: [XLSRXLSR-Wav2Vec2 model fine-tuned on ATC data](https://huggingface.co/Jzuluaga/wav2vec2-xls-r-300m-en-atc-atcosim)


In [None]:
import torch
from transformers import AutoModelForCTC, Wav2Vec2Processor, Wav2Vec2ProcessorWithLM
import torchaudio.functional as F

# ID of the model, link: https://huggingface.co/Jzuluaga/wav2vec2-xls-r-300m-en-atc-atcosim
model_id = "Jzuluaga/wav2vec2-xls-r-300m-en-atc-atcosim"

# load the model, you can ignore the warnings
model = AutoModelForCTC.from_pretrained(model_id)
processor_with_lm = Wav2Vec2ProcessorWithLM.from_pretrained(model_id)
processor_without_lm = Wav2Vec2Processor.from_pretrained(model_id)


Please use `allow_patterns` and `ignore_patterns` instead.


Fetching 4 files:   0%|          | 0/4 [00:00<?, ?it/s]



## Load the sample in memory

We load one sample into memory to show an example of how to run our model

In [83]:
# load one sample into memory

sample_iter = next(iter(atcosim_corpus_test))
file_sampling_rate = sample['audio']['sampling_rate']

if file_sampling_rate != 16000:
    resampled_audio = F.resample(torch.tensor(sample["audio"]["array"]), file_sampling_rate, 16000).numpy()
else:
    resampled_audio = torch.tensor(sample["audio"]["array"]).numpy()

input_values = processor(resampled_audio, return_tensors="pt").input_values

It is strongly recommended to pass the ``sampling_rate`` argument to this function. Failing to do so can result in silent errors that might be hard to debug.


## Evaluate model (run forward pass)
Here, we perform inference on the model with LM.

In [86]:
# run forward pass of the model
with torch.no_grad():
    logits = model(input_values).logits
    
# get the transcription with proce
transcription = processor_with_lm.batch_decode(logits.numpy()).text
transcription

['u s air one four descend to flight level two nine zero']

In [90]:
pred_ids = torch.argmax(logits, dim=-1)
transcription = processor_without_lm.batch_decode(pred_ids)


That's all, If you have more question, reach us out on Github: https://github.com/idiap/w2v2-air-traffic

# Cite us
If you use this code for your research, please cite our recent papers that involved this work:

```
@article{zuluaga2022bertraffic,
    title={How Does Pre-trained Wav2Vec2. 0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications},
    author={Zuluaga-Gomez, Juan and Prasad, Amrutha and Nigmatulina, Iuliia and Sarfjoo, Saeed and Motlicek, Petr and Kleinert, Matthias and Helmke, Hartmut and Ohneiser, Oliver and Zhan, Qingran},
    journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar},
    year={2022}
  }
```

and,

```
@article{zuluaga2022bertraffic,
  title={BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications},
  author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and Nigmatulina, Iuliia and Motlicek, Petr and Ondre, Karel and Ohneiser, Oliver and Helmke, Hartmut},
  journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar},
  year={2022}
  }
```