<a href="https://colab.research.google.com/github/jimregan/wav2vec2-sprint/blob/comparison/Irish_comparisons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Background

My "adventure" with ASR for Irish started a few years ago; I had been interested in it since seeing Star Trek as a kid, but when I started to get involved in NLP datasets were rare, and those that did exist cost large amounts of money.

My master's was in Speech and Language Processing, which was a strange mix of computing and linguistics: as many people that I knew went on to work as speech therapists as went on to be programmers. During one of the phonetics assignments—comparing the acoustic properties of vowels in two dialects of English—not having slept enough, I deleted my working copy of my data instead of the various scratch folders I'd meant to; there wasn't enough time to start again, so I wrote a quick script to wrap CMU pocketsphinx, and used that to find the vowels instead (pocketsphinx is phoneme-based). My phonetics professor is the instigator of the [Abair project](https://www.abair.tcd.ie/), the first speech synthesiser for Irish, so speech recognition for Irish was an item of interest for her, so over the next couple of weekends I scraped some pronunciation data from [the Irish Pronunciation Database](https://www.teanglann.ie/en/fuaim/) and trained a pocketsphinx-based recogniser. The results were not good, but it showed that (multi-dialect) speech recognition for Irish was at least possible.

A year and a half later, I had to write my dissertation. My topic was machine translation: the "Attention is All You Need" paper came out while I was writing, so while it was clear that NMT was the way forward, it wasn't quite as obvious is it is now. Still, I felt like I should try to include NMT. My brother had just bought a new PC, and his old one had a GPU, so he held off on selling it while I wrote, so I could try to include NMT: I'd already bitten off more than I could chew, and this was really too much extra, but because my brother had loaned his PC to me, I felt obliged to make use of it, so the rest of the writing process was a descent into madness: I still have an aversion to machine translation, and I thoroughly burned my bridges with my former supervisor in avoiding her attempts to talk some sense into me. When the thing was finally submitted, I wasn't ready to try to be a functioning member of society again, and my brother didn't ask for his PC back for a couple of weeks, so I went back to ASR, and spent the time figuring out how to train a Kaldi model. The results from the Kaldi DNN model were worse than the GMM model (not enough data), and the GMM results were worse than the Sphinx results.

Common Voice had launched a little before this, but was originally English-only; when they internationalised the codebase, they had a requirement of at least 1000 sentences to be read per language, and those sentences had to be public domain. For Irish, this was quite difficult: anything that's out of copyright is pre-standard (and difficult to read for most people). It took a while, but eventually I managed to collect the sentences, and Kevin Scannell translated the interface, which was the other major stumbling block.

Finding work after the master's proved to be quite difficult; in the end, I took advantage of the linguistics part, and at the start of the summer, I agreed to take a job teaching English in Poland, to start in September. In the meantime, Abair had branched out to working on ASR; I had a desk in their lab while I was studying there, and it had become like a second home, so while visiting, I shared some data I had collected, and they offered me a job for the summer to collect more. (And if I hadn't taken the teaching job, it's quite possible I would still be working there).

tl;dr - I've been trying to get this working for a long time.

## Pocketsphinx

The pocketsphinx model doesn't have a real language model (it came from dictionary, so it's single words only, headwords only), so to show it in the best light, I'm using the data from the website for [Fuaimeanna na Gaeilge](http://www.fuaimeanna.ie/en/) *("The Sounds of Irish")*, which has equivalent pronunciation examples. I've put an old scraper for the site [here](https://github.com/jimregan/wav2vec2-sprint/blob/main/irish/fuaimeanna.pl); this writes a .tsv file with the data, and a shell script to use wget to download the sounds. I already have the data, so I'm just uploading it, but during the sprint I wrote a [script](https://github.com/jimregan/wav2vec2-sprint/blob/main/irish/convert-fuaimeanna-csv.pl) to convert the .tsv to a .csv that `datasets` could read more easily.

Setting up is easy:

In [None]:
!apt-get install pocketsphinx

Next, grab the pretrained model:

In [None]:
!wget https://github.com/jimregan/irish-asr-data/releases/download/teanglann-0.1/cmusphinx-ga-teanglann-0.1.zip

In [None]:
!unzip cmusphinx-ga-teanglann-0.1.zip

In [None]:
!unzip fuaimeanna.zip

Pocketsphinx comes from the bad old days before audio libraries were something that could be relied on being present, so the files need to be 16k `.wav`

In [None]:
!for i in fuaimeanna/mp3/*.mp3;do ffmpeg -i "$i" -acodec pcm_s16le -ac 1 -ar 16000 "$i.wav";done

In [None]:
!for i in fuaimeanna/mp3/*.wav; do f=$(echo $i|awk -F/ '{print $NF}');printf "%s\t" $f >> ps-output; pocketsphinx_continuous -infile $i -hmm cmusphinx-ga-teanglann-0.1/ -dict cmusphinx-ga-teanglann-0.1/ga.dic -lm cmusphinx-ga-teanglann-0.1/ga.lm.DMP >> ps-output;done

In [None]:
!pip install jiwer