<a href="https://colab.research.google.com/github/jimregan/wav2vec2-sprint/blob/comparison/Irish_comparisons.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Pocketsphinx

The pocketsphinx model doesn't have a real language model (it came from dictionary, so it's single words only, headwords only), so to show it in the best light, I'm using the data from the website for [Fuaimeanna na Gaeilge](http://www.fuaimeanna.ie/en/) *("The Sounds of Irish")*, which has equivalent pronunciation examples. I've put an old scraper for the site [here](https://github.com/jimregan/wav2vec2-sprint/blob/main/irish/fuaimeanna.pl); this writes a .tsv file with the data, and a shell script to use wget to download the sounds. I already have the data, so I'm just uploading it, but during the sprint I wrote a [script](https://github.com/jimregan/wav2vec2-sprint/blob/main/irish/convert-fuaimeanna-csv.pl) to convert the .tsv to a .csv that `datasets` could read more easily.

Setting up is easy:

In [None]:
!apt-get install pocketsphinx

Next, grab the pretrained model:

In [None]:
!wget https://github.com/jimregan/irish-asr-data/releases/download/teanglann-0.1/cmusphinx-ga-teanglann-0.1.zip

In [None]:
!unzip cmusphinx-ga-teanglann-0.1.zip

In [None]:
!unzip fuaimeanna.zip

Pocketsphinx comes from the bad old days before audio libraries were something that could be relied on being present, so the files need to be 16k `.wav`

In [None]:
!for i in fuaimeanna/mp3/*.mp3;do ffmpeg -i "$i" -acodec pcm_s16le -ac 1 -ar 16000 "$i.wav";done

In [None]:
!for i in fuaimeanna/mp3/*.wav; do f=$(echo $i|awk -F/ '{print $NF}');printf "%s\t" $f >> ps-output; pocketsphinx_continuous -infile $i -hmm cmusphinx-ga-teanglann-0.1/ -dict cmusphinx-ga-teanglann-0.1/ga.dic -lm cmusphinx-ga-teanglann-0.1/ga.lm.DMP >> ps-output;done

In [None]:
!pip install jiwer

In [48]:
import csv
def get_lists(filea, fileb="/content/fuaimeanna/all-fuaimeanna-data.tsv"):
  data = dict()
  with open(fileb) as file:
      all = csv.reader(file, delimiter="\t", quotechar=None)
      for row in all:
        if row[0] == 'Orthographic':
          continue
        else:
          file1 = row[1].replace('/sounds/', '')
          data[file1] = row[0]
          file2 = row[3].replace('/sounds/', '')
          data[file2] = row[0]
          file3 = row[5].replace('/sounds/', '')
          data[file3] = row[0]
  merged = list()
  with open(filea) as file:
    ps = csv.reader(file, delimiter="\t", quotechar=None)
    for row in ps:
      if len(row) != 2:
        continue
      filename = row[0].replace('.wav', '')
      add=(row[1],data[filename])
      merged.append(add)
  lista = [a[0] for a in merged]
  listb = [a[1] for a in merged]
  return (lista, listb)

In [49]:
from jiwer import wer
lista, listb = get_lists("ps-output")
result = wer(lista, listb)
'{:.2f}'.format(result)

'0.99'

In [None]:
!pip install deepspeech

In [None]:
!wget https://github.com/jimregan/DeepSpeech/releases/download/0.8.2-ga-test/output_graph_ga.pbmm https://github.com/jimregan/DeepSpeech/releases/download/0.8.2-ga-test/kenlm.scorer

In [None]:
!for i in fuaimeanna/mp3/*.wav;do f=$(echo $i|awk -F/ '{print $NF}'); printf "%s\t" $f >> ds-output; deepspeech --model output_graph_ga.pbmm --scorer kenlm.scorer --audio $i >> ds-output;done

In [50]:
from jiwer import wer
lista, listb = get_lists("ds-output")
result = wer(lista, listb)
'{:.2f}'.format(result)

'7.83'

7.83 looks pretty impressive! But it's a false impression:

In [51]:
!head ds-output

aaineas_i1_s1.mp3.wav	
aaineas_i2_s2.mp3.wav	
aaineas_i3_s3.mp3.wav	
aaine_i1_s1.mp3.wav	
aaine_i2_s2.mp3.wav	
aaine_i3_s3.mp3.wav	
aaisiuuil_i1_s1.mp3.wav	is
aaisiuuil_i2_s2.mp3.wav	is 
aaisiuuil_i3_s3.mp3.wav	is
aa_ndiiol_i1_s1.mp3.wav	ní


In [52]:
!cat ds-output |awk -F'\t' 'BEGIN{c=0}($2==""){c++}END{print "Fields: " NR " With output: " c}'

Fields: 2276 With output: 1968


In [53]:
!cat ds-output |awk -F'\t' '{print $2}'|sort|uniq


a 
ach
ach 
an
an 
ar an 
i 
is
is 
is as
is as 
is í 
ní
ní 
sa
sa 
seans
sin
tá 
