# wav2vec-u Common Voice Swedish - prepare ltr/phn/wrd
> "ltr/phn/wrd preparation for wav2vec-u on Common Voice Swedish"

- toc: false
- branch: master
- badges: false
- comments: true
- categories: [kaggle, wav2vec-u]

Original [here](https://www.kaggle.com/jimregan/wav2vec-u-cv-swedish-prep-ltr-phn-wrd)

In the section [Preparation of speech and text data](https://github.com/pytorch/fairseq/tree/master/examples/wav2vec/unsupervised#preparation-of-speech-and-text-data) of the readme, it says:

> Similar to [wav2vec 2.0](https://github.com/pytorch/fairseq/blob/master/examples/wav2vec/README.md),  data folders contain {train,valid,test}.{tsv,wrd,phn} files, where audio paths are stored in tsv files, and word, letter or phoneme transcriptions are stored in .{wrd,ltr,phn}.

The `.wrd` and `.ltr` files are outputs of `libri_labels.py`

In [None]:
%%capture
!pip install phonemizer

In [None]:
%%capture
!apt-get -y install espeak

In [None]:
%%capture
!apt-get -y install zsh

This is just my best guess at what the `.wrd` files contain - it seems to match up with what `libri_labels.py` does: given input like
```
1272-128104-0000 MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL
```
it does `" ".join(items[1:])`, which is basically the same

In [None]:
!cat /kaggle/input/download-common-voice-swedish/cv-corpus-6.1-2020-12-11/sv-SE/test.tsv | awk -F'\t' '{print $3}'|grep -v '^sentence$' | perl -C7 -ane 'chomp;$_=lc($_);s/[^\p{L}\p{N}\p{M}'"\'"' \-]/ /g;s/  +/ /g;s/ $//;s/^ //;print "$_\n";' > test.wrd
!cat /kaggle/input/download-common-voice-swedish/cv-corpus-6.1-2020-12-11/sv-SE/dev.tsv | awk -F'\t' '{print $3}'|grep -v '^sentence$' | perl -C7 -ane 'chomp;$_=lc($_);s/[^\p{L}\p{N}\p{M}'"\'"' \-]/ /g;s/  +/ /g;s/ $//;s/^ //;print "$_\n";' > valid.wrd
!cat /kaggle/input/download-common-voice-swedish/cv-corpus-6.1-2020-12-11/sv-SE/train.tsv | awk -F'\t' '{print $3}'|grep -v '^sentence$' | perl -C7 -ane 'chomp;$_=lc($_);s/[^\p{L}\p{N}\p{M}'"\'"' \-]/ /g;s/  +/ /g;s/ $//;s/^ //;print "$_\n";' > train.wrd


In [None]:
for i in ['train', 'test', 'valid']:
    with open(f'/kaggle/working/{i}.wrd', 'r') as inf, open(f'/kaggle/working/{i}.ltr', 'w') as out:
        for line in inf.readlines():
            print(" ".join(list(line.strip().replace(" ", "|"))) + " |", file=out)

In [None]:
!head train.ltr

v a d | ä r | d e t | i | e u r o |
d u | s k a | v e t a | a t t | d e t | ä r | d u | s o m | h a r | f e l |
g å | n e r | p å | k n ä |
f ö r s t | m å s t e | j a g | s l å | s ö n d e r | d e n | d ä r | s t o r a | s k r o t h ö g e n |
d e t | b l i r | s v å r t |
v a d | f ö r | j ä v l a | f r å g a | ä r | d e t |
j a g | å t e r v ä n d e r | i n t e | t i l l | s k i t h å l e t |
t i t t a | p å | s ö m m a r n a |
f e s | d u | p r e c i s |
a k t r i s e r | h a r | e t t | b ä s t | f ö r e d a t u m |


There are some warnings about switching, so echo the filename first to known where the errors are

In [None]:
!for i in train test valid; do echo $i.wrd; cat $i.wrd | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -o $i.phn -p ' ' -w '' -l sv  -j 70 --language-switch remove-flags ;done

train.wrd
test.wrd
valid.wrd


In [None]:
!cat test.wrd|awk 'BEGIN{ln=1}{if(ln==81){print $0};ln++}'
!cat train.wrd|awk 'BEGIN{ln=1}{if(ln==254||ln==1457){print $0};ln++}'
!cat valid.wrd|awk 'BEGIN{ln=1}{if(ln==1831){print $0};ln++}'

det är taskigt
och så unik design
internet slutade fungera
det finns inget internet


In [None]:
!cat test.phn|awk 'BEGIN{ln=1}{if(ln==81){print $0};ln++}'
!cat train.phn|awk 'BEGIN{ln=1}{if(ln==254||ln==1457){print $0};ln++}'
!cat valid.phn|awk 'BEGIN{ln=1}{if(ln==1831){print $0};ln++}'

d eː t ɛː r  t a s k ɪ ɡ t  
ɔ k s oː ɵ n iː k  d ɪ z aɪ n  
 ɪ n t ə n ɛ t  s l ʉ t a d ə f ɵ n ɡ eː r a 
d eː t f ɪ n s ɪ ŋ ə t  ɪ n t ə n ɛ t  


"design" and "internet" are clearly the English words that are causing the switch in their respective sentences, but I'm not sure what the problem in test.wrd is: "taskigt"?
* [design](https://en.wiktionary.org/wiki/design#Swedish) `/dɛˈsajn/`
* [internet](https://en.wiktionary.org/wiki/internet#Swedish) `/ˈɪntɛrnɛt/, /ɪntɛrˈnɛt/`

In [None]:
!echo taskigt|espeak -v sv --ipa 2> /dev/null

 (en)tˈaskɪɡt(sv)


In [None]:
!cat test.phn|sed -e 's/^ //;s/t a s k ɪ ɡ t/t a s k ɪ t/' > tmp
!mv tmp test.phn
!cat train.phn|sed -e 's/^ //;s/d ɪ z aɪ n/d ɛ s a j n/;s/ɪ n t ə n ɛ t/ɪ n t ɛ r n ɛ t/' > tmp
!mv tmp train.phn
!cat valid.phn|sed -e 's/^ //;s/ɪ n t ə n ɛ t/ɪ n t ɛ r n ɛ t/' > tmp
!mv tmp valid.phn

In [None]:
!for i in train test valid; do cat $i.wrd|tr ' ' '\n'|sort|uniq |grep -v '^internet$'|grep -v '^design$'|grep -v '^taskigt$' > /tmp/$i.wl; cat /tmp/$i.wl | PHONEMIZER_ESPEAK_PATH=$(which espeak) phonemize -o /tmp/$i.wl.phn -p ' ' -w '' -l sv  -j 70 --language-switch remove-flags;paste /tmp/$i.wl /tmp/$i.wl.phn > dict.$i; done
!printf "taskigt\tt a s k ɪ t\n" >> dict.test
!printf "design\td ɛ s a j n\n" >> dict.train
!printf "internet\tɪ n t ɛ r n ɛ t\n" >> dict.train
!printf "internet\tɪ n t ɛ r n ɛ t\n" >> dict.valid

In [None]:
!for i in dic*;do cat $i |sort > tmp;mv tmp $i;done

cat: valid: No such file or directory
