### About this kernel
This kernel ONLY describe the data preparation scripts for preparing necessary files to run on Kaldi and it cannot run end to end on Kaggle kernel. I will not say any further about how to run it on Kaldi. There is a detail tutorial [here](http://kaldi-asr.org/doc/kaldi_for_dummies.html).
I'm also getting familar with Kaldi for nearly a month, so please feel free to discuss and make question or suggestion to have a better result.
The folder to put on running script (run.sh, data, etc...) is uploaded on this [Github repository](https://github.com/minhnq97/asr-commands).

## Data preparation
So let begin with creating data for training.
Just to remind, preparing data for Kaldi needs three files:
* *wav.scp*: Each line of file is followed by pattern: utterance_id path_to_audio
* *text*: Pattern of file is: utterance_id transcript
* *utt2spk*: Pattern of file is: utterance_id speaker

### Training and validation data

In [None]:
import os.path
import random

all_info = []
transcript = {}
path = '../input/train/audio/'

all_info = []
count = 0
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        if '.wav' in file:
            count+=1
            trans = r.split("/")[-1]
            file_id = file.split(".")[0] + "_" + trans
            spk_id = file_id.split("_")[0]
            transcript[file_id] = trans
            all_info.append([spk_id,file_id,os.path.join(r, file)])

counter = int(len(all_info) * 0.1)
random.shuffle(all_info)
all_train_info = all_info[counter:]
all_test_info = all_info[:counter]

Collect all data in a list [*speaker_id*, *utterance_id*, *file_to_audio*]  
Choose 10% from them to be a test list. (Although I named it test list, it should be considered as validation list).    
**Note** that to avoid some error from Kaldi, the utterance_id should begin with the speaker_id.

In [None]:
if not os.path.exists(os.path.dirname('data/train_command/text')):
    os.makedirs(os.path.dirname('data/train_command/text'))
if not os.path.exists(os.path.dirname('data/test_command/text')):
    os.makedirs(os.path.dirname('data/test_command/text'))

def text(file_infos):
    results = []
    # folder_path = os.path.abspath("recordings")
    for info in file_infos:
        utt_id = info[1]
        trans = transcript[utt_id]
        results.append("{} {}".format(utt_id, trans))
    return '\n'.join(sorted(results))

with open("data/train_command/text","wt") as f:
    f.writelines(text(all_train_info))
with open("data/test_command/text","wt") as f:
    f.writelines(text(all_test_info))

Create text file with format: utterance_id -> transcript

In [None]:
if not os.path.exists(os.path.dirname('data/train_command/wav.scp')):
    os.makedirs(os.path.dirname('data/train_command/wav.scp'))
if not os.path.exists(os.path.dirname('data/test_command/wav.scp')):
    os.makedirs(os.path.dirname('data/test_command/wav.scp'))

def wavscp(file_infos):
    results = []
    for info in file_infos:
        results.append("{} {}".format(info[1], info[2]))
    return '\n'.join(sorted(results))

with open("data/train_command/wav.scp","wt") as f:
    f.writelines(wavscp(all_train_info))
with open("data/test_command/wav.scp","wt") as f:
    f.writelines(wavscp(all_test_info))

Create wav.scp file with format: utterance_id -> audio path

In [None]:
if not os.path.exists(os.path.dirname('data/train_command/utt2spk')):
    os.makedirs(os.path.dirname('data/train_command/utt2spk'))
if not os.path.exists(os.path.dirname('data/test_command/utt2spk')):
    os.makedirs(os.path.dirname('data/test_command/utt2spk'))

def utt2spk(file_infos):
    results = []
    for info in file_infos:
        speaker = info[0]
        utt_id = info[1]
        results.append("{} {}".format(utt_id, speaker))
    return '\n'.join(sorted(results))

with open("data/train_command/utt2spk","wt") as f:
    f.writelines(utt2spk(all_train_info))
with open("data/test_command/utt2spk","wt") as f:
    f.writelines(utt2spk(all_test_info))

Create utt2spk file with format: utterance_id -> speaker

### Evaluation data

In [None]:
all_info = []
transcript = {}
path = '../input/data/test'

all_info = []
count = 0
# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        if '.wav' in file:
            count+=1
            file_name = file.split(".")[0]
            spk_id = file_name.split("_")[1]
            all_info.append([spk_id,spk_id + "_" + file_name,os.path.join(r, file)])


if not os.path.exists(os.path.dirname('data/eval_command/wav.scp')):
    os.makedirs(os.path.dirname('data/eval_command/wav.scp'))

def wavscp(file_infos):
    results = []
    for info in file_infos:
        results.append("{} {}".format(info[1], info[2]))
    return '\n'.join(sorted(results))

with open("data/eval_command/wav.scp","wt") as f:
    f.writelines(wavscp(all_info))


if not os.path.exists(os.path.dirname('data/eval_command/utt2spk')):
    os.makedirs(os.path.dirname('data/eval_command/utt2spk'))

def utt2spk(file_infos):
    results = []
    for info in file_infos:
        speaker = info[0]
        utt_id = info[1]
        results.append("{} {}".format(utt_id, speaker))
    return '\n'.join(sorted(results))

with open("data/eval_command/utt2spk","wt") as f:
    f.writelines(utt2spk(all_info))

Similarly, evaluation data contain utt2spk and wav.scp file. 

### Prepare language data
Language data required for running Kaldi contain 4 files:  
* *lexicon.txt*: Contain every word in the dataset and its phonemes. Pattern: <word> <phone1> <phone2>...
* *nonsilence_phones*: Every phonemes you have. Pattern: <phone>
* *optional_phones*: List of optional silence phone. I use only <sil>
* *silence_phones*: List of silence phone. I also use only <sil>


So the most important part is how to create your own phoneme list. For me, I list all the label and search the phoneme on this [website](http://www.speech.cs.cmu.edu/tools/lextool.html).  
My *lexicon.txt* is here:  
> bed	b eh d  
bird	b er d  
cat	k ae t  
dog	d ao g  
down	d aw n  
eight	ey t  
five	f ay v  
four	f ao r  
go	g ow  
happy	hh ae p iy  
house	hh aw s  
left	l eh f t  
marvin	m aa r v ih n  
nine	n ay n  
no	n ow  
off	ao f  
on	aa n  
on	ao n  
one	w ah n  
one	hh w ah n  
right	r ay t  
seven	s eh v ah n  
sheila	sh iy l ah  
six	s ih k s  
stop	s t aa p  
three	th r iy  
tree	t r iy  
two	t uw  
up	ah p  
wow	w aw  
yes	y eh s  
zero	z iy r ow  

## Running script on Kaldi

The detail of how to run the script is in the README of Github repository, which I gave on the top of this kernel.  
I just want to note that I'm using HMM + GMM and train with triphone.  
The result of validation set is relatively **5% WER** (Word Error Average) after finish the training.

## Postprocess for submission

After training, you can file the output process at log folder from Kaldi. Something like this: *$kaldi_path/egs/command/exp/tri3b/decode*/log/*  
In this log folder, there are many files decode.*.log (the number of files depends on how much number of job you use for decoding in Kaldi script, it should be lower than your CPU-core and lower than the number of speakers). And each file of decode.*.log will look like:  

>000044442_clip_000044442 no   
LOG (gmm-latgen-faster[5.5.382~1-c2163]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance 000044442_clip_000044442 is -3.39615 over 98 frames.  
0000adecb_clip_0000adecb happy   
LOG (gmm-latgen-faster[5.5.382~1-c2163]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance 0000adecb_clip_0000adecb is -4.63466 over 98 frames.

There will be the utterance_id<whitespace>transcript, and if the transcript is an empty string, we will consider it as silence sound.  
Let make a sample of it! We will create a *decode.1.log* file inside *log/* folder.

In [None]:
if not os.path.exists(os.path.dirname('log/decode.1.log')):
    os.makedirs(os.path.dirname('log/decode.1.log'))
    
with open("log/decode.1.log","wt") as f:
    f.write("000044442_clip_000044442 no\n")
    f.write("LOG (gmm-latgen-faster[5.5.382~1-c2163]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance 000044442_clip_000044442 is -3.39615 over 98 frames.\n")
    f.write("0000adecb_clip_0000adecb happy\n")
    f.write("LOG (gmm-latgen-faster[5.5.382~1-c2163]:DecodeUtteranceLatticeFaster():decoder-wrappers.cc:289) Log-like per frame for utterance 0000adecb_clip_0000adecb is -4.63466 over 98 frames.\n")

In [None]:
!cat log/decode.1.log

The postprocess script have to handle this file to get a submission file.

In [None]:
import re

all_info = []
path = 'log' #Path to log folder
eval = {}
count = 0
pattern = r'.{9}_clip_.{9}.*'

def _read_decode_file(filepath):
    with open(filepath, "rt") as f:
        for line in f.read().splitlines():
            if line.startswith("LOG"):
                continue
            x = re.search(pattern,line)
            if x is not None:
                res = x.group(0)
                info = res.split()
                utt_id = info[0]
                wav_id = utt_id[10:] + ".wav"
                if len(info) == 1:
                    trans = "silence"
                else:
                    trans = " ".join(info[1:])
                eval[wav_id] = trans
    pass


# r=root, d=directories, f = files
for r, d, f in os.walk(path):
    for file in f:
        if 'decode.' in file:
            count+=1
            _read_decode_file("/".join([r,file]))

To keep the order of submission, I will read the file id from sample_submission.csv and get the relative transcript from the above dictionary.

In [None]:
all_lines=[]
with open("../input/sample_submission.csv","rt") as f:
    for line in f.read().splitlines():
        if line.startswith("fname"):
            all_lines.append(line)
            continue
        line = line.split(",")
        try:
            trans = eval[line[0]]
        except KeyError:
            trans = "silence"
        all_lines.append(",".join([line[0],trans]))

with open("submission.csv","wt") as f:
    f.writelines("\n".join(all_lines))

## Conclusion
This kernel is only a reference if you want to run the challenge by Kaldi. I do search keyword Kaldi on Kaggle but it seems not very popular, so I decided to make this as a tutorial. However please give me some advises if you see something wrong.  
The final result I got is 75% on Public LB and 77% on Private LB.