# Welcome to ExKaldi

In this section, we will extract and process the acoustic feature.

Please ensure you have downloaded the complete librispeech_dummy corpus from our github.
https://github.com/wangyu09/exkaldi

First of all, update the wav path info in wav.scp file.

In [1]:
! cd librispeech_dummy && python3 reset_wav_path.py

From now on, we will start to build a ASR system from the scratch.

In [2]:
import exkaldi
exkaldi.info.reset_kaldi_root("/home/khanh/workspace/projects/kaldi")

import os
dataDir = "librispeech_dummy"

exkaldi.info.reset_kaldi_root( yourPath )
If not, ERROR will occur when implementing some core functions.


In the train dataset, there are 100 utterances fetched from 10 speakers. Each specker corresponds to 10 utterances.

You can compute feature from __WAV file__ or __Kaldi script-file table__ or exkaldi __ListTable__ object.

In [3]:
scpFile = os.path.join(dataDir, "train", "wav.scp")

feat = exkaldi.compute_mfcc(scpFile, name="mfcc")

feat

<exkaldi.core.archive.BytesFeat at 0x7f2e84178790>

Use function __compute_mfcc__ to compute MFCC feature. In current version of ExKaldi, there are 4 functions to compute acoustic feature:

__compute_mfcc__: compute the MFCC feature.  
__compute_fbank__: compute the fBank feature.  
__compute_plp__: compute the PLP feature.  
__compute_spectrogram__: compute the power spectrogram feature.  

The returned object: ___feat___ is an exkaldi feature archive whose class name is __BytesFeat__. In ExKaldi, we use 3 approaches to discribe Kaldi archives: __Bytes Object__, __Numpy Array__, and __Index Table__. We have designed a group of classes to hold them. We will introduce them in later steps.

Here, __BytesFeat__ is one of __Bytes Object__ and its object holds the acoustic feature data with bytes format. You can use attribute: __.data__ to get it, but we do not recommend this if you just want to look it, because it is not a human-readable data format.

___feat___ object has some useful attributes and methods. For example, use __.dim__ to look feature dimensions.

In [4]:
feat.dim

13

Use __.utts__ to get its' utterances IDs.

In [5]:
feat.utts[0:5]

['103-1240-0000',
 '103-1240-0001',
 '103-1240-0002',
 '103-1240-0003',
 '103-1240-0004']

Randomly sample 10 utterances.

In [6]:
samplingFeat = feat.subset(nRandom=10)

samplingFeat

<exkaldi.core.archive.BytesFeat at 0x7f2e6c480970>

Here, ___samplingFeat___ is also a __BytesFeat__ object.

In ExKaldi, the name of object will record the operation. For example, the ___samplingFeat___ generated above has a new name now.

In [7]:
samplingFeat.name

'subset(mfcc,random 10)'

In [8]:
del samplingFeat

Besides __BytesFeat__ class, these classes can hold other Kaldi archive tables in bytes format.

__BytesCMVN__: to hold the CMVN statistics.  
__BytesProb__: to hold the Neural Network output.  
__BytesAliTrans__: to hold the transition-ID Alignment.   
__BytesFmllr__: to hold the fmllr transform matrices. 

All these classes have some fimiliar properties. For more information, check the [ExKaldi Documents](https://wangyu09.github.io/exkaldi/#/) please. Here we only focus on feature processing.

By the way, in ExKaldi, we sort these archives rigorously in order to reduce buffer cost and accelerate processing.

In [9]:
featTemp = feat.sort(by="utt", reverse=True)

featTemp.utts[0:5]

['1088-134315-0009',
 '1088-134315-0008',
 '1088-134315-0007',
 '1088-134315-0006',
 '1088-134315-0005']

In [10]:
del featTemp

Raw feature can be further optimized, typically, with applying CMVN. Here we firstly compute the CMVN statistics.

In [11]:
spk2uttFile = os.path.join(dataDir, "train", "spk2utt")

cmvn = exkaldi.compute_cmvn_stats(feat, spk2utt=spk2uttFile, name="cmvn")

cmvn

<exkaldi.core.archive.BytesCMVN at 0x7f2e67e1abe0>

___cmvn___ is an exkaldi __BytesCMVN__ object. It holds the CMVN statistics in binary format. Then we use it to normalize the feature.

In [12]:
utt2spkFile = os.path.join(dataDir, "train", "utt2spk")

feat = exkaldi.use_cmvn(feat, cmvn, utt2spk=utt2spkFile)

feat.name

'cmvn(mfcc,cmvn)'

We save this feature into file. In futher steps, it will be restoraged. ExKaldi bytes archives can be saved the same as Kaldi format files.

In [13]:
featFile = os.path.join(dataDir, "exp", "train_mfcc_cmvn.ark")

exkaldi.utils.make_dependent_dirs(path=featFile, pathIsFile=True)

featIndex = feat.save(featFile, returnIndexTable=True)

#del feat

If you appoint the option __returnIndexTable__ to be True, an __IndexTable__ object will be returned. As we introduced above, this is our second approach to discribe archives, __index table__. It plays almost the same role with original feature object. __IndexTable__ is a subclass of Python dict class, so you can view its data directly.

When training a large corpus or using multiple processes, __IndexTable__ will become the main currency.

In [14]:
featIndex

{'103-1240-0000': IndexInfo(frames=1407, startIndex=0, dataSize=73193, filePath='librispeech_dummy/exp/train_mfcc_cmvn.ark'),
 '103-1240-0001': IndexInfo(frames=1593, startIndex=73193, dataSize=82865, filePath='librispeech_dummy/exp/train_mfcc_cmvn.ark'),
 '103-1240-0002': IndexInfo(frames=1393, startIndex=156058, dataSize=72465, filePath='librispeech_dummy/exp/train_mfcc_cmvn.ark'),
 '103-1240-0003': IndexInfo(frames=1469, startIndex=228523, dataSize=76417, filePath='librispeech_dummy/exp/train_mfcc_cmvn.ark'),
 '103-1240-0004': IndexInfo(frames=1250, startIndex=304940, dataSize=65029, filePath='librispeech_dummy/exp/train_mfcc_cmvn.ark'),
 '103-1240-0005': IndexInfo(frames=1516, startIndex=369969, dataSize=78861, filePath='librispeech_dummy/exp/train_mfcc_cmvn.ark'),
 '103-1240-0006': IndexInfo(frames=956, startIndex=448830, dataSize=49741, filePath='librispeech_dummy/exp/train_mfcc_cmvn.ark'),
 '103-1240-0007': IndexInfo(frames=1502, startIndex=498571, dataSize=78133, filePath='libr

Of cause, original archives can also be loaded into memory again. For example, feature can be loaded from Kaldi binary archive file (__.ark__ file) or script table file (__.scp__).

Particularly, we can fetch the data via index table directly.

In [15]:
feat = featIndex.fetch(arkType="feat")
del featIndex

feat

<exkaldi.core.archive.BytesFeat at 0x7f2e67fccca0>

All Bytes archives can be transformed to __Numpy__ format. So If you want to train NN acoustic model with Tensorflow or others, you can use the Numpy format data.

In [16]:
feat = feat.to_numpy()

feat

<exkaldi.core.archive.NumpyFeat at 0x7f2e67d72640>

by calling __.to_numpy()__ function, ___feat___ became an exkaldi __NumpyFeat__ object, it has some fimiliar attributes and methods with __BytesFeat__, but also has own properties. Let's skip the details here.

This is the third way to discribe archives: __Numpy Array__. __NumpyFeat__ is one of Numpy archives classes.

Here we will introduce some methods to use its data.

In [17]:
sampleFeat = feat.subset(nHead=2)

1. use __.data__ to get the dict object whose keys are utterance IDs and values are data arrays.

In [18]:
sampleFeat.data

{'103-1240-0000': array([[ -2.254528  ,  -3.344388  ,   8.894275  , ...,   2.5323038 ,
           6.9771852 ,   3.545384  ],
        [ -2.2711601 ,  -3.6887007 ,   8.395479  , ...,  -2.0043678 ,
           2.486678  ,   7.0842047 ],
        [ -2.2453518 ,  -2.678547  ,  12.083347  , ...,  -0.5561874 ,
           4.9453325 ,   3.957767  ],
        ...,
        [ -1.5548878 , -16.208216  , -15.402991  , ...,  -5.0331793 ,
          22.171038  ,   4.512825  ],
        [ -1.6056385 , -18.538912  , -13.540999  , ...,  -1.7717261 ,
          10.223823  ,  -1.9313327 ],
        [ -1.6158581 , -17.09361   , -12.508477  , ...,  -0.69831765,
           8.234857  ,   2.7844687 ]], dtype=float32),
 '103-1240-0001': array([[ -1.5342865, -13.794619 , -11.781871 , ...,   7.934154 ,
          11.860016 ,  -3.388668 ],
        [ -1.6354351, -16.402199 , -14.878404 , ...,  -4.1155005,
          10.174247 ,  -4.190131 ],
        [ -1.8280525, -11.93771  ,  -9.464145 , ...,  -2.257836 ,
          18.10266

2. use __.array__ get the arrays only.

In [19]:
sampleFeat.array

[array([[ -2.254528  ,  -3.344388  ,   8.894275  , ...,   2.5323038 ,
           6.9771852 ,   3.545384  ],
        [ -2.2711601 ,  -3.6887007 ,   8.395479  , ...,  -2.0043678 ,
           2.486678  ,   7.0842047 ],
        [ -2.2453518 ,  -2.678547  ,  12.083347  , ...,  -0.5561874 ,
           4.9453325 ,   3.957767  ],
        ...,
        [ -1.5548878 , -16.208216  , -15.402991  , ...,  -5.0331793 ,
          22.171038  ,   4.512825  ],
        [ -1.6056385 , -18.538912  , -13.540999  , ...,  -1.7717261 ,
          10.223823  ,  -1.9313327 ],
        [ -1.6158581 , -17.09361   , -12.508477  , ...,  -0.69831765,
           8.234857  ,   2.7844687 ]], dtype=float32),
 array([[ -1.5342865, -13.794619 , -11.781871 , ...,   7.934154 ,
          11.860016 ,  -3.388668 ],
        [ -1.6354351, -16.402199 , -14.878404 , ...,  -4.1155005,
          10.174247 ,  -4.190131 ],
        [ -1.8280525, -11.93771  ,  -9.464145 , ...,  -2.257836 ,
          18.10266  ,  -2.389845 ],
        ...,
   

3. use getitem function to get a specified utterance.

In [20]:
sampleFeat['103-1240-0000']

array([[ -2.254528  ,  -3.344388  ,   8.894275  , ...,   2.5323038 ,
          6.9771852 ,   3.545384  ],
       [ -2.2711601 ,  -3.6887007 ,   8.395479  , ...,  -2.0043678 ,
          2.486678  ,   7.0842047 ],
       [ -2.2453518 ,  -2.678547  ,  12.083347  , ...,  -0.5561874 ,
          4.9453325 ,   3.957767  ],
       ...,
       [ -1.5548878 , -16.208216  , -15.402991  , ...,  -5.0331793 ,
         22.171038  ,   4.512825  ],
       [ -1.6056385 , -18.538912  , -13.540999  , ...,  -1.7717261 ,
         10.223823  ,  -1.9313327 ],
       [ -1.6158581 , -17.09361   , -12.508477  , ...,  -0.69831765,
          8.234857  ,   2.7844687 ]], dtype=float32)

4. like dict object, __.keys()__,__.values()__,__.items()__ are availabel to iterate it.

In [21]:
for key in sampleFeat.keys():
    print( sampleFeat[key].shape )

(1407, 13)
(1593, 13)


5. setitem is also available only if you set the array with right format.

In [22]:
sampleFeat['103-1240-0000'] *= 2

In [23]:
sampleFeat['103-1240-0000']

array([[ -4.509056 ,  -6.688776 ,  17.78855  , ...,   5.0646076,
         13.9543705,   7.090768 ],
       [ -4.5423203,  -7.3774014,  16.790958 , ...,  -4.0087357,
          4.973356 ,  14.168409 ],
       [ -4.4907036,  -5.357094 ,  24.166695 , ...,  -1.1123748,
          9.890665 ,   7.915534 ],
       ...,
       [ -3.1097755, -32.41643  , -30.805983 , ..., -10.066359 ,
         44.342075 ,   9.02565  ],
       [ -3.211277 , -37.077824 , -27.081999 , ...,  -3.5434523,
         20.447645 ,  -3.8626654],
       [ -3.2317162, -34.18722  , -25.016954 , ...,  -1.3966353,
         16.469713 ,   5.5689373]], dtype=float32)

In [24]:
del sampleFeat

Similarly, ExKaldi Numpy archives can be transformed back to bytes archives easily. 

In [25]:
tempFeat = feat.to_bytes()

tempFeat

<exkaldi.core.archive.BytesFeat at 0x7f2e67c2fa90>

In [26]:
del tempFeat

Numpy data can also be saved to .npy file with a specified format.

In [27]:
tempFile = os.path.join(dataDir, "exp", "temp_mfcc.npy")

feat.save(tempFile)

  arr = np.asanyarray(arr)


'librispeech_dummy/exp/temp_mfcc.npy'

In [28]:
del feat

And it can also be restorage into memory again.

In [29]:
feat = exkaldi.load_feat(tempFile, name="mfcc")

feat

<exkaldi.core.archive.NumpyFeat at 0x7f2e67a8ff40>

In [30]:
feat

<exkaldi.core.archive.NumpyFeat at 0x7f2e67a8ff40>

Besides __NumpyFeat__ class, these classes hold Kaldi archives in Numpy format.

__NumpyCMVN__: to hold CMVN statistics data.  
__NumpyProb__:  to hold NN output data.  
__NumpyAli__:  to hold Users' own Alignment data.  
__NumpyAliTrans__:  to hold Transition-ID alignment.  
__NumpyAliPhone__:  to hold Phone-ID alignment.  
__NumpyAliPdf__:  to hold Pdf-ID alignment.  
__NumpyFmllr__:  to hold fmllr transform matrices.  

They have similar properties as __NumpyFeat__. We will introduce them in the next steps.