tKaldi

Yet Another Aproach to Port Kaldi

This is an experimental attempt to re-write Kaldi's matrix library with PyTorch's C++ API.

Note: This is my Sunday project.

Approach to Port Kaldi

This project aims to implement the following classes as wrppers around PyTorch's torch::Tensor class.

Vector Classes

kaldi::VectorBase
kaldi::Vector
kaldi::SubVector

Matrix Classes

kaldi::MatrixBase
kaldi::Matrix
kaldi::SubMatrix

(You can check out the code from here.)

Theoretically, by swapping the original source codes with these implementations, we should be able to build the reset of Kaldi libraries. (Except the parts related to CUDA and OpenFST, which I have not looked into.)

Once we build the Kaldi code with PyTorch's backend, it should be fairly easy to build the PyTorch binding of the resulting library, and this means that we can call Kaldi functions from PyTorch natively.

Execution

Since Kaldi's code base is huge, it is difficult to start by forking it and modifying it. Instead, I took a bottom up approach, which is, deciding on a target feature that I want to port, and then implementing the necessary interface of Vector/Matrix classes.

When compiling the target feature, the source code of the target features are copied to the workspace with minimum modification. Interestingly, all I had to do so far was to comment out some #include statements, which are not directly related to the target feature, and swapping some type definitions. You can checkout these in kaldi.patch.

For the initial target feature, I choese ComputeKaldiPitch and the corresponding CLI, compute-kaldi-pitch-feats.

I am porting these features in the following manner.

Phase 1 - Port `ComputeKaldiPitch`

The goal of this phase is to have ComputeKaldiPitch function that produces the exact same result as the original implementation. The performance of the function does not matter. In fact, since the resulting Vector / Matrix classes are wrapper around torch::Tensor, and torch::Tensor is backed by a similar (or same) BLAS library, while Kaldi's original implementation directly calls the BLAS library, it is expected to be slower or at the same speed at best.

Implement the minimal set of methods from Vector / Matrix classes. 016ab2e7
Compile ComputeKaldiPitch.
Bind the resulting ComputeKaldiPitch to Python. src
Check the parity of the Python function and compute-kaldi-pitch-feats from the original code. test

Phase 2 - Port `compute-kaldi-pitch-feats`

The next step is to port compute-kaldi-pitch-feats CLI so that I can compare the speed of the original CLI and the ported version.

Extend the Vector / Matrix classes bc8ac3c0.
Compile compute-kaldi-pitch-feats (#12)
Compare the speed of the original compute-kaldi-pitch-feats and ported one.

Phase 3 - Improve the performace of `ComputeKaldiPitch`

The third step is to improve the speed of ComputeKaldiPitch by modifying the implementation to take advantage of PyTorch's C++ API. (and potentially getting rid of Vector / Matrix classes).

Vectorize the operation and get rid of sequential element access.
Parallelize operations.
(Optional) Enable GPU support.

Build

Because of the approach explained in the previous section, this repository is not a fork of the original Kaldi. Instead, this repository references Kaldi as git-submodule and copy the required source codes from them.

tools.py facilitates this process.

Note When changing the list of source files under source control in src/libtkaldi/src, edit .gitignore and tools.py

./tools.py init
This will sync the Kaldi submodule (in third_party/kaldi), clean up the any changes present there, then apply the patch form kaldi.patch.
./tools.py dev
This will run git-clean on the current src/libtkaldi (so that files that are not under source control will be removed), copy the designated source codes from third_party/kaldi directory, then run python setup.py develop to build the library.
./tools.py stash This will stash the changes made to Kaldi submodule to kaldi.patch. When you apply change to the original source code of Kaldi and you need to persist the change accross commits, you need to check-in the patch.

Getting Started

git clone https://github.com/mthrok/tkaldi
cd tkaldi
./tools.py init

Building and Runnig test

./tools.py dev
pytest tests

Requirements

pytorch >= 1.7

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.circleci		.circleci
src		src
tests		tests
third_party		third_party
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
kaldi.patch		kaldi.patch
setup.py		setup.py
tools.py		tools.py

License

mthrok/tkaldi

Folders and files

Latest commit

History

Repository files navigation

tKaldi

Approach to Port Kaldi

Execution

Phase 1 - Port ComputeKaldiPitch

Phase 2 - Port compute-kaldi-pitch-feats

Phase 3 - Improve the performace of ComputeKaldiPitch

Build

Getting Started

Building and Runnig test

Requirements

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

Phase 1 - Port `ComputeKaldiPitch`

Phase 2 - Port `compute-kaldi-pitch-feats`

Phase 3 - Improve the performace of `ComputeKaldiPitch`