Skip to content

mthrok/tkaldi

Repository files navigation

CircleCI

tKaldi

Yet Another Aproach to Port Kaldi

This is an experimental attempt to re-write Kaldi's matrix library with PyTorch's C++ API.

Note: This is my Sunday project.

Approach to Port Kaldi

This project aims to implement the following classes as wrppers around PyTorch's torch::Tensor class.

Vector Classes

  • kaldi::VectorBase
  • kaldi::Vector
  • kaldi::SubVector

Matrix Classes

  • kaldi::MatrixBase
  • kaldi::Matrix
  • kaldi::SubMatrix

(You can check out the code from here.)

Theoretically, by swapping the original source codes with these implementations, we should be able to build the reset of Kaldi libraries. (Except the parts related to CUDA and OpenFST, which I have not looked into.)

Once we build the Kaldi code with PyTorch's backend, it should be fairly easy to build the PyTorch binding of the resulting library, and this means that we can call Kaldi functions from PyTorch natively.

Execution

Since Kaldi's code base is huge, it is difficult to start by forking it and modifying it. Instead, I took a bottom up approach, which is, deciding on a target feature that I want to port, and then implementing the necessary interface of Vector/Matrix classes.

When compiling the target feature, the source code of the target features are copied to the workspace with minimum modification. Interestingly, all I had to do so far was to comment out some #include statements, which are not directly related to the target feature, and swapping some type definitions. You can checkout these in kaldi.patch.

For the initial target feature, I choese ComputeKaldiPitch and the corresponding CLI, compute-kaldi-pitch-feats.

I am porting these features in the following manner.

Phase 1 - Port ComputeKaldiPitch

The goal of this phase is to have ComputeKaldiPitch function that produces the exact same result as the original implementation. The performance of the function does not matter. In fact, since the resulting Vector / Matrix classes are wrapper around torch::Tensor, and torch::Tensor is backed by a similar (or same) BLAS library, while Kaldi's original implementation directly calls the BLAS library, it is expected to be slower or at the same speed at best.

  • Implement the minimal set of methods from Vector / Matrix classes. 016ab2e7
  • Compile ComputeKaldiPitch.
  • Bind the resulting ComputeKaldiPitch to Python. src
  • Check the parity of the Python function and compute-kaldi-pitch-feats from the original code. test

Phase 2 - Port compute-kaldi-pitch-feats

The next step is to port compute-kaldi-pitch-feats CLI so that I can compare the speed of the original CLI and the ported version.

  • Extend the Vector / Matrix classes bc8ac3c0.
  • Compile compute-kaldi-pitch-feats (#12)
  • Compare the speed of the original compute-kaldi-pitch-feats and ported one.

Phase 3 - Improve the performace of ComputeKaldiPitch

The third step is to improve the speed of ComputeKaldiPitch by modifying the implementation to take advantage of PyTorch's C++ API. (and potentially getting rid of Vector / Matrix classes).

  • Vectorize the operation and get rid of sequential element access.
  • Parallelize operations.
  • (Optional) Enable GPU support.

Build

Because of the approach explained in the previous section, this repository is not a fork of the original Kaldi. Instead, this repository references Kaldi as git-submodule and copy the required source codes from them.

tools.py facilitates this process.

Note When changing the list of source files under source control in src/libtkaldi/src, edit .gitignore and tools.py

  • ./tools.py init
    This will sync the Kaldi submodule (in third_party/kaldi), clean up the any changes present there, then apply the patch form kaldi.patch.

  • ./tools.py dev
    This will run git-clean on the current src/libtkaldi (so that files that are not under source control will be removed), copy the designated source codes from third_party/kaldi directory, then run python setup.py develop to build the library.

  • ./tools.py stash This will stash the changes made to Kaldi submodule to kaldi.patch. When you apply change to the original source code of Kaldi and you need to persist the change accross commits, you need to check-in the patch.

Getting Started

git clone https://github.com/mthrok/tkaldi
cd tkaldi
./tools.py init

Building and Runnig test

./tools.py dev
pytest tests

Requirements

pytorch >= 1.7