Feature extraction in Lhotse is currently based exclusively on the Torchaudio library. We support spectrograms, log-Mel energies (fbank) and MFCCs. Fbank are the default features. We also support custom defined feature extractors via a Python API (which won't be available in the CLI, unless there is a popular demand for that).
We are striving for a simple relation between the audio duration, the number of frames, and the frame shift. You only need to know two of those values to compute the third one, regardless of the frame length. This is equivalent of having Kaldi's snip_edges
parameter set to False.
Features in Lhotse are stored as numpy matrices with shape (num_frames, num_features)
. By default, we use lilcom for lossy compression and reduce the size on the disk by about 3x. The lilcom compression method uses a fixed precision that doesn't depend on the magnitude of the thing being compressed, so it's better suited to log-energy features than energy features. We currently support two kinds of storage:
- HDF5 files with multiple feature matrices
- directory with feature matrix per file
We retrieve the arrays by loading the whole feature matrix from disk and selecting the relevant region (e.g. specified by a cut). Therefore it makes sense to cut the recordings first, and then extract the features for them to avoid loading unnecessary data from disk (especially for very long recordings).
There are two types of manifests:
- one describing the feature extractor;
- one describing the extracted feature matrices.
The feature extractor manifest is mapped to a Python configuration dataclass. An example for spectrogram:
dither: 0.0
energy_floor: 1e-10
frame_length: 0.025
frame_shift: 0.01
min_duration: 0.0
preemphasis_coefficient: 0.97
raw_energy: true
remove_dc_offset: true
round_to_power_of_two: true
window_type: povey
type: spectrogram
And the corresponding configuration class:
lhotse.features.SpectrogramConfig
The feature matrices manifest is a list of documents. These documents contain the information necessary to tie the features to a particular recording: start
, duration
, channel
and recording_id
. They currently do not have their own IDs. They also provide some useful information, such as the type of features, number of frames and feature dimension. Finally, they specify how the feature matrix is stored with storage_type
(currently numpy
or lilcom
), and where to find it with the storage_path
. In the future there might be more storage types.
- channels: 0
duration: 16.04
num_features: 23
num_frames: 1604
recording_id: recording-1
start: 0.0
storage_path: test/fixtures/libri/storage/dc2e0952-f2f8-423c-9b8c-f5481652ee1d.llc
storage_type: lilcom
type: fbank
There are two components needed to implement a custom feature extractor: a configuration and the extractor itself. We expect the configuration class to be a dataclass, so that it can be automatically mapped to dict and serialized. The feature extractor should inherit from FeatureExtractor
, and implement a small number of methods/properties. The base class takes care of initialization (you need to pass a config object), serialization to YAML, etc. A minimal, complete example of adding a new feature extractor:
from scipy.signal import stft
@dataclass
class ExampleFeatureExtractorConfig:
frame_len: Seconds = 0.025
frame_shift: Seconds = 0.01
class ExampleFeatureExtractor(FeatureExtractor):
"""
A minimal class example, showing how to implement a custom feature extractor in Lhotse.
"""
name = 'example-feature-extractor'
config_type = ExampleFeatureExtractorConfig
def extract(self, samples: np.ndarray, sampling_rate: int) -> np.ndarray:
f, t, Zxx = stft(
samples,
sampling_rate,
nperseg=round(self.config.frame_len * sampling_rate),
noverlap=round(self.frame_shift * sampling_rate)
)
# Note: returning a magnitude of the STFT might interact badly with lilcom compression,
# as it performs quantization of the float values and works best with log-scale quantities.
# It's advised to turn lilcom compression off, or use log-scale, in such cases.
return np.abs(Zxx)
@property
def frame_shift(self) -> Seconds:
return self.config.frame_shift
def feature_dim(self, sampling_rate: int) -> int:
return (sampling_rate * self.config.frame_len) / 2 + 1
The overridden members include:
name
for easy debuggability/automatic re-creation of an extractor;config_type
which specifies the complementary configuration class type;extract()
where the actual computation takes place;frame_shift
property, which is key to know the relationship between the duration and the number of frames.feature_dim()
method, which accepts thesampling_rate
as its argument, as some types of features (e.g. spectrogram) will depend on that.
Additionally, there are two extra methods than when overridden, allow to perform dynamic feature-space mixing (see Cuts):
@staticmethod
def mix(features_a: np.ndarray, features_b: np.ndarray, gain_b: float) -> np.ndarray:
raise ValueError(f'The feature extractor\'s "mix" operation is undefined.')
@staticmethod
def compute_energy(features: np.ndarray) -> float:
raise ValueError(f'The feature extractor\'s "compute_energy" is undefined.')
They are:
mix()
which specifies how to mix two feature matrices to obtain a new feature matrix representing the sum of signals;compute_energy()
which specifies how to obtain a total energy of the feature matrix, which is needed to mix two signals with a specified SNR. E.g. for a power spectrogram, this could be the sum of every time-frequency bin. It is expected to never return a zero.
During the feature-domain mix with a specified signal-to-noise ratio (SNR), we assume that one of the signals is a reference signal - it is used to initialize the FeatureMixer
class. We compute the energy of both signals and scale the non-reference signal, so that its energy satisfies the requested SNR. The scaling factor (gain) is computed using the following formula:
../lhotse/features/mixer.py
Note that we interpret the energy and the SNR in a power quantity context (as opposed to root-power/field quantities).
We will briefly discuss how to perform mean and variance normalization (a.k.a. CMVN) in Lhotse effectively. We compute and store unnormalized features, and it is up to the user to normalize them if they want to do so. There are three common ways to perform feature normalization:
- Global normalization: we compute the means and variances using the whole data (
FeatureSet
orCutSet
), and apply the same transform on every sample. The global statistics can be computed efficiently withFeatureSet.compute_global_stats()
orCutSet.compute_global_feature_stats()
. They use an iterative algorithm that does not require loading the whole dataset into memory. - Per-instance normalization: we compute the means and variances separately for each data sample (i.e. a single feature matrix). Each feature matrix undergoes a different transform. This approach seems to be common in computer vision modelling.
- Sliding window ("online") normalization: we compute the means and variances using a slice of the feature matrix with a specified duration, e.g. 3 seconds (a standard value in Kaldi). This is useful when we expect the model to work on incomplete inputs, e.g. streaming speech recognition. We currently recommend using Torchaudio CMVN for that.
Lhotse can be extended with additional storage backends via two abstractions: FeaturesWriter
and FeaturesReader
. We currently implement the following writers (and their corresponding readers):
lhotse.features.io.LilcomFilesWriter
lhotse.features.io.NumpyFilesWriter
lhotse.features.io.LilcomHdf5Writer
lhotse.features.io.NumpyHdf5Writer
The FeaturesWriter
and FeaturesReader
API is as follows:
lhotse.features.io.FeaturesWriter
lhotse.features.io.FeaturesReader
The feature manifest is represented by a FeatureSet
object. Feature extractors have a class that represents both the extract and its configuration, named FeatureExtractor
. We provide a utility called FeatureSetBuilder
that can process a RecordingSet
in parallel, store the feature matrices on disk and generate a feature manifest.
For example:
from lhotse import RecordingSet, Fbank, LilcomFilesWriter
# Read a RecordingSet from disk
recording_set = RecordingSet.from_yaml('audio.yml')
# Create a log Mel energy filter bank feature extractor with default settings
feature_extractor = Fbank()
# Create a feature set builder that uses this extractor and stores the results in a directory called 'features'
with LilcomFilesWriter('features') as storage:
builder = FeatureSetBuilder(feature_extractor=feature_extractor, storage=storage)
# Extract the features using 8 parallel processes, compress, and store them on in 'features/storage/' directory.
# Then, return the feature manifest object, which is also compressed and
# stored in 'features/feature_manifest.json.gz'
feature_set = builder.process_and_store_recordings(
recordings=recording_set,
num_jobs=8
)
It is also possible to extract the features directly from CutSet
- see below:
lhotse.cut.CutSet.compute_and_store_features
An equivalent example using the terminal:
lhotse write-default-feature-config feat-config.yml
lhotse make-feats -j 8 --storage-type lilcom_files -f feat-config.yml audio.yml features/
We are relying on Torchaudio Kaldi compatibility module, so most of the spectrogram/fbank/mfcc parameters are the same as in Kaldi. However, we are not fully compatible - Kaldi computes energies from a signal scaled between -32,768 to 32,767, while Torchaudio scales the signal between -1.0 and 1.0. It results in Kaldi energies being significantly greater than in Lhotse. By default, we turn off dithering for deterministic feature extraction.