Allosaurus is a pretrained universal phone recognizer. It can be used to recognize phones in more than 2000 languages.
This tool is based on our ICASSP 2020 work Universal Phone Recognition with a Multilingual Allophone System
Allosaurus is available from pip
pip install allosaurus
You can also clone this repository and install
python setup.py install
The basic usage is pretty simple, your input is an wav audio file and output is a sequence of phones.
python -m allosaurus.run -i <audio>
For example, you can try using the attached sample file in this repository. Guess what's in this audio file :)
python -m allosaurus.run -i sample.wav
æ l u s ɔ ɹ s
You can also use allosaurus directly in python
from allosaurus.app import read_recognizer
# load your model
model = read_recognizer()
# run inference -> æ l u s ɔ ɹ s
model.recognize('sample.wav')
For full features and details, please refer to the following sections.
The command line interface is as follows:
python -m allosaurus.run [--lang <language name>] [--model <model name>] [--device_id <gpu_id>] [--output <output_file>] [--topk <int>] -i <audio file/directory>
It will recognize the narrow phones in the audio file(s). Only the input argument is mandatory, other options can ignored. Please refer to following sections for their details.
There is also a simple python interface as follows:
from allosaurus.app import read_recognizer
# load your model by the <model name>, will use 'latest' if left empty
model = read_recognizer(model)
# run inference on <audio_file> with <lang>, lang will be 'ipa' if left empty
model.recognize(audio_file, lang)
The details of arguments in both interface are as follows:
The input can be a single file or a directory containing multiple audio files.
If the input is a single file, it will output only the phone sequence; if the input is a directory, it will output both the file name and phone sequence, results will be sorted by file names.
The audio file(s) should be in the following format:
-
It should be a wav file. If the audio is not in the wav format, please convert your audio to a wav format using sox or ffmpeg in advance.
-
The sampling rate can be arbitrary, we will automatically resample them based on models' requirements.
-
We assume the audio is a mono-channel audio.
The output is by default stdout (i.e. it will print all results to terminal).
If you specify a file as the output, then all output will be directed to that file.
The lang
option is the language id. It is to specify the phone inventory you want to use.
The default option is ipa
which tells the recognizer to use the the entire inventory (around 230 phones).
Generally, specifying the language inventory can improve your recognition accuracy.
You can check the full language list with the following command. The number of available languages is around 2000.
python -m allosaurus.bin.list_lang
To check language's inventory you can use following command
python -m allosaurus.bin.list_phone [--lang <language name>]
For example,
# to get English phone inventory
# ['a', 'aː', 'b', 'd', 'd̠', 'e', 'eː', 'e̞', 'f', 'h', 'i', 'iː', 'j', 'k', 'kʰ', 'l', 'm', 'n', 'o', 'oː', 'p', 'pʰ', 'r', 's', 't', 'tʰ', 't̠', 'u', 'uː', 'v', 'w', 'x', 'z', 'æ', 'ð', 'øː', 'ŋ', 'ɐ', 'ɐː', 'ɑ', 'ɑː', 'ɒ', 'ɒː', 'ɔ', 'ɔː', 'ɘ', 'ə', 'əː', 'ɛ', 'ɛː', 'ɜː', 'ɡ', 'ɪ', 'ɪ̯', 'ɯ', 'ɵː', 'ɹ', 'ɻ', 'ʃ', 'ʉ', 'ʉː', 'ʊ', 'ʌ', 'ʍ', 'ʒ', 'ʔ', 'θ']
python -m allosaurus.bin.list_phone --lang eng
# you can also skip lang option to get all inventory
#['I', 'a', 'aː', 'ã', 'ă', 'b', 'bʲ', 'bʲj', 'bʷ', 'bʼ', 'bː', 'b̞', 'b̤', 'b̥', 'c', 'd', 'dʒ', 'dʲ', 'dː', 'd̚', 'd̥', 'd̪', 'd̯', 'd͡z', 'd͡ʑ', 'd͡ʒ', 'd͡ʒː', 'd͡ʒ̤', 'e', 'eː', 'e̞', 'f', 'fʲ', 'fʷ', 'fː', 'g', 'gʲ', 'gʲj', 'gʷ', 'gː', 'h', 'hʷ', 'i', 'ij', 'iː', 'i̞', 'i̥', 'i̯', 'j', 'k', 'kx', 'kʰ', 'kʲ', 'kʲj', 'kʷ', 'kʷʼ', 'kʼ', 'kː', 'k̟ʲ', 'k̟̚', 'k͡p̚', 'l', 'lʲ', 'lː', 'l̪', 'm', 'mʲ', 'mʲj', 'mʷ', 'mː', 'n', 'nj', 'nʲ', 'nː', 'n̪', 'n̺', 'o', 'oː', 'o̞', 'o̥', 'p', 'pf', 'pʰ', 'pʲ', 'pʲj', 'pʷ', 'pʷʼ', 'pʼ', 'pː', 'p̚', 'q', 'r', 'rː', 's', 'sʲ', 'sʼ', 'sː', 's̪', 't', 'ts', 'tsʰ', 'tɕ', 'tɕʰ', 'tʂ', 'tʂʰ', 'tʃ', 'tʰ', 'tʲ', 'tʷʼ', 'tʼ', 'tː', 't̚', 't̪', 't̪ʰ', 't̪̚', 't͡s', 't͡sʼ', 't͡ɕ', 't͡ɬ', 't͡ʃ', 't͡ʃʲ', 't͡ʃʼ', 't͡ʃː', 'u', 'uə', 'uː', 'u͡w', 'v', 'vʲ', 'vʷ', 'vː', 'v̞', 'v̞ʲ', 'w', 'x', 'x̟ʲ', 'y', 'z', 'zj', 'zʲ', 'z̪', 'ä', 'æ', 'ç', 'çj', 'ð', 'ø', 'ŋ', 'ŋ̟', 'ŋ͡m', 'œ', 'œ̃', 'ɐ', 'ɐ̞', 'ɑ', 'ɑ̱', 'ɒ', 'ɓ', 'ɔ', 'ɔ̃', 'ɕ', 'ɕː', 'ɖ̤', 'ɗ', 'ə', 'ɛ', 'ɛ̃', 'ɟ', 'ɡ', 'ɡʲ', 'ɡ̤', 'ɡ̥', 'ɣ', 'ɣj', 'ɤ', 'ɤɐ̞', 'ɤ̆', 'ɥ', 'ɦ', 'ɨ', 'ɪ', 'ɫ', 'ɯ', 'ɯ̟', 'ɯ̥', 'ɰ', 'ɱ', 'ɲ', 'ɳ', 'ɴ', 'ɵ', 'ɸ', 'ɹ', 'ɹ̩', 'ɻ', 'ɻ̩', 'ɽ', 'ɾ', 'ɾj', 'ɾʲ', 'ɾ̠', 'ʀ', 'ʁ', 'ʁ̝', 'ʂ', 'ʃ', 'ʃʲː', 'ʃ͡ɣ', 'ʈ', 'ʉ̞', 'ʊ', 'ʋ', 'ʋʲ', 'ʌ', 'ʎ', 'ʏ', 'ʐ', 'ʑ', 'ʒ', 'ʒ͡ɣ', 'ʔ', 'ʝ', 'ː', 'β', 'β̞', 'θ', 'χ', 'ә', 'ḁ']
python -m allosaurus.bin.list_phone
The model
option is to select model for inference.
The default option is latest
, it is pointing to the latest model you downloaded.
It will automatically download the latest model during your first inference if you do not have any local models.
We intend to train new models and continuously release them. The update might include both acoustic model binary files and phone inventory. Typically, the model's name indicates its training date, so usually a higher model id should be expected to perform better.
To download a new model, you can run following command.
python -m allosaurus.bin.download_model -m <model>
If you do not know the model name,
you can just use latest
as model's name and it will automatically download the latest model.
We note that updating to a new model will not delete the original models. All the models will be stored under pretrained
directory where you installed allosaurus.
You might want to fix your model to get consistent results during one experiment.
To see which models are available in your local environment, you can check with the following command
python -m allosaurus.bin.list_model
To delete a model, you can use the following command. This might be useful when you are fine-tuning your models mentioned later.
python -m allosaurus.bin.remove_model
Current available models are the followings
The universal models predict language-independent phones and covers many languages. This is the default model allosaurus will try to download and use. If you cannot find your language on the language dependent models, please use this universal model instead.
Model | Target Language | Description |
---|---|---|
uni2005 |
Universal | This is the latest model (previously named as 200529 ) |
We are planning to deliver language-dependent models for some widely-used languages. The models here are trained with the target language specifically.
It should perform much better than the universal model for the target language. Those models will not be downloaded automatically. Please use the download_model
command above to download, and use --model
flag during inference.
Model | Target Language | Description |
---|---|---|
eng2102 |
English (eng) | English only model |
device_id
controls which device to run the inference.
By default, device_id will be -1, which indicates the model will only use CPUs.
However, if you have GPU, You can use them for inference by specifying device_id to a single GPU id. (note that multiple GPU inference is not supported)
You can retrieve an approximate timestamp for each recognized phone by using timestamp
argument.
python -m allosaurus.run --timestamp=True -i sample.wav
0.210 0.045 æ
0.390 0.045 l
0.450 0.045 u
0.540 0.045 s
0.630 0.045 ɔ
0.720 0.045 ɹ
0.870 0.045 s
The format here in each line is start_timestamp duration phone
where the start_timestamp
and duration
are shown in seconds.
Note that the current timestamp is only an approximation. It is provided by the CTC model, which might not be accurate in some cases due to its nature.
The same interface is also available in python as follows:
model = read_recognizer()
model.recognize('./sample.wav', timestamp=True)
Sometimes generating more phones might be helpful. Specifying the top-k arg will generate k phones at each emitting frame. Default is 1.
# default topk is 1
python -m allosaurus.run -i sample.wav
æ l u s ɔ ɹ s
# output top 5 probable phones at emitting frame, "|" is used to delimit frames (no delimiter when topk=1)
# probability is attached for each phone, the left most phone is the most probable phone
# <blk> is blank which can be ignored.
python -m allosaurus.run -i sample.wav --topk=5
æ (0.577) ɛ (0.128) ɒ (0.103) a (0.045) ə (0.021) | l (0.754) l̪ (0.196) lː (0.018) ʁ (0.007) ʀ (0.006) | u (0.233) ɨ (0.218) uː (0.104) ɤ (0.070) ɪ (0.066) | s (0.301) <blk> (0.298) z (0.118) s̪ (0.084) sː (0.046) | ɔ (0.454) ɑ (0.251) <blk> (0.105) ɹ̩ (0.062) uə (0.035) | ɹ (0.867) ɾ (0.067) <blk> (0.024) l̪ (0.018) r (0.015) | s (0.740) z (0.191) s̪ (0.039) zʲ (0.009) sː (0.003)
You can tell the model to emit more phones or less phones by changing the --emit
or -e
argument.
# default emit is 1.0
python -m allosaurus.run -i sample.wav
æ l u s ɔ ɹ s
# emit more phones when emit > 1
python -m allosaurus.run -e 1.2 -i sample.wav
æ l u s f h ɔ ɹ s
# emit less phones when emit < 1
python -m allosaurus.run -e 0.8 -i sample.wav
æ l u ɹ s
The default phone inventory might not be the inventory you would like to use, so we provide several commands here for you to customize your own inventory.
We have mentioned that you can check your current (default) inventory with following command.
python -m allosaurus.bin.list_phone --lang <language name>
The current phone inventory file can be dumped into a file
# dump the phone file
python -m allosaurus.bin.write_phone --lang <language name> --output <a path to save this file>
If you take a look at the file, it is just a simple format where each line represents a single phone. For example, the following one is the English file
a
aː
b
d
...
You can customize this file to add or delete IPAs you would like. Each line should only contain one IPA phone without any space. It might be easier to debug later if IPAs are sorted, but it is not required.
Next, update your model's inventory by the following command
python -m allosaurus.bin.update_phone --lang <language name> --input <the file you customized)
Then the file has been registered in your model, run the list_phone command again and you could see that it is now using your updated inventory
python -m allosaurus.bin.list_phone --lang <language name>
Now, if you run the inference again, you could also see the results also reflect your updated inventory.
Even after your update, you can easily switch back to the original inventory. In this case, your updated file will be deleted.
python -m allosaurus.bin.restore_phone --lang <language name>
You can also change the results by adjusting the prior probability for each phone. This can help you reduce the unwanted phones or increase the wanted phones.
For example, in the sample file, we get the output
æ l u s ɔ ɹ s
Suppose you think the first phone is wrong, and would like to reduce the probability of this phone, you can create a new file prior.txt
as follows
æ -10.0
The file can contain multiple lines and each line has information for each phone. The first field is your target phone and the second field is the log-based score to adjust your probability. Positive score means you want to boost its prediction, negative score will suppress its prediction. In this case, we can get a new result
python -m allosaurus.run -i=sample.wav --lang=eng --prior=prior.txt
ɛ l u s ɔ ɹ s
where you can see æ
is suppressed and another vowel ɛ
replaced it.
Another application of prior is to change the number of total output phones. You might want more phones outputs or less phones outputs.
In this case, you can change the score for the <blk>
which corresponds to the silence phone.
A positive <blk>
score will add more silence, therefore decrease the number of outputs, similarly, a negative <blk>
will increase the outputs. The following example illustrates this.
# <blk> 1.0
python -m allosaurus.run -i=sample.wav --lang=eng --prior=prior.txt
æ l u ɔ ɹ s
# <blk> -1.0
$ python -m allosaurus.run -i=sample.wav --lang=eng --prior=prior.txt
æ l u s f ɔ ɹ s
The first example reduces one phone and the second example adds a new phone.
We notice that the pretrained models might not be accurate enough for some languages, so we also provide a fine-tuning tool here to allow users to further improve their model by adapting to their data. Currently, it is only limited to fine-tuned with one language.
To fine-tune your data, you need to prepare audio files and their transcriptions.
First, please create one data directory (name can be arbitrary), inside the data directory, create a train
directory and a validate
directory.
Obviously, the train
directory will contain your training set, and the validate
directory will be the validation set.
Each directory should contain the following two files:
wave
: this is a file associating utterance with its corresponding audiostext
: this is a file associating utterance with its phones.
wave
is a txt file mapping each utterance to your wav files. Each line should be prepared as follows:
utt_id /path/to/your/audio.wav
Here utt_id
denotes the utterance id, it can be an arbitrary string as long as it is unique in your dataset.
The audio.wav
is your wav file as mentioned above, it should be a mono-channel wav format, but sampling rate can be arbitrary (the tool would automatically resample if necessary)
The delimiter used here is space.
To get the best fine-tuning results, each audio file should not be very long. We recommend to keep each utterance shorter than 10 seconds.
text
is another txt file mapping each utterance to your transcription. Each line should be prepared as follows
utt_id phone1 phone2 ...
Here utt_id
is again the utterance id and should match with the corresponding wav file.
The phone sequences came after utterance id is your phonetic transcriptions of the wav file.
The phones here should be restricted to the phone inventory of your target language.
Please make sure all your phones are already registered in your target language by the list_phone
command
Next, we will extract feature from both the wave
file and text
file.
We assume that you already prepared wave
file and text
file in BOTH train
directory and validate
directory
To prepare the audio features, run the following command on both your train
directory and validate
directory.
# command to prepare audio features
python -m allosaurus.bin.prep_feat --model=some_pretrained_model --path=/path/to/your/directory (train or validate)
The path
should be pointing to the train or the validate directory, the model
should be pointing to your traget pretrained model. If unspecified, it will use the latest model.
It will generate three files feat.scp
, feat.ark
and shape
.
-
The first one is an file indexing each utterance into a offset of the second file.
-
The second file is a binary file containing all audio features.
-
The third one contains the feature dimension information
If you are curious, the scp
and ark
formats are standard file formats used in Kaldi.
To prepare the text features, run the following command again on both your train
directory and validate
directory.
# command to prepare token
python -m allosaurus.bin.prep_token --model=<some_pretrained_model> --lang=<your_target_language_id> --path=/path/to/your/directory (train or validate)
The path
and model
should be the same as the previous command. The lang
is the 3 character ISO language id of this dataset.
Note that you should already verify the the phone inventory of this language id contains all of your phone transcriptions.
Otherwise, the extraction here might fail.
After this command, it will generate a file called token
which maps each utterance to the phone id sequences.
Next, we can start fine-tuning our model with the dataset we just prepared. The fine-tuning command is very simple.
# command to fine_tune your data
python -m allosaurus.bin.adapt_model --pretrained_model=<pretrained_model> --new_model=<your_new_model> --path=/path/to/your/data/directory --lang=<your_target_language_id> --device_id=<device_id> --epoch=<epoch>
There are couple of other optional arguments available here, but we describe the required arguments.
-
pretrained_model
should be the same model you specified before in theprep_token
andprep_feat
. -
new_model
can be an arbitrary model name (Actually, it might be easier to manage if you give each model the same format as the pretrained model (i.e. YYMMDD)) -
The
path
should be pointing to the parent directory of yourtrain
andvalidate
directories. -
The
lang
is the language id you specified inprep_token
-
The
device_id
is the GPU id for fine-tuning, if you do not have any GPU, use -1 as device_id. Multiple GPU is not supported. -
epoch
is the number of your training epoch
During the training, it will show some information such as loss and phone error rate for both your training set and validation set.
After each epoch, the model would be evaluated with the validation set and would save this checkpoint if its validation phone error rate is better than previous ones.
After the specified epoch
has finished, the fine-tuning process will end and the new model should be available.
After your training process, the new model should be available in your model list. use the list_model
command to check your new model is available now
# command to check all your models
python -m allosaurus.bin.list_model
If it is available, then this new model can be used in the same style as any other pretrained models. Just run the inference to use your new model.
python -m allosaurus.run --lang <language id> --model <your new model> --device_id <gpu_id> -i <audio>
This work uses part of the following codes and inventories. In particular, we heavily used AlloVera and Phoible to build this model's phone inventory.
- AlloVera: For pretraining the model with correct allophone mappings
- Phoible: For language specific phone inventory
- python_speech_features: For mfcc, filter bank feature extraction
- fairseq: For some utilities
- kaldi_io: For kaldi scp, ark reader and writer
Please cite the following paper if you use code in your work.
If you have any advice or suggestions, please feel free to send email to me (xinjianl [at] cs.cmu.edu) or submit an issue in this repo. Thanks!
@inproceedings{li2020universal,
title={Universal phone recognition with a multilingual allophone system},
author={Li, Xinjian and Dalmia, Siddharth and Li, Juncheng and Lee, Matthew and Littell, Patrick and Yao, Jiali and Anastasopoulos, Antonios and Mortensen, David R and Neubig, Graham and Black, Alan W and Florian, Metze},
booktitle={ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={8249--8253},
year={2020},
organization={IEEE}
}