Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch KenLM to trie based language model #1236

Closed
kdavis-mozilla opened this issue Feb 15, 2018 · 10 comments
Closed

Switch KenLM to trie based language model #1236

kdavis-mozilla opened this issue Feb 15, 2018 · 10 comments
Assignees

Comments

@kdavis-mozilla
Copy link
Contributor

No description provided.

@dbanka
Copy link

dbanka commented Jun 1, 2018

@kdavis-mozilla what would be the benefit of switching to trie based language model?

@kdavis-mozilla
Copy link
Contributor Author

@dbanka Trie based models can be compressed[1] making the entire footprint smaller. Our current language models can't be compressed.

@kdavis-mozilla
Copy link
Contributor Author

@dbanka Concretely, our current language model is 1.5 GB the we've made a trie based model which basically reproduces its quality and is 66 MB.

mathematiguy pushed a commit to TeHikuMedia/DeepSpeech that referenced this issue Jul 12, 2018
lissyx pushed a commit to lissyx/STT that referenced this issue Jul 20, 2018
lissyx added a commit that referenced this issue Jul 23, 2018
Revert "Fixes #1236 (Switch KenLM to trie based language model)"
@lissyx lissyx reopened this Jul 23, 2018
@lissyx
Copy link
Collaborator

lissyx commented Jul 23, 2018

Reopening since we reverted the fixes.

@samgd
Copy link
Collaborator

samgd commented Aug 16, 2018

The code snippet below builds a pruned, quantized 5-gram language model that is significantly better than the "quick-fix" language model.

The corpus used is described in section 4. Language Models of the original LibriSpeech paper.

With little to no optimisation or hyper-parameter tuning we get a dev-clean WER of ~6.4 on a version of our internal implementation of DS1. You can adjust the order, pruning level, and quantization level to suit your needs :-) (e.g. a 3-gram 8-bit binary trie is <1GB and has a dev-clean WER of ~6.5)

Note that this code was written in a Jupyter notebook and uses the lmplz and build_binary commands from the kenlm library. The resulting language model is ~1.7GB. I've sent this to @kdavis-mozilla via IRC already.

import gzip
import io
import os

from urllib import request

# Grab corpus.
url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'
data_upper = '/tmp/upper.txt.gz'
request.urlretrieve(url, data_upper)

# Convert to lowercase and cleanup.
data_lower = '/tmp/lower.txt'
with open(data_lower, 'w', encoding='utf-8') as lower:
    with io.TextIOWrapper(io.BufferedReader(gzip.open(data_upper)), encoding='utf8') as upper:
        for line in upper:
            lower.write(line.lower())
os.remove(data_upper)

# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text {data_lower} \
       --arpa {lm_path} \
       --prune 0 0 0 1

# Quantize and produce trie binary.
binary_path = '/tmp/lm.binary'
!build_binary -a 255 \
              -q 8 \
              trie \
              {lm_path} \
              {binary_path} 
os.remove(lm_path)

Example output:

=== 1/5 Counting and sorting n-grams ===
Reading /tmp/lower.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 803288729 types 973676
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11684112 2:3126698496 3:5862559744 4:9380094976 5:13679306752
Statistics:
1 973676 D1=0.647192 D2=1.04159 D3+=1.3919
2 41161096 D1=0.723617 D2=1.06317 D3+=1.36127
3 207278547 D1=0.804357 D2=1.09256 D3+=1.31993
4 60615302/438095063 D1=0.876863 D2=1.15052 D3+=1.32047
5 42225053/587120377 D1=0.914203 D2=1.27108 D3+=1.35262
Memory estimate for binary LM:
type      MB
probing 7822 assuming -p 1.5
probing 9594 assuming -r models -p 1.5
trie    4304 without quantization
trie    2457 assuming -q 8 -b 8 quantization 
trie    3556 assuming -a 22 array pointer compression
trie    1708 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*******#############################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:31701288 kB	VmRSS:32144 kB	RSSMax:27359364 kB	user:1187.72	sys:465.288	CPU:1653.02	real:2043.65

Reading /tmp/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

@pvanickova
Copy link

Are there going to be tools to extend the new language model with custom corpus data or individual phrases?

@kdavis-mozilla
Copy link
Contributor Author

@pvanickova You'll be able to use all the features of KenLM to extend the language model.

@lissyx
Copy link
Collaborator

lissyx commented Oct 4, 2018

@pvanickova You can do that, following data/lm/README.md and augmenting with your own data.

@pvanickova
Copy link

@lissyx perfect, thanks - so basically rebuilding the language model from scratch using the librivox corpus + my own corpus

@lock
Copy link

lock bot commented Mar 22, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants