Switch KenLM to trie based language model #1236

kdavis-mozilla · 2018-02-15T10:45:05Z

No description provided.

dbanka · 2018-06-01T08:04:46Z

@kdavis-mozilla what would be the benefit of switching to trie based language model?

kdavis-mozilla · 2018-06-01T08:15:26Z

@dbanka Trie based models can be compressed[1] making the entire footprint smaller. Our current language models can't be compressed.

kdavis-mozilla · 2018-06-01T08:18:07Z

@dbanka Concretely, our current language model is 1.5 GB the we've made a trie based model which basically reproduces its quality and is 66 MB.

This reverts commit e34c52f.

Revert "Fixes #1236 (Switch KenLM to trie based language model)"

lissyx · 2018-07-23T10:36:43Z

Reopening since we reverted the fixes.

samgd · 2018-08-16T15:51:17Z

The code snippet below builds a pruned, quantized 5-gram language model that is significantly better than the "quick-fix" language model.

The corpus used is described in section 4. Language Models of the original LibriSpeech paper.

With little to no optimisation or hyper-parameter tuning we get a dev-clean WER of ~6.4 on a version of our internal implementation of DS1. You can adjust the order, pruning level, and quantization level to suit your needs :-) (e.g. a 3-gram 8-bit binary trie is <1GB and has a dev-clean WER of ~6.5)

Note that this code was written in a Jupyter notebook and uses the lmplz and build_binary commands from the kenlm library. The resulting language model is ~1.7GB. I've sent this to @kdavis-mozilla via IRC already.

import gzip
import io
import os

from urllib import request

# Grab corpus.
url = 'http://www.openslr.org/resources/11/librispeech-lm-norm.txt.gz'
data_upper = '/tmp/upper.txt.gz'
request.urlretrieve(url, data_upper)

# Convert to lowercase and cleanup.
data_lower = '/tmp/lower.txt'
with open(data_lower, 'w', encoding='utf-8') as lower:
    with io.TextIOWrapper(io.BufferedReader(gzip.open(data_upper)), encoding='utf8') as upper:
        for line in upper:
            lower.write(line.lower())
os.remove(data_upper)

# Build pruned LM.
lm_path = '/tmp/lm.arpa'
!lmplz --order 5 \
       --temp_prefix /tmp/ \
       --memory 50% \
       --text {data_lower} \
       --arpa {lm_path} \
       --prune 0 0 0 1

# Quantize and produce trie binary.
binary_path = '/tmp/lm.binary'
!build_binary -a 255 \
              -q 8 \
              trie \
              {lm_path} \
              {binary_path} 
os.remove(lm_path)

Example output:

=== 1/5 Counting and sorting n-grams ===
Reading /tmp/lower.txt
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Unigram tokens 803288729 types 973676
=== 2/5 Calculating and sorting adjusted counts ===
Chain sizes: 1:11684112 2:3126698496 3:5862559744 4:9380094976 5:13679306752
Statistics:
1 973676 D1=0.647192 D2=1.04159 D3+=1.3919
2 41161096 D1=0.723617 D2=1.06317 D3+=1.36127
3 207278547 D1=0.804357 D2=1.09256 D3+=1.31993
4 60615302/438095063 D1=0.876863 D2=1.15052 D3+=1.32047
5 42225053/587120377 D1=0.914203 D2=1.27108 D3+=1.35262
Memory estimate for binary LM:
type      MB
probing 7822 assuming -p 1.5
probing 9594 assuming -r models -p 1.5
trie    4304 without quantization
trie    2457 assuming -q 8 -b 8 quantization 
trie    3556 assuming -a 22 array pointer compression
trie    1708 assuming -a 22 -q 8 -b 8 array pointer compression and quantization
=== 3/5 Calculating and sorting initial probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
*******#############################################################################################
=== 4/5 Calculating and writing order-interpolated probabilities ===
Chain sizes: 1:11684112 2:658577536 3:4145570940 4:1454767248 5:1182301484
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
####################################################################################################
=== 5/5 Writing ARPA model ===
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Name:lmplz	VmPeak:31701288 kB	VmRSS:32144 kB	RSSMax:27359364 kB	user:1187.72	sys:465.288	CPU:1653.02	real:2043.65

Reading /tmp/lm.arpa
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Identifying n-grams omitted by SRI
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Quantizing
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
Writing trie
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
****************************************************************************************************
SUCCESS

pvanickova · 2018-08-28T10:56:25Z

Are there going to be tools to extend the new language model with custom corpus data or individual phrases?

kdavis-mozilla · 2018-08-29T07:32:13Z

@pvanickova You'll be able to use all the features of KenLM to extend the language model.

lissyx · 2018-10-04T10:33:25Z

@pvanickova You can do that, following data/lm/README.md and augmenting with your own data.

pvanickova · 2018-10-04T11:00:38Z

@lissyx perfect, thanks - so basically rebuilding the language model from scratch using the librivox corpus + my own corpus

lock · 2019-03-22T16:01:54Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

kdavis-mozilla added the enhancement label Feb 15, 2018

kdavis-mozilla self-assigned this Feb 15, 2018

kdavis-mozilla closed this as completed in e34c52f Jun 5, 2018

mathematiguy pushed a commit to TeHikuMedia/DeepSpeech that referenced this issue Jul 12, 2018

Fixes mozilla#1236 (Switch KenLM to trie based language model)

333995b

lissyx pushed a commit to lissyx/STT that referenced this issue Jul 20, 2018

Revert "Fixes mozilla#1236 (Switch KenLM to trie based language model)"

650f7f8

This reverts commit e34c52f.

lissyx added a commit that referenced this issue Jul 23, 2018

Merge pull request #1465 from lissyx/revert-old-lm

da760bd

Revert "Fixes #1236 (Switch KenLM to trie based language model)"

lissyx reopened this Jul 23, 2018

kdavis-mozilla mentioned this issue Aug 23, 2018

Language model does not include apostrophe #955

Closed

kdavis-mozilla closed this as completed Feb 20, 2019

lock bot locked and limited conversation to collaborators Mar 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch KenLM to trie based language model #1236

Switch KenLM to trie based language model #1236

kdavis-mozilla commented Feb 15, 2018

dbanka commented Jun 1, 2018

kdavis-mozilla commented Jun 1, 2018

kdavis-mozilla commented Jun 1, 2018

lissyx commented Jul 23, 2018

samgd commented Aug 16, 2018

pvanickova commented Aug 28, 2018

kdavis-mozilla commented Aug 29, 2018

lissyx commented Oct 4, 2018

pvanickova commented Oct 4, 2018

lock bot commented Mar 22, 2019

Switch KenLM to trie based language model #1236

Switch KenLM to trie based language model #1236

Comments

kdavis-mozilla commented Feb 15, 2018

dbanka commented Jun 1, 2018

kdavis-mozilla commented Jun 1, 2018

kdavis-mozilla commented Jun 1, 2018

lissyx commented Jul 23, 2018

samgd commented Aug 16, 2018

pvanickova commented Aug 28, 2018

kdavis-mozilla commented Aug 29, 2018

lissyx commented Oct 4, 2018

pvanickova commented Oct 4, 2018

lock bot commented Mar 22, 2019