## Step 0: Prerequisite

To run this notebook, you need to build the decoder binaries and runtime first. Please refer to [README.md](../LanguageModelDecoder/README.md) for more details.

You will need at least **230GB** of free disk space and **100GB** of RAM to run this notebook.


## Step 1: Prepare language model training corpus. 

The training corpus should be a text file with one sentence per line. Here we use [OpenWebText2](https://openwebtext2.readthedocs.io/en/latest/) as an example.


In [1]:
%%sh
# Download the OpenWebText2 corpus

CORPUS_DIR=../../data/lm_corpus
mkdir -p $CORPUS_DIR
# If the download URL does not work, you can find the latest one at https://openwebtext2.readthedocs.io/en/latest/
wget https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar -O $CORPUS_DIR/openwebtext2.jsonl.zst.tar
cd $CORPUS_DIR
tar -xvf openwebtext2.jsonl.zst.tar


--2023-11-14 07:32:29--  https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
Resolving mystic.the-eye.eu (mystic.the-eye.eu)... 62.6.154.15
Connecting to mystic.the-eye.eu (mystic.the-eye.eu)|62.6.154.15|:443... failed: Connection timed out.
Retrying.

--2023-11-14 07:34:40--  (try: 2)  https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar


Process is interrupted.


Connecting to mystic.the-eye.eu (mystic.the-eye.eu)|62.6.154.15|:443... failed: Connection timed out.
Retrying.

--2023-11-14 07:36:52--  (try: 3)  https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
Connecting to mystic.the-eye.eu (mystic.the-eye.eu)|62.6.154.15|:443... failed: Connection timed out.
Retrying.

--2023-11-14 07:39:06--  (try: 4)  https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
Connecting to mystic.the-eye.eu (mystic.the-eye.eu)|62.6.154.15|:443... failed: Connection timed out.
Retrying.

--2023-11-14 07:41:20--  (try: 5)  https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
Connecting to mystic.the-eye.eu (mystic.the-eye.eu)|62.6.154.15|:443... failed: Connection timed out.
Retrying.

--2023-11-14 07:43:37--  (try: 6)  https://mystic.the-eye.eu/public/AI/pile_preliminary_components/openwebtext2.jsonl.zst.tar
Connecting to mystic.the-eye.eu (mystic.the-

In [2]:
from datasets import load_dataset

dataset = load_dataset("suolyer/pile_openwebtext2")

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/135M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/132M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

In [6]:
print(dataset)

DatasetDict({
    validation: Dataset({
        features: ['text', 'meta'],
        num_rows: 33400
    })
    test: Dataset({
        features: ['text', 'meta'],
        num_rows: 32925
    })
})


Now we need to concatenate all the text files into one big file.
Make sure you have python libraries `zstandard`, `jsonlines`, and `tqdm` installed.

In [None]:
import os
import glob
import zstandard
import json
import jsonlines
import io
import datetime
from tqdm.notebook import tqdm

def json_serial(obj):
    """JSON serializer for objects not serializable by default json code"""

    if isinstance(obj, (datetime.datetime,)):
        return obj.isoformat()
    raise TypeError ("Type %s not serializable" % type(obj))

# Modified version of lm_dataformat Archive for single file.
class Archive:
    def __init__(self, file_path, compression_level=3):
        self.file_path = file_path
        dir_name = os.path.dirname(file_path)
        if dir_name:
            os.makedirs(dir_name, exist_ok=True)
        self.fh = open(self.file_path, 'wb')
        self.cctx = zstandard.ZstdCompressor(level=compression_level)
        self.compressor = self.cctx.stream_writer(self.fh)

    def add_data(self, data, meta={}):
        self.compressor.write(json.dumps({'text': data, 'meta': meta}, default=json_serial).encode('UTF-8') + b'\n')

    def commit(self):
        self.compressor.flush(zstandard.FLUSH_FRAME)
        self.fh.flush()
        self.fh.close()

# Modified version of lm_dataformat Reader with self.fh set, allowing peeking for tqdm.
class Reader:
    def __init__(self):
        pass

    def read_jsonl(self, file, get_meta=False, autojoin_paragraphs=True, para_joiner='\n\n'):
        with open(file, 'rb') as fh:
            self.fh = fh
            cctx = zstandard.ZstdDecompressor()
            reader = io.BufferedReader(cctx.stream_reader(fh))
            rdr = jsonlines.Reader(reader)
            for ob in rdr:
                # naive jsonl where each object is just the string itself, with no meta. For legacy compatibility.
                if isinstance(ob, str):
                    assert not get_meta
                    yield ob
                    continue

                text = ob['text']

                if autojoin_paragraphs and isinstance(text, list):
                    text = para_joiner.join(text)

                if get_meta:
                    yield text, (ob['meta'] if 'meta' in ob else {})
                else:
                    yield text

lm_corpus_dir = 'lm_corpus'
merged_text_path = 'lm_corpus/openwebtext2.txt'
output = open(merged_text_path, 'w')

files = sorted(glob.glob(os.path.join(lm_corpus_dir, "*jsonl.zst")))
for file_path in tqdm(files, dynamic_ncols=True):
    print(file_path)
    reader = Reader()
    for document in tqdm(reader.read_jsonl(file_path)):
        output.write(document)
        output.write('\n')

## Step 2: Download CMU dictionary

In [None]:
%%bash

wget https://github.com/Alexir/CMUdict/raw/master/cmudict-0.7b -O lm_corpus/cmudict.txt

## Step 3: Build language model

Build a 3-gram language model based on the OpenWebText2 corpus.

In [3]:
%%bash

set -xe

LM_ROOT=../LanguageModelDecoder/examples/speech/s0/
LM_CORPUS_DIR=$PWD/lm_corpus
LM_MODEL_DIR=$PWD/lm_model

cd $LM_ROOT
echo $PWD
. path.sh

# First step is formatting the text corpus.
mkdir -p $LM_MODEL_DIR/data/local/lm_data
python local/format_lm_data.py \
    --input_text $LM_CORPUS_DIR/openwebtext2.txt \
    --output_text $LM_MODEL_DIR/data/local/lm_data/corpus.txt \
    --dict $LM_CORPUS_DIR/cmudict.txt \
    --unk

# Build the LM
dict_type=phn
lm_order=3
prune_threshold=1e-9
local/build_lm.sh \
    $LM_MODEL_DIR/data/local/lm_data/corpus.txt \
    $LM_MODEL_DIR/data/local/lm \
    $dict_type \
    $lm_order \
    $prune_threshold \
    $LM_CORPUS_DIR/cmudict.txt

# Optionally, if you have 1TB of RAM, you can build a 5-gram LM
#dict_type=phn
#lm_order=5
#prune_threshold=4e-11
#local/build_lm.sh \
#    $LM_MODEL_DIR/data/local/lm_data/corpus.txt \
#    $LM_MODEL_DIR/data/local/lm \
#    $dict_type \
#    $lm_order \
#    $prune_threshold \
#    $LM_CORPUS_DIR/cmudict.txt

/oak/stanford/groups/shenoy/stfan/code/speechBCI/LanguageModelDecoder/examples/speech/s0
/oak/stanford/groups/shenoy/stfan/code/speechBCI/AnalysisExamples/lm_model/data/local/lm
Prune LM with threshold 1e-9


IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



## Step 4: Build WFST decoder graph

Convert the previous 3-gram language model into a WFST decoder graph.

In [None]:
%%bash

LM_ROOT=../LanguageModelDecoder/examples/speech/s0/
LM_MODEL_DIR=$PWD/lm_model
use_all_phones=1
dict_type=phn
sil_prob=0.9

cd $LM_ROOT
. path.sh

# Prepare L.fst
local/prepare_dict_ctc.sh $LM_MODEL_DIR/data/local/lm $LM_MODEL_DIR/data/local/dict_phn $use_all_phones
tools/fst/ctc_compile_dict_token.sh --dict-type $dict_type --sil-prob $sil_prob \
    $LM_MODEL_DIR/data/local/dict_phn $LM_MODEL_DIR/data/local/lang_phn_tmp $LM_MODEL_DIR/data/lang_phn

# Build TLG decoding graph
tools/fst/make_tlg.sh $LM_MODEL_DIR/data/local/lm $LM_MODEL_DIR/data/lang_phn $LM_MODEL_DIR/data/lang_test

Now test loading the deocder graph. Make sure you have [NeuralDecoder](../NeuralDecoder) installed before running this.

In [2]:
import torch
import lm_decoder

import neuralDecoder.utils.lmDecoderUtils as lmDecoderUtils

ngramDecoder = lmDecoderUtils.build_lm_decoder(
    'lm_model/data/lang_test'
)


ModuleNotFoundError: No module named 'lm_decoder'