# Text Normalization

Text Normalization converts text from written form into its verbalized form. It is used as a preprocessing step for preprocessing Automatic Speech Recognition (ASR) training transcripts. For German text normalization, we primarily leverage the NeMo text normalization [library](https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/de).

In this tutorial, we will employ NeMo to normalize the Mozilla Common Voice (MCV), Multilingual LibriSpeech (MLS), and VoxPopuli datasets. The following code takes in a manifest file, normalizes the transcripts, and writes back the normalized manifest file.

Note: This tutorial should be run within a NeMo Docker container with the following command:
```bash
docker run --gpus=all --net=host --rm -it -v $PWD:/myworkspace nvcr.io/nvidia/nemo:22.08 bash
```
Then, from within the NeMo container, the Jupyter lab environment can be started.

**Note: this process will take a long time. On VoxPopuli, every 10k samples take an additional 1 hour on 80 CPU cores.**

In [5]:
from typing import List
import os
import json
import multiprocessing

from tqdm import tqdm
from functools import partial

from nemo_text_processing.text_normalization.normalize import Normalizer

def load_jsonl(filepath):
    data = []
    with open(filepath, 'r', encoding='utf_8') as fp:
        inlines = fp.readlines()
        for line in inlines:
            if line.startswith("//") or line.strip() == '':
                continue
            row = json.loads(line)
            data.append(row)
    return data


def dump_jsonl(filepath, data):
    with open(filepath, 'w') as fp:
        for datum in data:
            row = json.dumps(datum, ensure_ascii=False)
            fp.write(row)
            fp.write('\n')
            
def normalize_manifest(input_manifest, output_manifest, normalizer):
    utterances = load_jsonl(input_manifest)
    transcripts = [utt['text_original'] for utt in utterances]
    
    pool = multiprocessing.Pool(processes=os.cpu_count())
    normalized_result = tqdm(pool.imap(partial(normalizer.normalize, verbose=False), transcripts))
    for i, text in enumerate(normalized_result):
        utterances[i]['text'] = text  
    dump_jsonl(output_manifest, utterances)

In [None]:
#normalizer = Normalizer(input_case="cased", lang='de')
normalizer = Normalizer(
        input_case="cased",
        cache_dir="/tmp",
        overwrite_cache=True,
        lang="de",
    )
    
#for dataset in ['mls', 'voxpopuli', 'mcv']:
for dataset in ['mcv']:
    for subset in ['train', 'dev', 'test']:        
        input_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest.json")
        output_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized.json")
        print("Processing ", input_manifest)
        normalize_manifest(input_manifest, output_manifest, normalizer)
            

[NeMo I 2022-05-06 06:46:33 tokenize_and_classify:83] Creating ClassifyFst grammars. This might take some time...
Created /tmp/_cased_de_tn_True_deterministic.far
[NeMo I 2022-05-06 06:46:56 tokenize_and_classify:143] ClassifyFst grammars are saved to /tmp/_cased_de_tn_True_deterministic.far.
Processing  ./data/processed/mcv/mcv_train_manifest.json


9471it [53:20,  2.34it/s]