# Filter, Mix and Train/Test Split

In this notebook we apply a simple filter to filter outlying data points. Then, we combine the three datasets into a single one and do train/test split.

# Filter

We implement a simple filter here, to filter out samples that:
 - are too long (>20s), too short (<0.1s) or empty. 
 - contain characters other than those in the German alphabet, punctuation marks and numbers.

Advanced filter to weed out out samples that are considered 'noisy', that is, samples having very high WER (word error rate) or CER (character error rate) w.r.t. a previously trained German models are left as an advanced exercice for interested readers. 

In [42]:
import os
import tqdm
import json
import string

def load_jsonl(filepath):
    data = []
    with open(filepath, 'r', encoding='utf_8') as fp:
        inlines = fp.readlines()
        for line in inlines:
            if line.startswith("//") or line.strip() == '':
                continue
            row = json.loads(line)
            data.append(row)
    return data

def dump_jsonl(filepath, data):
    with open(filepath, 'w') as fp:
        for datum in data:
            row = json.dumps(datum, ensure_ascii=False)
            fp.write(row)
            fp.write('\n')

german_alphabet = set(' abcdefghijklmnopqrstuvwxyzäöüß'+string.punctuation+'0123456789')
 
def filter_manifest(input_manifest, output_manifest, min_duration=0.1, max_duration=20):
    utterances = load_jsonl(input_manifest)
    filtered_utterances = []
    invalid_chars = set()
    for i in tqdm.tqdm(range(len(utterances))):
        if (utterances[i]['duration'] > max_duration) and (utterances[i]['duration'] < min_duration):
            continue
        if len(set(utterances[i]['text_original'].lower())-german_alphabet)>0: # text containing non-German characters
            invalid_chars = invalid_chars.union(set(utterances[i]['text_original'].lower())-german_alphabet)            
            continue
        
        filtered_utterances.append(utterances[i])
    print("Number of utterances filtered out: ", len(utterances) - len(filtered_utterances))    
    print("Invalid characters:", ''.join(list(invalid_chars)))
    
    dump_jsonl(output_manifest, filtered_utterances)

In [43]:
for dataset in ['mls', 'voxpopuli', 'mcv']:
    for subset in ['train', 'dev', 'test']:        
        input_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized.json")
        output_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized_filtered.json")
        print("Processing ", input_manifest)
        filter_manifest(input_manifest, output_manifest)

Processing  ./data/processed/mls/mls_train_manifest_normalized.json


100%|█████████████████████████████████████████████████████████████████████████████| 23497/23497 [00:00<00:00, 163329.04it/s]


Number of utterances filtered out:  0
Invalid characters: 
Processing  ./data/processed/mls/mls_dev_manifest_normalized.json


100%|███████████████████████████████████████████████████████████████████████████████| 3469/3469 [00:00<00:00, 159734.33it/s]

Number of utterances filtered out:  6
Invalid characters: àèé





Processing  ./data/processed/mls/mls_test_manifest_normalized.json


100%|███████████████████████████████████████████████████████████████████████████████| 3394/3394 [00:00<00:00, 155981.19it/s]

Number of utterances filtered out:  2
Invalid characters: é





Processing  ./data/processed/voxpopuli/voxpopuli_train_manifest_normalized.json


100%|███████████████████████████████████████████████████████████████████████████| 108473/108473 [00:00<00:00, 182750.22it/s]


Number of utterances filtered out:  2101
Invalid characters: ćľņó─łáőňâęșéˊí—ūń¸̇şżïøýӧ‟‹ţàą„țúī°èšåñãž´źčğçė§ı›śă⁰đê
Processing  ./data/processed/voxpopuli/voxpopuli_dev_manifest_normalized.json


100%|███████████████████████████████████████████████████████████████████████████████| 2109/2109 [00:00<00:00, 180268.74it/s]


Number of utterances filtered out:  51
Invalid characters: ñíãć—čğăá„‚èé
Processing  ./data/processed/voxpopuli/voxpopuli_test_manifest_normalized.json


100%|███████████████████████████████████████████████████████████████████████████████| 1968/1968 [00:00<00:00, 179225.08it/s]

Number of utterances filtered out:  49
Invalid characters: ø—ğóàôá„żèš





Processing  ./data/processed/mcv/mcv_train_manifest_normalized.json


100%|███████████████████████████████████████████████████████████████████████████| 196403/196403 [00:00<00:00, 248552.22it/s]


Number of utterances filtered out:  5095
Invalid characters: 道ûņóáảő«șňďęéəшḫན″ếř临—ūиṩîœş‚ēżõаṭħøýе‹à¡ą–„țú尣⟩ô支òсšŏñůž´ṟčė›カśă−ġễч་ê’”ōćµùłквǐâí無臣“āộмńť′…ěþï‟ʻ»фắ孙о‘→īʿрìåźãæ⟨ğçıạ≡‑ðđë
Processing  ./data/processed/mcv/mcv_dev_manifest_normalized.json


100%|█████████████████████████████████████████████████████████████████████████████| 15340/15340 [00:00<00:00, 243739.41it/s]


Number of utterances filtered out:  386
Invalid characters: ćół辶áőș«âęéəíř“āọş‚ë…ěïø‹ʻ»ąứ–„ț‘ʿôåšñæžãźčğ›ıśă乡ðê’đ幺ō
Processing  ./data/processed/mcv/mcv_test_manifest_normalized.json


100%|█████████████████████████████████████████████████████████████████████████████| 15340/15340 [00:00<00:00, 248284.23it/s]


Number of utterances filtered out:  352
Invalid characters: ćנעóồłáả«șęňâéíř“āū”ń̇îşå…ěøýʻ»ǔàắ–„țú‘°ʿòשźšãæžא´čğñçבıśð’ëō


## Mix and Train/Test Split

We keep the train/dev/test structure of the original datasets, and simply merge them together. 
For other application where certain dataset is over- or under-represented, one might want to apply over sampling or undersampling instead.

In [44]:
for subset in ['train', 'dev', 'test']:
    merged_manifest = []
    for dataset in ['mls', 'voxpopuli', 'mcv']:    
        print("Processing ", dataset, subset)
        merged_manifest.extend(load_jsonl(os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized_filtered.json")))
    output_manifest = os.path.join('./data/processed/', f"{subset}_manifest_merged.json")
    dump_jsonl(output_manifest, merged_manifest)

Processing  mls train
Processing  voxpopuli train
Processing  mcv train
Processing  mls dev
Processing  voxpopuli dev
Processing  mcv dev
Processing  mls test
Processing  voxpopuli test
Processing  mcv test
