# Filter, Mix and Train/Test Split

In this notebook we apply a simple filter to filter outlying data points. Then, we combine the three datasets into a single one and do train/test split.

# Filter

We implement a simple filter here, to filter out samples that are too long (>20s), too short (<0.1s) or empty. 

Advanced filter to weed out out samples that are considered 'noisy', that is, samples having very high WER (word error rate) or CER (character error rate) w.r.t. a previously trained German models are left as an advanced exercice for interested readers. 

In [2]:
import tqdm
import os
import json

def load_jsonl(filepath):
    data = []
    with open(filepath, 'r', encoding='utf_8') as fp:
        inlines = fp.readlines()
        for line in inlines:
            if line.startswith("//") or line.strip() == '':
                continue
            row = json.loads(line)
            data.append(row)
    return data

def dump_jsonl(filepath, data):
    with open(filepath, 'w') as fp:
        for datum in data:
            row = json.dumps(datum, ensure_ascii=False)
            fp.write(row)
            fp.write('\n')
            
 
def filter_manifest(input_manifest, output_manifest, min_duration=0.1, max_duration=20):
    utterances = load_jsonl(input_manifest)
    filtered_utterances = []
    for i in tqdm.tqdm(range(len(utterances))):
        if (utterances[i]['duration'] < max_duration) and (utterances[i]['duration'] > min_duration):
            filtered_utterances.append(utterances[i])
    
    dump_jsonl(output_manifest, filtered_utterances)

In [3]:
#for dataset in ['mls', 'voxpopuli', 'mcv']:
for dataset in ['mls', 'voxpopuli']:
    for subset in ['train', 'dev', 'test']:        
        input_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized.json")
        output_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized_filtered.json")
        print("Processing ", input_manifest)
        filter_manifest(input_manifest, output_manifest)

Processing  ./data/processed/mls/mls_train_manifest_normalized.json


100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 23497/23497 [00:00<00:00, 972101.17it/s]


Processing  ./data/processed/mls/mls_dev_manifest_normalized.json


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3469/3469 [00:00<00:00, 1132298.88it/s]


Processing  ./data/processed/mls/mls_test_manifest_normalized.json


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3394/3394 [00:00<00:00, 1176582.18it/s]

Processing  ./data/processed/voxpopuli/voxpopuli_train_manifest_normalized.json



100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 108473/108473 [00:00<00:00, 1309530.11it/s]


Processing  ./data/processed/voxpopuli/voxpopuli_dev_manifest_normalized.json


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2109/2109 [00:00<00:00, 1281069.82it/s]


Processing  ./data/processed/voxpopuli/voxpopuli_test_manifest_normalized.json


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1968/1968 [00:00<00:00, 1282534.22it/s]


## Mix and Train/Test Split

We keep the train/dev/test structure of the original datasets, and simply merge them together. 
For other application where certain dataset is over- or under-represented, one might want to apply over sampling or undersampling instead.

In [8]:
merged_manifest = []

for subset in ['train', 'dev', 'test']:
    #for dataset in ['mls', 'voxpopuli', 'mcv']:
    for dataset in ['mls', 'voxpopuli']:
        print("Processing ", dataset, subset)
        merged_manifest.extend(load_jsonl(os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized_filtered.json")))
    output_manifest = os.path.join('./data/processed/', f"{subset}_manifest_merged.json")
    dump_jsonl(output_manifest, merged_manifest)

Processing  mls train
Processing  voxpopuli train
Processing  mls dev
Processing  voxpopuli dev
Processing  mls test
Processing  voxpopuli test
