# Filter, Mix, Train/Test and Split

In this tutorial, we apply a simple filter to filter outlying data points. Then, we combine the three datasets into a single one and perform a train/test split.

# Filter

We implement a simple filter here, to filter out samples that are too long (>20s), too short (<0.1s) or empty. 
We also replace special characters (other than those in the German alphabet, punctuation marks, and numbers) with a space.

An advanced filter weeds out samples that are considered 'noisy', that is, samples having very high WER (word error rate) or CER (character error rate) regarding a previously trained German model. This is left as an advanced exercise for interested readers. 

In [76]:
import os
import tqdm
import json
import string
import re

def load_jsonl(filepath):
    data = []
    with open(filepath, 'r', encoding='utf_8') as fp:
        inlines = fp.readlines()
        for line in inlines:
            if line.startswith("//") or line.strip() == '':
                continue
            row = json.loads(line)
            data.append(row)
    return data

def dump_jsonl(filepath, data):
    with open(filepath, 'w') as fp:
        for datum in data:
            row = json.dumps(datum, ensure_ascii=False)
            fp.write(row)
            fp.write('\n')

german_alphabet = set(" abcdefghijklmnopqrstuvwxyzäöüß"+ string.punctuation + "0123456789")
 
def filter_manifest(input_manifest, output_manifest, min_duration=0.1, max_duration=20):
    utterances = load_jsonl(input_manifest)
    filtered_utterances = []
    for i in tqdm.tqdm(range(len(utterances))):
        if (utterances[i]['duration'] > max_duration) and (utterances[i]['duration'] < min_duration):
            continue
        
        invalid_chars = set(utterances[i]['text'].lower())-german_alphabet
        for c in invalid_chars: 
            utterances[i]['text']= re.sub(c, " ", utterances[i]['text'])
        
        # Remove punctuation
        utterances[i]['text'] = utterances[i]['text'].translate(str.maketrans('', '', ''.join(set(string.punctuation)-{"'"})))
            
        filtered_utterances.append(utterances[i])
    
    print("Number of utterances filtered out: ", len(utterances) - len(filtered_utterances))    
    dump_jsonl(output_manifest, filtered_utterances)

In [77]:
for dataset in ['mls', 'voxpopuli', 'mcv']:
    for subset in ['train', 'dev', 'test']:        
        input_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized.json")
        output_manifest = os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized_filtered.json")
        print("Processing ", input_manifest)
        filter_manifest(input_manifest, output_manifest)

Processing  ./data/processed/mls/mls_train_manifest_normalized.json


100%|██████████████████████████████████████████████████████████████████████████████| 23497/23497 [00:00<00:00, 37217.18it/s]


Number of utterances filtered out:  0
Processing  ./data/processed/mls/mls_dev_manifest_normalized.json


100%|████████████████████████████████████████████████████████████████████████████████| 3469/3469 [00:00<00:00, 34964.46it/s]

Number of utterances filtered out:  0





Processing  ./data/processed/mls/mls_test_manifest_normalized.json


100%|████████████████████████████████████████████████████████████████████████████████| 3394/3394 [00:00<00:00, 35854.17it/s]

Number of utterances filtered out:  0





Processing  ./data/processed/voxpopuli/voxpopuli_train_manifest_normalized.json


100%|████████████████████████████████████████████████████████████████████████████| 108473/108473 [00:02<00:00, 47112.42it/s]


Number of utterances filtered out:  0
Processing  ./data/processed/voxpopuli/voxpopuli_dev_manifest_normalized.json


100%|████████████████████████████████████████████████████████████████████████████████| 2109/2109 [00:00<00:00, 46509.29it/s]


Number of utterances filtered out:  0
Processing  ./data/processed/voxpopuli/voxpopuli_test_manifest_normalized.json


100%|████████████████████████████████████████████████████████████████████████████████| 1968/1968 [00:00<00:00, 45736.27it/s]


Number of utterances filtered out:  0
Processing  ./data/processed/mcv/mcv_train_manifest_normalized.json


100%|████████████████████████████████████████████████████████████████████████████| 196403/196403 [00:02<00:00, 73612.86it/s]


Number of utterances filtered out:  0
Processing  ./data/processed/mcv/mcv_dev_manifest_normalized.json


100%|██████████████████████████████████████████████████████████████████████████████| 15340/15340 [00:00<00:00, 75184.66it/s]


Number of utterances filtered out:  0
Processing  ./data/processed/mcv/mcv_test_manifest_normalized.json


100%|██████████████████████████████████████████████████████████████████████████████| 15340/15340 [00:00<00:00, 77212.30it/s]


Number of utterances filtered out:  0


## Mix and Train/Test Split

We keep the train/dev/test structure of the original datasets, and simply merge them together. 
For other applications where certain datasets are over or under-represented, one might want to apply over sampling or undersampling instead.

In [None]:
for subset in ['train', 'dev', 'test']:
    merged_manifest = []
    for dataset in ['mls', 'voxpopuli', 'mcv']:    
        print("Processing ", dataset, subset)
        merged_manifest.extend(load_jsonl(os.path.join('./data/processed/', dataset, f"{dataset}_{subset}_manifest_normalized_filtered.json")))
    output_manifest = os.path.join('./data/processed/', f"{subset}_manifest_merged.json")
    dump_jsonl(output_manifest, merged_manifest)

Processing  mls train
Processing  voxpopuli train
Processing  mcv train
