<span style="font-size:36px"><b>Preprocess Bibleis</b></span>

Copyright &copy; 2020 Gunawan Lumban Gaol

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language overning permissions and limitations under the License.

# Import Packages

In [None]:
import os
import re
import glob
import json

import numpy as np
import pandas as pd

# Preprocess Transcription for Alignment

Preprocess each chapter transcription by:
1. Splitting each sentence divided by '.'
2. Removing any character except `"a-z"`, `"."`, `","`, `"<space>"`,
3. Write each of chapter verse to a `.txt` file.

In [None]:
df = pd.read_csv("../../dataset/processed/bibleis_trimmed.csv")

In [None]:
df.shape

In [None]:
df.head(1).values

In [None]:
def clean_str(x):
    return re.sub(r'[^a-zA-z.,\n ]', '', x)

In [None]:
tmp = [x.replace('\n\n', ' ').lower() for x in df['chapter_string']]
tmp = [x.replace('. ', '.\n') for x in tmp]
tmp = [re.sub(r'[-]', ' ', x) for x in tmp]
tmp = [clean_str(x) for x in tmp]

Store the result back in the dataframe and see example of cleaned transcription.

In [None]:
df['chapter_string'] = tmp

In [None]:
df.head(1)['chapter_string'].values

Write the cleaned transcription into `.txt` files.

In [None]:
# for x in df.values:
#     with open(x[2][:-4] + '.txt', 'w', encoding='utf-8') as f:
#         f.writelines(x[1])

# Preprocess Audio & Text After Alignment

Given aligned `.json` from aeneas output, split each audio sentence into its own `.mp3` and `.txt` files.

In [None]:
from gurih.data.splitter import AeneasSplitter

In [None]:
input_dir = '../../dataset/processed/bibleis_trimmed/'
output_dir = '../../dataset/processed/bibleis_trimmed_splitted/'
splitter = AeneasSplitter(input_dir=input_dir, output_dir=output_dir)

In [None]:
aligned_jsons = glob.glob(input_dir+"*.json")
aligned_jsons = [os.path.basename(path) for path in aligned_jsons]

In [None]:
for json in aligned_jsons:
    fragments = splitter.load(json)
    splitter.split_and_write(fragments)

# Extract Audio Features

Given splitted `.mp3` files, extract the features and write in `.npz` format.

In [None]:
from sklearn.pipeline import Pipeline
from gurih.data.normalizer import AudioNormalizer
from gurih.features.extractor import MFCCFeatureExtractor

In [None]:
input_dir = "../../test/test_data/data_generator/"

In [None]:
X = glob.glob(input_dir+"*.mp3")

pipeline = Pipeline(
    steps = [
        ("normalizer", AudioNormalizer(output_dir=input_dir)),
        ("mfcc_feature_extractor", MFCCFeatureExtractor(write_output=True,
                                                        output_dir=input_dir,
                                                        append_delta=True))
    ]
)
outputs = pipeline.fit_transform(X)