<span style="font-size:36px"><b>Preprocess Bibleis</b></span>

Copyright &copy; 2020 Gunawan Lumban Gaol

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language overning permissions and limitations under the License.

# Import Packages

In [1]:
import os
import re
import glob
import json

import numpy as np
import pandas as pd

# Preprocess Transcription for Alignment

## INDASV

Preprocess each chapter transcription according to these steps:
1. Splitting each sentence divided by '.'
2. Removing any character except `"a-z"`, `"."`, `","`, `"<space>"`,
3. Write each of chapter verse to a `.txt` file.

In [2]:
df = pd.read_csv("../../dataset/processed/bibleis_trimmed.csv")

In [3]:
df.shape

(1189, 3)

In [4]:
df.head(1).values

array([['https://live.bible.is/bible/INDASV/1CH/10?audio_type=audio',
        'Orang Filistin berperang melawan orang Israel. Orang-orang Israel melarikan diri dari hadapan orang Filistin dan banyak yang mati terbunuh di pegunungan Gilboa.\n\nOrang Filistin terus mengejar Saul dan anak-anaknya dan menewaskan Yonatan, Abinadab dan Malkisua, anak-anak Saul.\n\nKemudian makin beratlah pertempuran itu bagi Saul; para pemanah menjumpainya dan melukainya.\n\nLalu berkatalah Saul kepada pembawa senjatanya: "Hunuslah pedangmu dan tikamlah aku, supaya jangan datang orang-orang yang tidak bersunat ini memperlakukan aku sebagai permainan." Tetapi pembawa senjatanya tidak mau, karena ia sangat segan. Kemudian Saul mengambil pedang itu dan menjatuhkan dirinya ke atasnya.\n\nKetika pembawa senjatanya melihat, bahwa Saul telah mati, ia pun menjatuhkan dirinya ke atas pedangnya, lalu mati.\n\nJadi Saul, ketiga anaknya dan segenap keluarganya sama-sama mati.\n\nKetika dilihat seluruh orang Israel yang 

In [5]:
def clean_str(x):
    return re.sub(r'[^a-zA-z.,\n ]', '', x)

In [6]:
tmp = [x.replace('\n\n', ' ').lower() for x in df['chapter_string']]
tmp = [x.replace('. ', '.\n') for x in tmp]
tmp = [re.sub(r'[-]', ' ', x) for x in tmp]
tmp = [clean_str(x) for x in tmp]

Store the result back in the dataframe and see example of cleaned transcription.

In [7]:
df['chapter_string'] = tmp

In [8]:
df.head(1)['chapter_string'].values

array(['orang filistin berperang melawan orang israel.\norang orang israel melarikan diri dari hadapan orang filistin dan banyak yang mati terbunuh di pegunungan gilboa.\norang filistin terus mengejar saul dan anak anaknya dan menewaskan yonatan, abinadab dan malkisua, anak anak saul.\nkemudian makin beratlah pertempuran itu bagi saul para pemanah menjumpainya dan melukainya.\nlalu berkatalah saul kepada pembawa senjatanya hunuslah pedangmu dan tikamlah aku, supaya jangan datang orang orang yang tidak bersunat ini memperlakukan aku sebagai permainan. tetapi pembawa senjatanya tidak mau, karena ia sangat segan.\nkemudian saul mengambil pedang itu dan menjatuhkan dirinya ke atasnya.\nketika pembawa senjatanya melihat, bahwa saul telah mati, ia pun menjatuhkan dirinya ke atas pedangnya, lalu mati.\njadi saul, ketiga anaknya dan segenap keluarganya sama sama mati.\nketika dilihat seluruh orang israel yang di lembah, bahwa tentara telah melarikan diri, dan bahwa saul serta anak anaknya suda

Write the cleaned transcription into `.txt` files.

In [9]:
# for x in df.values:
#     with open(x[2][:-4] + '.txt', 'w', encoding='utf-8') as f:
#         f.writelines(x[1])

## INDWBT

Preprocess each chapter transcription according to these steps:
1. For every beginning chapter verse (e.g. MAT1, MRK1), append additional unique format speech transcription according to what is practiced usually in the church. (e.g. 1CO_1 --> *'Surat Rasul Paulus yang pertama kepada jemaat di Korintus pasal satu".*
2. For the rest, insert at the beginning a sentence reading the chapter and verse (e.g MAT1 --> *'Matius pasal satu'*)
3. (Optional) Do additional splitting (e.g. `smart split`, `split by comma`. The default is `split by verse`.
4. Removing any character except `"a-z"`, `"."`, `","`, `"<space>"`,
5. Write transcription in new `.txt` format compatible with `Aeneas` simple plain input format.

In [None]:
dict_add = {
    "1CO_1": "",
    "1JN_1": "",
    "1PE_1"
}

# Preprocess Audio & Text After Alignment

Given aligned `.json` from aeneas output, split each audio sentence into its own `.mp3` and `.txt` files.

\*\***NOTE**\*\*: This notebook use single thread to do the split. Consult `2.0-glg-split_mp.py` for multiprocess approach.

In [2]:
from gurih.data.splitter import AeneasSplitter

In [3]:
input_dir = '../../dataset/processed/bibleis_trimmed/'
output_dir = '../../dataset/processed/bibleis_trimmed_splitted/'
splitter = AeneasSplitter(input_dir=input_dir, output_dir=output_dir)

In [6]:
aligned_jsons = glob.glob(input_dir+"*.json")
aligned_jsons = [os.path.basename(path) for path in aligned_jsons]

In [None]:
for json in aligned_jsons:
    fragments = splitter.load(json)
    splitter.split_and_write(fragments)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=53), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=39), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=37), HTML(value='')))




HBox(children=(IntProgress(value=0, max=52), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=56), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=11), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=37), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=45), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=37), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=11), HTML(value='')))




HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=56), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=42), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=34), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=49), HTML(value='')))




HBox(children=(IntProgress(value=0, max=44), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=57), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=54), HTML(value='')))




HBox(children=(IntProgress(value=0, max=36), HTML(value='')))




HBox(children=(IntProgress(value=0, max=48), HTML(value='')))




HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=46), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=45), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=34), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=62), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=39), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=2), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=8), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=51), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))

# Extract Audio Features

Given splitted `.mp3` files, extract the features and write in `.npz` format. 

\*\***NOTE**\*\*: This notebook use single thread to do the split. Consult `hr-extraction_pipeline_mp.py` for multiprocess approach.

In [11]:
from sklearn.pipeline import Pipeline
from gurih.data.normalizer import AudioNormalizer
from gurih.features.extractor import MFCCFeatureExtractor

In [12]:
input_dir = "../../test/test_data/data_generator/"

In [13]:
X = glob.glob(input_dir+"*.mp3")

pipeline = Pipeline(
    steps = [
        ("normalizer", AudioNormalizer(output_dir=input_dir)),
        ("mfcc_feature_extractor", MFCCFeatureExtractor(write_output=True,
                                                        output_dir=input_dir,
                                                        append_delta=True))
    ]
)
outputs = pipeline.fit_transform(X)