<span style="font-size:36px"><b>Preprocess Bibleis</b></span>

Copyright &copy; 2020 Gunawan Lumban Gaol

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language overning permissions and limitations under the License.

# Import Packages

In [1]:
import os
import re
import glob
import json

import numpy as np
import pandas as pd

# Preprocess Transcription for Alignment

Preprocess each chapter transcription by:
1. Splitting each sentence divided by '.'
2. Removing any character except `"a-z"`, `"."`, `","`, `"<space>"`,
3. Write each of chapter verse to a `.txt` file.

In [2]:
df = pd.read_csv("../../dataset/processed/bibleis_trimmed.csv")

In [3]:
df[df['audio_title'] == 'INDASV_1CH_1.mp3'].values

array([['https://live.bible.is/bible/INDASV/1CH/1?audio_type=audio',
        "Adam, Set, Enos,\n\nKenan, Mahalaleel, Yared,\n\nHenokh, Metusalah, Lamekh,\n\nNuh, Sem, Ham dan Yafet.\n\nKeturunan Yafet ialah Gomer, Magog, Madai, Yawan, Tubal, Mesekh dan Tiras.\n\nKeturunan Gomer ialah Askenas, Difat dan Togarma.\n\nKeturunan Yawan ialah Elisa, Tarsis, orang Kitim dan orang Rodanim.\n\nKeturunan Ham ialah Kush, Misraim, Put dan Kanaan.\n\nKeturunan Kush ialah Seba, Hawila, Sabta, Raema dan Sabtekha; keturunan Raema ialah Syeba dan Dedan.\n\nKush memperanakkan Nimrod; dialah orang yang mula-mula sekali berkuasa di bumi.\n\nMisraim memperanakkan orang Ludim, orang Anamim, orang Lehabim, orang Naftuhim,\n\norang Patrusim, orang Kasluhim — dari mereka inilah berasal orang Filistin — dan orang Kaftorim.\n\nKanaan memperanakkan Sidon, anak sulungnya dan Het,\n\nserta orang Yebusi, orang Amori, orang Girgasi,\n\norang Hewi, orang Arki, orang Sini,\n\norang Arwadi, orang Semari dan orang Hamati.

In [4]:
def clean_str(x):
    return re.sub(r'[^a-zA-z.,\n ]', '', x)

In [5]:
tmp = [x.replace('\n\n', ' ').lower() for x in df['chapter_string']]
tmp = [x.replace('. ', '.\n') for x in tmp]
tmp = [re.sub(r'[-]', ' ', x) for x in tmp]
tmp = [clean_str(x) for x in tmp]

Store the result back in the dataframe and see example of cleaned transcription.

In [6]:
df['chapter_string'] = tmp

In [7]:
df.sample(1)['chapter_string'].values

array(['lalu allah memberkati nuh dan anak anaknya serta berfirman kepada mereka beranakcuculah dan bertambah banyaklah serta penuhilah bumi.\nakan takut dan akan gentar kepadamu segala binatang di bumi dan segala burung di udara, segala yang bergerak di muka bumi dan segala ikan di laut ke dalam tanganmulah semuanya itu diserahkan.\nsegala yang bergerak, yang hidup, akan menjadi makananmu.\naku telah memberikan semuanya itu kepadamu seperti juga tumbuh tumbuhan hijau.\nhanya daging yang masih ada nyawanya, yakni darahnya, janganlah kamu makan.\ntetapi mengenai darah kamu, yakni nyawa kamu, aku akan menuntut balasnya dari segala binatang aku akan menuntutnya, dan dari setiap manusia aku akan menuntut nyawa sesama manusia.\nsiapa yang menumpahkan darah manusia, darahnya akan tertumpah oleh manusia, sebab allah membuat manusia itu menurut gambar nya sendiri.\ndan kamu, beranakcuculah dan bertambah banyak, sehingga tak terbilang jumlahmu di atas bumi, ya, bertambah banyaklah di atasnya. b

Write the cleaned transcription into `.txt` files.

In [8]:
# for x in df.values:
#     with open(x[2][:-4] + '.txt', 'w', encoding='utf-8') as f:
#         f.writelines(x[1])

# Preprocess Audio & Text After Alignment

Given aligned `.json` from aeneas output, split each audio sentence into its own `.mp3` and `.txt` files.

In [2]:
from gurih.data.splitter import AeneasSplitter

In [3]:
input_dir = '../../dataset/processed/bibleis_trimmed/'
output_dir = '../../dataset/processed/bibleis_trimmed_splitted/'
splitter = AeneasSplitter(input_dir=input_dir, output_dir=output_dir)

In [6]:
aligned_jsons = glob.glob(input_dir+"*.json")
aligned_jsons = [os.path.basename(path) for path in aligned_jsons]

In [8]:
aligned_jsons[:1]

['INDASV_1CH_1.json']

In [None]:
for json in aligned_jsons:
    fragments = splitter.load(json)
    splitter.split_and_write(fragments)

HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=53), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=39), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=37), HTML(value='')))




HBox(children=(IntProgress(value=0, max=52), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=56), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=11), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=37), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=45), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=37), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=33), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=11), HTML(value='')))




HBox(children=(IntProgress(value=0, max=7), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=56), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=30), HTML(value='')))




HBox(children=(IntProgress(value=0, max=42), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=34), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=35), HTML(value='')))




HBox(children=(IntProgress(value=0, max=49), HTML(value='')))




HBox(children=(IntProgress(value=0, max=44), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=31), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=57), HTML(value='')))




HBox(children=(IntProgress(value=0, max=38), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=40), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=54), HTML(value='')))




HBox(children=(IntProgress(value=0, max=36), HTML(value='')))




HBox(children=(IntProgress(value=0, max=48), HTML(value='')))




HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=4), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=3), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=22), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=46), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=45), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=21), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=34), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=19), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=62), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=26), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=39), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=32), HTML(value='')))




HBox(children=(IntProgress(value=0, max=29), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=13), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=16), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))




HBox(children=(IntProgress(value=0, max=17), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=2), HTML(value='')))




HBox(children=(IntProgress(value=0, max=24), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=15), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=8), HTML(value='')))




HBox(children=(IntProgress(value=0, max=10), HTML(value='')))




HBox(children=(IntProgress(value=0, max=12), HTML(value='')))




HBox(children=(IntProgress(value=0, max=28), HTML(value='')))




HBox(children=(IntProgress(value=0, max=14), HTML(value='')))




HBox(children=(IntProgress(value=0, max=18), HTML(value='')))




HBox(children=(IntProgress(value=0, max=9), HTML(value='')))




HBox(children=(IntProgress(value=0, max=51), HTML(value='')))




HBox(children=(IntProgress(value=0, max=20), HTML(value='')))




HBox(children=(IntProgress(value=0, max=27), HTML(value='')))




HBox(children=(IntProgress(value=0, max=23), HTML(value='')))




HBox(children=(IntProgress(value=0, max=25), HTML(value='')))