## Data Trimming 
### From entire chapter text to first 2 sentences of each chapter for both KJV & BBE

This is to reduce the computational burden for both training and inference since a lot of chapters are very long. 

### Import packages

In [23]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [24]:
import pandas as pd
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')

[nltk_data] Downloading package punkt to /home/jiax1/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [25]:
pd.set_option('display.max_colwidth', None) 
pd.set_option('display.max_rows', None) 

### Keep first two sentences of each chapter for each Bible version

In [26]:
def keep_first_two_sentences(text):
    sentences = sent_tokenize(text)
    return ' '.join(sentences[:2])

In [27]:
df = pd.read_csv('bible_cleaned_data.csv')
df['KJV'] = df['KJV'].apply(keep_first_two_sentences)
df['BBE'] = df['BBE'].apply(keep_first_two_sentences)

## Dirty row removal

### BBE text for many chapters in Psalms are dirty and difficult to preprocess so remove them from the dataset

In [37]:
psalms_df = df[df['chapter'].str.startswith('Psalms')]

sampled_psalms_df = psalms_df.sample(n=10, random_state=42) 

sampled_psalms_df

Unnamed: 0,chapter,KJV,BBE
1042,Psalms3,"Lord, how are they increased that trouble me! many are they that rise up against me.",&lt;A Psalm. Of David.
987,Psalms115,"Not unto us, O Lord, not unto us, but unto thy name give glory, for thy mercy, and for thy truth's sake. Wherefore should the heathen say, Where is now their God?","Not to us, O Lord, not to us, but to your name let glory be given, because of your mercy and your unchanging faith. Why may the nations say, Where is now their God?"
1087,Psalms70,"Make haste, o God, to deliver me; make haste to help me, O Lord. Let them be ashamed and confounded that seek after my soul: let them be turned backward, and put to confusion, that desire my hurt.",&lt;To the chief music-maker. Of David.
1047,Psalms34,"I will bless the Lord at all times: his praise shall continually be in my mouth. My soul shall make her boast in the Lord: the humble shall hear thereof, and be glad.","&lt;Of David. When he made a change in his behaviour before Abimelech, who sent him away, and he went.&gt; I will be blessing the Lord at all times; his praise will be ever in my mouth."
1045,Psalms32,"Blessed is he whose transgression is forgiven, whose sin is covered. Blessed is the man unto whom the Lord imputeth not iniquity, and in whose spirit there is no guile.","&lt;Of David. Maschil.&gt; Happy is he who has forgiveness for his wrongdoing, and whose sin is covered."
1000,Psalms127,"Except the Lord build the house, they labour in vain that build it: except the Lord keep the city, the watchman waketh but in vain. It is vain for you to rise up early, to sit up late, to eat the bread of sorrows: for so he giveth his beloved sleep.","&lt;A Song of the going up. Of Solomon.&gt; If the Lord is not helping the builders, then the building of a house is to no purpose: if the Lord does not keep the town, the watchman keeps his watch for nothing."
1033,Psalms21,"The king shall joy in thy strength, O Lord; and in thy salvation how greatly shall he rejoice! Thou hast given him his heart's desire, and hast not withholden the request of his lips.",&lt;To the chief music-maker. A Psalm.
1110,Psalms91,"He that dwelleth in the secret place of the most High shall abide under the shadow of the Almighty. I will say of the Lord, He is my refuge and my fortress: my God; in him will I trust.","Happy is he whose resting-place is in the secret of the Lord, and under the shade of the wings of the Most High; Who says of the Lord, He is my safe place and my tower of strength: he is my God, in whom is my hope. He will take you out of the bird-net, and keep you safe from wasting disease."
1037,Psalms25,"Unto thee, O Lord, do I lift up my soul. O my God, I trust in thee: let me not be ashamed, let not mine enemies triumph over me.","&lt;Of David.&gt; To you, O Lord, my soul is lifted up. O my God, I have put my faith in you, let me not be shamed; let not my haters be glorying over me."
1051,Psalms38,"O Lord, rebuke me not in thy wrath: neither chasten me in thy hot displeasure. For thine arrows stick fast in me, and thy hand presseth me sore.",&lt;A Psalm. Of David.


**Note the headings being captured as the first two sentences in many BBE chapters in Psalms**

In [38]:
# Create a mask that is True for rows where 'chapter' does not start with 'Psalms'
mask = ~df['chapter'].str.startswith('Psalms')

# Apply the mask to the DataFrame to filter out unwanted rows
filtered_df = df[mask]

In [42]:
original_row_count = df.shape[0]
filtered_row_count = filtered_df.shape[0]
rows_removed = original_row_count - filtered_row_count

# Print the results
print(f"Original DataFrame had {original_row_count} rows.")
print(f"Filtered DataFrame has {filtered_row_count} rows.")
print(f"{rows_removed} rows were removed, corresponding to Psalm chapters.")

Original DataFrame had 1189 rows.
Filtered DataFrame has 1039 rows.
150 rows were removed, corresponding to Psalm chapters.


#### Remove Isaiah52 because BBE first two sentences that were captured do not correspond to the equivalent in KJV since the exclamation points were treated as ending punctuation and this would negatively impact model evaluation later

In [45]:
filtered_df[filtered_df['chapter'] == 'Isaiah52']

Unnamed: 0,chapter,KJV,BBE
577,Isaiah52,"Awake, awake; put on thy strength, O Zion; put on thy beautiful garments, O Jerusalem, the holy city: for henceforth there shall no more come into thee the uncircumcised and the unclean. Shake thyself from the dust; arise, and sit down, O Jerusalem: loose thyself from the bands of thy neck, O captive daughter of Zion.",Awake! awake!


In [46]:
filtered_df = filtered_df[filtered_df['chapter'] != 'Isaiah52']

### Save final clean and trimmed dataframe as csv file 'bible_cleaned_and_short_data.csv'

In [34]:
filtered_df.to_csv('bible_cleaned_and_short_data.csv', index=False) 