# Capstone 3 - Data Wrangling

In this notebook, I will extract the verses of the Hebrew Bible from the Text-Fabric API, place them in a DataFrame, and create labeled and unlabeled subsets of the data for modeling. 

In [2]:
#import the necessary packages
import re
import pandas as pd
from tf.app import use

In [4]:
A = use('etcbc/bhsa', hoist=globals())

This is Text-Fabric 9.2.5
Api reference : https://annotation.github.io/text-fabric/tf/cheatsheet.html

122 features found and 0 ignored


## Creating the Initial DataFrame

Text-Fabric offers 12 different formatting options for the biblical text. I opted for the 'text-phono-full' format  since it most closely approximates the Society of Biblical Literature (SBL) transcription system and makes use of fewer non-Latin characters than the other possibilities. I then defined a helper function to clean up some unneeded features of the transcription and used it to transfer the verses of the Hebrew Bible from the Text-Fabric API into a DataFrame. 

In [5]:
whole_text = " ".join([T.text(word, fmt='text-phono-full') for word in F.otype.s('word')])

verses = whole_text.split(".")

In [8]:
def clean_text(sentence, replacement_vals):
    for k, v in replacement_vals.items():
        sentence = re.sub(k, v, sentence)
    return sentence.strip()

In [9]:
replacement_vals = {" f " : "", 
                    " s " : "", 
                    "ˌ" : "", 
                    "ˈ" : "", 
                    "\." : "", 
                    "-" : " ", 
                    "\[" : "", 
                    "\]" : "", 
                    "v" : "b", 
                    "ḡ" : "g", 
                    "ḏ" : "d", 
                    "ḵ" : "k", 
                    "f" : "p", 
                    "ṯ" : "t",  
                    "ᵊ" : "ə", 
                    "ᵉ" : "ĕ", 
                    "ₐ" : "a", 
                    "ᵃ" : "ă",
                    "eʸ" : "ê",
                    "eh" : "ê",
                    "\*" : "",
                    r'\b(\w)\1*' : r'\1'}

In [11]:
df = pd.DataFrame({'Verse' : ['{} {}:{}'.format(*T.sectionFromNode(verse)) for verse in F.otype.s("verse")],
                   'Text' : [clean_text(verse, replacement_vals) for verse in verses]})

df.head()

Unnamed: 0,Verse,Text
0,Genesis 1:1,bə rēšît bārā ʔĕlōhîm ʔēt ha šāmayim wə ʔ...
1,Genesis 1:2,wə hā ʔāreṣ hāyətā tōhû wā bōhû wə ḥōšek ...
2,Genesis 1:3,wa yōmer ʔĕlōhîm yəhî ʔôr wa yəhî ʔôr
3,Genesis 1:4,wa yar ʔĕlōhîm ʔet hā ʔôr kî ṭôb wa yabd...
4,Genesis 1:5,wa yiqrā ʔĕlōhîm lā ʔôr yôm wə la ḥōšek ...


In order to make the data easier to work with, I replaced the full names of biblical books with their SBL Handbook of Style abbreviations. 


In [12]:
replacement_vals_books = {'Genesis' : 'Gen', 
                          'Exodus' : 'Exod', 
                          'Numbers' : 'Num', 
                          'Leviticus' : 'Lev', 
                          'Deuteronomy' : 'Deut', 
                          'Joshua' : 'Josh', 
                          'Judges' : 'Jud',
                          '1_Samuel' : '1 Sam', 
                          '2_Samuel' : '2 Sam', 
                          '1_Kings' : '1 Kgs', 
                          '2_Kings' : '2 Kgs', 
                          'Isaiah' : 'Isa', 
                          'Jeremiah' : 'Jer', 
                          'Ezekiel' : 'Ezek', 
                          'Hosea' : 'Hos', 
                          'Obadiah' : 'Obad', 
                          'Michah' : 'Mic', 
                          'Nahum' : 'Nah', 
                          'Habakkuk' : 'Hab', 
                          'Zephaniah' : 'Zeph', 
                          'Haggai' : 'Hag', 
                          'Zechariah' : 'Zech', 
                          'Malachi' : 'Mal', 
                          'Psalms' : 'Ps', 
                          'Proverbs' : 'Prov', 
                          'Song_of_Songs' : 'Song', 
                          'Ecclesiastes' : 'Eccl', 
                          'Lamentations' : 'Lam', 
                          'Esther' : 'Esth', 
                          'Daniel' : 'Dan', 
                          'Nehemiah' : 'Neh', 
                          '1_Chronicles' : '1 Chr', 
                          '2_Chronicles' : '2 Chr'}

for k, v in replacement_vals_books.items():
    df['Verse'].replace(k, v, inplace=True, regex=True)

Text-Fabric does not include a verse marker between 1 Kings 16:26 and 27, so all of the verses from 1 Kings 16:28 onward are shifted up by one. We need to fix this problem before moving forward. 

In [13]:
df.tail()

Unnamed: 0,Verse,Text
23208,2 Chr 36:19,wa yegel ha šəʔērît min ha ḥereb ʔel bābe...
23209,2 Chr 36:20,lə mallôt dəbar yəhwāh bə pî yirməyāhû ʕa...
23210,2 Chr 36:21,û bi šənat ʔaḥat lə kôreš melek pāras li ...
23211,2 Chr 36:22,kō ʔāmar kôreš melek pāras kol mamləkôt ...
23212,2 Chr 36:23,


In [14]:
first_half = df.iloc[0:9227]
second_half = df.iloc[9227:]
second_half['Text'] = second_half['Text'].shift(1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half['Text'] = second_half['Text'].shift(1)


In [15]:
first_half.iloc[9226]['Text'] = 'wa yēlek  bə kol  derek  yārobʕām  ben  nəbāṭ  û bə ḥaṭṭātô  ʔăšer  heḥĕṭî  ʔet  yiśrāʔēl  lə hakʕîs  ʔet  yəhwāh  ʔĕlōhê  yiśrāʔēl  bə hablêhem'
second_half.iloc[0]['Text'] = 'wə yeter  dibrê  ʕomrî  ʔăšer  ʕāśā  û gəbûrātô  ʔăšer  ʕāśā  hă lō  hēm  kətûbîm  ʕal  sēper  dibrê  ha yāmîm  lə maləkê  yiśrāʔēl'
second_half.dropna(inplace=True)
df = pd.concat([first_half, second_half])

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  first_half.iloc[9226]['Text'] = 'wa yēlek  bə kol  derek  yārobʕām  ben  nəbāṭ  û bə ḥaṭṭātô  ʔăšer  heḥĕṭî  ʔet  yiśrāʔēl  lə hakʕîs  ʔet  yəhwāh  ʔĕlōhê  yiśrāʔēl  bə hablêhem'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half.iloc[0]['Text'] = 'wə yeter  dibrê  ʕomrî  ʔăšer  ʕāśā  û gəbûrātô  ʔăšer  ʕāśā  hă lō  hēm  kətûbîm  ʕal  sēper  dibrê  ha yāmîm  lə maləkê  yiśrāʔēl'
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  second_half.dropna(inplace=True)


In order to make the DataFrame easier to navigate, I replaced the numerical index with the verse column. Now we can susbset the DataFrame on verses. 

In [16]:
df.set_index('Verse', inplace=True)

Let's double check that all of the transformations worked. 

In [17]:
df.tail()

Unnamed: 0_level_0,Text
Verse,Unnamed: 1_level_1
2 Chr 36:19,wa yiśrəpû ʔet bêt hā ʔĕlōhîm wa yənattəṣû...
2 Chr 36:20,wa yegel ha šəʔērît min ha ḥereb ʔel bābe...
2 Chr 36:21,lə mallôt dəbar yəhwāh bə pî yirməyāhû ʕa...
2 Chr 36:22,û bi šənat ʔaḥat lə kôreš melek pāras li ...
2 Chr 36:23,kō ʔāmar kôreš melek pāras kol mamləkôt ...


Despite its name, the Hebrew Bible contains about one hunderd verses in Aramaic (Jeremiah 10:11; Daniel 2:4b-7:28; Ezra 4:8-6:18; 7:12-26), a Semitic language closely related to Hebrew. I've opted to remove these verses  because they won't add anything to my biblical Hebrew language model. 

In [18]:
df.drop('Jer 10:11', inplace=True)
df.drop(df.loc['Dan 2:5' : 'Dan 7:28'].index, inplace=True)
df.drop(df.loc['Ezra 4:8' : 'Ezra 6:18'].index, inplace=True)
df.drop(df.loc['Ezra 7:12' : 'Ezra 7:26'].index, inplace=True)

In [19]:
df.loc['Dan 2:4']['Text'] = 'wa yədabbərû  ha kaśdîm  la  melek  ʔărāmît'

df.loc['Dan 2:4']['Text']

'wa yədabbərû  ha kaśdîm  la  melek  ʔărāmît'

Finally, let's save the dataframe to csv so that we can access it later. 

In [20]:
df.to_csv('HebrewBiblebyVerse.csv')

## Creating the Labeled and Unlabeled Datasets

Now that all of the verses of the Hebrew Bible are cleaned and stored in a DataFrame, I can create the labeled and unlabeled datasets. To do so, I will first label each chronological phases individually and then combine them into the labeled dataset. Then I will create the unlabeled dataset by subsetting the original DataFrame to exclude the labeled verses. 

### Archaic Biblical Hebrew

Archaic Biblical Hebrew (ABH) is the earliest and most poorly attested phase of the Hebrew language. It can be found in the poems from Genesis 49:2-27; Exodus 15:1-18; Numbers 23:7-10, 18-24; 24:3-9, 15-24; Deuteronomy 32:1-43; 33:2-29; Judges 5:2-31; 2 Samuel 22:2-51; Psalms 18:1-50; 68:2-35. 

In [21]:
df = pd.read_csv('HebrewBiblebyVerse.csv', index_col='Verse')

df.tail()

Unnamed: 0_level_0,Text
Verse,Unnamed: 1_level_1
2 Chr 36:19,wa yiśrəpû ʔet bêt hā ʔĕlōhîm wa yənattəṣû...
2 Chr 36:20,wa yegel ha šəʔērît min ha ḥereb ʔel bābe...
2 Chr 36:21,lə mallôt dəbar yəhwāh bə pî yirməyāhû ʕa...
2 Chr 36:22,û bi šənat ʔaḥat lə kôreš melek pāras li ...
2 Chr 36:23,kō ʔāmar kôreš melek pāras kol mamləkôt ...


In [22]:
ABH = pd.concat([df['Gen 49:2':'Gen 49:27'], 
                 df['Exod 15:1':'Exod 15:18'], 
                 df['Num 23:7' : 'Num 23:10'], 
                 df['Num 23:18' : 'Num 23:24'], 
                 df['Num 24:3' : 'Num 24:9'], 
                 df['Num 24:15' : 'Num 24:24'], 
                 df['Deut 32:1' : 'Deut 32:43'], 
                 df['Deut 33:2' : 'Deut 33:29'], 
                 df['Jud 5:2' : 'Jud 5:31'], 
                 df['2 Sam 22:2' : '2 Sam 22:51'], 
                 df['Ps 18:1' : 'Ps 18:50'], 
                 df['Ps 68:2' : 'Ps 68:35']])

In [23]:
ABH['Stage'] = 'Archaic Biblical Hebrew'

ABH.head()

Unnamed: 0_level_0,Text,Stage
Verse,Unnamed: 1_level_1,Unnamed: 2_level_1
Gen 49:2,hiqqābəṣû wə šimʕû bənê yaʕăqōb wə šimʕû ...,Archaic Biblical Hebrew
Gen 49:3,rəʔûbēn bəkōrî ʔattā kōḥî wə rēšît ʔônî ...,Archaic Biblical Hebrew
Gen 49:4,paḥaz ka mayim ʔal tôtar kî ʕālîtā mišk...,Archaic Biblical Hebrew
Gen 49:5,šimʕôn wə lēwî ʔaḥîm kəlê ḥāmās məkērōtêhem,Archaic Biblical Hebrew
Gen 49:6,bə sōdām ʔal tābō napšî bi qəhālām ʔal t...,Archaic Biblical Hebrew


### Classical Biblical Hebrew

Classical Biblical Hebrew (CBH) refers to the form of Hebrew used during the monarchic period of Israelite history (~900-586 BCE). Although several scholars have argued that CBH makes up the bulk of the Hebrew Bible, I've focused on texts that refer to historical events and can therefore be assigned to CBH with a higher degree of certainty. 

In [24]:
CBH = df['1 Sam 1:1' : '2 Kgs 25:30']

In [25]:
CBH['Stage'] = 'Classical Biblical Hebrew'

CBH.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  CBH['Stage'] = 'Classical Biblical Hebrew'


Unnamed: 0_level_0,Text,Stage
Verse,Unnamed: 1_level_1,Unnamed: 2_level_1
1 Sam 1:1,wa yəhî ʔîš ʔeḥād min hā rāmātayim ṣôpîm ...,Classical Biblical Hebrew
1 Sam 1:2,wə lô šəttê nāšîm šēm ʔaḥat ḥannā wə šēm...,Classical Biblical Hebrew
1 Sam 1:3,wə ʕālā hā ʔîš ha hû mē ʕîrô mi yāmîm yām...,Classical Biblical Hebrew
1 Sam 1:4,wa yəhî ha yôm wa yizbaḥ ʔelqānā wə nātan ...,Classical Biblical Hebrew
1 Sam 1:5,û lə ḥannā yittēn mānā ʔaḥat ʔappāyim kî ...,Classical Biblical Hebrew


### Transitional Biblical Hebrew

Transitional Biblical Hebrew (TBH) refers to the Hebrew used around the exilic period (586-539 BCE) when Judah was under Babylonian control. True to its name, it represents a transitional phase between Classical and Late Biblical Hebrew, often mixing features of both. 

In [26]:
TBH = pd.concat([df['Isa 40:1' : 'Isa 54:17'], df['Jer 1:1' : 'Ezek 48:35']])

In [27]:
TBH['Stage'] = 'Transitional Biblical Hebrew'

TBH.head()

Unnamed: 0_level_0,Text,Stage
Verse,Unnamed: 1_level_1,Unnamed: 2_level_1
Isa 40:1,naḥămû naḥămû ʕammî yōmar ʔĕlōhêkem,Transitional Biblical Hebrew
Isa 40:2,dabbərû ʕal lēb yərûšālaim wə qirʔû ʔēlêh...,Transitional Biblical Hebrew
Isa 40:3,qôl qôrē ba midbār pannû derek yəhwāh y...,Transitional Biblical Hebrew
Isa 40:4,kol gê yinnāśē wə kol har wə gibʕā yišpā...,Transitional Biblical Hebrew
Isa 40:5,wə niglā kəbôd yəhwāh wə rāʔû kol bāśār ...,Transitional Biblical Hebrew


### Late Biblical Hebrew

Late Biblical Hebrew (LBH) is the last recoreded phase of biblical Hebrew and hails from the Persian period (539-333 BCE). It can be found in the books of Ecclesiastes, Esther, Daniel, Ezra, Nehemiah, and the non-synoptic portions of 1 and 2 Chronicles.

In [28]:
LBH = pd.concat([df['Eccl 1:1' : 'Eccl 12:14'], 
                 df['Esth 1:1' : '1 Chr 9:44'], 
                 df['1 Chr 12:1' : '1 Chr 12:40'], 
                 df['1 Chr 15:1' : '1 Chr 15:24'], 
                 df['1 Chr 16:7' : '1 Chr 16:43'], 
                 df['1 Chr 21:1' : '1 Chr 29:19'], 
                 df['2 Chr 7:1' : '2 Chr 7:3'], 
                 df['2 Chr 14:9' : '2 Chr 15:7'], 
                 df['2 Chr 17:1' : '2 Chr 17:19'], 
                 df['2 Chr 21:12' : '2 Chr 21:17'], 
                 df['2 Chr 24:15' : '2 Chr 24:22'], 
                 df['2 Chr 26:6' : '2 Chr 26:21'], 
                 df['2 Chr 29:3' : '2 Chr 31:21'], 
                 df['2 Chr 33:10' : '2 Chr 33:20'], 
                 df['2 Chr 34:3' : '2 Chr 34:7'], 
                 df['2 Chr 36:22' : '2 Chr 36:23']])

In [29]:
LBH['Stage'] = 'Late Biblical Hebrew'

len(LBH)

2087

The ABH dataset is much smaller than the other corpora (307 vs 2000-3000 verses). In order to avoid working with an unbalanced data, I used subsamples of the larger CBH, TBH, and LBH DataFrames to create the labeled dataset.  

In [30]:
sample_size = len(ABH) 

labeled_verses = pd.concat([ABH, 
                            CBH.sample(sample_size, random_state=42), 
                            TBH.sample(sample_size, random_state=42), 
                            LBH.sample(sample_size, random_state=42)])

In [31]:
labeled_verses2 = pd.concat([ABH, CBH, TBH, LBH])

In [32]:
unlabeled_verses = df[~df.index.isin(labeled_verses.index)]

Finally, let's save the labeled and unlabeled datasets to csv for later use. 

In [33]:
labeled_verses.to_csv('labeled_verses.csv')

In [34]:
labeled_verses2.to_csv('labeled_verses2.csv')

In [35]:
unlabeled_verses.to_csv('unlabeled_verses.csv')