In [1]:
import pandas as pd

KJV = '/kaggle/input/bible-text-by-age/kjv_with_age.csv'
df = pd.read_csv(filepath_or_buffer=KJV).drop(columns=['id'])
df.columns = ['book_number', 'chapter', 'verse', 'text', 'book', 'age']
df.head()

Unnamed: 0,book_number,chapter,verse,text,book,age
0,1,1,1,In the beginning God created the heaven and th...,Genesis,1
1,1,1,2,"And the earth was without form, and void; and ...",Genesis,1
2,1,1,3,"And God said, Let there be light: and there wa...",Genesis,1
3,1,1,4,"And God saw the light, that it was good: and G...",Genesis,1
4,1,1,5,"And God called the light Day, and the darkness...",Genesis,1


First let's do some basic counts. 

In [2]:
df.nunique().to_frame().T

Unnamed: 0,book_number,chapter,verse,text,book,age
0,66,150,176,30834,66,11


We know from the data card that there are eleven ages. How many books are in each age?

In [3]:
df[['book', 'age']].drop_duplicates().T

Unnamed: 0,0,299,1533,2746,3605,4893,5852,6510,7128,7213,...,29939,29964,30267,30375,30480,30541,30646,30659,30674,30699
book,Genesis,Genesis,Exodus,Leviticus,Numbers,Deuteronomy,Joshua,Judges,Ruth,1 Samuel,...,Philemon,Hebrews,James,1 Peter,2 Peter,1 John,2 John,3 John,Jude,Revelation
age,1,2,3,3,3,3,3,4,4,4,...,11,11,11,11,11,11,11,11,11,12


It turns out there are some books that fit into more than one age, but we should be able to approximate the answer we want without having to allocate any of those books to one age or the other.

In [4]:
df[['book', 'age']].drop_duplicates()['age'].value_counts().sort_index().to_frame().T

age,1,2,3,4,5,6,7,8,10,11,12
count,1,2,5,3,8,9,8,6,4,22,1


This is an interesting results; we have two ages with one book each and one with none. 

In [5]:
ages_df = df[['book', 'age']].drop_duplicates(ignore_index=True)
ages_df[ages_df['age'] == 2]

Unnamed: 0,book,age
1,Genesis,2
20,Job,2


The source of this dataset puts an early date on Job, which is not exactly controversial, but neither is it universally accepted.

There are multiple interesting judgment calls in putting books into ages, but they're not all interesting enough to dig into here.

Instead let's look at the verses that are identical. How many would we expect to find?

In [6]:
verse_counts_df = df['text'].value_counts().to_frame()
verse_counts_df[verse_counts_df['count'] > 1]

Unnamed: 0_level_0,count
text,Unnamed: 1_level_1
"And the LORD spake unto Moses, saying,",72
"And the word of the LORD came unto me, saying,",12
"One young bullock, one ram, one lamb of the first year, for a burnt offering:",12
One kid of the goats for a sin offering:,12
"One golden spoon of ten shekels, full of incense:",10
...,...
"And Solomon had horses brought out of Egypt, and linen yarn: the king's merchants received the linen yarn at a price.",2
"The men of Anathoth, an hundred twenty and eight.",2
And there was war between Asa and Baasha king of Israel all their days.,2
"There went up a smoke out of his nostrils, and fire out of his mouth devoured: coals were kindled by it.",2


There are 120 unique verses that appear more than once, and the most prevalent one appears 72 times. That seems like a lot.