In [1]:
import pandas as pd

KJV = '/kaggle/input/bible-text-by-age/kjv_with_age.csv'
df = pd.read_csv(filepath_or_buffer=KJV).drop(columns=['id'])
df.columns = ['book_number', 'chapter', 'verse', 'text', 'book', 'age']
df.head()

Unnamed: 0,book_number,chapter,verse,text,book,age
0,1,1,1,In the beginning God created the heaven and th...,Genesis,1
1,1,1,2,"And the earth was without form, and void; and ...",Genesis,1
2,1,1,3,"And God said, Let there be light: and there wa...",Genesis,1
3,1,1,4,"And God saw the light, that it was good: and G...",Genesis,1
4,1,1,5,"And God called the light Day, and the darkness...",Genesis,1


First let's do some basic counts. 

In [2]:
df.nunique().to_frame().T

Unnamed: 0,book_number,chapter,verse,text,book,age
0,66,150,176,30834,66,11


We know from the data card that there are eleven ages. How many books are in each age?

In [3]:
df[['book', 'age']].drop_duplicates(ignore_index=True).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,59,60,61,62,63,64,65,66,67,68
book,Genesis,Genesis,Exodus,Leviticus,Numbers,Deuteronomy,Joshua,Judges,Ruth,1 Samuel,...,Philemon,Hebrews,James,1 Peter,2 Peter,1 John,2 John,3 John,Jude,Revelation
age,1,2,3,3,3,3,3,4,4,4,...,11,11,11,11,11,11,11,11,11,12


It turns out there are some books that fit into more than one age, but we should be able to approximate the answer we want without having to allocate any of those books to one age or the other.

In [4]:
df[['book', 'age']].drop_duplicates()['age'].value_counts().sort_index().to_frame().T

age,1,2,3,4,5,6,7,8,10,11,12
count,1,2,5,3,8,9,8,6,4,22,1


This is an interesting results; we have two ages with one book each and one with none. 

In [5]:
ages_df = df[['book', 'age']].drop_duplicates(ignore_index=True)
ages_df[ages_df['age'] == 2]

Unnamed: 0,book,age
1,Genesis,2
20,Job,2


The source of this dataset puts an early date on Job, which is not exactly controversial, but neither is it universally accepted.

There are multiple interesting judgment calls in putting books into ages, but they're not all interesting enough to dig into here.

Instead let's look at the verses that are identical. How many would we expect to find?

In [6]:
verse_counts_df = df['text'].value_counts().to_frame()
verse_counts_df[verse_counts_df['count'] > 1]

Unnamed: 0_level_0,count
text,Unnamed: 1_level_1
"And the LORD spake unto Moses, saying,",72
"And the word of the LORD came unto me, saying,",12
"One young bullock, one ram, one lamb of the first year, for a burnt offering:",12
One kid of the goats for a sin offering:,12
"One golden spoon of ten shekels, full of incense:",10
...,...
"And Solomon had horses brought out of Egypt, and linen yarn: the king's merchants received the linen yarn at a price.",2
"The men of Anathoth, an hundred twenty and eight.",2
And there was war between Asa and Baasha king of Israel all their days.,2
"There went up a smoke out of his nostrils, and fire out of his mouth devoured: coals were kindled by it.",2


There are 120 unique verses that appear more than once, and the most prevalent one appears 72 times. That seems like a lot.

Before we proceed, let's note that chapter markings are uncommon in the oldest texts (Psalms being an exception) and verse markings were added much later, and different books are broken up into chapter and verse somewhat unevenly. 

Which books have only one chapter?

In [7]:
chapter_count_df = df[['book', 'chapter']].drop_duplicates(ignore_index=True)['book'].value_counts().to_frame()
chapter_count_df[chapter_count_df['count'] == 1].T

book,Philemon,Obadiah,2 John,3 John,Jude
count,1,1,1,1,1


Which books are the longest when measured by chapters?

In [8]:
chapter_count_df.head(n=14).T

book,Psalms,Isaiah,Jeremiah,Genesis,Ezekiel,Job,Exodus,Numbers,2 Chronicles,Deuteronomy,Proverbs,1 Samuel,1 Chronicles,Acts
count,150,66,52,50,48,42,40,36,36,34,31,31,29,28


The fourteenth longest overall in terms of chapters is the longest in the New Testament. 

In [9]:
df['book'].value_counts().to_frame().head(n=10).T

book,Psalms,Genesis,Jeremiah,Isaiah,Numbers,Ezekiel,Exodus,Luke,Matthew,Job
count,2461,1533,1364,1292,1288,1273,1213,1151,1071,1070


If we count verses instead of chapters, two New Testament books make the top ten.

In [10]:
from plotly import express
from plotly import io

io.renderers.default = 'iframe'

express.bar(
    data_frame=df[['book', 'chapter']].groupby(by='book').max().reset_index().sort_values(ascending=False, by='chapter'),
    x='book', y='chapter',
)


In [11]:
express.bar(
    data_frame=df[['book', 'verse']].groupby(by='book').size().reset_index().sort_values(ascending=False, by=0),
    x='book', y=0,
)


Let's plot chapter and verse counts together. 

We sort of expect that the average chapter will have roughly the same number of verses across all books, so we kind of expect to see a linear plot here. But the bigger books are so much bigger than the smallest books we need a log plot to be able to see much at all.

In [12]:
express.scatter(
    data_frame=df[['book', 'verse']].groupby(by='book').size().reset_index().sort_values(ascending=False, by=0).merge(
    right=df[['book', 'chapter']].groupby(by='book').max().reset_index().sort_values(ascending=False, by='chapter'),
    on='book', how='inner').rename(columns={0: 'verse'}),
    x='chapter', y='verse', log_x=True, log_y=True, hover_name='book',
)


Let's take the data from the scatter plot above and put it in a DataFrame and then look at the distribution of average verses per chapter.

In [13]:
book_df = df[['book', 'verse']].groupby(by='book').size().reset_index().sort_values(ascending=False, by=0).merge(
    right=df[['book', 'chapter']].groupby(by='book').max().reset_index().sort_values(ascending=False, by='chapter'),
    on='book', how='inner').rename(columns={0: 'verse'})
book_df['vpc'] = book_df['verse'] / book_df['chapter']
express.histogram(data_frame=book_df, x='vpc', nbins=66)

Verses per chapter on average are distributed unevenly. Let's plot them.

In [14]:
express.scatter(data_frame=book_df, x='chapter', y='vpc', color='verse', hover_name='book', log_x=True, log_y=False)

Interestingly, almost all of the top verses per chapter books are primarily narrative. The books above 35 verses per chapter include all four Gospels and Acts, along with I Kings and Numbers.