# 📖 Analyzing KJV Vocab  

I want my prayers to use vocabulary (stored in `myvocabulary.py`) that is Biblical in flavor.  But, while I have some idea of which words are biblical (e.g. "myrrh") and which are not (e.g. "toothpaste"), I am sure that the words contained within the KJV version are much more interesting than what ever  KJV-like words I could conjure with my dubiously-lettered mind. 

I don't have time to go through the Bible word by word.

In this notebook I simply do some rudimentary analysis of words of various parts of speech, hastily surfacing those words that seem most "Biblical."  I could have used a more statistically-robust technique for discovering key-words (e.g. Rayson & Garside's technique), but it seems sufficient for my purposes to simply compare the frequency of a word in the KJV with the frequency of that word in another corpus---for convenience, the Brown Corpus. 

I can then just copy and paste the Biblical words I "discover" into `myvocabulary.py`.

***

In [1]:
from nltk import corpus, FreqDist, pos_tag

In [2]:
bible = corpus.gutenberg.words('bible-kjv.txt')

In [3]:
bible_tagged = pos_tag(bible)

In [4]:
brown = corpus.brown.words()

In [5]:
from nltk import FreqDist as fd 
biblefd = fd([w.lower() for w in bible])
brownfd = fd([w.lower() for w in brown])

Function for comparing frequency of word in KJV vs. Brown Corpus.

In [6]:
def biblicity(word):
    """
    Simple comparison of frequency between the word in the KJV vs. in Brown 
    """
    return biblefd[word]/max(brownfd[word],1)

In [7]:
from collections import defaultdict
token2tags = defaultdict(list)
for token,tag in bible_tagged:
    token2tags[token].append(tag)
    
token2tags['shalt'][:10]

['NN', 'NN', 'VBP', 'NN', 'NN', 'NN', 'VBD', 'NN', 'NN', 'NN']

Dealing with the fact that pos-taggers don't work so well on archaic text.

In [8]:
def most_common(lst):#https://stackoverflow.com/a/1518632
    return max(set(lst), key=lst.count)
    
def get_most_common_tag(word,nn_threshold=.7):
    """
    get most common pos tag for a token
    penalize NN since it is the default tag, often applied incorrectly
    """
    most_common_tag = most_common(token2tags[word])
    if most_common_tag=="NN":
        if token2tags[word].count("NN")/len(token2tags[word])>=nn_threshold:
            return "NN"
        else:
            return most_common([t for t in token2tags[word] if t!="NN"])
    else:
        return most_common_tag
    return most_common_tag
    
    
get_most_common_tag("shalt")
    

'VBD'

Function for ranking words according to biblicity.

In [9]:
def sort_words_based_on_pos(pos_start,strict=False):
    """
    get top (most Biblical) n words in the KJV whose tag starts with pos_start (e.g. "N" or "VBG")
    """
    if strict:
        words = list(set([w for w in list(set(bible)) if get_most_common_tag(w)==pos_start]))
    else:
        words = list(set([w for w in list(set(bible)) if get_most_common_tag(w).startswith(pos_start)]))
    words_biblicity = [(w,biblicity(w)) for w in words]
    words_biblicity.sort(key=lambda x: x[1],reverse=True)
    return words_biblicity
    

In [22]:
tops = sort_words_based_on_pos("NN",strict=True)

In [23]:
tops[700:800]

[('ransom', 2.6),
 ('grief', 2.6),
 ('heir', 2.5714285714285716),
 ('word', 2.551094890510949),
 ('wood', 2.5454545454545454),
 ('day', 2.537117903930131),
 ('master', 2.513888888888889),
 ('dawning', 2.5),
 ('fortress', 2.5),
 ('raging', 2.5),
 ('draught', 2.5),
 ('brim', 2.5),
 ('cleansing', 2.5),
 ('sect', 2.5),
 ('gourd', 2.5),
 ('insurrection', 2.5),
 ('rump', 2.5),
 ('custody', 2.5),
 ('chastisement', 2.5),
 ('robber', 2.5),
 ('destruction', 2.473684210526316),
 ('devil', 2.44),
 ('solemn', 2.4166666666666665),
 ('cock', 2.4),
 ('correction', 2.4),
 ('noise', 2.3783783783783785),
 ('deed', 2.375),
 ('herb', 2.375),
 ('iron', 2.3488372093023258),
 ('thigh', 2.3333333333333335),
 ('thorn', 2.3333333333333335),
 ('banquet', 2.3333333333333335),
 ('greet', 2.2857142857142856),
 ('idol', 2.2857142857142856),
 ('man', 2.2659486329743164),
 ('flood', 2.263157894736842),
 ('twilight', 2.25),
 ('virginity', 2.25),
 ('warp', 2.25),
 ('vale', 2.25),
 ('mule', 2.25),
 ('honey', 2.24),
 ('voi

***

### Comparing words (and adding them) to those already in `myvocabulary.py`

In [52]:
import myvocabulary

In [53]:
from importlib import reload  
reload(myvocabulary)

<module 'myvocabulary' from '/Users/kyle/Box Sync/computation/projects/TPwoC/orisonation/myvocabulary.py'>

In [54]:
alreadywords = []

for w in myvocabulary.myvocabulary["NN"]:
    if ">" in w:
        w1,w2 = w.split(">")
        alreadywords.append(w1)
        alreadywords.append(w2)
    else:
        alreadywords.append(w)

In [62]:
tops = sort_words_based_on_pos("NN",strict=True)[900:1000]

In [63]:
not_already = [n for n,c in tops if n not in alreadywords]

In [64]:
not_already[:30]

['ensign',
 'winefat',
 'patriarchs',
 'lowliness',
 'bribe',
 'desiredst',
 'fryingpan',
 'lama',
 'dowry',
 'curtain',
 'partridge',
 'flagon',
 'stumblingstone',
 'trump',
 'sorcerer',
 'camphire',
 'lapwing',
 'dimness',
 'extortion',
 'bier',
 'solemnity',
 'cuckow',
 'shipmaster',
 'fining',
 'amiss',
 'ninety',
 'watercourse',
 'bolster',
 'fellowsoldier',
 'slime']

Just test to make sure that the words are in the word2vec vocab.

In [65]:
import gensim
word2vec_path = "shrunkenvectors_200000.bin"
print("****Loading word2vec model: %s" % word2vec_path)
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary=True)
print("****Loaded")

****Loading word2vec model: shrunkenvectors_200000.bin
****Loaded


Interactive loop for going through words one by one, sorting some into a list, then (in cell below) joining them into a string that can be inserted into a list in `myvocabulary.py`.

In [66]:
output = []

for w in not_already:
    if w in word2vec.vocab:
        print(w)
        if input().startswith("y"):
            output.append(w)
        else:
            pass

ensign
y
patriarchs
n
bribe
y
lama

dowry
y
curtain
y
partridge
y
trump

sorcerer
y
extortion
y
solemnity
y
fining

amiss

ninety

watercourse
y
bolster

slime
y
pinnacle
y
furtherance

glutton
y
heath
y
obstinate
y
wellspring
y
partiality
y
purification
y
regeneration
y
dial
y
spoon
y
kite
y
botch

evangelist
y
fool
y
battle
y
wind
y
generation
y
flame
y
creature
y
month
y
household
y
valley
y
truth
y
prisoner
y
marvel
y
fountain
y
waste
y
thereto

camp
y
mistress
y
province
y
wife
y
earthquake
y
accord
y
height
y
strength
y
hearth
y
hem
y
weaver
y
drunkenness
y
ornament
y
law
y
oven
y
condemnation
y
offspring
y
meek
y
ministry
y
rest
y
palm
y
goddess
y
sweetness
y
maker
y
drunkard
y
emerald
y
horseback
y
skull
y


In [67]:
'"'+ '","'.join(output)+ '"'

'"ensign","bribe","dowry","curtain","partridge","sorcerer","extortion","solemnity","watercourse","slime","pinnacle","glutton","heath","obstinate","wellspring","partiality","purification","regeneration","dial","spoon","kite","evangelist","fool","battle","wind","generation","flame","creature","month","household","valley","truth","prisoner","marvel","fountain","waste","camp","mistress","province","wife","earthquake","accord","height","strength","hearth","hem","weaver","drunkenness","ornament","law","oven","condemnation","offspring","meek","ministry","rest","palm","goddess","sweetness","maker","drunkard","emerald","horseback","skull"'

*Notes to myself*

NN through 464 wwith small model

harps,oxenwise,servile
edification,reverence, dispossess,temperate

### Analyzing Vocab

In [None]:
for pos in vocab:
    words = vocab[pos]
    print(pos)
    for w in words:
        try:
            w2v[w.split(">")[0]]
        except:
            print("%s not in word2vec vocab" % w.split(">")[0])
    print()

***