# Metadata

```
Author: Linnaea Kavulich
Contact: qpk4kp@virginia.edu
Course: DS 5001 (Spring 2023)
```

<h1><center>The State of Natural Language Processing</center></h1>
<h3><center>An exploration of the infamous Locke-Hobbes-Rousseau Debate Using NLP techniques</center></h3>

*** 

```
This notebook imports F0 texts (as .txt files), converts to F1 form (gets TOKENS), normalizes TOKENS to TERMS (F2),  and annotates TERMS with stopwords, parts-of-speech, stems and lemmas to get VOCAB table (F3).

The end result is a CORPUS, LIB, and VOCAB table conatining 6 works of the social contract theorists Thomas Hobbes, Jean-Jacques Rousseau, and John Locke. 
```

| Book_id | Title | Author |
| :- | :- | :- |
| 1 | The Social Contract | Jean-Jacques Rousseau 
| 2 | Leviathan | Thomas Hobbes
| 3 | Second Treatise of Government | John Locke
| 4 | Discourse on the Origin and Basis of Inequality Among Men | Jean-Jacques Rousseau
| 5 | An Essay Concerning Humane Understanding, Vol. 1 | John Locke
| 6 | An Essay Concerning Humane Understanding, Vol. 2 | John Locke

## Import Packages

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import nltk
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.lancaster import LancasterStemmer
import re

import os
os.chdir('C:/Users/linna/Box/MSDS/DS5001/Final Project/Corpus/raw texts')

***

## Load Texts

In [2]:
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

In [3]:
text_file = 'social_contract_rousseau.txt'

In [4]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()
LINES.line_str = LINES.line_str.str.replace('"', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('{', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('}', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('&', '', regex=True)
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)\)", r"\1", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)|(\w+)\)", r"\1\2", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\d+", "", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"[^\w\s.?!]+", "", x))

In [5]:
LINES

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
0,THE SOCIAL CONTRACT
1,
2,FOREWORD
3,
4,This little treatise is part of a longer work ...
...,...
10588,the rich will be far from sparing themselves t...
10589,this is quite beside the point. If in every na...
10590,the Sovereign commits the government of the pe...
10591,position its enemies it would not be worth whi...


In [6]:
chap_book_pat = r"\s*(chapter|BOOK)\s+(?=([IVX]+))\2(\s|$)"

In [7]:
chap_pat = r"\s*(chapter)\s+(?=([IVX]+))\2(\s|$)"

In [8]:
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [10]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.chap_num = LINES.chap_num.ffill()

In [11]:
#LINES.loc[chap_lines]

In [12]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
921,his powers goods and liberty as it is importan...,13.0
1425,Free peoples be mindful of maxim Liberty may b...,17.0
4382,act of cowardice as indeed it is then since co...,46.0
7543,held in some repute the jealousy of lovers and...,48.0
5988,case of need. It follows that a republican Sta...,48.0
10158,provisions money and merchandise in just propo...,48.0
194,,4.0
8157,man could aggrandise himself only at the expen...,48.0
9060,the veil which hides all these horrors let us ...,48.0
6117,,48.0


In [13]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

In [14]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
3117,great ones as formerly the Greek towns resiste...,34
6613,through its successive developments nor shall ...,48
3947,consuming greed unrest intrigue continual remo...,43
5752,and the peoples will continue to be as they ar...,48
8121,also his wit beauty strength or skill merit or...,48
10516,the common people is not oppressed and the dut...,48
9250,necessity and afterwards from gratitude after ...,48
373,were unanimous would be the obligation on the ...,5
5893,MOST HONOURABLE MAGNIFICENT AND SOVEREIGN LORD...,48
7884,himself with many conveniences unknown to his ...,48


In [15]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')

In [16]:
CHAPS['chap_str'] = CHAPS['chap_str'].astype(str)

In [17]:
# CHAPS['chap_str'] = CHAPS['chap_str'].str.replace('\n', ' ')
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS['chap_str'] = CHAPS['chap_str'].str.replace('_', '', regex=True).str.strip()

In [18]:
para_pat = r'\n\n+'

In [19]:
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

In [20]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,SUBJECT OF THE FIRST BOOK
1,1,Man is born free and everywhere he is in chain...
1,2,If I took into account only force and the effe...
2,0,THE FIRST SOCIETIES
2,1,The most ancient of all societies and the only...


In [21]:
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

In [22]:
SENTS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,0,0,SUBJECT OF THE FIRST BOOK
1,1,0,Man is born free and everywhere he is in chains
1,1,1,One thinks himself\nthe master of others and ...
1,1,2,How\ndid this change come about
1,1,3,I do not know
...,...,...,...
48,387,1,e
48,387,2,those who impose or contrive the taxes being ...
48,387,3,But\nthis is quite beside the point
48,387,4,If in every nation those to whom\nthe Soverei...


In [23]:
token_pat = r"[\s',-]+"
TOKENS1 = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

In [24]:
TOKENS1.index.names = OHCO[:4]

In [25]:
TOKENS1

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,SUBJECT
1,0,0,1,OF
1,0,0,2,THE
1,0,0,3,FIRST
1,0,0,4,BOOK
...,...,...,...,...
48,387,4,36,make
48,387,4,37,the
48,387,4,38,people
48,387,4,39,happy


In [26]:
TOKENS1['term_str'] = TOKENS1.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
TOKENS1['term_str'] = TOKENS1.token_str.replace(r'/n', '', regex=True).str.lower()

In [27]:
TOKENS1 = pd.concat({'1': TOKENS1}, names=['book_id'])

In [28]:
TOKENS1

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,SUBJECT,subject
1,1,0,0,1,OF,of
1,1,0,0,2,THE,the
1,1,0,0,3,FIRST,first
1,1,0,0,4,BOOK,book
1,...,...,...,...,...,...
1,48,387,4,36,make,make
1,48,387,4,37,the,the
1,48,387,4,38,people,people
1,48,387,4,39,happy,happy


***

In [30]:
text_file = 'hobbes_leviathan.txt'
OHCO=['chap_num', 'para_num', 'sent_num', 'token_num']
   
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()
LINES.line_str = LINES.line_str.str.replace('"', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('{', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('}', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('&', '', regex=True)
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)\)", r"\1", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)|(\w+)\)", r"\1\2", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\d+", "", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"[^\w\s.?!]+", "", x))
LINES

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
0,LEVIATHAN
1,
2,By Thomas Hobbes
3,
4,
...,...
22612,all men welcome.
22613,
22614,
22615,


In [31]:
chap_pat=r"\s*CHAPTER\s+[IVXLCDM]+\."
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [32]:
#LINES.loc[chap_lines]

In [33]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [34]:
LINES.chap_num = LINES.chap_num.ffill()

In [35]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int
LINES

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
364,,1
365,,1
366,Concerning the Thoughts of man I will consider...,1
367,afterwards in Trayne or dependance upon one an...,1
368,are every one a Representation or Apparence of...,1
...,...,...
22612,all men welcome.,46
22613,,46
22614,,46
22615,,46


In [36]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')
    
CHAPS['chap_str'] = CHAPS['chap_str'].astype(str)
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS['chap_str'] = CHAPS['chap_str'].str.replace('_', '', regex=True).str.strip()

para_pat = r'\n\n+'
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
.to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
.to_frame('sent_str')
SENTS.index.names = OHCO[:3]

token_pat = r"[\s',-]+"
TOKENS2 = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
.to_frame('token_str')

TOKENS2['term_str'] = TOKENS2.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
TOKENS2['term_str'] = TOKENS2.token_str.replace(r'/n', '', regex=True).str.lower()

In [37]:
VOCAB = TOKENS2.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'

In [38]:
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,the,14783
1,of,10645
2,and,7226
3,to,7206
4,,6989
...,...,...
9502,hiding,1
9503,parcels,1
9504,molesteth,1
9505,incumbent,1


In [39]:
TOKENS2 = pd.concat({'2': TOKENS2}, names=['book_id'])

In [40]:
TOKENS2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2,1,0,0,0,Concerning,concerning
2,1,0,0,1,the,the
2,1,0,0,2,Thoughts,thoughts
2,1,0,0,3,of,of
2,1,0,0,4,man,man
2,...,...,...,...,...,...
2,46,70,3,13,all,all
2,46,70,3,14,men,men
2,46,70,3,15,welcome,welcome
2,46,70,4,0,,


***

In [42]:
text_file = 'locke_second.txt'
OHCO=['chap_num', 'para_num', 'sent_num', 'token_num']
   
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()
LINES.line_str = LINES.line_str.str.replace(r'"', '', regex=True).str.strip()
LINES.line_str = LINES.line_str.str.replace('{', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('}', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('&', '', regex=True)
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)\)", r"\1", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)|(\w+)\)", r"\1\2", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\d+", "", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"[^\w\s.?!]+", "", x))
LINES

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
0,SECOND TREATISE OF GOVERNMENT
1,
2,by JOHN LOCKE
3,
4,Digitized by Dave Gowan. John Lockes Second Tr...
...,...
5110,legislative in themselves or erect a new form ...
5111,place it in new hands as they think good.
5112,
5113,


In [43]:
chap_pat=r"\s*CHAPTER+\.\s+[IVXLCDM]+\."
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [44]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
128,CHAPTER. I.
186,CHAPTER. II.
446,CHAPTER. III.
570,CHAPTER. IV.
627,CHAPTER. V.
1131,CHAPTER. VI.
1645,CHAPTER. VII.
2037,CHAPTER. VIII.
2617,CHAPTER. IX.
2741,CHAPTER. X.


In [45]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [46]:
LINES.chap_num = LINES.chap_num.ffill()

In [47]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int
LINES

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
129,,1
130,AN ESSAY CONCERNING THE TRUE ORIGINAL EXTENT A...,1
131,GOVERNMENT,1
132,,1
133,,1
...,...,...
5110,legislative in themselves or erect a new form ...,19
5111,place it in new hands as they think good.,19
5112,,19
5113,,19


In [48]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')
    
CHAPS['chap_str'] = CHAPS['chap_str'].astype(str)
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS['chap_str'] = CHAPS['chap_str'].str.replace('_', '', regex=True).str.strip()

para_pat = r'\n\n+'
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
.to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
.to_frame('sent_str')
SENTS.index.names = OHCO[:3]

token_pat = r"[\s',-]+"
TOKENS3 = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
.to_frame('token_str')

TOKENS3['term_str'] = TOKENS3.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
TOKENS3['term_str'] = TOKENS3.token_str.replace(r'/n', '', regex=True).str.lower()

In [49]:
TOKENS3 = pd.concat({'3': TOKENS3}, names=['book_id'])

In [50]:
TOKENS3

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
3,1,0,0,0,AN,an
3,1,0,0,1,ESSAY,essay
3,1,0,0,2,CONCERNING,concerning
3,1,0,0,3,THE,the
3,1,0,0,4,TRUE,true
3,...,...,...,...,...,...
3,19,48,3,87,think,think
3,19,48,3,88,good,good
3,19,48,4,0,,
3,19,49,0,0,FINIS,finis


***

In [52]:
text_file = 'rousseau_inequality.txt'
OHCO=['chap_num', 'para_num', 'sent_num', 'token_num']
   
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()
LINES.line_str = LINES.line_str.str.replace('"', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('{', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('}', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('&', '', regex=True)
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)\)", r"\1", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)|(\w+)\)", r"\1\2", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\d+", "", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"[^\w\s.?!]+", "", x))
LINES.head()

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
0,A Discourse Upon The Origin And The Foundation...
1,Mankind
2,
3,By J. J. Rousseau
4,


In [53]:
chap_pat=r"\s*(DISCOURSE FIRST PART|SECOND PART)\s*"
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [54]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
180,DISCOURSE FIRST PART
1295,SECOND PART


In [55]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.chap_num = LINES.chap_num.ffill()

In [56]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int
LINES.head()

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
181,,1
182,However important it may be in order to form a...,1
183,natural state of man to consider him from his ...,1
184,him as it were in the first embryo of the spec...,1
185,attempt to trace his organization through its ...,1


In [57]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')
    
CHAPS['chap_str'] = CHAPS['chap_str'].astype(str)
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS['chap_str'] = CHAPS['chap_str'].str.replace('_', '', regex=True).str.strip()

para_pat = r'\n\n+'
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
.to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
.to_frame('sent_str')
SENTS.index.names = OHCO[:3]

token_pat = r"[\s',-]+"
TOKENS4 = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
.to_frame('token_str')

TOKENS4['term_str'] = TOKENS4.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
TOKENS4['term_str'] = TOKENS4.token_str.replace(r'/n', '', regex=True).str.lower()

In [58]:
TOKENS4 = pd.concat({'4': TOKENS4}, names=['book_id'])

In [59]:
TOKENS4

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
4,1,0,0,0,However,however
4,1,0,0,1,important,important
4,1,0,0,2,it,it
4,1,0,0,3,may,may
4,1,0,0,4,be,be
4,...,...,...,...,...,...
4,2,59,2,94,commonest,commonest
4,2,59,2,95,necessaries,necessaries
4,2,59,2,96,of,of
4,2,59,2,97,life,life


***

In [61]:
text_file = 'humane_understanding_locke.txt'
OHCO=['chap_num', 'para_num', 'sent_num', 'token_num']
   
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()
LINES.line_str = LINES.line_str.str.replace('"', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('{', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('}', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('&', '', regex=True)
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)\)", r"\1", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)|(\w+)\)", r"\1\2", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\d+", "", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"[^\w\s.?!]+", "", x))
LINES.head()

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
0,An Essay Concerning Humane Understanding
1,
2,TO THE RIGHT HONOURABLE THOMAS EARL OF PEMBROK...
3,HERBERT OF CARDIFF LORD ROSS OF KENDAL PAR FIT...
4,QUINTIN AND SHURLAND


In [62]:
chap_pat=r"\s*CHAPTER+\s+[IVXLCDM]+\."
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [63]:
#LINES.loc[chap_lines]

In [64]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.chap_num = LINES.chap_num.ffill()

In [65]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int
LINES.head()

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
559,INTRODUCTION.,1
560,,1
561,,1
562,. An Inquiry into the Understanding pleasant a...,1
563,,1


In [66]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')
    
CHAPS['chap_str'] = CHAPS['chap_str'].astype(str)
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS['chap_str'] = CHAPS['chap_str'].str.replace('_', '', regex=True).str.strip()

para_pat = r'\n\n+'
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
.to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
.to_frame('sent_str')
SENTS.index.names = OHCO[:3]

token_pat = r"[\s',-]+"
TOKENS5 = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
.to_frame('token_str')

TOKENS5['term_str'] = TOKENS5.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
TOKENS5['term_str'] = TOKENS5.token_str.replace(r'/n', '', regex=True).str.lower()

In [67]:
TOKENS5 = pd.concat({'5': TOKENS5}, names=['book_id'])

In [68]:
TOKENS5

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
5,1,0,0,0,INTRODUCTION,introduction
5,1,0,1,0,,
5,1,1,0,0,,
5,1,1,1,0,,
5,1,1,1,1,An,an
5,...,...,...,...,...,...
5,37,38,2,0,,
5,37,39,0,0,END,end
5,37,39,0,1,OF,of
5,37,39,0,2,VOLUME,volume


In [69]:
TOKENS5.term_str.count()

147681

***

In [71]:
text_file = 'humane_understanding_vol2.txt'
OHCO=['chap_num', 'para_num', 'sent_num', 'token_num']
   
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()
LINES.line_str = LINES.line_str.str.replace('"', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('{', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('}', '', regex=True)
LINES.line_str = LINES.line_str.str.replace('&', '', regex=True)
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)\)", r"\1", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\((\w+)|(\w+)\)", r"\1\2", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"\d+", "", x))
LINES["line_str"] = LINES["line_str"].apply(lambda x: re.sub(r"[^\w\s.?!]+", "", x))
LINES.head()

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
0,AN ESSAY CONCERNING HUMAN UNDERSTANDING
1,
2,BY
3,
4,JOHN LOCKE


In [72]:
chap_pat=r"\s*CHAPTER+\s+[IVXLCDM]+\."
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector

In [73]:
# LINES.loc[chap_lines]

In [74]:
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.chap_num = LINES.chap_num.ffill()

In [75]:
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int
LINES.head()

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
14,,1
15,OF WORDS OR LANGUAGE IN GENERAL.,1
16,,1
17,,1
18,. Man fitted to form articulated Sounds.,1


In [76]:
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')
    
CHAPS['chap_str'] = CHAPS['chap_str'].astype(str)
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()
CHAPS['chap_str'] = CHAPS['chap_str'].str.replace('_', '', regex=True).str.strip()

para_pat = r'\n\n+'
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
.to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
.to_frame('sent_str')
SENTS.index.names = OHCO[:3]

token_pat = r"[\s',-]+"
TOKENS6 = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
.to_frame('token_str')

TOKENS6['term_str'] = TOKENS6.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
TOKENS6['term_str'] = TOKENS6.token_str.replace(r'/n', '', regex=True).str.lower()

In [77]:
TOKENS6 = pd.concat({'6': TOKENS6}, names=['book_id'])

In [78]:
TOKENS6

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
6,1,0,0,0,OF,of
6,1,0,0,1,WORDS,words
6,1,0,0,2,OR,or
6,1,0,0,3,LANGUAGE,language
6,1,0,0,4,IN,in
6,...,...,...,...,...,...
6,32,10,3,51,from,from
6,32,10,3,52,another,another
6,32,10,4,0,,
6,32,11,0,0,The,the


In [79]:
TOKENS6.term_str.count()

128205

***

## Make CORPUS one table

In [81]:
TOKENS = TOKENS1.append([TOKENS2, TOKENS3, TOKENS4, TOKENS5, TOKENS6])
TOKENS

  TOKENS = TOKENS1.append([TOKENS2, TOKENS3, TOKENS4, TOKENS5, TOKENS6])


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,SUBJECT,subject
1,1,0,0,1,OF,of
1,1,0,0,2,THE,the
1,1,0,0,3,FIRST,first
1,1,0,0,4,BOOK,book
...,...,...,...,...,...,...
6,32,10,3,51,from,from
6,32,10,3,52,another,another
6,32,10,4,0,,
6,32,11,0,0,The,the


In [82]:
TOKENS.drop(TOKENS.loc[TOKENS['term_str']=='"'].index, inplace=True)

In [83]:
cit_pat = r"\"?\[\d+\]"

In [84]:
m = ~TOKENS.term_str.str.match(cit_pat, case=False)
TOKENS = TOKENS[m]

In [85]:
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,SUBJECT,subject
1,1,0,0,1,OF,of
1,1,0,0,2,THE,the
1,1,0,0,3,FIRST,first
1,1,0,0,4,BOOK,book
...,...,...,...,...,...,...
6,32,10,3,51,from,from
6,32,10,3,52,another,another
6,32,10,4,0,,
6,32,11,0,0,The,the


In [86]:
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,SUBJECT,subject
1,1,0,0,1,OF,of
1,1,0,0,2,THE,the
1,1,0,0,3,FIRST,first
1,1,0,0,4,BOOK,book
...,...,...,...,...,...,...
6,32,10,3,51,from,from
6,32,10,3,52,another,another
6,32,10,4,0,,
6,32,11,0,0,The,the


## Get VOCAB

In [88]:
# VOCAB = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
# VOCAB.index.name = 'term_id'
VOCAB = TOKENS.term_str.value_counts().to_frame('n')
VOCAB.index.name = 'term_str'
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = -np.log2(VOCAB.p)
VOCAB['n_chars'] = VOCAB.index.str.len()
VOCAB['h'] = VOCAB['p'] * VOCAB['i']

In [89]:
VOCAB

Unnamed: 0_level_0,n,p,i,n_chars,h
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
the,41075,0.059972,4.059561,3,0.243461
of,31825,0.046467,4.427661,2,0.205739
and,22465,0.032800,4.930142,3,0.161711
,22176,0.032378,4.948822,0,0.160235
to,22072,0.032227,4.955604,2,0.159702
...,...,...,...,...,...
ammunition,1,0.000001,19.385534,10,0.000028
spoiles,1,0.000001,19.385534,7,0.000028
serene,1,0.000001,19.385534,6,0.000028
achor,1,0.000001,19.385534,5,0.000028


## Annotate CORPUS, VOCAB with POS

In [90]:
TOKENS['pos_tuple'] = nltk.pos_tag(TOKENS['term_str'])

In [91]:
TOKENS['pos'] = TOKENS.pos_tuple.apply(lambda x: x[1])
CORPUS = TOKENS.drop(columns='pos_tuple')
CORPUS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str,pos
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,0,0,SUBJECT,subject,NN
1,1,0,0,1,OF,of,IN
1,1,0,0,2,THE,the,DT
1,1,0,0,3,FIRST,first,JJ
1,1,0,0,4,BOOK,book,NN
...,...,...,...,...,...,...,...
6,32,10,3,51,from,from,IN
6,32,10,3,52,another,another,DT
6,32,10,4,0,,,NN
6,32,11,0,0,The,the,DT


In [92]:
CORPUS = CORPUS[CORPUS.term_str != '']

In [93]:
CORPUS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str,pos
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,0,0,0,SUBJECT,subject,NN
1,1,0,0,1,OF,of,IN
1,1,0,0,2,THE,the,DT
1,1,0,0,3,FIRST,first,JJ
1,1,0,0,4,BOOK,book,NN
...,...,...,...,...,...,...,...
6,32,10,3,50,one,one,CD
6,32,10,3,51,from,from,IN
6,32,10,3,52,another,another,DT
6,32,11,0,0,The,the,DT


In [94]:
VOCAB['max_pos'] = TOKENS[['term_str', 'pos']].value_counts().unstack(fill_value=0).idxmax(1)
VOCAB

Unnamed: 0_level_0,n,p,i,n_chars,h,max_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
the,41075,0.059972,4.059561,3,0.243461,DT
of,31825,0.046467,4.427661,2,0.205739,IN
and,22465,0.032800,4.930142,3,0.161711,CC
,22176,0.032378,4.948822,0,0.160235,NNP
to,22072,0.032227,4.955604,2,0.159702,TO
...,...,...,...,...,...,...
ammunition,1,0.000001,19.385534,10,0.000028,NN
spoiles,1,0.000001,19.385534,7,0.000028,NNS
serene,1,0.000001,19.385534,6,0.000028,JJ
achor,1,0.000001,19.385534,5,0.000028,NN


In [95]:
VOCAB['cat_pos'] = CORPUS[['term_str','pos']].value_counts().to_frame('n').reset_index()\
    .groupby('term_str').pos.apply(lambda x: set(x))

In [96]:
VOCAB = VOCAB[VOCAB['cat_pos'].notna()]

In [97]:
VOCAB

Unnamed: 0_level_0,n,p,i,n_chars,h,max_pos,cat_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
the,41075,0.059972,4.059561,3,0.243461,DT,{DT}
of,31825,0.046467,4.427661,2,0.205739,IN,{IN}
and,22465,0.032800,4.930142,3,0.161711,CC,{CC}
to,22072,0.032227,4.955604,2,0.159702,TO,{TO}
that,13160,0.019214,5.701662,4,0.109554,IN,"{RB, VBN, VB, DT, IN, WDT}"
...,...,...,...,...,...,...,...
ammunition,1,0.000001,19.385534,10,0.000028,NN,{NN}
spoiles,1,0.000001,19.385534,7,0.000028,NNS,{NNS}
serene,1,0.000001,19.385534,6,0.000028,JJ,{JJ}
achor,1,0.000001,19.385534,5,0.000028,NN,{NN}


In [98]:
sw = pd.DataFrame({'stop': 1}, index=nltk.corpus.stopwords.words('english'))
sw.index.name='term_str'

In [99]:
if 'stop' not in VOCAB.columns:
    VOCAB = VOCAB.join(sw)
    VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')

In [100]:
stemmer1 = PorterStemmer()
VOCAB['stem_porter'] = VOCAB.apply(lambda x: stemmer1.stem(x.name), 1)

stemmer2 = SnowballStemmer("english")
VOCAB['stem_snowball'] = VOCAB.apply(lambda x: stemmer2.stem(x.name), 1)

stemmer3 = LancasterStemmer()
VOCAB['stem_lancaster'] = VOCAB.apply(lambda x: stemmer3.stem(x.name), 1)

In [101]:
VOCAB

Unnamed: 0_level_0,n,p,i,n_chars,h,max_pos,cat_pos,stop,stem_porter,stem_snowball,stem_lancaster
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
the,41075,0.059972,4.059561,3,0.243461,DT,{DT},1,the,the,the
of,31825,0.046467,4.427661,2,0.205739,IN,{IN},1,of,of,of
and,22465,0.032800,4.930142,3,0.161711,CC,{CC},1,and,and,and
to,22072,0.032227,4.955604,2,0.159702,TO,{TO},1,to,to,to
that,13160,0.019214,5.701662,4,0.109554,IN,"{RB, VBN, VB, DT, IN, WDT}",1,that,that,that
...,...,...,...,...,...,...,...,...,...,...,...
ammunition,1,0.000001,19.385534,10,0.000028,NN,{NN},0,ammunit,ammunit,ammunit
spoiles,1,0.000001,19.385534,7,0.000028,NNS,{NNS},0,spoil,spoil,spoil
serene,1,0.000001,19.385534,6,0.000028,JJ,{JJ},0,seren,seren,ser
achor,1,0.000001,19.385534,5,0.000028,NN,{NN},0,achor,achor,ach


## Make LIB table

In [102]:
book_id = [1, 2, 3, 4, 5, 6]
source_file_path = ['social_contract_rousseau.txt', 'hobbes_leviathan.txt', 'locke_second.txt', 
                     'rousseau_inequality.txt', 'humane_understanding_locke.txt', 'humane_understanding_vol2.txt']

chap_regex = ['\s*(chapter)\s+(?=([IVX]+))\2(\s|$)', '\s*CHAPTER\s+[IVXLCDM]+\.', '\s*CHAPTER+\.\s+[IVXLCDM]+\.',
             '\s*(DISCOURSE FIRST PART|SECOND PART)\s*', '\s*CHAPTER+\s+[IVXLCDM]+\.', '\s*CHAPTER+\s+[IVXLCDM]+\.']

title = ['The Social Contract', 'Leviathan', 'The Second Treatise of Government', 
         'A Discourse Upon The Origin And The Foundation Of The Inequality Among Mankind', 
         'An Essay Concerning Humane Understanding', 'An Essay Concerning Humane Understanding, Volume 2']

author = ['Jean-Jacques Rousseau', 'Thomas Hobbes', 'John Locke', 'Jean-Jacques Rousseau', 'John Locke', 'John Locke']

book_len = [111749, 224684, 58125, 26001, 150692, 130868]

n_chaps = [48, 46, 19, 2, 37, 32]

In [103]:
cols = {'book_id': book_id, 'source_file_path': source_file_path, 'chap_regex': chap_regex, 'title': title, 'author': author,
       'book_len': book_len, 'n_chaps': n_chaps} 
    
LIB = pd.DataFrame(cols)

In [104]:
LIB

Unnamed: 0,book_id,source_file_path,chap_regex,title,author,book_len,n_chaps
0,1,social_contract_rousseau.txt,\s*(chapter)\s+(?=([IVX]+))(\s|$),The Social Contract,Jean-Jacques Rousseau,111749,48
1,2,hobbes_leviathan.txt,\s*CHAPTER\s+[IVXLCDM]+\.,Leviathan,Thomas Hobbes,224684,46
2,3,locke_second.txt,\s*CHAPTER+\.\s+[IVXLCDM]+\.,The Second Treatise of Government,John Locke,58125,19
3,4,rousseau_inequality.txt,\s*(DISCOURSE FIRST PART|SECOND PART)\s*,A Discourse Upon The Origin And The Foundation...,Jean-Jacques Rousseau,26001,2
4,5,humane_understanding_locke.txt,\s*CHAPTER+\s+[IVXLCDM]+\.,An Essay Concerning Humane Understanding,John Locke,150692,37
5,6,humane_understanding_vol2.txt,\s*CHAPTER+\s+[IVXLCDM]+\.,"An Essay Concerning Humane Understanding, Volu...",John Locke,130868,32


***

## Save CORPUS, VOCAB, LIB

In [105]:
os.chdir('C:/Users/linna/Box/MSDS/DS5001/Final Project/Corpus/')

In [106]:
CORPUS.to_csv(f"C:/Users/linna/Box/MSDS/DS5001/Final Project/Corpus/CORPUS.csv")

In [107]:
VOCAB.to_csv(f"C:/Users/linna/Box/MSDS/DS5001/Final Project/Corpus/VOCAB.csv")

In [108]:
LIB.to_csv(f"C:/Users/linna/Box/MSDS/DS5001/Final Project/Corpus/LIB.csv")