# Metadata

```
Course:   DS 5001
Module:   04 HW KEY
Author:   R.C. Alvarado
```

# Instructions

In this week’s code exercise, you will use NLTK to help tokenize and annotate a small corpus of George Eliot's novels to create an `F3` level digital analytical edition from them.

Using this week's Lab notebook as a guide (`M04_01_Pipeline.ipynb`), which uses the `TextParser` class in the `/lib` directory of the notebook repository, import and combine the novels contained in the directory `/data/gutenberg/eliot-set`.

You should produce the following related dataframes:

* A library `LIB` with the following metadata (and data) about each book:
  * The `book_id`, matching the first level of the index in the `CORPUS`.
  * The raw book title will be sufficient, i.e. with title and author combined.
  * The path of the source file.
  * The regex used to parse chapter milestones.
  * The length of the book (number of tokens).
  * The number of chapters in the book.
* A an aggregate of all the novels' tokens `CORPUS` with an appropriate `OHCO` index, with following features:
  * The token string.
  * The term string.
  * THe part-of-speech tag inferred by NLTK.
* A vocabulary `VOCAB` of terms extracted from `CORPUS`, with the following annotation features derived from either NLTK or by using operations presented in the notebook:
  * Stopwords.
  * Porter stems.
  * Maximum POS; i.e. the most frequently associated POS tag for the term using `.idxmax()`. Note that ties are handled by the method.
  * POS ambiguity expressed a number of POS tags associated with a term's tokens.

Once you have these, use the dataframes to answer the questions below.

**Hints**:
* You will need to edit the `ohco_pats` config to match the downloaded texts.
* You may also need to edit the code that reads files from disk and parses their names.
* In defining the milestone regexes, be sure to include all chapter-level sections.

# Questions

## Q1 

What regular expression did you use to chunk _Middlemarch_ into chapters?

**Answer**: `^(?:PRELUDE|CHAPTER|FINALE)` or something similar.

## Q2

What is the title of the book has the most tokens?

**Answer**: _Middlemarch_. 

## Q3

How many chapter level chunks are there in this novel?

**Answer**: 88

## Q4 

Among the three stemming algorithms -- Porter, Snowball, and Lancaster -- which is the most aggressive, in terms of the number of words associated with each stem?

**Answer**: Lancaster (1.8 stems/term)

## Q5 

Using the most aggressive stemmer from the previous question, what is the stem with the most associated terms?

**Answer**: 'cont'

# Code

## Setup

In [3]:
data_home = "../labs-repo/data"
local_lib = "../labs-repo/lib"
source_files = f'{data_home}/gutenberg/eliot-set'
data_prefix = 'eliot'

In [4]:
OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']

In [5]:
import pandas as pd
import numpy as np
from glob import glob
import re
import nltk

In [6]:
import sys
sys.path.append(local_lib)

In [7]:
from textparser import TextParser

## Inspect

Since Project Gutenberg texts vary widely in their markup, we define our chunking patterns by hand.

In [47]:
roman = '[IVXLCM]+'
caps = "[A-Z';, -]+"
clip_pats = [
    r"\*\*\*\s*START OF",
    r"\*\*\*\s*END OF"
]
# All are 'chap'and 'm'
ohco_pat_list = [
    (6688,  rf"^Chapter\s+{roman}\.\s*$"),
    (507,   rf"^(?:Chapter\s+{roman}|Epilogue)\s*$"),
    (145,   rf"^(?:PRELUDE|BOOK|CHAPTER|FINALE)")
]

## Register

We get each file and add to a library `LIB`.

In [48]:
source_file_list = sorted(glob(f"{source_files}/*.*"))

In [49]:
source_file_list

['../labs-repo/data/gutenberg/eliot-set/ELIOT_GEORGE_ADAM_BEDE-pg507.txt',
 '../labs-repo/data/gutenberg/eliot-set/ELIOT_GEORGE_MIDDLEMARCH-pg145.txt',
 '../labs-repo/data/gutenberg/eliot-set/ELIOT_GEORGE_THE_MILL_ON_THE_FLOSS-pg6688.txt']

In [50]:
book_data = []
for source_file_path in source_file_list:
    book_id = int(source_file_path.split('-')[-1].split('.')[0].replace('pg',''))
    book_title = source_file_path.split('/')[-1].split('-')[0].replace('_', ' ')
    book_data.append((book_id, source_file_path, book_title))

In [51]:
LIB = pd.DataFrame(book_data, columns=['book_id','source_file_path','raw_title'])\
    .set_index('book_id').sort_index()

In [52]:
LIB

Unnamed: 0_level_0,source_file_path,raw_title
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1
145,../labs-repo/data/gutenberg/eliot-set/ELIOT_GE...,ELIOT GEORGE MIDDLEMARCH
507,../labs-repo/data/gutenberg/eliot-set/ELIOT_GE...,ELIOT GEORGE ADAM BEDE
6688,../labs-repo/data/gutenberg/eliot-set/ELIOT_GE...,ELIOT GEORGE THE MILL ON THE FLOSS


## Tokenize

We tokenize each book and add each `TOKENS` table to a list to be concatenated into a single `CORPUS`.

In [53]:
books = []
for pat in ohco_pat_list:
    
    book_id, chap_regex = pat
    print("Tokenizing", book_id, LIB.loc[book_id].raw_title)
    ohco_pats = [('chap', chap_regex, 'm')]
    src_file_path = LIB.loc[book_id].source_file_path
    
    text = TextParser(src_file_path, ohco_pats=ohco_pats, clip_pats=clip_pats, use_nltk=True)
    text.verbose = False
    text.strip_hyphens = True
    text.strip_whitespace = True
    text.import_source().parse_tokens();
    text.TOKENS['book_id'] = book_id
    text.TOKENS = text.TOKENS.reset_index().set_index(['book_id'] + text.OHCO)
    
    books.append(text.TOKENS)

Tokenizing 6688 ELIOT GEORGE THE MILL ON THE FLOSS
Tokenizing 507 ELIOT GEORGE ADAM BEDE
Tokenizing 145 ELIOT GEORGE MIDDLEMARCH


## Create Corpus

In [54]:
CORPUS = pd.concat(books).sort_index()

In [55]:
CORPUS.loc[145]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,pos_tuple,pos,token_str,term_str
chap_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,0,0,"(Who, WP)",WP,Who,who
1,0,0,1,"(that, WDT)",WDT,that,that
1,0,0,2,"(cares, VBZ)",VBZ,cares,cares
1,0,0,3,"(much, RB)",RB,much,much
1,0,0,4,"(to, TO)",TO,to,to
...,...,...,...,...,...,...,...
88,0,85,56,"(in, IN)",IN,in,in
88,0,85,57,"(unvisited, JJ)",JJ,unvisited,unvisited
88,0,85,58,"(tombs., NN)",NN,tombs.,tombs
88,0,86,0,"(THE, DT)",DT,THE,the


## Extract some features for `LIB`

In [56]:
LIB['book_len'] = CORPUS.groupby('book_id').term_str.count()

In [57]:
LIB['n_chaps'] = CORPUS.reset_index()[['book_id','chap_id']]\
    .drop_duplicates()\
    .groupby('book_id').chap_id.count()

In [58]:
LIB['chap_regex'] = LIB.index.map(pd.Series({x[0]:x[1] for x in ohco_pat_list}))

In [59]:
LIB.sort_values('book_len')

Unnamed: 0_level_0,source_file_path,raw_title,book_len,n_chaps,chap_regex
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6688,../labs-repo/data/gutenberg/eliot-set/ELIOT_GE...,ELIOT GEORGE THE MILL ON THE FLOSS,207459,58,^Chapter\s+[IVXLCM]+\.\s*$
507,../labs-repo/data/gutenberg/eliot-set/ELIOT_GE...,ELIOT GEORGE ADAM BEDE,215402,57,^(?:Chapter\s+[IVXLCM]+|Epilogue)\s*$
145,../labs-repo/data/gutenberg/eliot-set/ELIOT_GE...,ELIOT GEORGE MIDDLEMARCH,317799,88,^(?:PRELUDE|BOOK|CHAPTER|FINALE)


## Exract VOCAB

Extract a vocabulary from the CORPUS as a whole

In [60]:
# CORPUS[CORPUS.term_str == '']

In [61]:
CORPUS[CORPUS.term_str == ''].token_str.value_counts()

&      10
…       3
),      2
);      2
(&)     1
):      1
;”      1
Name: token_str, dtype: int64

In [62]:
CORPUS = CORPUS[CORPUS.term_str != '']

In [63]:
VOCAB = CORPUS.term_str.value_counts().to_frame('n').sort_index()
VOCAB.index.name = 'term_str'
VOCAB['n_chars'] = VOCAB.index.str.len()
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = -np.log2(VOCAB.p)

## Annotate VOCAB

In [64]:
VOCAB['max_pos'] = CORPUS[['term_str','pos']].value_counts().unstack(fill_value=0).idxmax(1)

In [65]:
TPM = CORPUS[['term_str','pos']].value_counts().unstack()

In [66]:
VOCAB['n_pos'] = TPM.count(1)

In [67]:
VOCAB['cat_pos'] = CORPUS[['term_str','pos']].value_counts().to_frame('n').reset_index()\
    .groupby('term_str').pos.apply(lambda x: set(x))

In [68]:
VOCAB

Unnamed: 0_level_0,n,n_chars,p,i,max_pos,n_pos,cat_pos
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,1,1,0.000001,19.498413,CD,1,{CD}
1790,1,4,0.000001,19.498413,CD,1,{CD}
1799,2,4,0.000003,18.498413,CD,1,{CD}
1801more,1,8,0.000001,19.498413,CD,1,{CD}
1807,1,4,0.000001,19.498413,CD,1,{CD}
...,...,...,...,...,...,...,...
œdipus,2,6,0.000003,18.498413,NN,1,{NN}
μέγεθος,1,7,0.000001,19.498413,NNP,1,{NNP}
τι,1,2,0.000001,19.498413,NNP,1,{NNP}
ἀπέρωτος,1,8,0.000001,19.498413,JJ,1,{JJ}


## Add Stopwords

We use NLTK's built in stopword list for English. Note that we can add and subtract from this list, or just create our own list and keep it in our data model.

In [28]:
sw = pd.DataFrame(nltk.corpus.stopwords.words('english'), columns=['term_str'])
sw = sw.reset_index().set_index('term_str')
sw.columns = ['dummy']
sw.dummy = 1

In [29]:
VOCAB['stop'] = VOCAB.index.map(sw.dummy)
VOCAB['stop'] = VOCAB['stop'].fillna(0).astype('int')

In [30]:
VOCAB

Unnamed: 0_level_0,n,n_chars,p,i,max_pos,n_pos,cat_pos,stop
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,1,1,0.000001,19.498411,CD,1,{CD},0
1790,1,4,0.000001,19.498411,CD,1,{CD},0
1799,2,4,0.000003,18.498411,CD,1,{CD},0
1801more,1,8,0.000001,19.498411,CD,1,{CD},0
1807,1,4,0.000001,19.498411,CD,1,{CD},0
...,...,...,...,...,...,...,...,...
œdipus,2,6,0.000003,18.498411,NN,1,{NN},0
μέγεθος,1,7,0.000001,19.498411,NNP,1,{NNP},0
τι,1,2,0.000001,19.498411,NNP,1,{NNP},0
ἀπέρωτος,1,8,0.000001,19.498411,JJ,1,{JJ},0


## Add Stems

In [31]:
from nltk.stem.porter import PorterStemmer
stemmer1 = PorterStemmer()
VOCAB['stem_porter'] = VOCAB.apply(lambda x: stemmer1.stem(x.name), 1)

from nltk.stem.snowball import SnowballStemmer
stemmer2 = SnowballStemmer("english")
VOCAB['stem_snowball'] = VOCAB.apply(lambda x: stemmer2.stem(x.name), 1)

from nltk.stem.lancaster import LancasterStemmer
stemmer3 = LancasterStemmer()
VOCAB['stem_lancaster'] = VOCAB.apply(lambda x: stemmer3.stem(x.name), 1)

In [32]:
VOCAB.sample(10)

Unnamed: 0_level_0,n,n_chars,p,i,max_pos,n_pos,cat_pos,stop,stem_porter,stem_snowball,stem_lancaster
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
extricate,1,9,1e-06,19.498411,VB,1,{VB},0,extric,extric,ext
protestantism,3,13,4e-06,17.913448,NNP,2,"{NNP, NN}",0,protestant,protestant,protest
recovered,32,9,4.3e-05,14.498411,VBN,4,"{JJ, VBD, VBN, NN}",0,recov,recov,recov
energumena,1,10,1e-06,19.498411,JJ,1,{JJ},0,energumena,energumena,energumen
are,1559,3,0.002105,8.892006,VBP,10,"{VB, IN, NNP, VBP, JJ, NN, VBD, NNS, VBZ, RB}",1,are,are,ar
conferences,1,11,1e-06,19.498411,NNS,1,{NNS},0,confer,confer,conf
fear,166,4,0.000224,12.123372,NN,6,"{VB, NN, VBP, JJ, VBN, RB}",0,fear,fear,fear
bonaventure,1,11,1e-06,19.498411,NNP,1,{NNP},0,bonaventur,bonaventur,bonav
shroud,3,6,4e-06,17.913448,NN,2,"{NN, VB}",0,shroud,shroud,shroud
culture,8,7,1.1e-05,16.498411,NN,2,"{NN, JJ}",0,cultur,cultur,cult


In [33]:
VOCAB[VOCAB.stem_porter != VOCAB.stem_snowball]

Unnamed: 0_level_0,n,n_chars,p,i,max_pos,n_pos,cat_pos,stop,stem_porter,stem_snowball,stem_lancaster
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
abjectly,1,8,0.000001,19.498411,RB,1,{RB},0,abjectli,abject,abject
abruptly,16,8,0.000022,15.498411,RB,5,"{NN, VBD, JJ, RB, RP}",0,abruptli,abrupt,abrupt
abstractedly,3,12,0.000004,17.913448,NN,2,"{NN, RB}",0,abstractedli,abstract,abstract
abundantly,4,10,0.000005,17.498411,VB,3,"{RB, VB, NNS}",0,abundantli,abund,abund
accordingly,11,11,0.000015,16.038979,NN,6,"{IN, NNP, NN, VBP, JJ, RB}",0,accordingli,accord,accord
...,...,...,...,...,...,...,...,...,...,...,...
yeswellyou,1,10,0.000001,19.498411,NN,1,{NN},0,yeswelly,yeswellyou,yeswellyou
yous,3,4,0.000004,17.913448,NN,2,"{NN, RB}",0,you,yous,yo
zealous,10,7,0.000014,16.176483,JJ,1,{JJ},0,zealou,zealous,zeal
æschylus,2,8,0.000003,18.498411,NNP,2,"{NNP, NN}",0,æschylu,æschylus,æschylus


# Answers

## Q1

In [36]:
ohco_pats[0][1]

'^(?:PRELUDE|BOOK|CHAPTER|FINALE)'

## Q2

In [40]:
LIB.loc[LIB.book_len.idxmax()].raw_title

'ELIOT GEORGE MIDDLEMARCH'

## Q3

How many chapter level chunks are there in this novel?

In [42]:
LIB.loc[145].n_chaps

88

## Q4

Among the three stemming algorithms -- Porter, Snowball, and Lancaster -- which is the most aggressive, defined as the average number of terms associated with each stem?

In [55]:
for stem_type in ['porter', 'snowball', 'lancaster']:
    x = VOCAB[f"stem_{stem_type}"].value_counts().mean()
    print(stem_type, round(x,2))

porter 1.5
snowball 1.53
lancaster 1.8


lancaster

## Q5

Using the most aggressive stemmer from the previous question, what is the stem with the most associated terms?

In [66]:
most_aggressive_stem = VOCAB.stem_lancaster.value_counts().head(1).index.values[0]

In [68]:
most_aggressive_stem

'cont'

In [67]:
VOCAB.query(f"stem_lancaster == '{most_aggressive_stem}'")

Unnamed: 0_level_0,n,n_chars,p,i,max_pos,n_pos,cat_pos,stop,stem_porter,stem_snowball,stem_lancaster
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
conceal,12,7,1.6e-05,15.913448,VB,1,{VB},0,conceal,conceal,cont
concealed,4,9,5e-06,17.498411,VBD,3,"{JJ, VBD, VBN}",0,conceal,conceal,cont
concealing,3,10,4e-06,17.913448,VBG,2,"{NN, VBG}",0,conceal,conceal,cont
concealment,17,11,2.3e-05,15.410948,NN,2,"{NN, JJ}",0,conceal,conceal,cont
concealments,1,12,1e-06,19.498411,NNS,1,{NNS},0,conceal,conceal,cont
conceals,1,8,1e-06,19.498411,VBZ,1,{VBZ},0,conceal,conceal,cont
concede,1,7,1e-06,19.498411,VB,1,{VB},0,conced,conced,cont
conceded,1,8,1e-06,19.498411,JJ,1,{JJ},0,conced,conced,cont
conceding,1,9,1e-06,19.498411,VBG,1,{VBG},0,conced,conced,cont
concentrate,3,11,4e-06,17.913448,VB,1,{VB},0,concentr,concentr,cont
