# M04 Homework

```yaml
Course:   DS 5001
Module:   04 Lab
Topic:    NLP and the Pipeline
Author:   JiHo Lee (qxz6hb)
Date:     10 February 2023
```

### Question 1. What regular expression did you use to chunk _Middlemarch_ into chapters?

> The regular expression used to chunk _Middlemarch_ into chapters is ```^\s*CHAPTER\s+[IVXLCM]+\.\s*$```

### Question 2. What is the title of the book that has the most tokens? 

> MIDDLEMARCH

_Middlemarch_ has the most number of tokens which is 317305.

### Question 3. How many chapter level chunks are there in this novel?

> _Middlemarch_: 86 

The number of chapters in _Middlemarch_ is 86.

_Adam bede_ and _The mill on the floss_ has 55, 58 chapters respectively.

### Question 4. Among the three stemming algorithms -- Porter, Lancaster, and Snowball --  which is the most aggressive, in terms of the number of words associated with each stem?

> Lancaster



### Question 5. Using the most aggressive stemmer from the previous question, what is the stem with the most associated terms?

> The stem with the most associated terms is ***'cont'*** with 34 times.



#### <mark>The code related to questions and answers with explanation is described in each corresponding section below.</mark>

### Set Up

In [1]:
import pandas as pd
import numpy as np
from glob import glob
import re
import nltk
import plotly_express as px
import configparser

config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']
local_lib = config['DEFAULT']['local_lib']

data_home = data_home.replace('/', '\\')
output_dir = output_dir.replace('/', '\\')
local_lib = local_lib.replace('/', '\\')

source_files = f'{data_home}/gutenberg/eliot-set'
data_prefix = 'eliot'

OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']

import sys
sys.path.append(local_lib)
# print(local_lib)

from textparser import TextParser

In [2]:
clip_pats = [
    r"\*\*\*\s*START OF",
    r"\*\*\*\s*END OF"
]

# All are 'chap'and 'm'
roman = '[IVXLCM]+'
caps = "[A-Z';, -]+"
ohco_pat_list = [
    (145, rf"^\s*CHAPTER\s+{roman}\.\s*$"),
#     (507, rf"^\s*Chapter\s+{roman}\s"),
    (507, rf"^\s*Chapter\s+{roman}\b$"),
    (6688, rf"^\s*Chapter\s+{roman}\.\s*$")
]

### Register

We get each file and add to a library `LIB`.

In [3]:
source_file_list = sorted(glob(f"{source_files}/*.*"))

book_data = []
for source_file_path in source_file_list:
    source_file_path = source_file_path.replace('/', '\\')
#     print(source_file_path)
    book_id = int(source_file_path.split('-')[-1].split('.')[0].replace('pg',''))
    book_title = source_file_path.split('\\')[-1].split('-')[0].replace('_', ' ')
    book_data.append((book_id, source_file_path, book_title))
    
LIB = pd.DataFrame(book_data, columns=['book_id','source_file_path','raw_title'])\
    .set_index('book_id').sort_index()

try:
    LIB['author'] = LIB.raw_title.apply(lambda x: ', '.join(x.split()[:2]))
    LIB['title'] = LIB.raw_title.apply(lambda x: ' '.join(x.split()[2:]))
    LIB = LIB.drop('raw_title', axis=1)
except AttributeError:
    pass

LIB['chap_regex'] = LIB.index.map(pd.Series({x[0]:x[1] for x in ohco_pat_list}))


### <mark>Question 1.</mark>

The regular expression used to chunk _Middlemarch_ into chapters is ```^\s*CHAPTER\s+[IVXLCM]+\.\s*$```

In [4]:
LIB

Unnamed: 0_level_0,source_file_path,author,title,chap_regex
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
145,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",MIDDLEMARCH,^\s*CHAPTER\s+[IVXLCM]+\.\s*$
507,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",ADAM BEDE,^\s*Chapter\s+[IVXLCM]+\b$
6688,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",THE MILL ON THE FLOSS,^\s*Chapter\s+[IVXLCM]+\.\s*$


### Tokenize Corpus

We tokenize each book and add each `TOKENS` table to a list to be concatenated into a single `CORPUS`.

In [5]:
def tokenize_collection(LIB):

    clip_pats = [
        r"\*\*\*\s*START OF",
        r"\*\*\*\s*END OF"
    ]

    books = []
    for book_id in LIB.index:

        # Announce
        print("Tokenizing", book_id, LIB.loc[book_id].title)

        # Define vars
        chap_regex = LIB.loc[book_id].chap_regex
        ohco_pats = [('chap', chap_regex, 'm')]
        src_file_path = LIB.loc[book_id].source_file_path

        # Create object
        text = TextParser(src_file_path, ohco_pats=ohco_pats, clip_pats=clip_pats, use_nltk=True)

        # Define parameters
        text.verbose = True
        text.strip_hyphens = True
        text.strip_whitespace = True

        # Parse
        text.import_source().parse_tokens();

        # Name things
        text.TOKENS['book_id'] = book_id
        text.TOKENS = text.TOKENS.reset_index().set_index(['book_id'] + text.OHCO)

        # Add to list
        books.append(text.TOKENS)
        
    # Combine into a single dataframe
    CORPUS = pd.concat(books).sort_index()

    # Clean up
    del(books)
    del(text)
        
    print("Done")
        
    return CORPUS

In [6]:
CORPUS = tokenize_collection(LIB)

Tokenizing 145 MIDDLEMARCH
Importing  C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\eliot-set\ELIOT_GEORGE_MIDDLEMARCH-pg145.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^\s*CHAPTER\s+[IVXLCM]+\.\s*$
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 507 ADAM BEDE
Importing  C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\eliot-set\ELIOT_GEORGE_ADAM_BEDE-pg507.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^\s*Chapter\s+[IVXLCM]+\b$
line_str chap_str
Index(['chap_str'], dtype='object')
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by NLTK model
Parsing OHCO level 3 token_num by NLTK model
Tokenizing 6688 THE MILL ON THE FLOSS
Importing  C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\eliot-set\ELIOT_GEORGE_THE_MILL_ON_THE_FLOSS-pg6688.txt
Clipping text
Parsing OHCO level 0 chap_id by

### Extract some features for `LIB`

In [7]:
LIB['book_len'] = CORPUS.groupby('book_id').term_str.count()

### <mark>Question 2.</mark>

_Middlemarch_ has the most number of tokens which is 317305 as below.

In [8]:
LIB.sort_values('book_len')

Unnamed: 0_level_0,source_file_path,author,title,chap_regex,book_len
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
6688,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",THE MILL ON THE FLOSS,^\s*Chapter\s+[IVXLCM]+\.\s*$,207461
507,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",ADAM BEDE,^\s*Chapter\s+[IVXLCM]+\b$,215404
145,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",MIDDLEMARCH,^\s*CHAPTER\s+[IVXLCM]+\.\s*$,317305


### <mark>Quesion 3.</mark>

The number of chapters in _Middlemarch_ is 86.

_Adam bede_ and _The mill on the floss_ has 55, 58 chapters respectively.

In [9]:
LIB['n_chaps'] = CORPUS.reset_index()[['book_id','chap_id']]\
    .drop_duplicates()\
    .groupby('book_id').chap_id.count()

LIB

Unnamed: 0_level_0,source_file_path,author,title,chap_regex,book_len,n_chaps
book_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
145,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",MIDDLEMARCH,^\s*CHAPTER\s+[IVXLCM]+\.\s*$,317305,86
507,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",ADAM BEDE,^\s*Chapter\s+[IVXLCM]+\b$,215404,55
6688,C:\DS5001\DS5001_2024_01_R\..\data\gutenberg\e...,"ELIOT, GEORGE",THE MILL ON THE FLOSS,^\s*Chapter\s+[IVXLCM]+\.\s*$,207461,58


### Exract VOCAB

Extract a vocabulary from the CORPUS as a whole

### Handle Anomalies

NLTK's POS tagger is not perfect -- note the classification of punctuation as nouns, verbs, etc. We remove these from our corups.

In [10]:
# handle anomalies
CORPUS = CORPUS[CORPUS.term_str != '']
CORPUS['pos_group'] = CORPUS.pos.str[:2]

VOCAB = CORPUS.term_str.value_counts().to_frame('n').sort_index()
VOCAB.index.name = 'term_str'
VOCAB['n_chars'] = VOCAB.index.str.len()
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['i'] = -np.log2(VOCAB.p)

### Add Stems

In [11]:
from nltk.stem.porter import PorterStemmer
stemmer1 = PorterStemmer()
VOCAB['stem_porter'] = VOCAB.apply(lambda x: stemmer1.stem(x.name), 1)

from nltk.stem.snowball import SnowballStemmer
stemmer2 = SnowballStemmer("english")
VOCAB['stem_snowball'] = VOCAB.apply(lambda x: stemmer2.stem(x.name), 1)

from nltk.stem.lancaster import LancasterStemmer
stemmer3 = LancasterStemmer()
VOCAB['stem_lancaster'] = VOCAB.apply(lambda x: stemmer3.stem(x.name), 1)

VOCAB[VOCAB.stem_porter != VOCAB.stem_snowball]

Unnamed: 0_level_0,n,n_chars,p,i,stem_porter,stem_snowball,stem_lancaster
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
abjectly,1,8,0.000001,19.497458,abjectli,abject,abject
abruptly,16,8,0.000022,15.497458,abruptli,abrupt,abrupt
abstractedly,3,12,0.000004,17.912496,abstractedli,abstract,abstract
abundantly,4,10,0.000005,17.497458,abundantli,abund,abund
accordingly,11,11,0.000015,16.038027,accordingli,accord,accord
...,...,...,...,...,...,...,...
yeswellyou,1,10,0.000001,19.497458,yeswelly,yeswellyou,yeswellyou
yous,3,4,0.000004,17.912496,you,yous,yo
zealous,10,7,0.000014,16.175530,zealou,zealous,zeal
æschylus,2,8,0.000003,18.497458,æschylu,æschylus,æschylus


### <mark>Question 4.</mark>

_Lancaster_ is the most aggressive stemming algorithm.

Each _Porter, Snawball, Lancaster_ algorithm has 17540, 17203, 14612 unique number of stems.
And, the number of changed words associated with each stem is 16823, 16617, 19401 respectively.
Also, the average number of associated words per each stem is 1.50, 1.53, 1.80 respectively.
Therefore, _Lancaster_ aggressively changed the most number of words with its associated stem which is the least unique number of stems among three algorithms.

In [13]:
unique_porter = VOCAB['stem_porter'].nunique()
unique_snowball = VOCAB['stem_snowball'].nunique()
unique_lancaster = VOCAB['stem_lancaster'].nunique()

comp1 = VOCAB[VOCAB.index != VOCAB.stem_porter]
comp2 = VOCAB[VOCAB.index != VOCAB.stem_snowball]
comp3 = VOCAB[VOCAB.index != VOCAB.stem_lancaster]

n1 = len(comp1)
n2 = len(comp2)
n3 = len(comp3)

porter_grouped = VOCAB.groupby('stem_porter').size().reset_index(name='n_terms')
avg1 = porter_grouped['n_terms'].mean()
snawball_grouped = VOCAB.groupby('stem_snowball').size().reset_index(name='n_terms')
avg2 = snawball_grouped['n_terms'].mean()
lancaster_grouped = VOCAB.groupby('stem_lancaster').size().reset_index(name='n_terms')
avg3 = lancaster_grouped['n_terms'].mean()

comparison = {
    'Stemmer': ['Porter', 'Snowball', 'Lancaster'],
    'Unique_Stems': [unique_porter, unique_snowball, unique_lancaster],
    '# of changed term_str': [n1, n2, n3],
    'Average of # of associated words with each stem' : [avg1, avg2, avg3]
}

comparison = pd.DataFrame(comparison)

comparison

Unnamed: 0,Stemmer,Unique_Stems,# of changed term_str,Average of # of associated words with each stem
0,Porter,17540,16823,1.501539
1,Snowball,17203,16617,1.530954
2,Lancaster,14612,19401,1.802423


### <mark>Question 5.</mark>

The stem with the most associated terms is ***'cont'*** with 34 times as below.

In [14]:
# lancaster is the most aggressive algorithm
lancaster_grouped = VOCAB.groupby('stem_lancaster').size().reset_index(name='n_terms')
tmp = lancaster_grouped.loc[lancaster_grouped['n_terms'].idxmax()]

stem = tmp['stem_lancaster']
cnt = tmp['n_terms']

print(stem, ': ', cnt)
lancaster_grouped[lancaster_grouped['stem_lancaster']=='cont']

cont :  34


Unnamed: 0,stem_lancaster,n_terms
2481,cont,34
