# M02 Homework

```yaml
Course:   DS 5001 
Author:   JiHo Lee (qxz6hb)
```

### Question 1. How many raw tokens are in the combined data frame?

> The number of raw tokens in the combined data frame is <b>207896</b> as below.

### Question 2. How many distinct terms are there in the combined data frame (i.e. how big is the vocabulary)?

> The number of distinct terms in the combined data frame is <b>8239</b> as below.

### Question 3. How many more terms does the vocabulary of Sense and Sensibility have than that of Persuasion?

> <b>520</b> more terms.
_Sense and Sensibility_ has 6280 terms. _Persuasion_ has 5760 terms.

### Question 4. What is the average number of tokens, rounded to an integer, per chapter in the corpus?

> The average number of tokens per chapter in the corpus is <b>2809</b>.

### Question 5. What is the average number of tokens, rounded to an integer, per paragraph in the corpus?

> The average number of tokens per paragraph in the corpus is <b>74</b>.

`The code related to questions and answers with explanation is described in each corresponding section below.`

### Import libraries

In [1]:
import pandas as pd

import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']
data_home = data_home.replace('/', '\\')
output_dir = output_dir.replace('/', '\\')

text_file = f"{data_home}/gutenberg/pg161.txt"
csv_file  = f"{output_dir}/austen-persuasion.csv" # The file we will create

OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

### Import file into a dataframe

In [2]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.replace(r'\n+', ' ', regex=True).str.strip()

title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

clip_pats = [
    r"\*\*\*\s*START OF (?:THE|THIS) PROJECT",
    r"\*\*\*\s*END OF (?:THE|THIS) PROJECT"
]

pat_a = LINES.line_str.str.match(clip_pats[0])
pat_b = LINES.line_str.str.match(clip_pats[1])

line_a = LINES.loc[pat_a].index[0] + 1
line_b = LINES.loc[pat_b].index[0] - 1

LINES = LINES.loc[line_a : line_b]

### Chunk by Chapter

#### Find all chapter headers, Assign numbers to chapters, Forward-fill chapter numbers to following text lines,  Clean up

The regex will depend on the source text. You need to investigate the source text to figure this out.

In [3]:
chap_pat = r"^\s*(?:chapter|letter)\s+\d+"
chap_lines = LINES.line_str.str.match(chap_pat, case=False) # Returns a truth vector
LINES.loc[chap_lines, 'chap_num'] = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]
LINES.chap_num = LINES.chap_num.ffill()
LINES = LINES.dropna(subset=['chap_num']) # Remove everything before Chapter 1
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove everything before Chapter 1 (alternate method)
LINES = LINES.loc[~chap_lines] # Remove chapter heading lines; their work is done
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int
# Make big string for each chapter
CHAPS = LINES.groupby(OHCO[:1])\
    .line_str.apply(lambda x: '\n'.join(x))\
    .to_frame('chap_str')
CHAPS['chap_str'] = CHAPS.chap_str.str.strip()

### Split chapters into paragraphs 

We use Pandas' convenient `.split()` method with `expand=True`, followed by `.stack()`.
Note that this creates zero-based indexes.

In [4]:
para_pat = r'\n\n+'
PARAS = CHAPS['chap_str'].str.split(para_pat, expand=True).stack()\
    .to_frame('para_str').sort_index()
PARAS.index.names = OHCO[:2]

PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True)
PARAS['para_str'] = PARAS['para_str'].str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs

### Split paragraphs into sentences

In [5]:
# sent_pat = r'[.?!;:"]+'
sent_pat = r'[.?!;:]+'
SENTS = PARAS['para_str'].str.split(sent_pat, expand=True).stack()\
    .to_frame('sent_str')
SENTS.index.names = OHCO[:3]

SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip() # CRUCIAL TO REMOVE BLANK TOKENS

### Split sentences into tokens

In [6]:
token_pat = r"[\s',-]+"
TOKENS = SENTS['sent_str'].str.split(token_pat, expand=True).stack()\
    .to_frame('token_str')

TOKENS.index.names = OHCO[:4]

### Combine _Persuasion_ and _Sense and Sensibility_

### <mark>Question 1.</mark>

The number of raw tokens in the combined data frame is <b>207896</b> as below.

In [7]:
csv_file  = f"{data_home}/austen-persuasion.csv" 
temp_df = pd.read_csv(csv_file)
temp_df = temp_df[['chap_num', 'para_num','sent_num','token_num', 'token_str']]
OHCO = ['book_num','chap_num', 'para_num', 'sent_num', 'token_num']
temp_df.insert(0, "book_num", [2]*temp_df.shape[0], True)
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = temp_df.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df
# gather(1)
# gather(2)
temp_df = gather(5)

TOKENS.insert(0, "book_num", [1]*TOKENS.shape[0], True)
def gather(ohco_level):
    global TOKENS
    level_name = OHCO[ohco_level-1].split('_')[0]
    df = TOKENS.groupby(OHCO[:ohco_level])\
        .token_str.apply(lambda x: x.str.cat(sep=' '))\
        .to_frame(f"{level_name}_str")
    return df
TOKENS = gather(5)

merged = pd.concat([TOKENS, temp_df])
merged

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str
book_num,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1
1,1,0,0,0,The
1,1,0,0,1,family
1,1,0,0,2,of
1,1,0,0,3,Dashwood
1,1,0,0,4,had
...,...,...,...,...,...
2,24,13,0,6,of
2,24,13,0,7,Persuasion
2,24,13,0,8,by
2,24,13,0,9,Jane


### <mark>Question 2.</mark>

The number of distinct terms in the combined data frame is <b>8239</b> as below.

In [8]:
merged['term_str'] = merged.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB = merged.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB.index.name = 'term_id'
VOCAB

Unnamed: 0_level_0,term_str,n
term_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0,the,7435
1,to,6923
2,and,6290
3,of,6146
4,her,3747
...,...,...
8234,unconquerable,1
8235,outgrown,1
8236,prosperously,1
8237,nominal,1


### <mark>Question 3.</mark>

<b>520</b> more terms.
_Sense and Sensibility_ has 6280 terms. _Persuasion_ has 5760 terms.

In [9]:
TOKENS['term_str'] = TOKENS.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB1 = TOKENS.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB1.index.name = 'term_id'

sense_len = len(VOCAB1)

temp_df['term_str'] = temp_df.token_str.replace(r'[\W_]+', '', regex=True).str.lower()
VOCAB2 = temp_df.term_str.value_counts().to_frame('n').reset_index().rename(columns={'index':'term_str'})
VOCAB2.index.name = 'term_id'

per_len = len(VOCAB2)

print(sense_len, per_len)

6280 5760


### <mark>Question 4.</mark>

The average number of tokens per chapter in the corpus is <b>2809</b>.

In [10]:
average_tokens_per_chapter = merged.groupby(['book_num', 'chap_num']).size().mean()
average_tokens_per_chapter 

2809.4054054054054

### <mark>Question 5.</mark>

The average number of tokens per paragraph in the corpus is <b>74</b>.

In [11]:
average_tokens_per_paragraph = merged.groupby(['book_num', 'chap_num', 'para_num']).size().mean()
average_tokens_per_paragraph 

73.74813763746009