# Metadata

```yaml
Course:   DS 5001 
Module:   02 Homework KEY
Topic:    Text Models
Author:   R.C. Alvaraddo
Date:     30 January 2023 (revised)
```

**Overview**

Students parse a second text following the pattern of the first, and then combine them to create a corpus.

With the corpus, students observe basic descriptive statistical features.

# Instructions

In this exercise, you will convert a different text from raw text into a data frame of tokens and preserving its OHCO. Then you will extract some statistical features from the resulting corpus.

Follow these instructions:

1. Download the attached Gutenberg version of Jane Austen's _Sense and Sensibility_ (`pg161.txt`).
2. Create a notebook  to convert the raw text into a data frame of tokens, just as we did with _Persuasion_. You may use the notebook from the lab as your guide.
3. Specifically, make sure your complete these tasks:
    1. Remove Gutenberg's front and back matter using the lines that indicate the start and end of the project.
    2. Chunk by chapter, using the pattern of locating the headers in the data frame, assigning them numbers, forward-filling those numbers, and then grouping by number (and cleaning up).
    3. Split resulting data frame into paragraphs using the regex provided.
    4. Split resulting data frame into sentences using the regex provided.
    5. Split resulting data frame into tokens using the regex provided.
    6. Be sure to include the OHCO of Chapters, Paragraphs, and Sentences in your data frame's index.
4. Once you have done this, combine both _Persuasion_ and _Sense and Sensibility_ into a single data frame with an appropriately modified OHCO list. In other words, make sure your index includes a new index level for the book. Use the attached CSV (`austen-persuasion.csv`) to get the _Persuasion_ data and then import it into your notebook as a data frame.
5. From the combined data frame, extract a vocabulary, i.e. a data frame with term string as index, along with term frequency and term length as features.
6. After you have done all this, answer the following questions by extracting features from the corpus.
    1. How many raw tokens are in the combined data frame?
    2. How many distinct terms are there in the combined data frame (i.e. how big is the vocabulary)?
    3. How many more terms does the vocabulary of Sense and Sensibility have than that of Persuasion?
    4. What is the average number of tokens, rounded to an integer, per chapter in the corpus?
    5. What is the average number of tokens, rounded to an integer, per paragraph in the corpus?

## Summary

* Convert `pg161.txt` into `austen-sense.csv` with OHCO of chapters, paragraphs, sentences, and tokens.
* Combine tokenized dataframes of `austen-sense.csv` and `austen-persuasion.csv` into `austen-combo.csv`.
* Extract a vocabulary with term frequencies.
* Answer the questions.

# Set Up

In [94]:
import pandas as pd
import seaborn as sns

In [95]:
sns.set()

In [96]:
data_home = '../../../repo/lessons/data'

In [97]:
text_file = f"{data_home}/gutenberg/pg161.txt" 
csv_file1 = f"{data_home}/output/austen-sense.csv" # To be created
csv_file2 = f"{data_home}/output/austen-persuasion.csv" # Already created
csv_combo = f"{data_home}/output/austen-combo.csv" # To be created
OHCO = ['chap_num', 'para_num', 'sent_num', 'token_num']

# Import Text

Import _Sense and Sensibility_ into a dataframe.

In [98]:
LINES = pd.DataFrame(open(text_file, 'r', encoding='utf-8-sig').readlines(), 
    columns=['line_str'])
LINES.index.name = 'line_num'
LINES.line_str = LINES.line_str.str.strip()

In [99]:
LINES.sample(10)

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
3690,"they say it is a sweet pretty place."""
9395,since it can advance him so little towards wha...
5744,"arrival, without once stirring from her seat, ..."
12762,"1.E.1. The following sentence, with active li..."
8713,"barbarous have I been to you!--you, who have b..."
5656,of her real situation with respect to him.
5911,"circumstances, it was better for both that the..."
2804,Mrs. Dashwood was sorry for what she had said;...
3870,"""Is Mr. Willoughby much known in your part of ..."
2501,will always be welcome; for I will not press y...


# Extract title of work from first line

In [100]:
title = LINES.loc[0].line_str.replace('The Project Gutenberg EBook of ', '')

In [101]:
title

'Sense and Sensibility, by Jane Austen'

In [102]:
LINES.head()

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
0,The Project Gutenberg EBook of Sense and Sensi...
1,
2,This eBook is for the use of anyone anywhere a...
3,almost no restrictions whatsoever. You may co...
4,re-use it under the terms of the Project Guten...


# Remove Gutenberg's front and back matter

In [103]:
a = LINES.line_str.str.match(r"\*\*\*\s*START OF (THE|THIS) PROJECT")
b = LINES.line_str.str.match(r"\*\*\*\s*END OF (THE|THIS) PROJECT")

In [104]:
an = LINES.loc[a].index[0]
bn = LINES.loc[b].index[0]

In [105]:
LINES = LINES.loc[an + 1 : bn - 2]

In [106]:
LINES

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
20,
21,
22,
23,
24,
...,...
12661,
12662,
12663,
12664,


# Chunk by chapter

## Find all chapter headers

In [107]:
chap_lines = LINES.line_str.str.match(r"^\s*(chapter|letter)\s+(\d+)", case=False)

In [108]:
LINES.loc[chap_lines]

Unnamed: 0_level_0,line_str
line_num,Unnamed: 1_level_1
42,CHAPTER 1
196,CHAPTER 2
399,CHAPTER 3
561,CHAPTER 4
756,CHAPTER 5
858,CHAPTER 6
986,CHAPTER 7
1112,CHAPTER 8
1244,CHAPTER 9
1448,CHAPTER 10


## Assign numbers to chapters

In [109]:
chap_nums = [i+1 for i in range(LINES.loc[chap_lines].shape[0])]

In [110]:
LINES.loc[chap_lines, 'chap_num'] = chap_nums

## Forward-fill chapter numbers to following text lines

In [111]:
LINES.chap_num = LINES.chap_num.ffill()

## Clean up

In [112]:
# LINES = LINES.loc[~LINES.chap_num.isna()] # Remove chapter heading lines
LINES = LINES.dropna(subset=['chap_num'])
LINES = LINES.loc[~chap_lines] # Remove everything before Chapter 1
LINES.chap_num = LINES.chap_num.astype('int') # Convert chap_num from float to int

In [113]:
LINES.sample(10)

Unnamed: 0_level_0,line_str,chap_num
line_num,Unnamed: 1_level_1,Unnamed: 2_level_1
3994,"demands which this politeness made on it, was ...",21
10943,"happy, and afterwards returned to town to be g...",44
8132,,35
7473,,33
1436,"destroyed all its ingenuity.""",9
602,perceive how you could express yourself more w...,4
7157,"Mrs. Jennings, who knew nothing of all this, w...",32
11596,"In the evening, when they were all three toget...",47
10595,,44
12002,,49


## Group lines by chapter num 

In [114]:
CHAPS = LINES.groupby(OHCO[:1]).line_str.apply(lambda x: '\n'.join(x)).to_frame('chap_str')

In [115]:
CHAPS.head()

Unnamed: 0_level_0,chap_str
chap_num,Unnamed: 1_level_1
1,\n\nThe family of Dashwood had long been settl...
2,\n\nMrs. John Dashwood now installed herself m...
3,\n\nMrs. Dashwood remained at Norland several ...
4,"\n\n""What a pity it is, Elinor,"" said Marianne..."
5,"\n\nNo sooner was her answer dispatched, than ..."


# Split into paragraphs 

In [116]:
PARAS = CHAPS['chap_str'].str.split(r'\n\n+', expand=True).stack()\
    .to_frame('para_str')
PARAS.index.names = OHCO[:2] 

In [117]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,0,
1,1,The family of Dashwood had long been settled i...
1,2,"By a former marriage, Mr. Henry Dashwood had o..."
1,3,"The old gentleman died: his will was read, and..."
1,4,"Mr. Dashwood's disappointment was, at first, s..."


In [118]:
PARAS['para_str'] = PARAS['para_str'].str.replace(r'\n', ' ', regex=True).str.strip()
PARAS = PARAS[~PARAS['para_str'].str.match(r'^\s*$')] # Remove empty paragraphs

In [119]:
PARAS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,para_str
chap_num,para_num,Unnamed: 2_level_1
1,1,The family of Dashwood had long been settled i...
1,2,"By a former marriage, Mr. Henry Dashwood had o..."
1,3,"The old gentleman died: his will was read, and..."
1,4,"Mr. Dashwood's disappointment was, at first, s..."
1,5,His son was sent for as soon as his danger was...


# Split into sentences

NOTE: ADDED `"` to regex in `split()`

In [72]:
SENTS = PARAS['para_str'].str.split(r'[.?!;:"]+', expand=True).stack()\
    .to_frame().rename(columns={0:'sent_str'})
SENTS.index.names = OHCO[:3]
SENTS = SENTS[~SENTS['sent_str'].str.match(r'^\s*$')] # Remove empty paragraphs
SENTS.sent_str = SENTS.sent_str.str.strip()

In [73]:
SENTS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,sent_str
chap_num,para_num,sent_num,Unnamed: 3_level_1
1,1,0,The family of Dashwood had long been settled i...
1,1,1,"Their estate was large, and their residence wa..."
1,1,2,The late owner of this estate was a single man...
1,1,3,"But her death, which happened ten years before..."
1,1,4,"for to supply her loss, he invited and receive..."


# Split into tokens

In [74]:
TOKENS = SENTS['sent_str'].str.split(r"[\s',-]+", expand=True).stack()\
    .to_frame('token_str')
TOKENS.index.names = OHCO[:4]

In [75]:
TOKENS['term_str'] = TOKENS.token_str.str.replace(r"[\W_]+", '', regex=True).str.lower()

In [76]:
TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,term_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,1,0,0,The,the
1,1,0,1,family,family
1,1,0,2,of,of
1,1,0,3,Dashwood,dashwood
1,1,0,4,had,had
...,...,...,...,...,...
50,23,0,8,and,and
50,23,0,9,Sensibility,sensibility
50,23,0,10,by,by
50,23,0,11,Jane,jane


# Save work to CSV

In [77]:
TOKENS.to_csv(csv_file1)

# Combine the two into a Corpus

In [78]:
csv_file2 = f"{data_home}/output/austen-persuasion.csv"

In [79]:
df1 = pd.read_csv(csv_file1)
df2 = pd.read_csv(csv_file2)

In [80]:
len(df1), len(df2)

(122257, 85014)

In [81]:
df1['book_id'] = 1 # They may use the string for the titles here
df2['book_id'] = 2

In [166]:
LIB = {
    1: 'Sense & Sensibility', 
    2:'Persuasion'
}

In [83]:
CORPUS = pd.concat([df1, df2])

In [84]:
OHCO2 = ['book_id'] + OHCO

In [85]:
CORPUS = CORPUS.set_index(OHCO2)

In [86]:
# CORPUS.sample(10)

In [87]:
len(CORPUS), CORPUS.shape[0], CORPUS.token_str.count()

(207271, 207271, 205599)

# Extract a vocabulary $V$

In [88]:
CORPUS['term_str'] = CORPUS.token_str.str.replace(r"\W+", "", regex=True).str.lower()
V = CORPUS.term_str.value_counts().to_frame('n')
V.index.name = 'term_str'
V['n_chars'] = V.index.str.len()

In [89]:
len(V)

8239

In [90]:
V.n_chars.mean()

7.5543148440344705

In [91]:
CORPUS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,1,0,0,The,the
1,1,1,0,1,family,family
1,1,1,0,2,of,of
1,1,1,0,3,Dashwood,dashwood
1,1,1,0,4,had,had
...,...,...,...,...,...,...
2,24,13,0,6,of,of
2,24,13,0,7,Persuasion,persuasion
2,24,13,0,8,by,by
2,24,13,0,9,Jane,jane


In [92]:
V

Unnamed: 0_level_0,n,n_chars
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1
the,7435,3
to,6923,2
and,6290,3
of,6146,2
her,3747,3
...,...,...
unconquerable,1,13
outgrown,1,8
prosperously,1,12
nominal,1,7


# Save Combo

Do this for safe keeping.

Students are not asked to do this, so don't worry if it's not there.

In [93]:
CORPUS.to_csv(csv_combo)

# Answer Questions


## 1. How many raw tokens are in the combined data frame?

In [120]:
CORPUS.shape[0]

207271

## 2. How many distinct terms are there in the combined data frame (i.e. how big is the vocabulary)?

In [121]:
V.shape[0]

8239

## 3. How many more terms does the vocabulary of _Sense and Sensibility_ have than that of _Persuasion_?

### Method 1

In [152]:
vc_sense = CORPUS.loc[1].term_str.value_counts().shape[0]
vc_persu = CORPUS.loc[2].term_str.value_counts().shape[0]

In [153]:
vc_sense - vc_persu

520

### Method 2

Students don't have to do this, but it's a good idea to put features where they belong.

In this case, we can think of the the term counts per book as features of $V$.

In [146]:
V['in_1'] = CORPUS.loc[1].term_str.value_counts()
V['in_2'] = CORPUS.loc[2].term_str.value_counts()

In [172]:
V.in_1.count() - V.in_2.count()

520

A second way to do this, which does not rely on the existence of `NA`s in the dataframe (since these may have been replaced by $0$, for example), is to convert the values to booleans and then sum them.

In [None]:
V.in_1.fillna(0).astype('bool').sum() - V.in_2.fillna(0).astype('bool').sum()

520

## 4. What is the average number of tokens, rounded to an integer, per chapter in the corpus?


In [167]:
CORPUS.groupby(OHCO2[:2]).term_str.count().mean().round().astype('int')

2778

## 5. What is the average number of tokens, rounded to an integer, per paragraph in the corpus?

In [168]:
CORPUS.groupby(OHCO2[:3]).term_str.count().mean().round().astype('int')

73