# Metadata

```yaml
Course: DS 5001 
Module: 03: Homework KEY
Topics: Inferring and Interpreting Language Models 
Author: R.C. Alvarado
```

# Instructions

Use the the following libraries and source text to answer the questions in this assessment. 
  * `pg42324.txt`
  * `textimporter.py`
  * `langmod.py`

Follow this pattern:
* Create a new notebook for your work.
* Parse the _Frankenstein_ text to generate TOKENS and VOCAB tables.
* Create a list of sentences from the TOKENS table and a list of terms from the VOCAB table. 
* Pass the two lists to an `langmod.NgramCounter` object to generate ngram type tables and models, going up to the trigram level.
* Write the code to answer the following questions:
  1. List six words that precede the word "monster," excluding stop words (and sentence boundary markers). Stop words include 'a', 'an', 'the', 'this', 'that', etc. Hint: use the `df.query()` method.  
  2. List the following sentences in ascending order of bigram perpexity according to the language model generated from the text:
    ```
    The monster is on the ice.
    Flowers are happy things.
    I have never seen the aurora borealis.
    He never knew the love of a family.
    ```
  3. Using the bigram model represented as a matrix, explore the relationship between bigram pairs using the following lists. Hint: use the `.unstack()` method on the feature `n` and then use `.loc[]` to select the first list from the index, and the second list from the columns.
     1. `['he','she']` to select the indices.
     2. `['said','heard']` to select the columns.
  4. Generate 20 sentences using the `.generate_text()` method from the `langmod.NgramLanguageModel` class.
  5. Compute the redundancy $R$ for each of the n-gram models using the MLE of the joint probability of each ngram type. In other words, for each model, just use the `.mle` feature as $p$ in computing $H = \sum p(ng) \log_2(1/p(ng))$. Does $R$ increase, decrease, or remain the same as the choice of n-gram increases in length? Hint: Remember that $R = 1 - \frac{H}{H_{max}}$, where $H$ is the actual entropy of the model and $H_{max}$ is its maximum entropy. 


Hints:
* You may use the libraries or cut-and-paste code from the relevant notebooks.
* Use the `M03_LanguageModels.ipynb` to see how the objects from the libraries are used.
* The story begins with the Preface.
* Even though they are not called "chapters," treat the Preface and Letters as chapters.
* Don't worry about OOV words or creating and `<UNK>` term in your vocabulary.
* You don't have to use the "START OF PROJECT GUTENBERG ...", etc., to clip the text. Find the lines where you think the text actually begins and ends.

# Solution

## Config

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_home = "./"
local_lib = "./"
src_file_path = f'{data_home}/pg42324.txt'

In [3]:
import sys
sys.path.append(local_lib)

In [4]:
from textimporter import TextImporter
from langmod import NgramCounter, NgramLanguageModel

## Import Data

In [5]:
ohco_pats = [
    ('chap', r"^(?:PREFACE|CHAPTER|LETTER)\s", 'm')
]
clip_pats = [
    r"^M\. W\. S\.\s*$",
    r"^THE END\.\s*$"
]

In [6]:
franky = TextImporter(src_file_path, ohco_pats=ohco_pats, clip_pats=clip_pats)

In [7]:
franky.import_source().parse_tokens().extract_vocab();

Importing  .//pg42324.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^(?:PREFACE|CHAPTER|LETTER)\s
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by delimitter [.?!;:]+
Parsing OHCO level 3 token_num by delimitter [\s',-]+


In [8]:
franky.TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,term_str
chap_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,_To,to
1,0,0,1,Mrs,mrs
1,0,1,1,Saville,saville
1,0,1,2,England,england
1,0,2,0,_,
...,...,...,...,...,...
28,82,1,10,lost,lost
28,82,1,11,in,in
28,82,1,12,darkness,darkness
28,82,1,13,and,and


In [9]:
franky.VOCAB

Unnamed: 0_level_0,n,n_chars,p,s,i,h
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
the,4197,3,0.055427,18.041696,4.173263,0.231312
and,2976,3,0.039302,25.443884,4.669247,0.183512
i,2852,1,0.037665,26.550140,4.730648,0.178178
of,2647,2,0.034957,28.606347,4.838263,0.169133
to,2101,2,0.027747,36.040457,5.171545,0.143493
...,...,...,...,...,...,...
overweigh,1,9,0.000013,75721.000000,16.208406,0.000214
pledge,1,6,0.000013,75721.000000,16.208406,0.000214
salvation,1,9,0.000013,75721.000000,16.208406,0.000214
timorous,1,8,0.000013,75721.000000,16.208406,0.000214


In [10]:
franky.OHCO

['chap_id', 'para_num', 'sent_num', 'token_num']

In [11]:
sents = franky.gather_tokens(2).sent_str.to_list()

In [12]:
sents[:10]

['to mrs',
 'saville england',
 '',
 'st',
 'petersburgh dec',
 '11th 17',
 'you will rejoice to hear that no disaster has accompanied the commencement of an enterprise which you have regarded with such evil forebodings',
 'i arrived here yesterday',
 'and my first task is to assure my dear sister of my welfare and increasing confidence in the success of my undertaking',
 'i am already far north of london']

In [13]:
vocab = franky.VOCAB.index.to_list()

In [14]:
vocab[:10]

['the', 'and', 'i', 'of', 'to', 'my', 'a', 'in', 'was', 'that']

In [15]:
train = NgramCounter(sents, vocab)

In [16]:
train.generate()

In [46]:
train.LM[1]

Unnamed: 0_level_0,Unnamed: 1_level_0,n,mle,mle2,p,log_p
w0,w1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,</s>,2,0.000022,1.000000,0.000717,-10.445015
11th,17,1,0.000011,0.500000,0.000430,-11.181980
11th,the,1,0.000011,0.500000,0.000430,-11.181980
12th,17,1,0.000011,0.500000,0.000430,-11.181980
12th,</s>,1,0.000011,0.500000,0.000430,-11.181980
...,...,...,...,...,...,...
youthful,days,1,0.000011,0.333333,0.000430,-11.182394
youthful,lovers,2,0.000022,0.666667,0.000717,-10.445429
zeal,</s>,1,0.000011,0.250000,0.000430,-11.182808
zeal,modern,1,0.000011,0.250000,0.000430,-11.182808


In [17]:
# train.LM[2].n.unstack(fill_value=0)

## Q1

List six words that precede the word "monster," excluding stop words (and sentence boundary markers). Stop words include 'a', 'an', 'the', 'this', 'that', etc.

Hint, use the `df.query()` method.

**<span style="color:red;">ISSUE</span>**: If you use `text_importer.py` you get a set of 6, if you parse it yourself you get 5 of the same but a different 6.

In [18]:
train.LM[1].query("w1 == 'monster'")

Unnamed: 0_level_0,Unnamed: 1_level_0,n,mle
w0,w1,Unnamed: 2_level_1,Unnamed: 3_level_1
<s>,monster,1,1.1e-05
a,monster,3,3.3e-05
abhorred,monster,1,1.1e-05
detestable,monster,1,1.1e-05
gigantic,monster,1,1.1e-05
hellish,monster,1,1.1e-05
hideous,monster,1,1.1e-05
miserable,monster,1,1.1e-05
the,monster,20,0.00022
this,monster,1,1.1e-05


```
abhorred
detestable    
gigantic      
hellish       
hideous       
miserable     
```

Trying it by hand ...

In [19]:
import re

In [20]:
big_line = open(src_file_path, 'r').read()
big_line = big_line.lower().replace("\n", ' ')
big_line = re.sub(r"[\W_]+", " ", big_line)
big_line = re.sub(r"\s+", " ", big_line)
tokens = big_line.split()

In [21]:
big_line[:500]

' the project gutenberg ebook of frankenstein by mary w shelley this ebook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever you may copy it give it away or re use it under the terms of the project gutenberg license included with this ebook or online at www gutenberg org title frankenstein or the modern prometheus author mary w shelley release date march 13 2013 ebook 42324 language english start of this project gutenberg ebook frankenstein produced by greg w'

In [22]:
bg_data = []
for i in range(len(tokens)):
    bg_data.append(tokens[i:i+2])
BG = pd.DataFrame(bg_data, columns=['w0','w1']).drop_duplicates()

In [23]:
BG.query("w1 == 'monster'").sort_values('w0')

Unnamed: 0,w0,w1
40878,a,monster
33259,abhorred,monster
48760,cried,monster
45661,detestable,monster
72064,gigantic,monster
70652,hellish,monster
48800,hideous,monster
18370,miserable,monster
19663,the,monster
19350,this,monster


## Q2 

List the following sentences in ascending order of bigram perpexity according to the language model generated from the text.

```
The monster is on the ice.
Flowers are happy things.
I have never seen the aurora borealis.
He never knew the love of a family.
```

In [24]:
model = NgramLanguageModel(train)
model.apply_smoothing()

In [25]:
test_sents = """
The monster is on the ice.
Flowers are happy things.
I have never seen the aurora borealis.
He never knew the love of a family.
""".split('\n')[1:-1]

In [26]:
test_sents = [s.lower() for s in test_sents]

In [27]:
test = NgramCounter(test_sents, vocab)
test.generate()

In [28]:
model.predict(test)

In [29]:
model.T.S

Unnamed: 0,sent_str,len,ng_1_ll,pp1,ng_2_ll,pp2,ng_3_ll,pp3
0,the monster is on the ice.,9,-46.64946,36.334631,-74.688657,314.897754,-213.042107,13359340.0
1,flowers are happy things.,7,-44.532783,82.243297,-75.997581,1854.477868,-177.397939,42547250.0
2,i have never seen the aurora borealis.,10,-50.323281,32.725155,-87.041808,417.080128,-230.966554,8969869.0
3,he never knew the love of a family.,11,-65.633527,62.538999,-115.580343,1455.504786,-232.560952,2313915.0


In [30]:
model.T.S.sort_values('pp2').sent_str

0                the monster is on the ice.
2    i have never seen the aurora borealis.
3       he never knew the love of a family.
1                 flowers are happy things.
Name: sent_str, dtype: object

## Q3

Using the bigram model represented as a matrix, explore the relationship between bigram pairs as done in the "Explore" section of the template notebook, but use the following lists. **What might you speculate about gender and communication given the results you see?**
* `['he','she']` to select the indices.
* `['said','heard']` to select the columns.

Hint: use `.unstack()` method on the feature `n` and then use `.loc[]` to select the first list from the index, and the second list from the columns.

In [31]:
BGX = model.LM[1].n.unstack()

In [32]:
print(BGX.loc[['he','she'],['said','heard']])

w1   said  heard
w0              
he   21.0    5.0
she   3.0    3.0


Speculation: Men talk more than women.

## Q4

Generate a text using the `generate_text` function.

In [33]:
model.generate_text()

01. I REMEMBERED SHUDDERING THE MAD ENTHUSIASM THAT HURRIED ME ON EVERY SIDE THE SOUND OF VOICES AS THE CASE I DARE NOT.

02. BUT SUCCESS SHALL CROWN MY ENDEAVOURS SO SOON.

03. AND IF THEIR TESTIMONY SHALL NOT.

04. THEY DIED BY MY PROTECTORS HAD MANIFESTED TOWARDS HIM.

05. A DEADLY STRUGGLE WOULD THEN DRIVE AWAY INCIPIENT DISEASE.

06. BUT SLEEP DID NOT ALLOW ME TO WRITE TO YOU FIRST SAW HIM SOMETIMES SHUDDER WITH HORROR.

07. BUT AS I DID NOT LIVE TO FULFIL IT.

08. HE ASKED ME WITH AFFECTION WAS THE CORPSE OF SOME DISCOVERIES HAVING BEEN MADE BY DIFFERENT FEELINGS.

09. BEFORE I LOOKED UPON ME HOWEVER WITH SOME DEGREE BENEFICIAL.

10. EVERY ONE WAS NEAR ME WHO SOOTHED ME.

11. AND ALTHOUGH THE STRANGER.

12. AND I CONJECTURED TO REST IF THERE WAS NO LONGER NECESSARY AND YET SHE PAID THE GREATEST DANGER OWING TO THE ACTIVE SPIRIT OF GOOD.

13. .

14. HIS MANNERS WERE RUDE DESERVED BETTER TREATMENT THAN BLOWS AND A WINNING MILDNESS TO HER.

15. .

16. WHEN I MOMENTARILY EXPECT MY RE

## Q5

Compute the redundancy $R$ for each of the n-gram models using the MLE of the joint probability of each ngram type. In other words, for each model, just use the `.mle` feature as $p$ in computing $H = \sum p(ng) \log_2(1/p(ng))$

Remember that $R = 1 - \frac{H}{H_{max}}$, where $H$ is the actual entropy of the model and $H_{max}$ is its maximum entropy. 

Does $R$ increase, decrease, or remain the same as the choice of n-gram increases in length?

In [34]:
V = len(vocab)

In [35]:
R = []
for i in range(3):
    N = V**(i+1)
    H = (train.LM[i]['mle'] * np.log2(1/train.LM[i]['mle'])).sum()
    Hmax = np.log2(N)
    R.append(int(round(1 - H/Hmax, 2) * 100))

In [36]:
R

[33, 48, 61]

**ANSWER**: Redundancy increases.

**<span style="color:red;">ISSUE</span>**: If you use the just the vector length of seen values, the redundancy will decrease. We accept both answers since some 
students were told to use only the seen values for the length.

# Notes

## Q2

```
self.T.S[f'ng_{ng}_ll'] = self.T.NG[i]\
    .join(self.LM[i].log_p, on=self.widx[:ng])\
    .fillna(self.Z1[i]).fillna(self.Z2[i])\
    .groupby('sent_num').log_p.sum()
    
self.T.S[f'pp{ng}'] = 2**( -self.T.S[f'ng_{ng}_ll'] / self.T.S['len'])
```

In [37]:
# Bigram Prediction 
X = test.NG[1]\
    .join(train.LM[1].log_p, on=['w0','w1'])\
    .fillna(model.Z1[1])\
    .fillna(model.Z2[1])

In [38]:
X.groupby('sent_num').log_p.sum() #.sort_values(ascending=False).to_frame('log_p_sum')

sent_num
0    -74.688657
1    -75.997581
2    -87.041808
3   -115.580343
Name: log_p, dtype: float64

In [39]:
test.S.ng_2_ll

0    -74.688657
1    -75.997581
2    -87.041808
3   -115.580343
Name: ng_2_ll, dtype: float64