# Metadata

```yaml
Course: DS 5001 
Module: 03: Homework KEY
Topics: Inferring and Interpreting Language Models 
Author: R.C. Alvarado
```

# Instructions

Use the the following libraries and source text to answer the questions in this assessment. 
  * `pg42324.txt`
  * `textimporter.py`

Follow this pattern:
* Create a new notebook for your work.
* Parse the _Frankenstein_ text to generate TOKENS and VOCAB tables.
* Create a list of sentences from the TOKENS table and a list of terms from the VOCAB table. 
* Generate ngram type tables and models, going up to the trigram level.
* Write the code to answer the following questions:
  1. List six words that precede the word "monster," excluding stop words (and sentence boundary markers). Stop words include 'a', 'an', 'the', 'this', 'that', etc. Hint: use the `df.query()` method.  
  2. List the following sentences in ascending order of bigram perpexity according to the language model generated from the text:
    ```
    The monster is on the ice.
    Flowers are happy things.
    I have never seen the aurora borealis.
    He never knew the love of a family.
    ```
  3. Using the bigram model represented as a matrix, explore the relationship between bigram pairs using the following lists. Hint: use the `.unstack()` method on the feature `n` and then use `.loc[]` to select the first list from the index, and the second list from the columns.
     1. `['he','she']` to select the indices.
     2. `['said','heard']` to select the columns.
  4. Generate 20 sentences using the `.generate_text()` method from the `langmod.NgramLanguageModel` class.
  5. Compute the redundancy $R$ for each of the n-gram models using the MLE of the joint probability of each ngram type. In other words, for each model, just use the `.mle` feature as $p$ in computing $H = \sum p(ng) \log_2(1/p(ng))$. Does $R$ increase, decrease, or remain the same as the choice of n-gram increases in length? Hint: Remember that $R = 1 - \frac{H}{H_{max}}$, where $H$ is the actual entropy of the model and $H_{max}$ is its maximum entropy. 
  
**Hints for Q5:**

- If `mle` is not a feature in your models, just use `p` for the unigram model and compute `p` for the other two models by dividing `n` by the sum of `n`, i.e. 

```python
M[1]['p'] = M[1].n /  M[1].n.sum()
M[2]['p'] = M[2].n /  M[2].n.sum()
``` 
- N is computed as the number of all possible combinations for each ngram. So, for the bigram model N is the number of unigrams (i.e. the vocabulary size plus the sentence boundary signs) squared, and for the trigram model the value is cubed, i.e.

```python
N = len(M[0].index)**{i+1}
```


**Other Hints**:
* You may use the libraries or cut-and-paste code from the relevant notebooks.
* Use the `M03_06_NGramLanguageModels.ipynb` for code patterns.
* The story begins with the Preface.
* Even though they are not called "chapters," treat the Preface and Letters as chapters.
* You don't have to use the "START OF PROJECT GUTENBERG ...", etc., to clip the text. Find the lines where you think the text actually begins and ends.

# Solution

## Config

In [1]:
import pandas as pd
import numpy as np

In [2]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_home = config['DEFAULT']['data_home']
local_lib = config['DEFAULT']['local_lib']
output_dir = config['DEFAULT']['output_dir']

In [3]:
src_file_path = f'{data_home}/gutenberg/pg42324.txt'

In [4]:
import sys
sys.path.append(local_lib)

In [5]:
from textimporter import TextImporter
import langmod_funcs as lm

## Import Data

In [6]:
ohco_pats = [
    ('chap', r"^(?:PREFACE|CHAPTER|LETTER)\s", 'm')
]
clip_pats = [
    r"^M\. W\. S\.\s*$",
    r"^THE END\.\s*$"
]

In [7]:
franky = TextImporter(src_file_path, ohco_pats=ohco_pats, clip_pats=clip_pats)

In [8]:
franky.import_source().parse_tokens().extract_vocab();

Importing  /Users/rca2t1/Dropbox/Courses/DS/DS5001/DS5001_2024_01_R/data/gutenberg/pg42324.txt
Clipping text
Parsing OHCO level 0 chap_id by milestone ^(?:PREFACE|CHAPTER|LETTER)\s
Parsing OHCO level 1 para_num by delimitter \n\n
Parsing OHCO level 2 sent_num by delimitter [.?!;:]+
Parsing OHCO level 3 token_num by delimitter [\s',-]+


In [9]:
franky.TOKENS

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,term_str
chap_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,_To,to
1,0,0,1,Mrs,mrs
1,0,1,1,Saville,saville
1,0,1,2,England,england
1,0,2,0,_,
...,...,...,...,...,...
28,82,1,10,lost,lost
28,82,1,11,in,in
28,82,1,12,darkness,darkness
28,82,1,13,and,and


In [10]:
franky.VOCAB

Unnamed: 0_level_0,n,n_chars,p,s,i,h
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
the,4197,3,0.055427,18.041696,4.173263,0.231312
and,2976,3,0.039302,25.443884,4.669247,0.183512
i,2852,1,0.037665,26.550140,4.730648,0.178178
of,2647,2,0.034957,28.606347,4.838263,0.169133
to,2101,2,0.027747,36.040457,5.171545,0.143493
...,...,...,...,...,...,...
overweigh,1,9,0.000013,75721.000000,16.208406,0.000214
pledge,1,6,0.000013,75721.000000,16.208406,0.000214
salvation,1,9,0.000013,75721.000000,16.208406,0.000214
timorous,1,8,0.000013,75721.000000,16.208406,0.000214


In [11]:
franky.OHCO

['chap_id', 'para_num', 'sent_num', 'token_num']

## Model Config

In [12]:
ngrams = 3
widx = [f"w{i}" for i in range(ngrams)]

## OOV Terms

In [13]:
franky.VOCAB['n_chars'] = franky.VOCAB.index.str.len()
franky.VOCAB['modified_term_str'] = franky.VOCAB.index
franky.VOCAB.loc[(franky.VOCAB.n == 1) & (franky.VOCAB.n_chars < 3), 'modified_term_str'] = "<UNK>"

In [14]:
franky.TOKENS['modified_term_str'] = franky.TOKENS.term_str.map(franky.VOCAB.modified_term_str)

## Get Ngrams

In [15]:
def token_to_padded(token, grouper=['sent_num'], term_str='term_str'):
    ohco = token.index.names # We preserve these since they get lost in the shuffle
    padded = token.groupby(grouper)\
        .apply(lambda x: '<s> ' + ' '.join(x[term_str]) + ' </s>')\
        .apply(lambda x: pd.Series(x.split()))\
        .stack().to_frame('term_str')
    padded.index.names = ohco
    return padded

In [16]:
PADDED = token_to_padded(franky.TOKENS, grouper=franky.OHCO[:3], term_str='modified_term_str')

In [17]:
PADDED

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,term_str
chap_id,para_num,sent_num,token_num,Unnamed: 4_level_1
1,0,0,0,<s>
1,0,0,1,to
1,0,0,2,mrs
1,0,0,3,</s>
1,0,1,0,<s>
...,...,...,...,...
28,82,1,11,in
28,82,1,12,darkness
28,82,1,13,and
28,82,1,14,distance


In [18]:
def padded_to_ngrams(padded, grouper=['sent_num'], n=2):
    
    ohco = padded.index.names
    ngrams = padded.groupby(grouper)\
        .apply(lambda x: pd.concat([x.shift(0-i) for i in range(n)], axis=1))\
        .reset_index(drop=True)
    ngrams.index = padded.index
    ngrams.columns = widx
    
    return ngrams

In [19]:
NG = padded_to_ngrams(PADDED, franky.OHCO[:3], ngrams)

In [20]:
# NG = lm.get_ngrams(franky.TOKENS, n=3)

In [21]:
NG

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,w0,w1,w2
chap_id,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,0,0,0,<s>,to,mrs
1,0,0,1,to,mrs,</s>
1,0,0,2,mrs,</s>,
1,0,0,3,</s>,,
1,0,1,0,<s>,saville,england
...,...,...,...,...,...,...
28,82,1,11,in,darkness,and
28,82,1,12,darkness,and,distance
28,82,1,13,and,distance,</s>
28,82,1,14,distance,</s>,


## Generate Models

In [22]:
# M = lm.get_ngram_counts(NG)

In [23]:
def ngrams_to_models(ngrams):
    global widx
    n = len(ngrams.columns)
    model = [None for i in range(n)]
    for i in range(n):
        if i == 0:
            model[i] = ngrams.value_counts('w0').to_frame('n')
            model[i]['p'] = model[i].n / model[i].n.sum()
            model[i]['i'] = np.log2(1/model[i].p)
        else:
            model[i] = ngrams.value_counts(widx[:i+1]).to_frame('n')    
            model[i]['cp'] = model[i].n / model[i-1].n
            model[i]['i'] = np.log2(1/model[i].cp)
        model[i] = model[i].sort_index()
    return model

In [24]:
M = ngrams_to_models(NG)

In [25]:
M[2]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,n,cp,i
w0,w1,w2,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
11th,17,</s>,1,1.0,0.0
11th,the,passage,1,1.0,0.0
12th,17,</s>,1,1.0,0.0
13th,17,</s>,1,1.0,0.0
18th,17,</s>,2,1.0,0.0
...,...,...,...,...,...
youthful,lovers,have,1,0.5,1.0
youthful,lovers,while,1,0.5,1.0
zeal,modern,philosophers,1,1.0,0.0
zeal,of,felix,1,0.5,1.0


## Q1

List six words that precede the word "monster," excluding stop words (and sentence boundary markers). Stop words include 'a', 'an', 'the', 'this', 'that', etc.

Hint, use the `df.query()` method.

**<span style="color:red;">ISSUE</span>**: If you use `text_importer.py` you get a set of 6, if you parse it yourself you get 5 of the same but a different 6.

In [26]:
M[1].query("w1 == 'monster'")

Unnamed: 0_level_0,Unnamed: 1_level_0,n,cp,i
w0,w1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
<s>,monster,1,0.000194,12.328114
a,monster,3,0.002161,8.853829
abhorred,monster,1,0.083333,3.584963
detestable,monster,1,0.5,1.0
gigantic,monster,1,0.166667,2.584963
hellish,monster,1,0.142857,2.807355
hideous,monster,1,0.090909,3.459432
miserable,monster,1,0.015385,6.022368
the,monster,20,0.004765,7.713215
this,monster,1,0.002488,8.651052


```
abhorred
detestable    
gigantic      
hellish       
hideous       
miserable     
```

## Q2 

List the following sentences in ascending order of bigram perpexity according to the language model generated from the text.

```
The monster is on the ice.
Flowers are happy things.
I have never seen the aurora borealis.
He never knew the love of a family.
```

In [27]:
test_sents = """
The monster is on the ice.
Flowers are happy things.
I have never seen the aurora borealis.
He never knew the love of a family.
""".split('\n')[1:-1]

In [28]:
def sentence_to_token(sent_list, file=True):
    
    # Convert list of sentences to dataframe
    if file:
        S = pd.read_csv("test_sentences.txt", header=None, names=['sent_str'])
    else:
        S = pd.DataFrame(sent_list, columns=['sent_str'])
    S.index.name = 'sent_num'
    
    # Convert dataframe of sentences to TOKEN with normalized terms
    K = S.sent_str.apply(lambda x: pd.Series(x.split())).stack().to_frame('token_str')
    K['term_str'] = K.token_str.str.replace(r"[\W_]+", "", regex=True).str.lower()
    K.index.names = ['sent_num', 'token_num']
    
    return S, K

In [29]:
TEST_SENTS, TEST_TOKENS = sentence_to_token(test_sents, False)

In [30]:
TEST_PADDED = token_to_padded(TEST_TOKENS)

In [31]:
TEST_NGRAMS = padded_to_ngrams(TEST_PADDED, n=ngrams)

## Test Model

In [32]:
def test_model(model, ngrams):
    
    global widx
    
    assert len(model) == len(ngrams.columns)
    
    n = len(model)
    ohco = ngrams.index.names
    
    R = []
    for i in range(n):
        T = ngrams.merge(M[i], on=widx[:i+1], how='left')
        T.index = ngrams.index
        T = T.reset_index().set_index(ohco + widx).i #.to_frame(f"i{i}")
        
        # This how we handle unseen combos
        T[T.isna()] = T.max()
        R.append(T.to_frame(f"i{i}"))
                
    return pd.concat(R, axis=1)

In [33]:
R = test_model(M, TEST_NGRAMS)

In [34]:
R.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,i0,i1,i2
sent_num,token_num,w0,w1,w2,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0,<s>,the,monster,4.058119,3.816361,7.511753
0,1,the,monster,is,4.35109,7.713215,7.511753
0,2,monster,is,on,11.432037,10.438792,7.511753
0,3,is,on,the,8.124138,6.262095,2.0
0,4,on,the,ice,7.543883,1.623182,7.219169


In [35]:
def compute_perplexity(results, test_sents, n=3):
    for i in range(n):
        test_sents[f"pp{i}"] = np.exp2(results.groupby('sent_num')[f"i{i}"].mean())
    return test_sents

In [36]:
PP = compute_perplexity(R, TEST_SENTS)

In [37]:
PP.sort_values('pp1')

Unnamed: 0_level_0,sent_str,pp0,pp1,pp2
sent_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,The monster is on the ice.,116.146797,80.632951,68.983256
3,He never knew the love of a family.,170.855904,136.87052,64.734928
2,I have never seen the aurora borealis.,340.954187,138.718691,81.279212
1,Flowers are happy things.,587.20506,533.982028,182.5


Answer: 0, 3, 2, 1

## Q3

Using the bigram model represented as a matrix, explore the relationship between bigram pairs as done in the "Explore" section of the template notebook, but use the following lists. **What might you speculate about gender and communication given the results you see?**
* `['he','she']` to select the indices.
* `['said','heard']` to select the columns.

Hint: use `.unstack()` method on the feature `n` and then use `.loc[]` to select the first list from the index, and the second list from the columns.

In [38]:
BGX = M[1].n.unstack()

In [39]:
print(BGX.loc[['he','she'],['said','heard']])

w1   said  heard
w0              
he   21.0    5.0
she   3.0    3.0


Speculation: Men talk more than women.

## Q4

Generate a text using the `generate_text` function.

In [40]:
def generate_text(M, n=250):
    
    if len(M) < 3:
        raise ValueError("Must have trigram model generated.")
    
    # Start list of words
    first_word = M[1].loc['<s>'].sample(weights='cp').index[0]
    
    words = ['<s>', first_word]
    
    for i in range(n):
        
        bg = tuple(words[-2:])

        # Try trigram model
        try:
            next_word = M[2].loc[bg].sample(weights='cp').index[0]

        # If not found in model, back off ...
        except KeyError as e1:
            try:
                # Get the last word in the bigram
                ug = bg[1]
                next_word = M[1].loc[ug].sample(weights='cp').index[0]
            
            except KeyError as e2:
                next_word = M[0].sample(weights='p').index[0]
                
        words.append(next_word)
    
    
    text = ' '.join(words[2:])
    print('\n\n'.join([str(i+1) + ' ' + line.replace('<s>','')\
        .strip().upper() for i, line in enumerate(text.split('</s>'))]))

In [41]:
generate_text(M)

1 WHOLE OF THE WRETCHED SPHERE OF MY ENEMY AND UNPROTECTED TO THE COURT WAS HELD

2 MY PLAN OF YOUR HANDS

3 ALMOST AS IMPOSING AND INTERESTING AS TRUTH

4 ARDENTLY HOPE THAT YOUR OTHER DUTIES ARE EQUALLY NEGLECTED

5 I SLAKED MY THIRST FOR SYMPATHY AND COMPASSION CONFIRMED MY RESOLUTION TO PURSUE MY DESTROYER IN ITS GENERAL RELATIONS AND THAT CLERVAL SHOULD JOIN ME AT ONCE WITH FRIGHTFUL LOUDNESS FROM VARIOUS QUARTERS OF THE NEARLY PERPENDICULAR ASCENT OF MONT SALÊVE

6 AND THUS WHILE THEY EXIST YOU SHALL CURSE THE HOUR FROM WHICH THE MURDER HAD DISCOVERED IN HER HIDING PLACES

7 THE IDEA OF THE ACCUSED

8 OF COLD AND WET AT LENGTH EXHAUSTED BY HIS PRESENCE BROUGHT BACK DESPAIR TO HIM TO BE MEN ON WHOM I HAD SO MISERABLY

9 DEAR SISTER BUT I WAS SEATED IN A LOUD SCREAM I FIRED THE STRAW AND FELL AT HIS FEET

10 CLEAR CONCEPTION OF MY LABOURS DID NOT SATISFY MY OWN VAMPIRE MY OWN INEXPERIENCE AND MISTAKE THAN TO THE COUNTY TOWN WHERE THE MERCHANT HAD DECIDED TO WAIT A FAVOURABLE OPPORT

## Q5

Compute the redundancy $R$ for each of the n-gram models using the MLE of the joint probability of each ngram type. In other words, for each model, just use the `.mle` feature as $p$ in computing $H = \sum p(ng) \log_2(1/p(ng))$

Remember that $R = 1 - \frac{H}{H_{max}}$, where $H$ is the actual entropy of the model and $H_{max}$ is its maximum entropy. 

Does $R$ increase, decrease, or remain the same as the choice of n-gram increases in length?

**Hints**:
- If `mle` is not a feature in your models, just use `p` for the unigram model and compute `p` for the other two models by dividing `n` by the sum of `n`, i.e. 

```python
M[1]['p'] = M[1].n /  M[1].n.sum()
M[2]['p'] = M[2].n /  M[2].n.sum()
```
- N is computed as the number of all possible combinations for each ngram. So, for the bigram model N is the number of unigrams (i.e. the vocabulary size plus the sentence boundary signs) squared, and for the trigram model the value is cubed, i.e.

```python
N = len(M[0].index)**{i+1}
```

In [44]:
V = len(M[0].index)

M[1]['p'] = M[1].n /  M[1].n.sum()
M[2]['p'] = M[2].n /  M[2].n.sum()

R = []
for i in range(3):
    N = V**(i+1)
    H = (M[i].p * np.log2(1/M[i].p)).sum()
    Hmax = np.log2(N)
    print(i, N, Hmax)
    R.append(int(round(1 - H/Hmax, 2) * 100))

In [45]:
R

[31, 45, 59]

**ANSWER**: Redundancy increases.