# Metadata

```yaml
Course: DS 5001 
Module: 03: Language Models
Topics: Inferring NGram Language Models 
Author: R.C. Alvarado
Date:   25 January 2024
```

## Purpose

Wew create word-level langage models from a set of novels and evaluate them.

## Set Up

### Import libraries

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import re
import seaborn as sns
from IPython.core.display import HTML
sns.set()

### Configure

In [2]:
import configparser
config = configparser.ConfigParser()
config.read("../../../env.ini")
data_dir = config['DEFAULT']['data_home']
output_dir = config['DEFAULT']['output_dir']

In [3]:
OHCO = ['book_id', 'chap_num', 'para_num', 'sent_num', 'token_num']
text_file = f'{output_dir}/austen-combo-TOKENS.csv' # Generated in HW 02
vocab_file = f'{output_dir}/austen-combo-VOCAB.csv' # Generated in HW 02

In [4]:
ngram_size = 3

## Prepare Training Data

Problem: These texts treat Mr. and Mrs. as sentences. Since sentence breaks are represented in the OHCO, to fix this requires reparsing the source text into the TOKEN table with the abbreviation periods removed.

### Import TOKENS

In [5]:
TOKENS = pd.read_csv(text_file).set_index(OHCO).dropna()
TOKENS['term_str'] = TOKENS.token_str.str.lower().str.replace(r'[\W_]+', '', regex=True)
TOKENS = TOKENS[TOKENS.term_str != '']

In [6]:
TOKENS.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,token_str,term_str
book_id,chap_num,para_num,sent_num,token_num,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1,0,0,0,Sir,sir
1,1,0,0,1,Walter,walter
1,1,0,0,2,Elliot,elliot
1,1,0,0,3,of,of
1,1,0,0,4,Kellynch,kellynch


Look at book 1 (_Persuasion_) only:

In [7]:
TOKENS.loc[1].head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,token_str,term_str
chap_num,para_num,sent_num,token_num,Unnamed: 4_level_1,Unnamed: 5_level_1
1,0,0,0,Sir,sir
1,0,0,1,Walter,walter
1,0,0,2,Elliot,elliot
1,0,0,3,of,of
1,0,0,4,Kellynch,kellynch


### Get VOCAB

In [8]:
VOCAB = pd.read_csv(vocab_file).set_index('term_str')

### Add features to VOCAB

In [9]:
VOCAB['n_chars'] = VOCAB.index.str.len()

In [10]:
VOCAB['p'] = VOCAB.n / VOCAB.n.sum()
VOCAB['s'] = 1/VOCAB.p
VOCAB['i'] = np.log2(VOCAB.s)
VOCAB['h'] = VOCAB.p * VOCAB.i

In [11]:
VOCAB

Unnamed: 0_level_0,n,n_chars,p,s,i,h
term_str,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3,1,0.000015,68267.333333,16.058908,0.000235
15,1,2,0.000005,204802.000000,17.643870,0.000086
16,1,2,0.000005,204802.000000,17.643870,0.000086
1760,1,4,0.000005,204802.000000,17.643870,0.000086
1784,1,4,0.000005,204802.000000,17.643870,0.000086
...,...,...,...,...,...,...
youthful,3,8,0.000015,68267.333333,16.058908,0.000235
z,1,1,0.000005,204802.000000,17.643870,0.000086
zeal,7,4,0.000034,29257.428571,14.836515,0.000507
zealous,4,7,0.000020,51200.500000,15.643870,0.000306


## Define NGramCounter Class

In [12]:
class NgramCounter():
    """A class to generate tables of ngram tokens and types from a list of sentences."""
    
    unk_sign = '<UNK>'
    sent_pad_signs = ['<s>','</s>']
        
    def __init__(self, sents:[], vocab:[], n:int=3):
        self.sents = sents # Expected to be normalized
        self.vocab = vocab # Can be extracted from another corpus
        self.n = n
        self.widx = [f'w{i}' for i in range(self.n)] # Used for cols and index names
        
    def generate(self):
        
        # Convert sentence list to dataframe
        self.S = pd.DataFrame(dict(sent_str=self.sents))
            
        # Pad sentences 
        pad = (self.sent_pad_signs[0] + ' ') * (self.n - 1)
        self.I = (pad + self.S.sent_str + ' ' + self.sent_pad_signs[1])\
            .str.split(expand=True).stack().to_frame('w0')
        
        # Set index names
        self.I.index.names = ['sent_num', 'token_num']
        
        # Remove OOV terms (for test data)
        self.I.loc[~self.I.w0.isin(self.vocab + self.sent_pad_signs), 'w0'] = self.unk_sign

        # Get sentence lengths (these will include pads)
        self.S['token_len'] = self.I.groupby('sent_num').w0.count()
                
        # Add w columns
        for i in range(1, self.n):
            self.I[f'w{i}'] = self.I[f"w{i-1}"].shift(-1)         
        
        # Generate ngrams
        self.NG = []
        for i in range(self.n):
            self.NG.append(self.I.iloc[:, :i+1].copy())

        # Remove spurious rows
        self.NG[i] = self.NG[i].dropna()
                                
        # Generate raw ngram counts
        self.LM = []
        for i in range(self.n):
            self.LM.append(self.NG[i].value_counts().to_frame('n'))
            # self.LM[i]['mle'] = self.LM[i].n / self.LM[i].n.sum()
            self.LM[i] = self.LM[i].sort_index()

        # Convert single value tuple to scalar in unigram table ...
        self.LM[0].index = [i[0] for i in self.LM[0].index]
        self.LM[0].index.name = 'w0'

## Get Training NGrams

In [13]:
S = TOKENS.groupby(OHCO[:4]).term_str\
    .apply(lambda x: ' '.join(x)).to_list()

In [14]:
V = VOCAB.index.to_list()

In [15]:
train = NgramCounter(S, V)
train.generate()

Look at the first sentence

In [16]:
train.NG[2].loc[0]

Unnamed: 0_level_0,w0,w1,w2
token_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,<s>,<s>,sir
1,<s>,sir,walter
2,sir,walter,elliot
3,walter,elliot,of
4,elliot,of,kellynch
5,of,kellynch,hall
6,kellynch,hall,in
7,hall,in,somersetshire
8,in,somersetshire,was
9,somersetshire,was,a


Look at the bigram model

In [17]:
train.LM[2]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,n
w0,w1,w2,Unnamed: 3_level_1
1,1760,married,1
1,1785,</s>,1
1,ends,</s>,1
15,1784,elizabeth,1
16,1810,charles,1
...,...,...,...
zealous,attention,as,1
zealous,officer,too,1
zealous,on,the,1
zealously,active,as,1


Look at words that precede "joy"

In [18]:
train.LM[1].query("w1 == 'joy'").sort_values('n', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,n
w0,w1,Unnamed: 2_level_1
of,joy,8
her,joy,6
s,joy,3
the,joy,3
you,joy,2
a,joy,1
great,joy,1
him,joy,1
like,joy,1
me,joy,1


## Define NGram Language Model Class

In [68]:
class NgramLanguageModel():
    """A class to create ngram language models."""
    
    def __init__(self, ngc:NgramCounter):
        self.S = ngc.S
        self.LM = ngc.LM
        self.NG = ngc.NG
        self.n = ngc.n
        self.widx = ngc.widx
        self.k:float = .5 # Set the Lidstone Smoothing value; LaPlace = 1

    def apply_smoothing(self):
        """Applies simple smoothing to ngram type counts to estimate the models."""
        
        # Z1 and Z2 will hold info about unseen ngrams
        self.Z1 = [None for _ in range(self.n)] # Unseen N grams, but seen N-1 grams
        self.Z2 = [None for _ in range(self.n)] # Unseen N-1 grams too
        
        # The base vocab size (same as number of unigram types)
        V = len(self.LM[0]) # Inlcides <s> and </s>
        
        # The number of ngram types
        B = [V**(i+1) for i in range(self.n)]

        for i in range(self.n):
            
            self.LM[i]['p'] = self.LM[i].n / self.LM[i].n.sum()
            self.LM[i]['log_p'] = np.log2(self.LM[i].p)
            
            # MLE
            # self.LM[i]['mle'] = self.LM[i].n / self.LM[i-1].n
            
            if i > 0:

                # Employ smoothing formula
                self.LM[i]['cp'] = (self.LM[i].n + self.k) / (self.LM[i-1].n + B[i-1] * self.k)
                self.LM[i]['log_cp'] = np.log2(self.LM[i].cp)

                # Unseen N grams, but seen N-1 grams
                self.Z1[i] = np.log2(self.k / (self.LM[i-1].n + B[i-1] * self.k))

                # Unsess N-1 grams too
                self.Z2[i] = np.log2(self.k / B[i-1] * self.k)
                
            # Tidy up the index
            self.LM[i].sort_index(inplace=True)
        
    def predict(self, test:NgramCounter):
        """Predicts test sentences with estimated models."""
        self.T = test
        for i in range(self.n):
            ng = i + 1
            if i == 0:
                self.T.S[f'ng_{ng}_ll'] = self.T.NG[0]\
                    .join(self.LM[0].log_p, on=self.widx[:ng])\
                    .groupby('sent_num').log_p.sum()
            else:
                self.T.S[f'ng_{ng}_ll'] = self.T.NG[i]\
                    .join(self.LM[i].log_cp, on=self.widx[:ng])\
                    .fillna(self.Z1[i]).fillna(self.Z2[i])\
                    .groupby('sent_num').log_cp.sum()
                
            self.T.S[f'pp{ng}'] = 2**( -self.T.S[f'ng_{ng}_ll'] / self.T.S['token_len'])
            

    def generate_text(self, n_sents=20):
        """Generate texts using Shannon Game method."""
        
        LM = self.LM # For convenience
        i = self.n - 1
    
        # Start with beginning sentence marker
        words = ['<s>' for _ in range(i)]

        # Sentence counter
        sent_count = 0

        # Generate sentences until we've reached our limit
        while sent_count < n_sents:

            # Get ngram context
            ng = tuple(words[-i:])

            # Get next word
            weight_param = 'p' if i == 0 else 'cp'
            
            words.append(LM[i].loc[ng].sample(weights=weight_param).index.values[0])

            # Terminate when end-of-sentence marker found
            if words[-1] == '</s>':
                sent_count += 1                        
                if sent_count < n_sents:
                    words.append('<s>')

        # Create text from words
        text = ' '.join(words)

        sents = pd.DataFrame(dict(sent_str=text.split('<s> <s>')))
        sents['len'] = sents.sent_str.str.len()
        sents = sents[sents.len > 0]
        sents.sent_str = sents.sent_str.str.replace('<s> ', '')
        sents.sent_str = sents.sent_str.str.replace(' </s>', '')
        sents.sent_str = sents.sent_str.str.strip()
        sents.sent_str = sents.sent_str.str.replace(r" s ", "'s ", regex=True)
        
        _ = [print(f"{str(x+1).zfill(2)}. {sent}.\n".upper()) for x, sent in enumerate(sents.sent_str)]

        self.generated_sents = sents

## Train the Model

In [69]:
model = NgramLanguageModel(train)

In [70]:
model.apply_smoothing()

In [71]:
NG = model.NG
LM = model.LM
Z1 = model.Z1
Z2 = model.Z2

In [72]:
LM[1].sort_index()

Unnamed: 0_level_0,Unnamed: 1_level_0,n,p,log_p,cp,log_cp
w0,w1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1760,1,0.000004,-17.892003,0.000364,-11.424516
1,1785,1,0.000004,-17.892003,0.000364,-11.424516
1,ends,1,0.000004,-17.892003,0.000364,-11.424516
15,1784,1,0.000004,-17.892003,0.000364,-11.423816
16,1810,1,0.000004,-17.892003,0.000364,-11.423816
...,...,...,...,...,...,...
zealous,attention,1,0.000004,-17.892003,0.000364,-11.424866
zealous,officer,1,0.000004,-17.892003,0.000364,-11.424866
zealous,on,1,0.000004,-17.892003,0.000364,-11.424866
zealously,active,1,0.000004,-17.892003,0.000364,-11.424166


In [73]:
LM[2].loc[('anne', 'was')].sort_values('n', ascending=False)

Unnamed: 0_level_0,n,p,log_p,cp,log_cp
w2,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
not,2,8e-06,-16.891997,7.36402e-08,-23.694931
at,2,8e-06,-16.891997,7.36402e-08,-23.694931
startled,2,8e-06,-16.891997,7.36402e-08,-23.694931
so,1,4e-06,-17.891997,4.418412e-08,-24.431897
now,1,4e-06,-17.891997,4.418412e-08,-24.431897
obliged,1,4e-06,-17.891997,4.418412e-08,-24.431897
one,1,4e-06,-17.891997,4.418412e-08,-24.431897
out,1,4e-06,-17.891997,4.418412e-08,-24.431897
really,1,4e-06,-17.891997,4.418412e-08,-24.431897
renewing,1,4e-06,-17.891997,4.418412e-08,-24.431897


## Collect Test Data

### Choose Test Sentences

In [74]:
# Some paragraphs from Austen's _Emma_ and other stuff (first two)
S_TEST = """
The car was brand new
Computer programs are full of bugs
The event had every promise of happiness for her friend 
Mr Weston was a man of unexceptionable character easy fortune suitable age and pleasant manners
and there was some satisfaction in considering with what self-denying generous friendship she had always wished and promoted the match
but it was a black morning's work for her 
The want of Miss Taylor would be felt every hour of every day 
She recalled her past kindness the kindness the affection of sixteen years 
how she had taught and how she had played with her from five years old 
how she had devoted all her powers to attach and amuse her in health 
and how nursed her through the various illnesses of childhood 
A large debt of gratitude was owing here 
but the intercourse of the last seven years 
the equal footing and perfect unreserve which had soon followed Isabella's marriage 
on their being left to each other was yet a dearer tenderer recollection 
She had been a friend and companion such as few possessed intelligent well-informed useful gentle 
knowing all the ways of the family 
interested in all its concerns 
and peculiarly interested in herself in every pleasure every scheme of hers 
one to whom she could speak every thought as it arose 
and who had such an affection for her as could never find fault 
How was she to bear the change 
It was true that her friend was going only half a mile from them 
but Emma was aware that great must be the difference between a Mrs Weston 
only half a mile from them 
and a Miss Taylor in the house 
and with all her advantages natural and domestic 
she was now in great danger of suffering from intellectual solitude 
She dearly loved her father 
but he was no companion for her 
He could not meet her in conversation rational or playful 
The evil of the actual disparity in their ages
and Mr Woodhouse had not married early
was much increased by his constitution and habits 
for having been a valetudinarian all his life 
without activity of mind or body 
he was a much older man in ways than in years 
and though everywhere beloved for the friendliness of his heart and his amiable temper 
his talents could not have recommended him at any time 
Her sister though comparatively but little removed by matrimony 
being settled in London only sixteen miles off was much beyond her daily reach 
and many a long October and November evening must be struggled through at Hartfield 
before Christmas brought the next visit from Isabella and her husband 
and their little children to fill the house and give her pleasant society again 
""".split('\n')[1:-1]

In [75]:
test = NgramCounter(S_TEST, V)
test.generate()

In [77]:
test.LM[1]

Unnamed: 0_level_0,Unnamed: 1_level_0,n
w0,w1,Unnamed: 2_level_1
</s>,<s>,43
<UNK>,</s>,4
<UNK>,<UNK>,7
<UNK>,all,1
<UNK>,and,2
...,...,...
work,for,1
would,be,1
years,</s>,3
years,old,1


In [78]:
model.predict(test)

In [81]:
model.T.S.sort_values("pp1")

Unnamed: 0,sent_str,token_len,ng_1_ll,pp1,ng_2_ll,pp2,ng_3_ll,pp3
1,Computer programs are full of bugs,9,-37.354252,17.758926,-98.393438,1954.527555,-235.507108,75368400.0
25,and a Miss Taylor in the house,10,-43.161226,19.91968,-74.24722,171.816168,-220.859051,4451639.0
32,and Mr Woodhouse had not married early,10,-53.587787,41.034877,-93.857561,668.950692,-245.248022,24137990.0
21,How was she to bear the change,10,-57.965564,55.582405,-89.612213,498.42108,-248.30722,29839540.0
0,The car was brand new,8,-46.915495,58.260125,-87.366171,1938.562364,-203.168322,44153980.0
6,The want of Miss Taylor would be felt every ho...,16,-98.111283,70.129774,-154.025633,790.489253,-403.911573,39750600.0
23,but Emma was aware that great must be the diff...,17,-104.26102,70.18064,-175.616114,1287.502612,-435.887388,52304600.0
34,for having been a valetudinarian all his life,11,-68.411242,74.501825,-85.966001,225.204372,-259.830367,12900670.0
42,before Christmas brought the next visit from I...,14,-90.443779,88.051148,-135.129892,804.604013,-351.791831,36667230.0
41,and many a long October and November evening m...,17,-110.867239,91.875117,-188.455331,2173.203691,-437.044155,54830670.0
