# Advanced NLP HW0

Before starting the task please read thoroughly these chapters of Speech and Language Processing by Daniel Jurafsky & James H. Martin:

•	N-gram language models: https://web.stanford.edu/~jurafsky/slp3/3.pdf

•	Neural language models: https://web.stanford.edu/~jurafsky/slp3/7.pdf 

In this task you will be asked to implement the models described there.

Build a text generator based on n-gram language model and neural language model.
1.	Find a corpus (e.g. http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt ), but you are free to use anything else of your interest
2.	Preprocess it if necessary (we suggest using nltk for that)
3.	Build an n-gram model
4.	Try out different values of n, calculate perplexity on a held-out set
5.	Build a simple neural network model for text generation (start from a feed-forward net for example). We suggest using tensorflow + keras for this task

Criteria:
1.	Data is split into train / validation / test, motivation for the split method is given
2.	N-gram model is implemented
  *	Unknown words are handled
  * Add-k Smoothing is implemented
3.	Neural network for text generation is implemented
4.	Perplexity is calculated for both models
5.	Examples of texts generated with different models are present and compared
6.	Optional: Try both character-based and word-based approaches.

In [1]:
from typing import Iterable, Union, Tuple, List
import random
from functools import wraps
from collections import Counter, defaultdict
from prettytable import PrettyTable

from tqdm import tqdm

In [2]:
import numpy as np
import re
import multiprocessing
from sklearn import preprocessing

from sklearn.model_selection import train_test_split

In [3]:
import nltk
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import wordnet
from nltk import word_tokenize
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/aleksandr/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/aleksandr/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Custom ngram model

Base class for the model.

In [4]:
class BaseLM:
    def _check_fit(func):
        """
        A helper decorator that ensures that the LM was fit on vocab.
        """
        @wraps(func)
        def wrapper(self,*args,**kwargs):
            if not self.is_fitted:
                raise AttributeError(f"Fit model before call {func.__name__} method")
            return func(self, *args,**kwargs)
        return wrapper

    def __init__(self, 
                 n: int, 
                 vocab: Iterable[str] = None, 
                 unk_label: str = "<UNK>"
                ):
        """
        Language model constructor
        n -- n-gram size
        vocab -- optional fixed vocabulary for the model
        unk_label -- special token that stands in for so-called "unknown" items
        """
        self.n = n
        self._vocab = vocab if vocab else None
        self.unk_label = unk_label
  
    def _lookup(self, 
                words: Union[str, Iterable[str]]
               ) -> Union[str, Tuple[str]]:
        """
        Look ups words in the vocabulary
        """
        raise NotImplementedError

    @_check_fit
    def prob(self, 
             word: str, 
             context: Tuple[str] = None
            ) -> float:
        """This method returns probability of a word with given context: P(w_t | w_{t - 1}...w_{t - n + 1})

        For example:
        >>> lm.prob('hello', context=('world',))
        0.99988
        """
        raise NotImplementedError

    def prob_with_smoothing(self, 
                            word: str, 
                            context: Tuple[str] = None, 
                            alpha: float = 1.0
                            ) -> float:
        """Proabaility with Additive smoothing

        see: https://en.wikipedia.org/wiki/Additive_smoothing
        where:
        x - count of word in context
        N - total
        d - wocab size
        a - alpha

        """
        raise NotImplementedError

    @_check_fit
    def generate(self, 
                 text_length: int, 
                 text_seed: Iterable[str] = None,
                 random_seed: Union[int,random.Random] = 42,
                 prob_method = str
                 ) -> List[str]:
        """
        This method generates text of a given length. 

        text_length: int -- Length for the output text including `text_seed`.
        text_seed: List[str] -- Given text to calculates probas for next words.
        prob_method: str -- Specifies what method to use: with or without smoothing.

        For example
        >>> lm.generate(2)
        ["hello", "world"]

        """
        raise NotImplementedError

    def fit(self, 
            sequence_of_tokens: Iterable[str]
           ):
        """
        This method learns probabilities based on given sequence of tokens and
        updates `self.vocab`.

        sequence_of_tokens -- iterable of tokens

        For example
        >>> lm.update(['hello', 'world'])
        """
        raise NotImplementedError

    @_check_fit  
    def perplexity(self, 
                   sequence_of_tokens: Union[Iterable[str], Iterable[Tuple[str]]]
                   ) -> float:
        """
        This method returns perplexity for a given sequence of tokens

        sequence_of_tokens -- iterable of tokens
        """
        raise NotImplementedError

In [5]:
class MyNgrammLM:
    
    def _check_fit(func):
        """
        A helper decorator that ensures that the LM was fit on vocab.
        """
        @wraps(func)
        def wrapper(self,*args,**kwargs):
            if not self.is_fitted:
                raise AttributeError(f"Fit model before call {func.__name__} method")
            return func(self, *args,**kwargs)
        return wrapper
    
    def __init__(self, 
                 n: int, 
                 vocab: Iterable[str] = None, 
                 unk_label: str = "<UNK>",
                 unk_ratio = 0.1,
                 alpha = 1.
                ):
        self.n = n
        
        self.unk_label = unk_label
        
        self.start_sym = "<s>"
        
        self.model = defaultdict(lambda: defaultdict(lambda: 0))
        self.probs = defaultdict(lambda: defaultdict(lambda: 0))
        self.smooth_probs = defaultdict(lambda: defaultdict(lambda: 0))
        

        if vocab:
            self.is_vocab = True
            self.vocab = set(vocab)
        else:
            self.is_vocab = False
            self.vocab = set()
        
        self.vocab.add(self.unk_label)
        self.vocab.add(self.start_sym)
        self.vocab.add('.')
        
        
        self.is_fitted = False
        
        self.unk_ratio = unk_ratio
        
        self.alpha = alpha
            
    def fit_doc(self, 
            sequence_of_texts: Iterable[str]
           ):
        """
        This method learns probabilities based on given sequence of tokens and
        updates `self.vocab`.

        sequence_of_tokens -- iterable of tokens

        For example
        >>> lm.update(['hello', 'world'])
        """
        
        doc = []
        
        for text in sequence_of_texts:
            for word in text.split():
                doc.append(word)
            
            for _ in range(self.n-1):
                doc.append(self.start_sym)
        
        doc = doc[:-self.n+1]
        
        self.fit(doc)
        
    def update_probs(self):
        for context in self.model:
            for word in self.model[context]:
                self.probs[context][word] = self.prob(word, context)
                
    def update_smooth_probs(self):
        for context in self.model:
            for word in self.model[context]:
                self.smooth_probs[context][word] = self.prob_with_smoothing(word, context, self.alpha)
                
        
    def fit(self, 
            sequence_of_tokens: Iterable[str]
           ):
        """
        This method learns probabilities based on given sequence of tokens and
        updates `self.vocab`.

        sequence_of_tokens -- iterable of tokens

        For example
        >>> lm.update(['hello', 'world'])
        """
        
        if not self.is_vocab:
            N = len(sequence_of_tokens)
            
            dd = defaultdict(lambda: 0)
            for word in sequence_of_tokens:
                dd[word]+=1
            
            N = len(dd.keys())
            
            min_args = np.argsort(list(dd.values()))
            
            for i in range(N):
                if i not in min_args[ : int(N*self.unk_ratio)]:
                    self.vocab.add(list(dd.keys())[i])
                    
                    
        
        sequence = self._add_padding(list(sequence_of_tokens))
        
        sequence = self.check_vocab(sequence)
        
        for i in range(len(sequence)-self.n+1):
            
            context = tuple(sequence[i:i+self.n-1])
            word = sequence[i+self.n-1]
            
            self.vocab.add(word)
            
            self.model[context][word]+=1
            
        self.is_fitted = True
        
        self.update_probs()
        self.update_smooth_probs()
        
        #raise NotImplementedError
        
        
    @_check_fit
    def prob(self, 
             word: str, 
             context: Tuple[str] = None
            ) -> float:
        """This method returns probability of a word with given context: P(w_t | w_{t - 1}...w_{t - n + 1})

        For example:
        >>> lm.prob('hello', context=('world',))
        0.99988
        
        """
        if word not in self.vocab:
            word = self.unk_label
        
        text = self._add_padding(context, end=False)
        
        text = self.check_vocab(text)
        
        context = tuple(text[-self.n+1:])
            
        num = sum(self.model[context].values())
        
        if num == 0:
            return 0
        
        return self.model[context][word] / num
    

    def prob_with_smoothing(self, 
                            word: str, 
                            context: Tuple[str] = None, 
                            alpha: float = 1.0
                            ) -> float:
        """Proabaility with Additive smoothing

        see: https://en.wikipedia.org/wiki/Additive_smoothing
        where:
        x - count of word in context
        N - total
        d - wocab size
        a - alpha

        """
        if word not in self.vocab:
            word = self.unk_label

        text = self._add_padding(context, end=False)
        
        text = self.check_vocab(text)
        
        context = tuple(text[-self.n+1:])
            
        num = sum(self.model[context].values())
         
        res = (self.model[context][word] + alpha) / (num + alpha* len(self.vocab))
        
        if alpha == self.alpha:
            self.smooth_probs[context][word] = res
            
        return res
    
    @_check_fit  
    def perplexity(self, 
                   sequence_of_tokens: Union[Iterable[str], Iterable[Tuple[str]]],
                   prob_method = str,
                   ) -> float:
        """
        This method returns perplexity for a given sequence of tokens

        sequence_of_tokens -- iterable of tokens
        """
        
        sequence_of_tokens = self._add_padding(sequence_of_tokens)
        
        sequence_of_tokens = self.check_vocab(sequence_of_tokens)
        
        N = len(sequence_of_tokens)
        muls = 1
        for i in range(N-self.n+1):
            
            context = tuple(sequence_of_tokens[i:i+self.n-1])
            word = sequence_of_tokens[i+self.n-1]
            
            if prob_method == 'smooth':
                
                if (context not in self.smooth_probs) or (word not in self.smooth_probs[context]):
                    if context not in self.model:
                        p = 1 / len(self.vocab)
                        self.smooth_probs[context][word] = p
                            
                    else:
                        p = self.prob_with_smoothing(word, context, self.alpha)
                
                else:
                    p = self.smooth_probs[context][word]
                    
            else:
                p = self.probs[context][word]
                
            muls *= p ** (1/(N-self.n+1))

        if muls == 0:
            return -1
        
        return muls ** -1
    
    
    
    @_check_fit
    def generate(self, 
                 text_length: int, 
                 text_seed: Iterable[str] = None,
                 random_seed: Union[int,random.Random] = 42,
                 prob_method = str,
                 ) -> List[str]:
        """
        This method generates text of a given length. 

        text_length: int -- Length for the output text including `text_seed`.
        text_seed: List[str] -- Given text to calculates probas for next words.
        prob_method: str -- Specifies what method to use: with or without smoothing.

        For example
        >>> lm.generate(2)
        ["hello", "world"]

        """
        
          
        text = self._add_padding(text_seed, end=False)
        
        text = self.check_vocab(text)
        
        result = []
        
        i = 0
        
        while i < text_length:
            context = tuple(text[-self.n+1:])
            
            words = list(self.vocab)
            words.remove(self.unk_label)
            
            if prob_method == 'smooth':
                probs = []
                for word in words:
                    if (context not in self.smooth_probs) or (word not in self.smooth_probs[context]):
                        if context not in self.model:
                            p = 1 / len(self.vocab)
                            self.smooth_probs[context][word] = p
                            
                        else:
                            p = self.prob_with_smoothing(word, context, self.alpha)
                        
                        
                        
                    else:
                        p = self.smooth_probs[context][word]
                        
                    probs.append(p)
            else:
                probs = [self.probs[context][x] for x in words]
            
            
            probs = preprocessing.normalize([probs], norm='l1')[0]
            
            if probs.sum() == 0:
                probs = None

            num = np.random.choice(len(words), p=probs)
            
            word = words[num]
            
            if word == self.start_sym:
                result.append([x for x in text[self.n-1:]])
                text = self._add_padding(None, end=False)
                
            else:    
                text.append(word)
                i+=1
                
        result.append([x for x in text[self.n-1:]])    
            
        return result
    
    
    def update_alpha(self, alpha):
        self.alpha = alpha
        self.update_smooth_probs()
        
        
    def _lookup(self, 
                words: Union[str, Iterable[str]]
               ) -> Union[str, Tuple[str]]:
        """
        Look ups words in the vocabulary
        """
        raise NotImplementedError
        
        
    def check_vocab(self,
                    text: Iterable[str],
                   ):
        return [word if word in self.vocab
                else self.unk_label
                for word in text]
        
        
    def _add_padding(self,
                     text: Iterable[str],
                     end=True
                    ):
        
        start_ = [self.start_sym] * (self.n-1)
        end_ = [self.start_sym] * (self.n-1)
        
        if not text:
            text = []
        
        sequence = start_ + list(text)
        if end:
            sequence += end_
            
        
        return sequence
        
        
    def show(self):
        print(self.model)

# Text

In [6]:
def regular_preproc(data):
    transform = [['\'ll', ' will'],
             ['can\'t', 'cannot'],
             ['won\'t', 'will not'],
             ['n\'t', ' not'],
             ['\'d', ' would'],
             ['\'re', ' are'],
             ['\'s', ' is'],
             ['\'m', ' am'],
             ['\'ve', ' have'],
             ['\'t', ' it'],
             ['o\'', 'of']] # list of abbreviations

    for op in transform:
        data = re.sub(op[0], op[1], data) # apply replacement
  
    return data

In [7]:
with open("shakespeare_input.txt", "r") as file:
    text = file.read()

In [8]:
text



## Dialogs

In [9]:
text_d = re.split('\n\n', text)

In [10]:
text_d = [re.sub(': ', ' ', seq) for seq in text_d]
text_d = [re.sub(':\n', ' : ', seq) for seq in text_d]
text_d = [re.sub('\n', ' ', seq) for seq in text_d]

In [11]:
with multiprocessing.Pool() as pool:
    text_d = pool.map(regular_preproc, text_d)

In [12]:
train_d = []
for seq in text_d:
    x = re.sub('[^\w.:?! ]', ' ', seq).lower()
    x = re.sub('[?!.]', ' . ', x)
    if len(x)>0:
        train_d.append(" ".join(x.split()))
        
text_d = train_d
text_d

['first citizen : before we proceed any further hear me speak .',
 'all : speak speak .',
 'first citizen : you are all resolved rather to die than to famish .',
 'all : resolved . resolved .',
 'first citizen : first you know caius marcius is chief enemy to the people .',
 'all : we know it we know it .',
 'first citizen : let us kill him and we will have corn at our own price . is it a verdict .',
 'all : no more talking o not let it be done away away .',
 'second citizen : one word good citizens .',
 'first citizen : we are accounted poor citizens the patricians good . what authority surfeits on would relieve us if they would yield us but the superfluity while it were wholesome we might guess they relieved us humanely but they think we are too dear the leanness that afflicts us the object of our misery is as an inventory to particularise their abundance our sufferance is a gain to them let us revenge this with our pikes ere we become rakes for the gods know i speak this in hunger fo

## sentence

In [13]:
text_s = re.split('\n\n', text)

In [14]:
text_s = [re.sub('.+:\n', '', seq) for seq in text_s]
text_s = [re.sub('\n', ' ', seq) for seq in text_s]

with multiprocessing.Pool() as pool:
    text_s = pool.map(regular_preproc, text_s)

In [15]:
train_s = [] 
for seq in text_s:
    train_s += re.split('[.!?]', seq)

In [16]:
text_s = []
for seq in train_s:
    x = re.sub('[^\w ]', ' ', seq).lower()
    if len(x)>0:
        text_s.append(" ".join(x.split()))

In [17]:
text_s

['before we proceed any further hear me speak',
 'speak speak',
 'you are all resolved rather to die than to famish',
 'resolved',
 'resolved',
 'first you know caius marcius is chief enemy to the people',
 'we know it we know it',
 'let us kill him and we will have corn at our own price',
 'is it a verdict',
 'no more talking o not let it be done away away',
 'one word good citizens',
 'we are accounted poor citizens the patricians good',
 'what authority surfeits on would relieve us if they would yield us but the superfluity while it were wholesome we might guess they relieved us humanely but they think we are too dear the leanness that afflicts us the object of our misery is as an inventory to particularise their abundance our sufferance is a gain to them let us revenge this with our pikes ere we become rakes for the gods know i speak this in hunger for bread not in thirst for revenge',
 'would you proceed especially against caius marcius',
 'against him first he is a very dog to th

## transform

In [18]:
def get_wordnet_pos(word):
    """Map POS tag to first character lemmatize() accepts"""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}

    return tag_dict.get(tag, wordnet.NOUN)



def lemm(text):
    words = map(lambda word: word.lower(), word_tokenize(text))
         
    lemmatizer = WordNetLemmatizer() # init lemmatzer

#     p = re.compile('[a-zA-Z]+')
#     filtered_tokens = list(filter(lambda token: p.match(token) and len(token)>=2, words))
    

    return ' '.join([lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in words])


In [19]:
%%time

with multiprocessing.Pool() as pool:
    text_d = pool.map(lemm, text_d)

CPU times: user 289 ms, sys: 75.6 ms, total: 364 ms
Wall time: 17 s


In [20]:
%%time
with multiprocessing.Pool() as pool:
    text_s = pool.map(lemm, text_s)

CPU times: user 299 ms, sys: 80.2 ms, total: 379 ms
Wall time: 16.5 s


In [21]:
dialogs_tr, dialogs_val = train_test_split(text_d, test_size=0.2, random_state=42)

In [22]:
sentence_tr, sentence_val = train_test_split(text_s, test_size=0.2, random_state=42)

# NGramModel

In [23]:
def toFixed(numObj, digits=0):
    #return f"{numObj:.{digits}f}"
    numObj

In [25]:
d = 5

## dialogs

In [192]:
table = PrettyTable(['model', 
                     'train perplexity', 
                     'val perplexity',
                     'inf ratio on val',
                    ])# creat table for results

for i in [2, 3, 5, 10, 15]:
    model = MyNgrammLM(i)
    model.fit_doc(dialogs_tr)

    train_score = np.array([model.perplexity(x.split()) for x in dialogs_tr])
    
    val_score = np.array([model.perplexity(x.split()) for x in dialogs_val])
    
    table.add_row([ f'{i}-gramm model',
                   toFixed(np.mean(train_score[train_score>=0]), d),
                   toFixed(np.mean(val_score[val_score>=0]), d),
                   toFixed(len(val_score[val_score==-1]) / len(dialogs_val), d),
                  ]) # add scores
    
print(table)

+----------------+------------------+----------------+------------------+
|     model      | train perplexity | val perplexity | inf ratio on val |
+----------------+------------------+----------------+------------------+
| 2-gramm model  |     61.00328     |    37.07633    |     0.75391      |
| 3-gramm model  |     8.30589      |    8.87059     |     0.94343      |
| 5-gramm model  |     1.90399      |    3.91641     |     0.98131      |
| 10-gramm model |     1.51272      |    2.18098     |     0.98259      |
| 15-gramm model |     1.38910      |    1.72926     |     0.98259      |
+----------------+------------------+----------------+------------------+


In [232]:
model = MyNgrammLM(2)
model.fit_doc(dialogs_tr)

res = model.generate(300)

for x in res:
    print(" ".join(x))

cade driven before i persuade me my brother and twenty year may he bear he hath that must wear a cart and say that have need for duty not a caesar call .
pandarus : ay such a house be he doth fill would thou art thou lay fourteen and yet i shall not that have well do you sir .
sand into rhyme very taunt i hope i think you come bring in a mad away two hour by the sail and thereupon he wore the very virtue which valiantly he mock lord my wolsey : hum a we proceed good parentage .
iago : and fleet .
othello .
mistress . then confess she with this time which we forth her a i with much thou shalt be the night in ross :
captain steward : nay it with your crown within but give no pulse . i must not ruin and so cunning to our side strike my lord .
king that gem be my body and crouch under mar you could myself most know the inaudible and what be the third servingman : he do not afeard now turn therefore fasten would him what in his tender spray thus for grace . that i shall forget ist and there

#### smooth

In [212]:
table = PrettyTable(['model', 
                     'train perplexity', 
                     'val perplexity',
                     'inf ratio on val',
                    ])# creat table for results


for i in tqdm([2, 3, 5, 10]):
    model = MyNgrammLM(i)
    model.fit_doc(dialogs_tr)
    table.add_row([f'{i}-gramm model',
                   '',
                   '',
                   '']) # add scores
        
    for alpha in [0.0001, 0.001, 0.01, 0.1]:
        
        model.update_alpha(alpha)

        train_score = np.array([model.perplexity(x.split(), prob_method='smooth') for x in dialogs_tr])

        val_score = np.array([model.perplexity(x.split(), prob_method='smooth') for x in dialogs_val])

        table.add_row([f'alpha: {alpha}',
                       toFixed(np.mean(train_score[train_score>=0]), d),
                       toFixed(np.mean(val_score[val_score>=0]), d),
                       toFixed(len(val_score[val_score==-1]) / len(dialogs_val), d),
                      ]) # add scores
    
print(table)

100%|██████████| 4/4 [01:26<00:00, 21.65s/it]

+----------------+------------------+----------------+------------------+
|     model      | train perplexity | val perplexity | inf ratio on val |
+----------------+------------------+----------------+------------------+
| 2-gramm model  |                  |                |                  |
| alpha: 0.0001  |     62.19712     |   516.17293    |     0.00000      |
|  alpha: 0.001  |     68.01581     |   317.62360    |     0.00000      |
|  alpha: 0.01   |     94.15980     |   263.96228    |     0.00000      |
|   alpha: 0.1   |    199.86126     |   349.71493    |     0.00000      |
| 3-gramm model  |                  |                |                  |
| alpha: 0.0001  |     9.98537      |   1906.32992   |     0.00000      |
|  alpha: 0.001  |     18.24615     |   1102.12657   |     0.00000      |
|  alpha: 0.01   |     64.26872     |   1083.94150   |     0.00000      |
|   alpha: 0.1   |    376.83011     |   1695.39509   |     0.00000      |
| 5-gramm model  |                  | 




In [233]:
model = MyNgrammLM(2, alpha=0.01)
model.fit_doc(dialogs_tr)

res = model.generate(100, prob_method='smooth')

for x in res:
    print(" ".join(x))

leontes calmness staider carpenter insultment outsport scarcely cubiculo spending assembly disburse erst daffed befall rive tod heaviest unblessed lump upstart ravisher misprison triton devoutly saxon pecus cog bottomless consisteth recomforture thought most gracious .
gloucester .
arthur hautboy immodest happiest determinate boon future swarths northamptonshire warp sunken nothing louse closer prescribe whitehall libertine sumpter gazer nod dragon cricket discordant blossoming gloss nuptial leonato hapless endows doreus minion machiavel personates fresher cheaply bolter fog exhort testril biscuit lisbon violent hand and pride perpetuity pendulous insupportable uncertain trill countenance quite actual toadstool tuft hardness of grievous revolt nuncle downfall of clotens syria


## sentence

In [197]:
table = PrettyTable(['model', 
                     'train perplexity', 
                     'val perplexity',
                     'inf ratio on val',
                    ])# creat table for results

for i in [2, 3, 5, 10]:
    model = MyNgrammLM(i)
    model.fit_doc(sentence_tr)

    train_score = np.array([model.perplexity(x.split()) for x in sentence_tr])
    
    val_score = np.array([model.perplexity(x.split()) for x in sentence_val])
    
    table.add_row([ f'{i}-gramm model',
                   toFixed(np.mean(train_score[train_score>=0]), d),
                   toFixed(np.mean(val_score[val_score>=0]), d),
                   toFixed(len(val_score[val_score==-1]) / len(sentence_val), d),
                  ]) # add scores
    
print(table)

+----------------+------------------+----------------+------------------+
|     model      | train perplexity | val perplexity | inf ratio on val |
+----------------+------------------+----------------+------------------+
| 2-gramm model  |     91.59909     |    85.28901    |     0.70482      |
| 3-gramm model  |     10.44828     |    17.26235    |     0.88105      |
| 5-gramm model  |     2.93297      |    7.45387     |     0.91930      |
| 10-gramm model |     1.93122      |    3.53005     |     0.92107      |
+----------------+------------------+----------------+------------------+


In [234]:
model = MyNgrammLM(2)
model.fit_doc(sentence_tr)

res = model.generate(300)

for x in res:
    print(" ".join(x))

no more of his mistress be this foolish knight
be of the shade all
i be have thy book of noble lady entertain would in company here at me not choose
by god do
fie coward a tongue
how so truly limn would malice itself in night can make
i ask forgiveness then to my ache bone hear thee and truly by and a bond of his tongue to be not home
till he will bring thee
nor to my lord
thy head shalt not pas
nor charter a pound a fearful soul
by my past and bury when would it
hum
o pardon
a for you must lave our honour
for the small prick my lord chatillon
o be gentleman your foot let thy friend
down upon macbeth
valiant that there be dead
i would and with my master would and when you may give me whereas no
what to the shift into the devil porter let but how much more she could not churchman master bardolph
get thee the day be such a i be unlearn would benefit which well do never wear and nod they bear me you be not so perchance do you render you down and let why i of you say what have thrice bow w

In [207]:
table = PrettyTable(['model', 
                     'train perplexity', 
                     'val perplexity',
                     'inf ratio on val',
                    ])# creat table for results


for i in tqdm([2, 3, 5, 10]):
    model = MyNgrammLM(i)
    model.fit_doc(sentence_tr)
    table.add_row([f'{i}-gramm model',
                   '',
                   '',
                   '']) # add scores
        
    for alpha in [0.0001, 0.001, 0.01, 0.1]:
        
        model.update_alpha(alpha)

        train_score = np.array([model.perplexity(x.split(), prob_method='smooth') for x in sentence_tr])

        val_score = np.array([model.perplexity(x.split(), prob_method='smooth') for x in sentence_val])

        table.add_row([f'alpha: {alpha}',
                       toFixed(np.mean(train_score[train_score>=0]), d),
                       toFixed(np.mean(val_score[val_score>=0]), d),
                       toFixed(len(val_score[val_score==-1]) / len(sentence_val), d),
                      ]) # add scores
    
print(table)

100%|██████████| 4/4 [01:20<00:00, 20.18s/it]

+----------------+------------------+----------------+------------------+
|     model      | train perplexity | val perplexity | inf ratio on val |
+----------------+------------------+----------------+------------------+
| 2-gramm model  |                  |                |                  |
| alpha: 0.0001  |     93.62841     |   5553.19558   |     0.00000      |
|  alpha: 0.001  |    104.83816     |   1474.15306   |     0.00000      |
|  alpha: 0.01   |    156.02278     |   784.55832    |     0.00000      |
|   alpha: 0.1   |    365.46459     |   804.33733    |     0.00000      |
| 3-gramm model  |                  |                |                  |
| alpha: 0.0001  |     12.74687     |   3600.03890   |     0.00000      |
|  alpha: 0.001  |     24.78790     |   2053.71656   |     0.00000      |
|  alpha: 0.01   |     93.48619     |   1896.08682   |     0.00000      |
|   alpha: 0.1   |    545.32808     |   2635.93452   |     0.00000      |
| 5-gramm model  |                  | 




In [235]:
model = MyNgrammLM(2, alpha=0.01)
model.fit_doc(sentence_tr)

res = model.generate(100, prob_method='smooth')

for x in res:
    print(" ".join(x))

i know it would these rogue comply exhalation termagant ardent penetrable myrtle lavish spirit burthen total hubert poop herne be peaseblossom perdurably rapine consonant affecteth story we look you have ta encertainties cure abortives erflows mind hail lord call virtue complots intrenchant guildfords debitor betwitched pulse bedazzle deprave rainy simple soundest plantain ti abroach silently brown witted passado menton dedicates bane mealy nuncio chafe sweetest sleep outside appris trice harshness enacts icicle expell tyrannous therewithal dissension brief only semblable sinner hammer notify virginal lentus severn chape dilate totter capability and then leave
yet i have in deepest behoof swain affectioned


# NN

In [27]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding
from keras import layers 
from keras import callbacks
from keras import models

In [25]:
class NNLM(BaseLM):
    def _check_fit(func):
        """
        A helper decorator that ensures that the LM was fit on vocab.
        """
        @wraps(func)
        def wrapper(self,*args,**kwargs):
            if not self.is_fitted:
                raise AttributeError(f"Fit model before call {func.__name__} method")
            return func(self, *args,**kwargs)
        return wrapper
    
    
    
    def __init__(self, 
                 n: int, 
                 vocab: Iterable[str] = None, 
                 unk_label: str = "<UNK>",
                 unk_ratio = 0.1
                ):
        super().__init__(n, vocab, unk_label)
        self.start_sym = "<s>"
        self.unk_label = unk_label
        
        self.tokenizer = Tokenizer(oov_token=unk_label, filters='')
        
        if vocab:
            self.is_vocab = True
            self.vocab = set(vocab)
            self.tokenizer.fit_on_texts(vocab)
        else:
            self.is_vocab = False
            self.vocab = set()
        
        self.vocab.add(self.unk_label)
        self.vocab.add(self.start_sym)
        self.vocab.add('.')
        
        self.unk_ratio = unk_ratio
        
    def load_model(self, path):
        self.model = self.get_model()
        self.model.load_weights(path)
        self.is_fitted = True
        
    def get_model(self):
        self.model = Sequential()
        self.model.add(Embedding(len(self.tokenizer.word_index)+1, 10, input_length=self.n-1))
            #model.add(LSTM(50))
        self.model.add(layers.GlobalMaxPool1D())
        self.model.add(Dense(50, activation='relu'))
        self.model.add(Dense(len(self.tokenizer.word_index)+1, activation='softmax'))

            # compile network
        #self.model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
        self.model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])    
        return self.model
    
    
    def update_vocab(self, doc):
        doc = self.doc_gen(doc) 
            
        dd = defaultdict(lambda: 0)
        for word in doc:
            dd[word]+=1
            
        N = len(dd.keys())
            
            
        min_args = np.argsort(list(dd.values()))
            
        for i in range(N):
            if i not in min_args[ : int(N*self.unk_ratio)]:
                self.vocab.add(list(dd.keys())[i])
                    
        self.tokenizer.fit_on_texts(self.vocab)
        
        self.model = self.get_model()
        
        
    
    def _add_padding(self,
                     text: Iterable[str],
                     end=True
                    ):
        
        start_ = [self.start_sym] * (self.n-1)
        end_ = [self.start_sym] * (self.n-1)
        
        if not text:
            text = []
        
        sequence = start_ + text
        if end:
            sequence += end_
            
        
        return sequence
    
    
    def doc_gen(self, sequence_of_texts):
        doc = [self.start_sym] * (self.n - 1)
        
        for text in sequence_of_texts:
            for word in text.split():
                doc.append(word)
            
            for _ in range(self.n-1):
                doc.append(self.start_sym)
        
        return doc
    
    
    def splits(self, text):
        encoded = self.tokenizer.texts_to_sequences([text])[0]
        

        contexts = []
        words = []
        
        for i in range(self.n-1, len(encoded)):
            contexts.append(encoded[i-self.n+1:i])
            words.append(encoded[i])
            
        return contexts, words
        
    def fit(self, 
            sequence_of_texts: Iterable[str],
            val_texts: Iterable[str],
            patience=5,
            epochs=200,
            path = 'model'
           ):
        """
        This method learns probabilities based on given sequence of tokens and
        updates `self.vocab`.

        sequence_of_tokens -- iterable of tokens

        For example
        >>> lm.update(['hello', 'world'])
        """
        
        
        val_doc = self.doc_gen(val_texts)
        
        doc = self.doc_gen(sequence_of_texts) 
        
        
        
        if not self.is_vocab:
            
            dd = defaultdict(lambda: 0)
            for word in doc:
                dd[word]+=1
            
            N = len(dd.keys())
            
            
            min_args = np.argsort(list(dd.values()))
            
            for i in range(N):
                if i not in min_args[ : int(N*self.unk_ratio)]:
                    self.vocab.add(list(dd.keys())[i])
                    
            self.tokenizer.fit_on_texts(self.vocab)
        
        
        doc = " ".join(doc)
        val_doc = " ".join(val_doc)
        
        
        contexts, words = self.splits(doc)
        
        
        val_contexts, val_words = self.splits(val_doc)
        
        
        #words = to_categorical(words, num_classes=len(self.tokenizer.word_index)+1)
        
        
        self.get_model()
        
        early_stop = callbacks.EarlyStopping(monitor='val_loss', patience=patience)
        model_checkpoint = callbacks.ModelCheckpoint(
                                                  filepath=path,
                                                  save_weights_only=True,
                                                  monitor='val_loss',
                                                  mode='min',
                                                  save_best_only=True)

            
        self.model.fit(np.array(contexts), 
                       np.array(words),
                       validation_data=(val_contexts, val_words),
                       epochs=epochs, 
                       callbacks=[early_stop, model_checkpoint],
                       verbose=0)
        
        self.is_fitted = True
        
        #raise NotImplementedError
        
        
        
    @_check_fit    
    def generate(self, 
                 text_length: int, 
                 text_seed: Iterable[str] = None,
                 random_seed: Union[int,random.Random] = 42,
                 ) -> List[str]:
        """
        This method generates text of a given length. 

        text_length: int -- Length for the output text including `text_seed`.
        text_seed: List[str] -- Given text to calculates probas for next words.
        prob_method: str -- Specifies what method to use: with or without smoothing.

        For example
        >>> lm.generate(2)
        ["hello", "world"]

        """
        
          
        text = self._add_padding(text_seed, end=False)
        text = " ".join(text)
        
        start_len = len(self.start_sym) * (self.n - 1) + (self.n - 2)
        
        
        result = []
        i = 0
        
        while i < text_length:
            encoded = self.tokenizer.texts_to_sequences([text])[0]
            
            context = encoded[-self.n+1:]

            
            probs = self.model.predict([context])[0]
            
            #probs = preprocessing.normalize([probs], norm='l1')[0]

            num = np.random.choice(len(probs), p=probs)
            
            
            word = 'ERROR'
            for pred_word, index in self.tokenizer.word_index.items():
                if index == num:
                    word = pred_word
                    break
            
            if word == self.start_sym:
                if len(text) > start_len:
                    result.append(text[start_len+1:])
                text = self._add_padding(None, end=False)
                text = " ".join(text)
                
            else:    
                text += ' ' + word
                i+=1
                
        result.append(text[start_len+1:])  
            
        return result    
    
    
    
    
    @_check_fit
    def prob(self, 
             word: str, 
             context: Tuple[str] = None
            ) -> float:
        """This method returns probability of a word with given context: P(w_t | w_{t - 1}...w_{t - n + 1})

        For example:
        >>> lm.prob('hello', context=('world',))
        0.99988
        
        """
        
        word = self.tokenizer.texts_to_sequences([word])[0][0] - 1 
        
        text = self._add_padding(context, end=False)
        
        text = " ".join(text)
        
        encoded = self.tokenizer.texts_to_sequences([text])[0]
        
        context = encoded[-self.n+1:]
            
        probs = self.model.predict([context])[0]
        
        return probs[word]
    
    
    
    @_check_fit  
    def perplexity(self, 
                   sequence_of_tokens: Union[Iterable[str], Iterable[Tuple[str]]]
                   ) -> float:
        """
        This method returns perplexity for a given sequence of tokens

        sequence_of_tokens -- iterable of tokens
        """
        
        sequence_of_tokens = self._add_padding(sequence_of_tokens)
        
        N = len(sequence_of_tokens)
        
        
        muls = 1
        for i in range(N-self.n+1):
            
            context = sequence_of_tokens[i:i+self.n-1]
            word = sequence_of_tokens[i+self.n-1]
            
            muls *= self.prob(word, context) ** (1/(N-self.n+1))

        if muls == 0:
            return -1
        
        return muls ** -1

In [30]:
for n in tqdm([2, 3, 5]):
    
    nn_model = NNLM(n=n)
    nn_model.fit(dialogs_tr, dialogs_val, epochs=200, path=f'models/{n}_model', patience=3)


100%|██████████| 3/3 [40:20<00:00, 806.68s/it]


In [23]:
def to_fixed(num):
    return "{0:.5f}".format(num)

In [46]:
table = PrettyTable(['model', 
                     'train perplexity', 
                     'val perplexity',
                    ])# creat table for results


for n in tqdm([2, 3, 5]):
    nn_model = NNLM(n=n)
    nn_model.update_vocab(dialogs_tr)
    nn_model.load_model(f'models/{n}_model')
        
    train_score = np.array([nn_model.perplexity(x.split()) for x in dialogs_tr[:100]])
    
    val_score = np.array([nn_model.perplexity(x.split()) for x in dialogs_val[:100]])

    table.add_row([ f'{n}-gramm NNmodel',
                   to_fixed(np.mean(train_score[train_score>=0])),
                   to_fixed(np.mean(val_score[val_score>=0])),
                  ]) # add scores
    
print(table)

 29%|██▉       | 29/100 [16:31<40:26, 34.17s/it]




 33%|███▎      | 1/3 [03:04<06:08, 184.08s/it]



 67%|██████▋   | 2/3 [06:07<03:03, 183.78s/it]



100%|██████████| 3/3 [09:26<00:00, 188.68s/it]

+-----------------+------------------+------------------+
|      model      | train perplexity |  val perplexity  |
+-----------------+------------------+------------------+
| 2-gramm NNmodel | 2067776932.01734 |  32060722.89975  |
| 3-gramm NNmodel | 546141224.43736  | 2591031142.82844 |
| 5-gramm NNmodel | 678448272.27763  | 726211154.08193  |
+-----------------+------------------+------------------+





In [47]:
print(table)

+-----------------+------------------+------------------+
|      model      | train perplexity |  val perplexity  |
+-----------------+------------------+------------------+
| 2-gramm NNmodel | 2067776932.01734 |  32060722.89975  |
| 3-gramm NNmodel | 546141224.43736  | 2591031142.82844 |
| 5-gramm NNmodel | 678448272.27763  | 726211154.08193  |
+-----------------+------------------+------------------+


In [48]:
nn_model = NNLM(n=2)
nn_model.update_vocab(dialogs_tr)
nn_model.load_model(f'models/2_model')

In [None]:
result = nn_model.generate(300)

In [52]:
result

['second nor not strong not to a she his corse .',
 'touchstone : good husband be a before how will catch up a speak in heart be spoke of : these meaner skill would with new tutor out for his apprehension too he will not take the place but appear .',
 'bawd and plot a a it last a knight .',
 'first with florence while he ever .',
 'first a blow with a chain . i prove you be well let u . why once i will we know it . rouse your idle curse . i live honesty should command how of her face . high with a word and half the princess be duke of me enough beloved of thy eldest .',
 'hamlet : what saw her o what make get the sword home come .',
 'brutus : there to the look the senator : signior young world',
 'pericles : i nor never wring indeed i shall we must be pride to by be pity for your strife again no still low necessity and four hand chamber free them . thy very than tongue when if my meat of the dangerous of slander and hark : i spake or let me whom that how i be more goodness : yet brigh

## sentence

In [28]:
for n in tqdm([2, 3, 5]):
    
    nn_model = NNLM(n=n)
    nn_model.fit(sentence_tr, sentence_val, epochs=200, path=f'models/{n}_model_sen')

100%|██████████| 3/3 [48:20<00:00, 966.87s/it]


In [29]:
table = PrettyTable(['model', 
                     'train perplexity', 
                     'val perplexity',
                    ])# creat table for results


for n in tqdm([2, 3, 5]):
    nn_model = NNLM(n=n)
    nn_model.update_vocab(sentence_tr)
    nn_model.load_model(f'models/{n}_model_sen')
        
    train_score = np.array([nn_model.perplexity(x.split()) for x in sentence_tr[:100]])
    
    val_score = np.array([nn_model.perplexity(x.split()) for x in sentence_val[:100]])

    table.add_row([ f'{n}-gramm NNmodel',
                   to_fixed(np.mean(train_score[train_score>=0])),
                   to_fixed(np.mean(val_score[val_score>=0])),
                  ]) # add scores
    

 33%|███▎      | 1/3 [01:42<03:25, 102.74s/it]



 67%|██████▋   | 2/3 [03:28<01:44, 104.78s/it]



100%|██████████| 3/3 [05:26<00:00, 108.76s/it]


In [30]:
print(table)

+-----------------+------------------+-----------------+
|      model      | train perplexity |  val perplexity |
+-----------------+------------------+-----------------+
| 2-gramm NNmodel |  4876755.80593   |  5354134.01411  |
| 3-gramm NNmodel |  13982476.39807  |  18084974.64803 |
| 5-gramm NNmodel |  93081145.43325  | 193248152.16751 |
+-----------------+------------------+-----------------+


In [31]:
nn_model = NNLM(n=2)
nn_model.update_vocab(sentence_tr)
nn_model.load_model(f'models/2_model_sen')

In [None]:
result = nn_model.generate(300)

In [33]:
result

['by the earth',
 'these hat of my lord',
 'you play me my more to tell',
 'what be about it take his authority in faction he hath far and found',
 'my head',
 'take this rogue',
 'what be go still',
 'be grievously accurse',
 'it in the same violently dress would and soon shook thine shall you be milan before be father',
 'yes my name and nothing can it but quench would sorrow so man bend knock',
 'how i be a blood to speak in humble for tell me meet by winter be him',
 'well play in a',
 'i say there and weep yet some adramadio bassanio who like have see she not my equal good i could go thou the sky in france that set me but she',
 'i have do bestow the flesh must i the sure his favour',
 'fetch thee to play fair body now young hang you hurt this your with this desperate temper be in the world leave this shot n be',
 'but set me if he will be fed in your peace new form of usurp of him fill thy sense or it thee',
 'an contemplation be occasion and which not nought i have thou i be tal