# Language Models
* One traditional approach
* And one deep approach

----

## What is a Language Model?

### A model which tries to assess the liklehood of language

$P(W) = P(w_1, w_2, ..., w_n)$

or

$P(w_{t+1} | w_{t-1+n}, ..., w_{t})$

### Three main areas that are more or less 'likely' to occur:
* Syntax issues - e.g. I go home vs I home go
* Semantic issues - I go home vs I go house
* Pragmatics issues - I go home vs I go home and 2+2=4

### All this knowledge neds to be captured inside the pairings of words with other words!

---

## A 'traditional' language model... for Sequence Generation

#### A bigram Markov chain (more later in the course)

In [1]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
import re

In [2]:
def bigram_word_distribution(data):
    """create a probability distribution over all bigrams
    
        params: data - a Bunch data object from sklearn
        returns: Word probability distribution
    """

    text = data['data']
    all_data = ' '.join([' '.join(re.findall('(?u)\\b\\w\\w*\\b',article.lower())) for article in text]).split()
    words = pd.DataFrame({'words':all_data})
    words['next_words'] = words['words'].shift(-1)
    word_distribution = words.groupby('words')['next_words'].value_counts(normalize=True)
    
    return word_distribution

In [3]:
def bigram_text_generation(seed, length, distribution):
    """seed a distribution with a seed word, and ask it to make more words
        
        params: seed - A seed word, 
                length -Length of the generated sentence
                distribution - A word probability distribution
                
        returns: generated sentence
    """
    
    try:
        seed = seed.lower()
        for i in range(length):
             seed += ' ' + np.random.choice(distribution[seed.split()[-1]].index, p=distribution[seed.split()[-1]].values)
        return seed
    
    except:
        print('Oops! Try another seed')
        return None

### Download text data

In [4]:
data = fetch_20newsgroups(remove=['headers', 'footers'])

In [5]:
data.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [6]:
data['data'][1:2], len(data['data'])

(["A fair number of brave souls who upgraded their SI clock oscillator have\nshared their experiences for this poll. Please send a brief message detailing\nyour experiences with the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies are especially requested.\n\nI will be summarizing in the next two days, so please add to the network\nknowledge base if you have done the clock upgrade and haven't answered this\npoll. Thanks."],
 11314)

### Calculate the bigram probabilities

In [7]:
bi_dist = bigram_word_distribution(data)

### Generate a sentence

In [8]:
seed = 'It'

In [9]:
sentence_bigram = bigram_text_generation(seed, 20, bi_dist)

In [10]:
sentence_bigram

'it is removed from the disk doctor i build say about this sounds very little rom s immune system and feel'

### How can we improve it?
* We're using bigram predictions, instead we can use trigram
* Take context from the previous 2 words instead of the previous word only

In [11]:
def trigram_word_distribution(data):
    """create a probability distribution over all trigrams
    
        params: data - a Bunch data object from sklearn
        returns: [Bigram probability distribution, trigram probability distribution]
    """
    
    text = data['data']
    all_data = ' '.join([' '.join(re.findall('(?u)\\b\\w\\w*\\b',article.lower())) for article in text]).split()
    tri_gram = [' '.join([x,y]) for x,y in zip(all_data[:-1:], all_data[1::])]
    next_word = all_data[2:] + [' '] * 1
    words = pd.DataFrame({'seed_word':all_data[:-1],'gram_words':tri_gram, "next_word":next_word})
    words['seed_next_word'] = words['seed_word'].shift(-1)
    seed_word_distribution = words.groupby('seed_word')['seed_next_word'].value_counts(normalize=True)
    gram_word_distribution = words.groupby('gram_words')['next_word'].value_counts(normalize=True)
    
    return [seed_word_distribution, gram_word_distribution]

In [12]:
tri_dist = trigram_word_distribution(data)

In [13]:
def trigram_text_generation(seed, length, distribution):
    """seed a distribution with a seed word, and ask it to make more words
        
        params: seed - A seed word, 
                length -Length of the generated sentence
                distribution - A word probability distribution
                
        returns: generated sentence
    """
    
    try:
        seed = seed.lower()
        seed += ' ' + np.random.choice(distribution[0][seed].index, p=distribution[0][seed].values)
        for i in range(length):
             seed += ' ' + np.random.choice(distribution[1][' '.join(seed.split()[-2:])].index, p=distribution[1][' '.join(seed.split()[-2:])].values)
        return seed
    
    except:
        print('Oops! Try another seed')
        return None

In [14]:
sentence_trigram = trigram_text_generation('Because',20,tri_dist)

In [15]:
sentence_trigram

'because all the text itself b parallel passages my reply please be aware of the whosoever wont s whether slaves were discouraged'

### Other areas
* Monitor sentence start distributions differently
* Add start and end tokens to generate separable sentences
* Add other punctuation

### Drawbacks:

* This model typs requires lots of computation power to train, and a lot of space to store advanced models
* N-grams are a sparse representation of language -  any word not present in the training corpus has a zero probability chance of being used


---

# Better approach - Deep Language Models!
* Deep Language generation using LSTMs

In [16]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import LSTM
from tensorflow.keras.callbacks import ModelCheckpoint
from tensorflow.keras.utils import to_categorical

In [17]:
# grab the text
text = data['data']

In [18]:
# create a continuous list of words
all_data = ' '.join([' '.join(re.findall('(?u)\\b\\w\\w*\\b',article.lower())) for article in text]).split()

In [19]:
# work out the vocab and size of vocab
vocab_list = list(set(all_data))
n_vocab = len(vocab_list)

In [20]:
#translate words to numbers
word_to_num = {}
num_to_word = {}
for i, word in enumerate(vocab_list):
    num_to_word[i] = word
    word_to_num[vocab_list[i]] = i

In [21]:
#embed the data
embedded_data = [word_to_num[word] for word in all_data]

In [22]:
#create the next word guess for each previous 10 words
X_data= []
y_data = []
seq_length=10
for i in range(len(embedded_data)-seq_length):
    X_data.append(embedded_data[i:i+seq_length])
    y_data.append(embedded_data[i+seq_length])

In [23]:
# reshape the X and y data
X = np.array(X_data).reshape(len(X_data), seq_length, 1)
y = to_categorical(y_data)

#normalise the X data
X = X / float(n_vocab)

### Build the model

In [24]:
model = Sequential()
model.add(LSTM(128, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

#### Save our best version of the model

In [27]:
filepath=f"weights_{epoch:2d}.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='min')
callbacks_list = [checkpoint]

NameError: name 'epoch' is not defined

In [None]:
# fit the model
#history = model.fit(X, y, epochs=1, batch_size=128, callbacks=callbacks_list)

In [None]:
# load a saved model
# filename = "weights_01.hdf5"
# model.load_weights(filename)
# model.compile(loss='categorical_crossentropy', optimizer='adam')

### For a given input string generate some new text

In [None]:
def prepare_input(seed_input):
    """prepare a string for the LSTM"""
    
    seed_input = seed_input.split()
    try:
        return np.expand_dims(np.array([word_to_num[x] for x in seed_input]).reshape(-1,1),axis=0)
    except:
        return 'please try with different words'

In [None]:
def generate_text(input_string):
    """generate some new text as a string"""
    
    seed = prepare_input(seed)
    for i in range(10):
    #predict next word based on window of 10 previous words - and add to embedded doc
        next_word = np.argmax(model.predict(seed[:,i:,:])).reshape(1,-1,1)
        seed = np.append(seed,next_word,axis=1)

    return ' '.join([num_to_word[x] for x in seed[0,:,0]])

----