# Language Model (Natural Language Proessing)
## Overview


*   Language Models are very important component in the field of Natural Language Processing.
*   Language models are are very useful for the AI powered NLP applications.


### *  Statistical Language Model


*   These models use traditional statistical methods like **n-gram**, **hidden markov model**, **rule based models** etc.


### *   Neural Language Model


*   Due to the advancement of neural networks the language models are very powerful and used extensively.





Here we will discuss the **n-gram** and **lstm** based language models.



In [0]:
# Mounting my G-Drive to this session
from google.colab import drive
drive.mount('/content/drive')

In [0]:
# Dataset directory
data_dir = '/content/drive/My Drive/dataset/'

## Statistical (tri-gram) language model
 * We will use nltk's reuters corpus for this excersise.
 * In case of trigram the first two tokens will be the key of the dictionary and the last token will be counted everytime and will be stored.
 * Once the whole corpus is read and stored, we will calculate the percentage of the last token for each first two tokens, which will give the prediction.

In [0]:
# Import dependencies
from nltk import trigrams
from collections import defaultdict
from nltk.corpus import reuters

In [0]:
# Download nltk datasets
import nltk
nltk.download('all')
# nltk.download('punkt')

## Model Creation and prediction
* Here we are trying to create a model based on python's nested dictionary structure where the key of the outer dictionary will be the two previous tokens (words) and the value will be another dictionary in which the key will be the third token (word) and its count.
* Once the above model (nested dictionary) is created, we will calculate the percentage count of each third tokens with respect to the previous two tokens.
It will help us to predict the output properly with a probability measure.

In [0]:
def create_model():
  # Create a model based on a dictionary
  model = defaultdict(lambda: defaultdict(lambda: 0))
  
  for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
      model[(w1, w2)][w3] += 1
  return model


def get_total_count(model):
  # get the percentage value
  total_count = dict()
  for w1_w2, w3 in model.items():
    key_count = sum(w3.values())
    for w3_ in w3.keys():
      model[w1_w2][w3_] /= key_count
      model[w1_w2][w3_] *= 100
  return model


def predict_next(w1, w2, model):
  next_words = tuple(model[w1, w2].items())
  next_words = sorted(next_words,key=lambda x:x[-1], reverse=True)
  next_words = list(map(lambda x:(x[0],round(x[-1], 2)), next_words))
  next_words = next_words[0:3]
  for word, perc in next_words:
    print(word, perc, '%')

In [0]:
model = create_model()
model = get_total_count(model)

In [0]:
predict_next('the', 'time', model)

of 38.18 %
being 18.18 %
, 4.55 %


## Observations
* As we can see that the model works preety fine like an auto completion tool which works similarly like a mobile device.
* But there are some limitations to this approach.
* ### Limitations


* Typically if we increase the 'n' the model will try to perform better, but eventually we will need more computation power and huge resource of memory (RAM) for the same.
*   Here in this approach we are building the model based on the probability of words co-occurring. It will give zero probability to all the words that are not present in the training corpus, which is not at all desirable.






## Neural Language Model (LSTM)

In [0]:
file = data_dir + 'train_dataset_modified.txt'
with open(file, 'r') as f:
  data = f.read()
print(data[0:100])

When did Beyonce start becoming popular? in the late 1990s What areas did Beyonce compete in when sh


In [0]:
len(data)

9673809

In [0]:
import numpy as np
import pandas as pd
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import LSTM, Dense, GRU, Embedding
from keras.callbacks import EarlyStopping, ModelCheckpoint

Using TensorFlow backend.


## Clean the data
* As the text has many unnecessary characters, punctuation, numbers, special characters etc. We need to clean them up for our model to predict only necessary things.

In [0]:
import re
def text_cleaner(text):
  newString = text.lower()
  newString = re.sub(r"'s\b","",newString)
  newString = re.sub("[^a-zA-Z]", " ", newString)
  newString = re.sub(r"won't", "will not", newString)
  newString = re.sub(r"can\'t", "can not", newString)

  # general
  newString = re.sub(r"n\'t", " not", newString)
  newString = re.sub(r"\'re", " are", newString)
  newString = re.sub(r"\'s", " is", newString)
  newString = re.sub(r"\'d", " would", newString)
  newString = re.sub(r"\'ll", " will", newString)
  newString = re.sub(r"\'t", " not", newString)
  newString = re.sub(r"\'ve", " have", newString)
  newString = re.sub(r"\'m", " am", newString)
  
  #remove words with numbers python: https://stackoverflow.com/a/18082370/4084039
  newString = re.sub("\S*\d\S*", "", newString).strip()
  
  #remove html tags
  newString = re.sub(r"http\S+", "", newString)
  
  long_words=[]
  for i in newString.split():
    if len(i)>=3:                  
      long_words.append(i)
  return (" ".join(long_words)).strip()


data_new = text_cleaner(data)

In [0]:
data_new[:100]


'when did beyonce start becoming popular the late what areas did beyonce compete when she was growing'

## Pre processing
As part of pre-processing the cleaned text we will do the following steps before fitting the model.
1. We will now create a list of character sequences of fixed length for the training of the language model.
2. Encoding each character by means of assigning each character an unique number.
3. Split the data into training set and validation set where X will be the first n-1 characters and Y will be the nth character.

In [0]:
def create_seq(text):
  length = 30
  sequences = list()
  for i in range(length, len(text)):
    seq = text[i-length:i+1]
    sequences.append(seq)
  print('Total Sequences: %d' % len(sequences))
  return sequences

sequences = create_seq(data_new)

Total Sequences: 8473433


In [0]:
sequences[0:3]

['when did beyonce start becoming',
 'hen did beyonce start becoming ',
 'en did beyonce start becoming p']

In [0]:
# Encode sequence
set_chars = sorted(list(set(data_new)))
mapping_dict = dict((c, i) for i, c in enumerate(set_chars))

def encode_sequence(seqs):
  encoded = list()
  for seq in seqs:
    encoded.append([mapping_dict[char] for char in seq])
  return encoded

sequences = encode_sequence(sequences)

In [0]:
from sklearn.model_selection import train_test_split

# vocabulary size
vocab_size = len(mapping_dict)
sequences = np.array(sequences[:100000])
# create X and y
X, y = sequences[:,:-1], sequences[:,-1]
# one hot encode y
y = to_categorical(y, num_classes=vocab_size)
# create train and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, random_state=12)

print('Train shape:', X_tr.shape, 'Val shape:', X_val.shape)

Train shape: (80000, 30) Val shape: (20000, 30)


## Build the LSTM model
Now we are ready to fir our LSTM model.
* ***NOTE : as part of building lstm model i have used two layered lstm networks and due to less computation resources I used very few data for the same. As part your activity you can play with the model by increasing or decreasing the layers, by changing the dropout rates, by changing the number of neurons in each layer, to get the best performance.***

In [0]:
# define model
model = Sequential()
model.add(Embedding(vocab_size, 50, input_length=30, trainable=True))
#model.add(GRU(150, recurrent_dropout=0.1, dropout=0.1))
model.add(LSTM(128, activation='relu', recurrent_dropout=0.3, dropout=0.4, return_sequences=True))
model.add(LSTM(128, activation='relu', recurrent_dropout=0.2, dropout=0.4))
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary())

# compile the model
model.compile(loss='categorical_crossentropy', metrics=['acc'], optimizer='adam')
# fit the model
model.fit(X_tr, y_tr, epochs=50, verbose=2, validation_data=(X_val, y_val))

W0814 12:21:11.492724 139823921956736 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

W0814 12:21:11.528764 139823921956736 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0814 12:21:11.534662 139823921956736 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

W0814 12:21:11.667651 139823921956736 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:133: The name tf.placeholder_with_default is deprecated. Please use tf.compat.v1.placeholder_with_default instead.

W0814 12:21:11.678578 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 30, 50)            1350      
_________________________________________________________________
lstm_1 (LSTM)                (None, 30, 128)           91648     
_________________________________________________________________
lstm_2 (LSTM)                (None, 128)               131584    
_________________________________________________________________
dense_1 (Dense)              (None, 27)                3483      
Total params: 228,065
Trainable params: 228,065
Non-trainable params: 0
_________________________________________________________________
None
Train on 80000 samples, validate on 20000 samples
Epoch 1/50
 - 260s - loss: 2.5132 - acc: 0.2667 - val_loss: 2.0685 - val_acc: 0.3814
Epoch 2/50
 - 255s - loss: 1.9950 - acc: 0.4043 - val_loss: 1.8020 - val_acc: 0.4599
Epoch 3/50
 - 254s - loss: 1.8120 - acc: 0.

<keras.callbacks.History at 0x7f2af3440ba8>

In [0]:
# generate a sequence of characters with a language model
def generate_seq(model, mapping_dict, seq_length, in_text, pred_chars):
	for _ in range(pred_chars):
		# encode the characters
		encoded_seq = [mapping_dict[char] for char in in_text]
		
    # truncate sequences to a fixed length if it's more
		encoded_seq = pad_sequences([encoded_seq], maxlen=seq_length, truncating='pre')
		
    # predict character from the model
		yout = model.predict_classes(encoded_seq, verbose=0)
		
		out_char = ''
		for char, index in mapping_dict.items():
			if index == yout:
				out_char = char
				break
		# append to input
		in_text += char
	return in_text

In [0]:
generate_seq(model, mapping_dict, 30, 'mother', 15)

'mother what was the n'

In [0]:
import pickle
pick_file = data_dir + 'language_model1.pkl'
with open(pick_file, 'wb') as f:
  pickle.dump(model, f)