# Neural Language Model


This is a very basic implementation of a neural language model trained on the Berkeley Restaurant Project Dataset. The code is adapted from the following two tutorials:

* [Neural Machine Translation](https://github.com/kmsravindra/ML-AI-experiments/blob/master/AI/Neural%20Machine%20Translation/Neural%20machine%20translation%20-%20Encoder-Decoder%20seq2seq%20model.ipynb) 
* [Neural Language Model](https://ethen8181.github.io/machine-learning/keras/rnn_language_model_basic_keras.html) 

This model is designed as follows:
* **input:** one-hot vectors. No projection layer is used for learning word embeddings. 
    * Note: You can try improving this model using pre-trained word embedding.

* **Recurrence:** We use 1 LSTM layer. LSTM is a special type of recurrent neural unit that is more suitable for long sequences compared to regular RNNs. [Learn More](https://colah.github.io/posts/2015-08-Understanding-LSTMs/). 

* **Output:** Softmax layer with 1815 units (the size of the vocabulary in this set). 

We do not model unknown words here, and we only test it for text generation. You will notice that the output is not great. It is in fact inferior to a bigram language model. Possible ways to improve the model include using pre-trained word embeddings, more training data, and experimenting with different network architectures. Decoding could also be improved by resetting the LSTM state and using random sampling instead (here we just keep the LSTM state as another way to randomize the output). 

Here is [another way](https://www.kdnuggets.com/2020/07/pytorch-lstm-text-generation-tutorial.html) to implement neural text generation using pytorch. 



---

# Running the code

You can just run the code blocks sequentially. Using GPU runtime is recommended for faster training (Runtime --> Change runtime type --> GPU)


In [None]:
# download the dataset
! git clone https://github.com/wooters/berp-trans.git


In [None]:
#process the data
import re

#Let's first read the file
filename ='berp-trans/transcript.txt'
file = open(filename, 'rt')
text = file.read()
file.close()

text = text.split('\n')
i=0
while i < len(text):
  text[i] = re.sub(r'^\S+\s+', '', text[i])
  text[i] = re.sub(r'[<\[][^>\]]+[>\]]\s+', '' , text[i])
  i = i + 1

import nltk

#create FreqDist object
unigram_dist = nltk.FreqDist()
for sen in text:
  sen = "<s> "+ sen +" </s>"
  unigram_dist.update(sen.split())
      
vocab = unigram_dist.keys()
trainset = " ".join(text).split()

In [None]:

import numpy as np
from keras.models import Model, Sequential
from keras.layers import Input, LSTM, Dense, Embedding
from tensorflow.keras.utils import to_categorical

V_size = len(vocab) #length of vocab


word2index = {}
index2word = [] 

for word in vocab:
    index2word.append(word)
    word2index[word] = len(word2index)

num_classes = len(word2index)


embedding_size = 50
lstm_size = 256
maxlen = 10 #maximum sentence length

encoder_input = Input(shape=(None,V_size))

encoder_LSTM = LSTM(lstm_size,return_state = True)
encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input)
encoder_states = [encoder_h, encoder_c]
encoder_dense = Dense(V_size, activation = 'softmax')
encoder_out = encoder_dense (encoder_outputs)
model = Model(inputs=[encoder_input],outputs=[encoder_out])

model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

print(model.summary())


# create semi-overlapping sequences of words with
# a fixed length specified by the maxlen parameter

X = []
y = []
N=0
step=3
for i in range(0, len(trainset) - maxlen, step):
  N=N+1

X = np.zeros(shape = (N,maxlen, V_size), dtype='float32')
Y = np.zeros(shape = (N,V_size), dtype='float32')
ii=0
for i in range(0, len(trainset) - maxlen, step):
    sent = trainset[i:i + maxlen]
    next_word = trainset[i + maxlen]
    for k,w in enumerate(sent):
        X[ii,k, word2index[w]] = 1
    Y[ii,word2index[next_word]] = 1
    ii=ii+1


print('sequence dimension: ', X.shape)
print('target dimension: ', Y.shape)
print('example sequence:\n', X[0])





In [None]:
#Train the model
model.fit(X, Y, batch_size = 64, 
                            epochs = 50, 
                            validation_split = 0.2)






In [None]:
# Encoder inference model
initial_model = Model(encoder_input, encoder_states)
target_seq = np.zeros((1, 1, V_size))

#Initial token
target_seq[0, 0, word2index['<s>']] = 1
        
states_val = initial_model.predict(target_seq)
#print(states_val)

encoder_state_input_h = Input(shape=(256,))
encoder_state_input_c = Input(shape=(256,))
encoder_input_states = [encoder_state_input_h, encoder_state_input_c]

encoder_out, encoder_h, encoder_c = encoder_LSTM(encoder_input, 
                                                 initial_state=encoder_input_states)

encoder_states = [encoder_h , encoder_c]

encoder_out = encoder_dense(encoder_out)

encoder_model_inf = Model(inputs=[encoder_input] + encoder_input_states,
                          outputs=[encoder_out] + encoder_states )
    



In [None]:
#a function for decoding ... 
def decode_seq(states_val):
    
    target_seq = np.zeros((1, 1, V_size))
    target_seq[0, 0, word2index['<s>']] = 1
    
    translated_sent = ''
    stop_condition = False
    count =1
    

    while not stop_condition:
        
        decoder_out, decoder_h, decoder_c = encoder_model_inf.predict(x=[target_seq] + states_val)
        #print(decoder_out)
        #print(decoder_out.shape)
        max_val_index = np.argmax(decoder_out[0,:]) #greedy decoding
        sampled_word = index2word[max_val_index]
        translated_sent += " "+sampled_word
        count = count+1
        #stop if </s> is reached or more than 50 words generated ... 
        if ( (sampled_word == '</s>') or (len(translated_sent) > 50)) :
            stop_condition = True
        
        target_seq = np.zeros((1, 1, V_size))
        target_seq[0, 0,max_val_index]=1
        states_val = [decoder_h, decoder_c]
                
    return translated_sent, states_val

for i in range(10):
  sent, states_val = decode_seq(states_val)
  print(sent)

