<a href="https://colab.research.google.com/github/leeming99/next_word_predictor/blob/master/Next_word_predictor_with_USE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Fetch and preprocess corpus - Lower case and remove punctionations


In [0]:
from keras.utils.data_utils import get_file
import string
print('\nFetching the text...')
url = 'https://raw.githubusercontent.com/maxim5/stanford-tensorflow-tutorials/master/data/arxiv_abstracts.txt'
path = get_file('arxiv_abstracts.txt', origin=url)

print('\nPreparing the sentences...')
max_sentence_len = 40
with open(path) as file_:
  docs = file_.readlines()
translator = str.maketrans('', '', string.punctuation)
# for doc in docs:
#   print(doc.lower().translate(translator))
sentences = [doc.lower().translate(translator) for doc in docs]
print('First sentence: ', sentences[0])
print('Num sentences:', len(sentences))


Fetching the text...

Preparing the sentences...
First sentence:  in science and engineering intelligent processing of complex signals such as images sound or language is often performed by a parameterized hierarchy of nonlinear processing layers sometimes biologically inspired hierarchical systems or more generally nested systems offer a way to generate complex mappings using simple stages each layer performs a different operation and achieves an ever more sophisticated representation of the input as for example in an deep artificial neural network an object recognition cascade in computer vision or a speech frontend processing joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem difficult to parallelize for execution in a distributed computation environment and requiring significant human expert effort which leads to suboptimal systems in practice we describe a ge

### Set output dictionary and array

In [0]:
vocab_arr = list(set(' '.join(sentences).replace('\n','').split(' ')))
vocab_index_dict = {}
for i, vocab in enumerate(vocab_arr):
  vocab_index_dict[vocab] = i
print(len(vocab_arr))

2694


###Split into X and Y

In [0]:
import numpy
import tensorflow.keras.backend as K
X=[]
y=[]
for sent in sentences:
  words = sent.replace('\n','').split(' ')
  X.append(' '.join(words[:-1]))
  # y.append(vocab_arr.index(words[-1]))
  tmpy = [0 for i in range(len(vocab_arr))]
  tmpy[vocab_index_dict[words[-1]]] = 1
  y.append(tmpy) 


### Split into test-train set

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
# y_train = K.constant(numpy.array(y_train))
# y_test = K.constant(numpy.array(y_test))
y_test = numpy.array(y_test)
y_train = numpy.array(y_train)

### Vectorize sentences using Universal Sentence Encoder

In [0]:
from absl import logging

import tensorflow as tf

import tensorflow_hub as hub
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import re
import seaborn as sns

module_url = "https://tfhub.dev/google/universal-sentence-encoder/4" 
use_model = hub.load(module_url)
print ("module %s loaded" % module_url)
def embed(input):
  return use_model(input)

# print(model.variables)

module https://tfhub.dev/google/universal-sentence-encoder/4 loaded


In [0]:
# message_embeddings = embed(sentences)

# for i, message_embedding in enumerate(np.array(message_embeddings).tolist()[:5]):
#   print("Message: {}".format(sentences[i]))
#   print("Embedding size: {}".format(len(message_embedding)))
#   message_embedding_snippet = ", ".join(
#       (str(x) for x in message_embedding[:3]))
#   print("Embedding: [{}, ...]\n".format(message_embedding_snippet))

X_train = embed(X_train)
X_test = embed(X_test)
X_train = X_train.numpy()
X_test = X_test.numpy()

In [0]:
print(X_train.shape, X_test.shape, y_test.shape, y_train.shape)

(5400, 512) (1800, 512) (1800, 2694) (5400, 2694)


### Initialize Model

In [0]:
from keras.callbacks import LambdaCallback
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Activation
from keras.models import Sequential

# feed into LSTM
from keras.layers.recurrent import LSTM
from keras.layers.embeddings import Embedding
from keras.layers import Dense, Activation
from keras.models import Sequential
embedding_size = 512
vocab_size = len(vocab_arr)
model = Sequential()
# model.add(Embedding(input_dim=len(vocab_arr), output_dim=100))
# model.add(LSTM(units=100))
# model.add(Dense(units=2762, activation = 'softmax'))
# model.add(Dense(units=vocab_size))
# model.add(Activation('softmax'))
# model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
# model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model = Sequential()
# model.add(LSTM(units=100, input_shape=[512]))
model.add(Dense(512, input_shape=[embedding_size], activation = 'relu'))
model.add(Dense(units=vocab_size, activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])
model.summary()

Model: "sequential_28"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_23 (Dense)             (None, 512)               262656    
_________________________________________________________________
dense_24 (Dense)             (None, 2694)              1382022   
Total params: 1,644,678
Trainable params: 1,644,678
Non-trainable params: 0
_________________________________________________________________


### Train model

In [0]:
# train model
from keras.callbacks import LambdaCallback
import numpy as np
def generate_next(text, num_generated=1):
  for i in range(num_generated):
    prediction = model.predict(x=embed([text]).numpy())
    idx = np.argmax(prediction[-1])
    text += ' ' + vocab_arr[idx]
  return text

def on_epoch_end(epoch, _):
  print('\nGenerating text after epoch: %d' % epoch)
  texts = [
    'deep convolutional',
    'simple and effective',
    'a nonconvex',
    'a',
  ]
  for text in texts:
    sample = generate_next(text)
    print('%s... -> %s' % (text, sample))

# print(y_train, X_train)
model.fit(X_train, y_train,
          batch_size=512,
          shuffle=True,
          epochs=20,
          validation_data=(X_test, y_test),
          callbacks=[LambdaCallback(on_epoch_end=on_epoch_end)])

Train on 5400 samples, validate on 1800 samples
Epoch 1/20

Generating text after epoch: 0
deep convolutional... -> deep convolutional performance
simple and effective... -> simple and effective experiments
a nonconvex... -> a nonconvex networks
a... -> a improvement
Epoch 2/20

Generating text after epoch: 1
deep convolutional... -> deep convolutional performance
simple and effective... -> simple and effective experiments
a nonconvex... -> a nonconvex networks
a... -> a improvement
Epoch 3/20

Generating text after epoch: 2
deep convolutional... -> deep convolutional networks
simple and effective... -> simple and effective experiments
a nonconvex... -> a nonconvex networks
a... -> a improvement
Epoch 4/20

Generating text after epoch: 3
deep convolutional... -> deep convolutional networks
simple and effective... -> simple and effective experiments
a nonconvex... -> a nonconvex networks
a... -> a improvement
Epoch 5/20

Generating text after epoch: 4
deep convolutional... -> deep convo

<keras.callbacks.callbacks.History at 0x7fb32ddcfa20>

### Mount GDrive

In [0]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


### Save Model

In [0]:
path = "/content/gdrive/My Drive/pencil/next_word_predictor"
model.save(path)

### Save vocab_arr
Remember to save model as well as array to interprete our model output!

In [0]:
vocab_arr = numpy.array(vocab_arr)
numpy.save('/content/gdrive/My Drive/pencil/vocab_arr.npy', vocab_arr)

### Load Model

In [0]:
from tensorflow import keras
path = "/content/gdrive/My Drive/pencil/next_word_predictor"
model = keras.models.load_model(path)


### Load vocab_arr

In [0]:
vocab_arr = numpy.load('/content/gdrive/My Drive/pencil/vocab_arr.npy') 

### Use Model

In [0]:
vocab_arr[np.argmax(model.predict(embed(['test']).numpy())[-1])]

'datasets'

In [0]:
vocab_arr[np.argmax(model.predict(embed(['fly']).numpy())[-1])]

'improvement'

In [0]:
vocab_arr[np.argmax(model.predict(embed(['datasets']).numpy())[-1])]

'work'