<a href="https://colab.research.google.com/github/nikp29/MachineLearningClass20182019/blob/master/Spring/190204WordVectors.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## GPU Setup

In [1]:
import tensorflow as tf
import numpy
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


In [2]:
import tensorflow as tf
import timeit

# See https://www.tensorflow.org/tutorials/using_gpu#allowing_gpu_memory_growth
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

with tf.device('/cpu:0'):
  random_image_cpu = tf.random_normal((100, 100, 100, 3))
  net_cpu = tf.layers.conv2d(random_image_cpu, 32, 7)
  net_cpu = tf.reduce_sum(net_cpu)

with tf.device('/gpu:0'):
  random_image_gpu = tf.random_normal((100, 100, 100, 3))
  net_gpu = tf.layers.conv2d(random_image_gpu, 32, 7)
  net_gpu = tf.reduce_sum(net_gpu)

sess = tf.Session(config=config)

# Test execution once to detect errors early.
try:
  sess.run(tf.global_variables_initializer())
except tf.errors.InvalidArgumentError:
  print(
      '\n\nThis error most likely means that this notebook is not '
      'configured to use a GPU.  Change this in Notebook Settings via the '
      'command palette (cmd/ctrl-shift-P) or the Edit menu.\n\n')
  raise

def cpu():
  sess.run(net_cpu)
  
def gpu():
  sess.run(net_gpu)
  
# Runs the op several times.
print('Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images '
      '(batch x height x width x channel). Sum of ten runs.')
print('CPU (s):')
cpu_time = timeit.timeit('cpu()', number=10, setup="from __main__ import cpu")
print(cpu_time)
print('GPU (s):')
gpu_time = timeit.timeit('gpu()', number=10, setup="from __main__ import gpu")
print(gpu_time)
print('GPU speedup over CPU: {}x'.format(int(cpu_time/gpu_time)))

sess.close()

Instructions for updating:
Use keras.layers.conv2d instead.
Instructions for updating:
Colocations handled automatically by placer.
Time (s) to convolve 32x7x7x3 filter over random 100x100x100x3 images (batch x height x width x channel). Sum of ten runs.
CPU (s):
9.634823247
GPU (s):
1.8757040849999953
GPU speedup over CPU: 5x


## Installation

1. If you haven't already installed Python and Jupyter Notebook:   
    1. Get Python3 from [Python.org](https://www.python.org/downloads/). **Tensorflow does not yet work with Python 3.7, so you _must_ get Python 3.6.** See https://github.com/tensorflow/tensorflow/issues/20517 for updates on 3.7 support.
    1. In Terminal, run `python3 -m pip install jupyter`
    1. In Terminal, cd to the folder in which you downloaded this file and run `jupyter notebook`. This should open up a page in your web browser that shows all of the files in the current directory, so that you can open this file. You will need to leave this Terminal window up and running and use a different one for the rest of the instructions.
1. Install the Gensim word2vec Python implementation: `pip3 install --upgrade gensim`
1. Get the trained model (1billion_word_vectors.zip) from me via airdrop or flashdrive and put it in the same folder as the ipynb file, the folder in which you are running the jupyter notebook command.
1. Unzip the trained model file. You should now have three files in the folder (if zip created a new folder, move these files out of that separate folder into the same folder as the ipynb file):
    * 1billion_word_vectors
    * 1billion_word_vectors.syn1neg.npy
    * 1billion_word_vectors.wv.syn0.npy
1. If you didn't install keras last time, install it now
    1. Install the tensorflow machine learning library by typing the following into Terminal:
    `pip3 install --upgrade tensorflow`
    1. Install the keras machine learning library by typing the following into Terminal:
    `pip3 install keras`


## Documentation/Sources
* [https://radimrehurek.com/gensim/models/word2vec.html](https://radimrehurek.com/gensim/models/word2vec.html) for more information about how to use gensim word2vec in general
* [https://codekansas.github.io/blog/2016/gensim.html](https://codekansas.github.io/blog/2016/gensim.html) for information about using it to create embedding layers for neural networks.
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) for using pre-trained embeddings with keras (though the syntax they use for the model layers is different than most other tutorials I've seen).
* [https://keras.io/](https://keras.io/) Keras API documentation

## Load the trained word vectors

In [0]:
from gensim.models import word2vec

Now we have a way to turn words into word vectors with Keras layers. Yes! Time to get some data.

# Exercise: Use the word vectors in a full model
Using the knowledge about how the imdb dataset and the keras embedding layer represent words, as detailed above, define a model that uses the pre-trained word vectors from the imdb dataset rather than an embedding that keras learns as it goes along. You'll need to swap out the embedding layer and feed in different training data.

For any model that you try, take notes about the performance you see or anything you notice about the differences between each of them.

## Process the dataset
For this exercise, we're going to keep all inputs the same length (we'll see how to do variable-length later). This means we need to choose a maximum length for the review, cutting off longer ones and adding padding to shorter ones. What should we make the length? Let's understand our data.

In [4]:
from keras.datasets import imdb
(x_train, y_train), (x_test, y_test) = imdb.load_data()

Using TensorFlow backend.


Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz


In [0]:
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, Dense, Flatten, MaxPooling1D, Dropout

In [6]:
imdb_offset = 3
imdb_map = dict((index + imdb_offset, word) for (word, index) in imdb.get_word_index().items())
imdb_map[0] = 'PADDING'
imdb_map[1] = 'START'
imdb_map[2] = 'UNKNOWN'

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


In [0]:
train_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_train]
test_sentences = [['PADDING'] + [imdb_map[word_index] for word_index in review] for review in x_test]

In [0]:
with tf.device('/gpu:0'):
  imdb_wv_model = word2vec.Word2Vec(train_sentences + test_sentences + ['UNKNOWN'], min_count=1)

In [0]:
imdb_wordvec = imdb_wv_model.wv

In [10]:
imdb_wordvec.vocab['PADDING'].index

28

In [11]:
map_to_wordvec = {} #structured like map index : wordvec index
for map_index in imdb_map:
  word = imdb_map[map_index]
  if word == "'l'":
    print(map_index)
    wordvec_index=1841
  else:
    wordvec_index = imdb_wordvec.vocab[word].index
  map_to_wordvec.update({str(map_index):wordvec_index})

88587


In [20]:
lengths = [len(review) for review in x_train + x_test]
print('Longest review: {} Shortest review: {}'.format(max(lengths), min(lengths)))


Longest review: 2697 Shortest review: 70


2697 words! Wow. Well, let's see how many reviews would get cut off at a particular cutoff.

In [40]:
cutoff = 1000
print('{} reviews out of {} are over {}.'.format(
    sum([1 for length in lengths if length > cutoff]), 
    len(lengths), 
    cutoff))

1097 reviews out of 25000 are over 1000.


In [0]:
x_train_mapped = [[map_to_wordvec[str(word_index)] for word_index in review] for review in x_train]
x_test_mapped = [[map_to_wordvec[str(word_index)] for word_index in review] for review in x_test]

In [0]:
from keras.preprocessing import sequence
x_train_padded = sequence.pad_sequences(x_train, maxlen=cutoff,value=28)
x_test_padded = sequence.pad_sequences(x_test, maxlen=cutoff,value=28)

In [0]:
test_embedding_layer = imdb_wordvec.get_keras_embedding(train_embeddings=False)
test_embedding_layer.input_length = cutoff

In [0]:
vector_model = Sequential()
vector_model.add(test_embedding_layer)
vector_model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
vector_model.add(MaxPooling1D(pool_size=3))
vector_model.add(Dropout(.1))
vector_model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
vector_model.add(MaxPooling1D(pool_size=3))
vector_model.add(Dropout(.2))
vector_model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
vector_model.add(MaxPooling1D(pool_size=5))
vector_model.add(Dropout(.3))

vector_model.add(Flatten())
vector_model.add(Dense(units=128, activation='relu'))
vector_model.add(Dropout(.5))
vector_model.add(Dense(units=64, activation='relu'))
vector_model.add(Dropout(.5))
vector_model.add(Dense(units=32, activation='relu'))
vector_model.add(Dense(units=1, activation='sigmoid')) # because at the end, we want one yes/no answer
vector_model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

In [49]:
with tf.device('/gpu:0'):
  # Train our model
  vector_model.fit(x_train_padded, y_train, epochs=30, batch_size=256, validation_data=(x_test_padded, y_test))

  # Evaluate our model
  score = vector_model.evaluate(x_test_padded, y_test, verbose=0)
  print('Test loss:', score[0])
  print('Test accuracy:', score[1])

Train on 25000 samples, validate on 25000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
Test loss: 0.42696055778503417
Test accuracy: 0.803


### Notes

The most optimized model I could create still only had ~ 80% accuracy. Anext step for me would be to use the vectors from the billion word vectors file and train my model with that layer.