## Sentiment Analysis on Twitter data using LSTM with Tensorflow and a word2vec algorithm

### 01. Word2Vec
In this first notebook the word vectors are loaded and introduced. The data set is explored and some easy examples are given. It is also the first introduction to Tensorflow.

A Word2Vec model is a Neural Network by itself. In this project we used a predefined set of words and vectors. This is available on https://nlp.stanford.edu/projects/glove/. I downloaded the zip file and loaded the txt file with 400k words

In [76]:
# import libraries
import numpy as np
import tensorflow as tf
import matplotlib
import pandas as pd
import time

# system info
import sys
print(sys.version)

3.5.4 |Anaconda custom (x86_64)| (default, Oct 27 2017, 11:48:53) 
[GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)]


### Data
#### Import and prepare the word and vector data

In [77]:
with open('/Users/olafdeleeuw/Desktop/ODSC/Project/ODSC-London-2018/data/glove.6B.50d.txt') as f:
    content = f.readlines()
# remove whitespace characters like `\n` at the end of each line
content = [x.strip() for x in content]

In [85]:
words = []
vectors = []
for item in content:
    list_item = item.split(' ')
    word = list_item[0]
    vector = list_item[1:52]
    words.append(word)
    vectors.append(vector)

# the first 'word' in the list must be 0
words = ['0'] + words
vectors = [['0'] * 50] + vectors
# convert list of vectors to numpy array of dtype float32
vectors = np.asarray(vectors).astype('float32')

In [97]:
np.save("words", words)
np.save("vectors", vectors)

In [86]:
# let's see what the data looks like
print(words[0:50])
print(vectors[0:2])

['0', 'the', ',', '.', 'of', 'to', 'and', 'in', 'a', '"', "'s", 'for', '-', 'that', 'on', 'is', 'was', 'said', 'with', 'he', 'as', 'it', 'by', 'at', '(', ')', 'from', 'his', "''", '``', 'an', 'be', 'has', 'are', 'have', 'but', 'were', 'not', 'this', 'who', 'they', 'had', 'i', 'which', 'will', 'their', ':', 'or', 'its', 'one']
[[  0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00000000e+00   0.00000000e+00
    0.00000000e+00   0.00000000e+00   0.00

#### Time for an example...What is the index of the word 'Bitcoin'? And how about its vector?

In [87]:
bitcoinIndex = words.index('bitcoin')
print(bitcoinIndex)
vectors[bitcoinIndex]

113658


array([ -4.21889991e-01,  -1.77220002e-01,   4.74950001e-02,
         9.31389987e-01,  -1.92499995e-01,   3.30260009e-01,
         3.46399993e-01,   1.65500000e-01,  -4.02389988e-02,
         1.51989996e+00,   8.18899989e-01,   4.06619996e-01,
         9.31710005e-01,   9.09690022e-01,   5.04790008e-01,
         5.25290012e-01,  -1.89789996e-01,   2.40840003e-01,
         9.88189995e-01,   4.50459987e-01,   1.69430006e+00,
        -7.72909999e-01,  -1.68149993e-01,   2.61500001e-01,
         6.53119981e-02,   9.95670021e-01,   6.19050026e-01,
        -1.06849998e-01,   3.08539987e-01,  -9.48569998e-02,
        -2.32089996e-01,  -5.40139973e-01,   2.01020002e-01,
         3.00799996e-01,  -5.27949989e-01,   3.29109989e-02,
        -1.48550004e-01,  -6.03600025e-01,  -2.75130004e-01,
         4.72749993e-02,   1.40760001e-03,  -5.94990015e-01,
        -9.98790026e-01,   5.39520025e-01,  -4.62949991e-01,
        -8.96990001e-01,   3.14480007e-01,   8.33949983e-01,
        -1.66089997e-01,

### Function to convert sentences into indices

In the end we are of course not only interested in words but mainly in sentences. Therefore the next function to turn sentences into indices. After that an example of three different sentences.

In [88]:
def turn_sentence_to_indices(sentence):
    indices = np.zeros(250, dtype='int32')
    for i in range(0,len(sentence)):
        try:
            indices[i] = words.index(sentence[i])
        except:
            indices[i] = 0
    return(indices)

In [89]:
example = np.array(['the', 'share', 'price', 'went', 'up'])
example2 = np.array(['the', 'stock', 'market', 'was', 'rising'])
example3 = np.array(['the', 'soccer', 'game', 'ended', 'in', 'a', 'draw'])
# print(turn_sentence_to_indices(example))
# print(turn_sentence_to_indices(example2))
print(turn_sentence_to_indices(example3))

[   1 1734  187  844    7    8 1708    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0  

### Function to convert indices into vectors using Tensorflow

Now the indices are turned into vectors for which we use Tensorflow.

In [90]:
def indices_to_vector(indices):
    
    v1 = tf.get_variable("vectors", [250, 50])  # define the shape of each item: 250 words represented by a 50d vector
    
    # Define a placeholder and assign op for each variable, so
    # that we can feed the initial value without adding it to the graph.
    vars = [v1]
    placeholders = [tf.placeholder(tf.float32, shape=[250, 50]) for v in vars]
    assign_ops = [v.assign(p) for (v, p) in zip(vars, placeholders)]

    init_op = tf.global_variables_initializer()

    saver = tf.train.Saver(tf.global_variables())
    
    with tf.Session() as sess:
      sess.run(init_op)
      for p, assign_op in zip(placeholders, assign_ops):
        vector = tf.nn.embedding_lookup(vectors,indices).eval()
        sess.run(assign_op, {p: vector})
        print(vector)
      # Save the variables to disk.
      save_path = saver.save(sess, "/tmp/trainmodel.ckpt")
      print("Vectors saved in file: %s" % save_path)

#### Let's check out the vector of the first example

In [91]:
indices_to_vector(turn_sentence_to_indices(example))

[[ 0.41800001  0.24968    -0.41242    ..., -0.18411    -0.11514    -0.78580999]
 [ 0.39412001  0.23183     0.68751001 ...,  0.57809001  0.25825    -0.1166    ]
 [-0.44953999  0.11784     0.65070999 ...,  0.45262     0.40169001
   0.67246997]
 ..., 
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]
 [ 0.          0.          0.         ...,  0.          0.          0.        ]]
Vectors saved in file: /tmp/trainmodel.ckpt


#### How to restore the data...

Now the vectors are saved but you may want to restore them at some time.

In [92]:
tf.reset_default_graph()

# Create some variables.
v1 = tf.get_variable("vectors", shape=[250, 50])

# Add ops to save and restore all the variables.
saver = tf.train.Saver()

# Later, launch the model, use the saver to restore variables from disk, and
# do some work with the model.
with tf.Session() as sess:
  # Restore variables from disk.
  saver.restore(sess, "/tmp/trainmodel.ckpt")
  print("Model restored.")
  # Check the values of the variables
  # print("v1 : %s" % v1.eval())
  word_vectors = v1.eval()

INFO:tensorflow:Restoring parameters from /tmp/trainmodel.ckpt
Model restored.


In [93]:
word_vectors

array([[ 0.41800001,  0.24968   , -0.41242   , ..., -0.18411   ,
        -0.11514   , -0.78580999],
       [ 0.39412001,  0.23183   ,  0.68751001, ...,  0.57809001,
         0.25825   , -0.1166    ],
       [-0.44953999,  0.11784   ,  0.65070999, ...,  0.45262   ,
         0.40169001,  0.67246997],
       ..., 
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.        ,  0.        ,  0.        , ...,  0.        ,
         0.        ,  0.        ]], dtype=float32)