## Introduction

In this project I will use Ye

## <b>Import libraries<b>

In [48]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, LSTM, Conv1D, MaxPooling1D, Dropout, Activation
from keras.layers.embeddings import Embedding
import pandas as pd
import numpy as np

In [143]:
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.tools as tls
%matplotlib inline

## <b> Get  and process the data<b>

In [49]:
df = pd.read_csv('train.csv', sep = '|', names = ['stars', 'text'], error_bad_lines=False)

In [50]:
df= df.dropna()
df = df[df.stars.apply(lambda x: x.isnumeric())]
df = df[df.stars.apply(lambda x: x !="")]

In [51]:
df = df[df.text.apply(lambda x: x !="")]

In [52]:
df.describe()

Unnamed: 0,stars,text
count,1673870,1673870
unique,5,1673452
top,5,Good stuff
freq,709732,6


In [53]:
df.head()

Unnamed: 0,stars,text
0,5,The minute I realized that Conflict was a bloc...
2,5,I love Conflict Kitchen. The food is fantasti...
3,4,Holy moly! I'm addicted!\n\nI first heard of C...
4,4,"Had some great Persian food, though it was mor..."
5,4,Yummy food. Good prices. Encourages me to try ...


### Convert five classes into two classes (positive = 1 and negative = 0)

Since the main purpose is to tentatively identy positive or negative comments, I convert five class star category into two classes: 

<li> (1) Positive: comments with stars > 3 and 
<li> (2) Negative: comments with stars <= 3

In [54]:
labels = df['stars'].map(lambda x : 1 if int(x) > 3 else 0)

### Tokenize text data

Because of the computational expenses, I want to use only top most common 20000 unique words. First tokenize the commnets then convert those to sequences. I chose 50 words to limit the number of words in each comment. 

In [72]:
vocabulary_size = 20000
tokenizer = Tokenizer(num_words= vocabulary_size)
tokenizer.fit_on_texts(df['text'])
sequences = tokenizer.texts_to_sequences(df['text'])
data = pad_sequences(sequences, maxlen=50)

In [73]:
print(data.shape)

(1673870, 50)


###  <b>Build neural network with LSTM<b>

### Network Architechture
The network starts with an embedding layer. The layer lets the network expand each token to a larger vector, allowing the network to represent a words in a meaningful way. The layer take 20000 as the first argument, which is the size of our vocabulary, and 100 as the second input para,eter, which is the dimension of the embeddogs. The third paramer is the input_length of 50, which is the length of each comment sequence.

In [13]:
model = Sequential()
model.add(Embedding(20000, 100, input_length=50))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

### Train the network

There are about 1.6 million commnets, it takes a while to train the model in a laptop. To save time I have used only 3 epochs. GPU machines can be used to accelerate the training with more epochs. I split the whole datasets as : 60% for trainning and 40% for validation.

In [14]:
model.fit(data, np.array(labels), validation_split=0.4, epochs=3)

Train on 1004322 samples, validate on 669548 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


NameError: name 'padded_docs' is not defined

##  <b>Build neural network with LSTM and CNN <b>
The LSTM model worked well, however, it takes forever to train 3 epochs. One way to speed up the training time is to improve our network architecture and add a “Convolutional” layer. Convolutional Neural Networks (CNNs) come from image processing. They pass a “filter” over the data, and calculate a higher-level representation. They have been shown to work surprisingly well for text, even though they have none of the sequence processing ability of LSTMs.

In [59]:
def create_conv_model():
    model_conv = Sequential()
    model_conv.add(Embedding(vocabulary_size, 100, input_length=50))
    model_conv.add(Dropout(0.2))
    model_conv.add(Conv1D(64, 5, activation='relu'))
    model_conv.add(MaxPooling1D(pool_size=4))
    model_conv.add(LSTM(100))
    model_conv.add(Dense(1, activation='sigmoid'))
    model_conv.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model_conv 

In [60]:
model_conv = create_conv_model()
model_conv.fit(data, np.array(labels), validation_split=0.4, epochs = 3)

Train on 1004322 samples, validate on 669548 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1579daeb8>

### Save processed Data

In [28]:
df_save = pd.DataFrame(data)
df_label = pd.DataFrame(np.array(labels))

In [29]:
result = pd.concat([df_save, df_label], axis = 1)

In [31]:
result.to_csv('train_dense_word_vectors.csv', index=False)

## <b>Use pre-trained GloVe word embeddings<b>

In this subsection I want to use word embeddings from pretrained Glove. It was trained on a dataset of one billion tokens (words) with a vocabulary of 400 thousand words. Glove has embedding vector sizes, including 50, 100, 200 and 300 dimensions. I chose the 100-dimensional version. I also want see the model behavior incase the learned word weights do not get updated. I therefore, set the trainable attribute for the model to be False.

### Get the embeddings from Glove 

In [61]:
embeddings_index = dict()
f = open('glove.6B/glove.6B.100d.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [105]:
# create a weight matrix for words in training docs
embedding_matrix = np.zeros((vocabulary_size, 100))
for word, index in tokenizer.word_index.items():
    if index > vocabulary_size - 1:
        break
    else:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[index] = embedding_vector

### Develop model

I use the same model architechture with convolutional layer on top of the LSTM layer. Because it is faster than the one without convolutional layer.

In [106]:
model_glove = Sequential()
model_glove.add(Embedding(vocabulary_size, 100, input_length=50, weights=[embedding_matrix], trainable=False))
model_glove.add(Dropout(0.2))
model_glove.add(Conv1D(64, 5, activation='relu'))
model_glove.add(MaxPooling1D(pool_size=4))
model_glove.add(LSTM(100))
model_glove.add(Dense(1, activation='sigmoid'))
model_glove.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [107]:
model_glove.fit(data, np.array(labels), validation_split=0.4, epochs = 3)

Train on 1004322 samples, validate on 669548 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x1228e2e80>

## <b>Word embedding visialization<b>

In this Section I want to visualize word embedding weights obtained from trained models. Word embeddings with 100 dimensions are first reduced to 2 dimensions using t-SNE. Tensorflow has great tool to visualize the embeddings in a great way, but here I just want to simply vizualise the word relationsship.  

### Get embedding weights from glove

In [115]:
word_weights = model_glove.layers[0].get_weights()[0]

### Get word list 

In [137]:
word_list = []
for word, i in tokenizer.word_index.items():
    word_list.append(word)

### Scatter plot of first two components of TSNE

In [124]:
from sklearn.manifold import TSNE
X_embedded = TSNE(n_components=2).fit_transform(word_weights)

In [150]:
trace = go.Scatter(
    x = X_embedded[0:1000,0], 
    y = X_embedded[0:1000, 1],
    mode = 'markers',
    text= word_list[0:1000]
)

layout = dict(title= 't-SNE 1 vs t-SNE 2',
              yaxis = dict(title='t-SNE 2'),
              xaxis = dict(title='t-SNE 1'),
              hovermode= 'closest',)

fig = dict(data = [trace], layout= layout)
py.iplot(fig)