The dataset used in this nootebook can be found [here](https://inclass.kaggle.com/c/si650winter11)

In [1]:
import pandas as pd
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
from string import punctuation
from collections import Counter
%matplotlib inline

In [6]:
sentiment_data = pd.read_csv('kaggle_sentiment.txt', sep='\t')
sentiment_data.columns =['Class', 'Data']

In [7]:
unlabeld_data = pd.read_csv('unlabeld_data.txt', sep='\t')
unlabeld_data.columns = ['Data']

### Step 1. Preprocessing pipeline

In [8]:
sentiment_data.head()

Unnamed: 0,Class,Data
0,1,this was the first clive cussler i've ever rea...
1,1,i liked the Da Vinci Code a lot.
2,1,i liked the Da Vinci Code a lot.
3,1,I liked the Da Vinci Code but it ultimatly did...
4,1,that's not even an exaggeration ) and at midni...


In [9]:
unlabeld_data.head()

Unnamed: 0,Data
0,"harvard is dumb, i mean they really have to be..."
1,I'm loving Shanghai > > > ^ _ ^.
2,harvard is for dumb people.
3,"As i stepped out of my beautiful Toyota, i hea..."
4,"Bodies being dismembered, blown apart, and mut..."


#### Step 1.1 Shuffle dataframe

The dataset is well sorted. First, we have half of data samples that are positive and then half of them negative. If we separate the dataset to training and testing parts like this, we will have most of the data (if not all) from one class. To prevent that from happening, we will shuffle the dataset first.

In [10]:
from sklearn.utils import shuffle
sentiment_data = shuffle(sentiment_data)
unlabeld_data = shuffle(unlabeld_data)

In [11]:
sentiment_data.head()

Unnamed: 0,Class,Data
749,1,the people who are worth it know how much i lo...
339,1,the people who are worth it know how much i lo...
1745,1,i love kirsten / leah / kate escapades and mis...
3907,1,man i loved brokeback mountain!
1560,1,I like Mission Impossible movies because you n...


#### Step 1.2 Split to labels and reviews

In this step we need to create separated variables that will hold labels (positive or negative) and reviews.

In [12]:
labels = sentiment_data.iloc[:, 0].values
reviews = sentiment_data.iloc[:, 1].values
unlabeled_reviews = unlabeld_data.iloc[:,0].values

#### Step 1.3 Clean data from punctuation

The punctuation won't effect our prediction so we will delete all punctuation from reviews.

In [13]:
reviews_processed = []
unlabeled_processed = [] 
for review in reviews:
    review_cool_one = ''.join([char for char in review if char not in punctuation])
    reviews_processed.append(review_cool_one)
    
for review in unlabeled_reviews:
    review_cool_one = ''.join([char for char in review if char not in punctuation])
    unlabeled_processed.append(review_cool_one)

#### Step 1.4 Creating vocabulary, coverting all characters to lower case and spliting each review into words

In this step we are creating vocabulary which will be created by using function Counter. Also in this step we will lower all characters in the dataset, we can do this as well because lower/upper case character won't affect prediction results. Lastly, we will split each review to separate words.

In [21]:
word_reviews = []
word_unlabeled = []
all_words = []
for review in reviews_processed:
    word_reviews.append(review.lower().split())
    for word in review.split():
        all_words.append(word.lower())

for review in unlabeled_processed:
    word_unlabeled.append(review.lower().split())
    for word in review.split():
        all_words.append(word.lower())
    
counter = Counter(all_words)
vocab = sorted(counter, key=counter.get, reverse=True)

#### Step 1.5 Creating vocab_to_int dictionary which will map word with a number

In [22]:
vocab_to_int = {word: i for i, word in enumerate(vocab, 1)}

#### Step 1.6 Using vocab_to_int to transform each review to vector of numbers

In [23]:
reviews_to_ints = []
for review in word_reviews:
    reviews_to_ints.append([vocab_to_int[word] for word in review])

In [24]:
unlabeled_to_ints = []

for review in word_unlabeled:
    unlabeled_to_ints.append([vocab_to_int[word] for word in review])

#### Step 1.7 Check if we have some 0 length reviews.

In [25]:
reviews_lens = Counter([len(x) for x in reviews_to_ints])
print('Zero-length {}'.format(reviews_lens[0]))
print("Max review length {}".format(max(reviews_lens)))

Zero-length 0
Max review length 931


#### Step 1.8 Creating word vectors

This step can be done in this way: 
    1. Define sequence length. (250 in this case)
    2. Each review shorted then this sequence will be padded (at the beginning) with zeros
    3. Each review longer than the sequence length will be shortened.

In [27]:
seq_len = 250

features = np.zeros((len(reviews_to_ints), seq_len), dtype=int)
for i, review in enumerate(reviews_to_ints):
    features[i, -len(review):] = np.array(review)[:seq_len]
    
features_test = np.zeros((len(unlabeled_to_ints), seq_len), dtype=int)
for i, review in enumerate(unlabeled_to_ints):
    features_test[i, -len(review):] = np.array(review)[:seq_len]

#### Step 1.9 Split into training and testing parts

In [28]:
X_train = features[:6400]
y_train = labels[:6400]

X_test = features[6400:]
y_test = labels[6400:]

X_unlabeled = features_test

print('X_trian shape {}'.format(X_train.shape))
print('X_unlabeled shape {}'.format(X_unlabeled.shape))

X_trian shape (6400, 250)
X_unlabeled shape (28936, 250)


### Done with preprocessing pipeline

## Step 2. Defining RNN

In [39]:
hidden_layer_size = 512 # how many nodes LSTM cells will have
number_of_layers = 1 # how many RNN layers the network will use
batch_size = 100 # how many reviews we feed at onces
learning_rate = 0.001 # learning rate
number_of_words = len(vocab_to_int) + 1 #how many unique words do we have in vocab (+1  is used for 0 - padding)
dropout_rate = 0.8 
embed_size = 300 #how long our word embedings will be
epochs = 6 # how many epochs do we use for training

In [30]:
tf.reset_default_graph() #Clean the graph

#### Step 2.1 Define placeholders

In [31]:
inputs = tf.placeholder(tf.int32, [None, None], name='inputs')
targets = tf.placeholder(tf.int32, [None, None], name='targets')

#### Step 2.2 Define embeding layer

In [32]:
word_embedings = tf.Variable(tf.random_uniform((number_of_words, embed_size), -1, 1))
embed = tf.nn.embedding_lookup(word_embedings, inputs)

#### Step 2.3 Define hidden layer and Dynamic RNN

In [33]:
hidden_layer = tf.contrib.rnn.BasicLSTMCell(hidden_layer_size)
hidden_layer = tf.contrib.rnn.DropoutWrapper(hidden_layer, dropout_rate)

cell = tf.contrib.rnn.MultiRNNCell([hidden_layer]*number_of_layers)
init_state = cell.zero_state(batch_size, tf.float32)

In [34]:
outputs, states = tf.nn.dynamic_rnn(cell, embed, initial_state=init_state)

#### Step 2.4 Get the prediction for each review 

From the last step of our network we get output and use it as a prediction. Than we use that result and compare it with real sentiment for that review.

In [35]:
prediction = tf.layers.dense(outputs[:, -1], 1, activation=tf.sigmoid)
cost = tf.losses.mean_squared_error(targets, prediction)

optimizer = tf.train.AdamOptimizer(learning_rate).minimize(cost)

#### Step 2.5 Define accuracy

In [36]:
currect_pred = tf.equal(tf.cast(tf.round(prediction), tf.int32), targets)
accuracy = tf.reduce_mean(tf.cast(currect_pred, tf.float32))

## Step 3. Training

In [37]:
session = tf.Session()

In [38]:
session.run(tf.global_variables_initializer())

In [40]:
for i in range(epochs):
    training_accurcy = []
    ii = 0
    epoch_loss = []
    while ii + batch_size <= len(X_train):
        X_batch = X_train[ii:ii+batch_size]
        y_batch = y_train[ii:ii+batch_size].reshape(-1, 1)
        
        a, o, _ = session.run([accuracy, cost, optimizer], feed_dict={inputs:X_batch, targets:y_batch})

        training_accurcy.append(a)
        epoch_loss.append(o)
        ii += batch_size
    print('Epoch: {}/{}'.format(i, epochs), ' | Current loss: {}'.format(np.mean(epoch_loss)),
          ' | Training accuracy: {:.4f}'.format(np.mean(training_accurcy)*100))

Epoch: 0/6  | Current loss: 0.05040295422077179  | Training accuracy: 93.2500
Epoch: 1/6  | Current loss: 0.008079711347818375  | Training accuracy: 98.9844
Epoch: 2/6  | Current loss: 0.0035864049568772316  | Training accuracy: 99.6406
Epoch: 3/6  | Current loss: 0.0047574774362146854  | Training accuracy: 99.4844
Epoch: 4/6  | Current loss: 0.002684194827452302  | Training accuracy: 99.7031
Epoch: 5/6  | Current loss: 0.0015217065811157227  | Training accuracy: 99.7812


In [41]:
test_accuracy = []

ii = 0
while ii + batch_size <= len(X_test):
    X_batch = X_test[ii:ii+batch_size]
    y_batch = y_test[ii:ii+batch_size].reshape(-1, 1)

    a = session.run([accuracy], feed_dict={inputs:X_batch, targets:y_batch})
    
    test_accuracy.append(a)
    ii += batch_size

In [42]:
print("Test accuracy is {:.4f}%".format(np.mean(test_accuracy)*100))

Test accuracy is 98.8000%


## Step 4. Testing on the unlabeld data

In [60]:
predictions_unlabeled = []
ii = 0
while ii + batch_size <= len(X_unlabeled):
    if ii + batch_size > len(X_unlabeled):
        batch_size = len(X_unlabeled) - ii
    X_batch = X_unlabeled[ii:ii+batch_size]
    y_batch = X_unlabeled[ii:ii+batch_size].reshape(-1, 1)

    pred = session.run([prediction], feed_dict={inputs:X_batch, targets:y_batch})
    
    predictions_unlabeled.append(pred)
    ii += batch_size

In [64]:
pred_real = []
for i in range(len(predictions_unlabeled)):
    for ii in range(len(predictions_unlabeled[i][0])):
        if predictions_unlabeled[i][0][ii][0] >= 0.5:
            pred_real.append(1)
        else:
            pred_real.append(0)

In [65]:
np.savetxt('predictions.txt', pred_real)

In [66]:
new_dataframe = unlabeld_data[:len(pred_real)]

In [67]:
new_dataframe['Classes'] = pred_real

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [69]:
new_dataframe

Unnamed: 0,Data,Classes
6244,london sucks....,0
22537,I love the Toyota Prius.,1
28827,Great job ata soccer..........,1
23001,"AAA's "" Q "" is catchy and an ear worm, like ma...",1
6567,i love shanghai ~ ~ ~ ~ 外滩好像有一家专卖上海纪念品的小店 ~ ~ ~.,1
17371,+ + + Bruce Willis hat den PrÃ ¤ sident von Ko...,0
9954,"Since then, 25 automakers including Toyota Mot...",0
18476,And as stupid as San Francisco's road system i...,1
8848,"Today, when Monkee was backing out of the Milp...",0
6277,Then we had stupid trivia about San Francisco ...,0
