# Sentiment Analysis using Word Embeddings
Sentiment analysis is...

In this notebook we are going to use the IMDB Review dataset compiled by Stanford (add a link here). This dataset has [enter number here] reviews, half of which are used for training and the other half for testing. This is a binary classification problem where the classes are either 'positive' or 'negative'.

In [30]:
import pandas as pd
import numpy as np
from glob import glob
import os
import sys

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split, GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.layers.core import Activation
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, RNN
from keras.layers.embeddings import Embedding

The IMDB dataset is has been downloaded from [here](http://ai.stanford.edu/~amaas/data/sentiment/) and unzipped into the 'data' directory.

Note: Keras has a built-in function to access this database but we want to manually perform the preprocessing

In [2]:
TRAIN_PATH = 'data/train'
TEST_PATH = 'data/test'
SEED = 2018
VOCAB_SIZE = 100
MAX_REVIEW_LEN = 250
NUM_EPOCHS = 5
BATCH_SIZE = 64

In [3]:
def get_x_y(file_path):
    files = {}
    files['pos'] = glob(os.path.join(file_path, 'pos', '*.txt'))
    files['neg'] = glob(os.path.join(file_path, 'neg', '*.txt'))
    
    sentiment_map = {'pos': 1, 'neg': 0}
    x = []
    y = []
    for sentiment in files:
        for file_name in files[sentiment]:
            temp_ = []
            with open(file_name) as file_:
                temp_ = file_.read()
            x.append(temp_)
            y.append(sentiment_map[sentiment])
    return x, y

In [4]:
# Read in the text data
x_train, y_train = get_x_y(TRAIN_PATH)
x_test, y_test = get_x_y(TEST_PATH)

Our data now looks like the following:

In [None]:
print(x_train[0])

While this type of data makes sense to humans, we need to convert the (in this case) English sentences into sequences of numbers. This can be done by using a tool provided by Keras called a 'Tokenizer'. This transforms strings into sequences of numbers where words are mapped to numbers corresponding to their overall frequency. For example, if the word 'a' is the most common word and 'this' is the second most common the sentence: 'This is a dog.' Would become [2, 0, 1, 0], where '0' is a placeholder for any word not in the tokenizer. We also need to make sure all of our sequences are the same length. we can choose a length that makes sense and pad the sequences with zeros to that length.

Note: We have already did some transformations by converting 'pos' => 0, and 'neg' => 1 when we read in the data

In [5]:
tokenizer = Tokenizer(num_words=VOCAB_SIZE)
tokenizer.fit_on_texts(x_train)

In [6]:
# Fit our training data
x_train = tokenizer.texts_to_sequences(x_train)
x_train = pad_sequences(x_train, maxlen=MAX_REVIEW_LEN)

# Fit our testing data
x_test = tokenizer.texts_to_sequences(x_test)
x_test = pad_sequences(x_test, maxlen=MAX_REVIEW_LEN)

After applying our tokenizer the data looks like:

In [None]:
print(x_train[0])

In [None]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score

In [None]:
model = SGDClassifier()
model.fit(x_train, y_train)
pred = model.predict(x_test)
accuracy = accuracy_score(y_test, pred)
print(accuracy)

In [31]:
def basic_model():
    model = Sequential()

    model.add(LSTM(
        input_dim=1,
        output_dim=50,
        return_sequences=True))
    model.add(Dropout(0.2))

    model.add(LSTM(
        100,
        return_sequences=False))
    model.add(Dropout(0.2))

    model.add(Dense(
        output_dim=1))
    model.add(Activation("linear"))
    model.compile(loss="mse", optimizer="rmsprop")
    return model

In [32]:
model = basic_model()
model.fit(x_train, y_train, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

  import sys
  import sys
  app.launch_new_instance()


ValueError: Error when checking input: expected lstm_10_input to have 3 dimensions, but got array with shape (25000, 250)

In [None]:
def basic_lstm_model(embedding_vector_length=32, dropout_rate=0.2):
    model = Sequential()
    model.add(Embedding(VOCAB_SIZE, embedding_vector_length, input_length=MAX_REVIEW_LEN))
    model.add(Dropout(dropout_rate))
    model.add(LSTM(100))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

In [None]:
model = basic_lstm_model()
model.fit(x_train, y_train, epochs=NUM_EPOCHS, batch_size=BATCH_SIZE)
scores = model.evaluate(x_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))