# Sentiment Analysis
### Anything Goes Implementation

This implementation is heavily based on Dr. Scannell's Keras example notebook. I tried a few other implementations using other libraries such as PyTorch (which I am much more familiar with than Keras), but it's NLP package, Torchtext, seems to have a significant learning curve. So for this project I went with Keras based on ease of getting started with the example code. I'm hoping I can spend some time in the coming weeks to get more familiar with the PyTorch NLP packages and plan to use those in the future.

Imports

In [1]:
import csv
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Embedding, Conv1D
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import classification_report, accuracy_score

As in my "from scratch" implementation, **to run with test data simply assign the `test_file` variable to the relative filepath.** If not developing and only testing, set the `validation_percent` to 0.01 or some other very small number so as to maximize the amount of training data available.

In [2]:
train_file = 'data/train.tsv'
validation_percent = 0.3
# if running locally, using only a subset of the overall dataset for development purposes
using_subset = True
subset_count = 5000
test_file = ''

Load all training data. This is the same code as in the "from scratch" implementation.

In [3]:
print("Loading training data...")
labels = []
inputs = []
with open(train_file, encoding='utf-8') as data:
  reader = csv.reader(data, delimiter='\t')
  idx = 0
  for row in reader:
    if len(row) == 2 and (idx < subset_count or not using_subset):
        labels.append(row[0])
        inputs.append(row[1])
    idx += 1
print("Loaded {} documents".format(len(labels)))

Loading training data...
Loaded 4997 documents


Load testing data if testing.

In [4]:
test_labels = []
test_inputs = []
if len(test_file) > 0:  
    print("Loading testing data...")
    with open(test_file, encoding='utf-8') as data:
      reader = csv.reader(data, delimiter='\t')
      idx = 0
      for row in reader:
        if len(row) == 2 and (idx < subset_count or not using_subset):
            test_labels.append(row[0])
            test_inputs.append(row[1])
        idx += 1
    print("Loaded {} documents".format(len(labels)))

Turn the train and test data into a pandas dataframe.

In [5]:
data = {'labels': labels, 'inputs': inputs}
train_data = pd.DataFrame(data=data)
test_data = {'labels': test_labels, 'inputs': test_inputs}
test_data = pd.DataFrame(data=data)

Split the training data into train and validation sets.

In [6]:
X_train, X_valid, y_train_str, y_valid_str = train_test_split(train_data['inputs'], train_data['labels'], test_size=validation_percent, random_state = 42)
y_train = [int(a) for a in y_train_str]
y_valid = [int(a) for a in y_valid_str]
y_test = [int(a) for a in test_data['labels']]

Create word tokens from the top 10,000 words

In [7]:
V = 10000
tokenizer = Tokenizer(num_words=V)
tokenizer.fit_on_texts(train_data['inputs'])

Generate sequences and pad with 0s so all lengths are equal.

In [8]:
max_length = max([len(document.split()) for document in train_data['inputs']])
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_valid_seq = tokenizer.texts_to_sequences(X_valid)
X_test_seq = tokenizer.texts_to_sequences(test_data['inputs'])
X_train_padded = pad_sequences(X_train_seq, maxlen=max_length, padding='post')
X_valid_padded = pad_sequences(X_valid_seq, maxlen=max_length, padding='post')
X_test_padded = pad_sequences(X_test_seq, maxlen=max_length, padding='post')

I decided to make two major changes to the model. For one, I added two LSTM layers. The single layer did not seem to be enough to capture the complexity necessary for the task. I also tried a few other layers such as a 1D convolutional layer but settled on purely LSTM.
Secondly, I increased the learning rate to 0.05 in an attempt to increase training speed and prevent getting trapped in local minima.

In [9]:
emb_dim = 100
model = Sequential()
model.add(Embedding(input_dim=V, output_dim=emb_dim, input_length=max_length))
model.add(LSTM(100, dropout=0.2, return_sequences=True))
model.add(LSTM(100, dropout=0.1, return_sequences=True))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
optimizer = Adam(learning_rate=0.05)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 557, 100)          1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 557, 100)          80400     
_________________________________________________________________
lstm_1 (LSTM)                (None, 557, 100)          80400     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100)               80400     
_________________________________________________________________
dense (Dense)                (None, 1)                 101       
Total params: 1,241,301
Trainable params: 1,241,301
Non-trainable params: 0
_________________________________________________________________


Fit the model for 50 epochs with 64 batch size

In [None]:
model.fit(X_train_padded, np.asarray(y_train), batch_size=64, epochs=50, validation_data=(X_valid_padded, np.asarray(y_valid)),verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50

In [None]:
y_classes = model.predict_classes(x=X_valid_padded)
print(accuracy_score(y_valid, y_classes))
print(classification_report(y_valid, y_classes))

if len(train_file) > 0:
    max_length = max([len(document.split()) for document in test_data['inputs']])
    X_test_seq = tokenizer.texts_to_sequences(X_train)
    X_train_padded = pad_sequences(X_train_seq, maxlen=max_length, padding='post')

    y_classes = model.predict_classes(x=X_test_padded)
    print(accuracy_score(y_test, y_classes))
    print(classification_report(y_test, y_classes))