# Beta Testing Opinions | Sentiment Analysis Model

_Author: Karolina Mamczarz_

_Based on: [Deep Learning Nanodegree Program | Udacity](https://www.udacity.com/course/deep-learning-nanodegree--nd101)_

_Code sources:_ 
* _[Udacity SageMaker deployment project](https://github.com/udacity/sagemaker-deployment/tree/master/Project)_
* _[Udacity Recurrent Neural Networks | Sentiment Analysis RNN](https://github.com/udacity/deep-learning-v2-pytorch/tree/master/sentiment-rnn)_
* _[Amazon Review Data (2018) | Code](https://nijianmo.github.io/amazon/index.html)_


## Description

PyTorch is used as a training tool. It is an open source machine learning framework.

## Load dataset

Reaserch will use [Amazon Review Data (2018)](https://nijianmo.github.io/amazon/index.html) datasets (downloaded on March 4th, 2020):
* [Video Games subset](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz)
* [Software subset](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Software_5.json.gz)

See citiation below:

> Jianmo Ni, Jiacheng Li, Julian McAuley, **Justifying recommendations using distantly-labeled reviews and fined-grained aspects**, _Empirical Methods in Natural Language Processing (EMNLP)_, 2019

### Read sentiment data

In [1]:
import gzip
import json

def parse_dataset(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

In [2]:
def get_sentiment_data(path):
    data = {'pos': [], 'neg': []}
    labels = {'pos': [], 'neg': []}

    for review in parse_dataset(path):
        if 'reviewText' in review:
            if review['overall'] >= 4.0:
                data['pos'].append(review['reviewText'])
                labels['pos'].append(1)
            elif review['overall'] <= 2.0:
                data['neg'].append(review['reviewText'])
                labels['neg'].append(0)
    
    for sentiment in ['pos', 'neg']:
        assert len(data[sentiment]) == len(labels[sentiment]), \
                    "{} data size does not match labels size".format(sentiment)
    
    return data, labels   

In [3]:
video_games_data, video_games_labels = get_sentiment_data('./data/Video_Games_5.json.gz')
software_data, software_labels = get_sentiment_data('./data/Software_5.json.gz')

print('Reviews Video Games: {} pos / {} neg'.format(len(video_games_data['pos']), len(video_games_data['neg'])))
print('Reviews Software: {} pos / {} neg'.format(len(software_data['pos']), len(software_data['neg'])))

Reviews Video Games: 393267 pos / 55012 neg
Reviews Software: 8987 pos / 2219 neg


In [4]:
def join_sentiment_data(data1, data2, labels1, labels2):
    data = {'pos': [], 'neg': []}
    labels = {'pos': [], 'neg': []}
    
    for sentiment in ['pos', 'neg']:
        data[sentiment] = data1[sentiment] + data2[sentiment]
        labels[sentiment] = labels1[sentiment] + labels2[sentiment]
    
    return data, labels

In [5]:
pre_data, pre_labels = join_sentiment_data(video_games_data, software_data, video_games_labels, software_labels)

print('Data: {} pos / {} neg'.format(len(pre_data['pos']), len(pre_data['neg'])))

Data: 402254 pos / 57231 neg


In [6]:
def crop_sentiment_data(data, limit=25000):
    new_data = {'pos': [], 'neg': []}

    for sentiment in ['pos', 'neg']:
        new_data[sentiment] = data[sentiment][0:limit]
        
    return new_data

In [7]:
data = crop_sentiment_data(pre_data)
labels = crop_sentiment_data(pre_labels)

print('Data: {} pos / {} neg'.format(len(data['pos']), len(data['neg'])))
print('Labels: {} pos / {} neg'.format(len(labels['pos']), len(labels['neg'])))

Data: 25000 pos / 25000 neg
Labels: 25000 pos / 25000 neg


In [8]:
def combine_sentiment_data(data, labels):
    all_data = data['pos'] + data['neg']
    all_labels = labels['pos'] + labels['neg']
    
    return all_data, all_labels

In [9]:
all_data, all_labels = combine_sentiment_data(data, labels)
print('Data: {}'.format(len(all_data)))
print('Labels: {}'.format(len(all_labels)))

Data: 50000
Labels: 50000


### Clean up sentiment data

In [10]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text()
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    words = text.split()
    words = [w for w in words if w not in stopwords.words("english")]
    words = [PorterStemmer().stem(w) for w in words]
    
    return words

In [11]:
print(review_to_words(all_data[1266]))

['rememb', 'first', 'time', 'laid', 'eye', 'game', 'hear', 'awesom', 'nintendo', '64', 'go', 'schoolmat', 'eagerli', 'anticip', 'usa', 'releas', 'system', 'rememb', 'hear', 'demand', 'ensu', 'shortag', 'upcom', 'christma', 'even', 'saw', 'one', 'day', 'christma', 'shop', 'dad', 'happen', 'come', 'across', 'demo', 'set', 'mall', 'final', 'got', 'see', 'massiv', 'hype', 'game', 'bright', 'color', 'mario', 'fulli', '3d', 'larg', 'crowd', 'gather', 'around', 'catch', 'glimps', 'latest', 'video', 'game', 'technolog', 'linger', 'moment', 'yet', 'first', 'impress', 'still', 'fresh', 'memori', 'mario', '64', 'look', 'like', 'noth', 'ever', 'seen', 'also', 'look', 'leap', 'bound', 'better', 'anyth', 'ever', 'seen', 'playstat', 'sega', 'saturn', 'system', 'impress', 'sever', 'month', 'later', 'final', 'save', 'enough', 'money', 'purchas', 'system', 'bundl', 'mario', '64', 'love', 'game', 'dearli', 'back', 'day', 'still', 'love', 'game', 'still', 'look', 'feel', 'play', 'unlik', 'game', 'came', '

In [14]:
import pickle, os

cache_dir = os.path.join("./cache", "sentiment_analysis")
os.makedirs(cache_dir, exist_ok=True)

def reviews_to_words(data, cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass
    
    if cache_data is None:
        words = [review_to_words(review) for review in data]
        
        if cache_file is not None:
            cache_data = dict(words=words)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        words = cache_data['words']
    
    return words

In [29]:
reviews_words = reviews_to_words(all_data)
all_words = [item for sublist in reviews_words for item in sublist]

Read preprocessed data from cache file: preprocessed_data.pkl


### Tokenize words

In [30]:
from collections import Counter

def tokenize_words(all_words, reviews_words):
    counts = Counter(all_words)
    vocab = sorted(counts, key=counts.get, reverse=True)
    vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)}
    reviews_ints = []
    for review_words in reviews_words:
        reviews_ints.append([vocab_to_int[word] for word in review_words])
        
    return vocab_to_int, reviews_ints

In [43]:
vocab_to_int, pre_reviews_ints = tokenize_words(all_words, reviews_words)

print('Unique words: {}'.format(len((vocab_to_int))))
print('Tokenized review: \n {}'.format(pre_reviews_ints[:1]))

Unique words: 50770
Tokenized review: 
 [[1, 137, 83, 4, 1219, 11]]


### Remove zero-length reviews

In [44]:
review_lens = Counter([len(x) for x in pre_reviews_ints])

print('Zero-length reviews: {}'.format(review_lens[0]))
print('Maximum review length: {}'.format(max(review_lens)))

Zero-length reviews: 55
Maximum review length: 2996


In [45]:
import numpy as np

def remove_zero_length_reviews(reviews_ints, labels):
    non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

    new_reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
    labels = np.array([labels[ii] for ii in non_zero_idx])
    
    return new_reviews_ints, labels

In [46]:
reviews_ints, encoded_labels = remove_zero_length_reviews(pre_reviews_ints, all_labels)

print('Number of reviews after removing zero-length review: {}'.format(len(reviews_ints)))
print('Number of labels after removing zero-length review: {}'.format(len(encoded_labels)))

Number of reviews after removing zero-length review: 49945
Number of labels after removing zero-length review: 49945


### Pad features

In [50]:
def pad_features(reviews_ints, seq_length=200):
    features = np.zeros((len(reviews_ints), seq_length), dtype=int)
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
        
    assert len(features) == len(reviews_ints), "Features should have as many rows as reviews."
    assert len(features[0]) == seq_length, "Each feature row should contain seq_length values."
    
    return features

In [55]:
features = pad_features(reviews_ints)
print(features[:25,:12])

[[    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0     0     0]
 [  269   488   339  3626

### Prepare training, validation and test data

In [112]:
def get_train_valid_test_datasets(features, encoded_labels, batch_size, train_size=39000):
    rest = len(features) % batch_size
    if rest:
        new_features_len = len(features) - rest
        features = features[:new_features_len]
        encoded_labels = encoded_labels[:new_features_len]
    
    train_x, remain_x = features[:train_size], features[train_size:]
    train_y, remain_y = encoded_labels[:train_size], encoded_labels[train_size:]

    test_idx = int(len(remain_x)*0.5)
    valid_x, test_x = remain_x[:test_idx], remain_x[test_idx:]
    valid_y, test_y = remain_y[:test_idx], remain_y[test_idx:]
    
    return train_x, train_y, valid_x, valid_y, test_x, test_y

In [113]:
batch_size = 50
train_x, train_y, valid_x, valid_y, test_x, test_y = get_train_valid_test_datasets(features, encoded_labels, batch_size)


print('Training dataset: {}'.format(train_x.shape))
print('Validation dataset: {}'.format(valid_x.shape))
print('Test dataset: {}'.format(test_x.shape))

Training dataset: (39000, 200)
Validation dataset: (5450, 200)
Test dataset: (5450, 200)


In [130]:
unique_train_reviews = set(map(tuple, train_x))
print("Train data set has {} reviews that have the same content.".format(len(train_x) - len(unique_train_reviews)))

unique_valid_reviews = set(map(tuple, valid_x))
print("Validation data set has {} reviews that have the same content.".format(len(valid_x) - len(unique_valid_reviews)))

unique_test_reviews = set(map(tuple, test_x))
print("Test data set has {} reviews that have the same content.".format(len(test_x) - len(unique_test_reviews)))

Train data set has 2245 reviews that have the same content.
Validation data set has 1302 reviews that have the same content.
Test data set has 79 reviews that have the same content.


In [114]:
import torch
from torch.utils.data import TensorDataset, DataLoader

def get_data_loader(dataset_x, dataset_y, batch_size=batch_size):
    tensor_dataset = TensorDataset(torch.from_numpy(dataset_x), torch.from_numpy(dataset_y))
    data_loader = DataLoader(tensor_dataset, shuffle=True, batch_size=batch_size)
    
    return data_loader

In [115]:
train_loader = get_data_loader(train_x, train_y)
valid_loader = get_data_loader(valid_x, valid_y)
test_loader = get_data_loader(test_x, test_y)

In [116]:
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample dataset size: {}'.format(sample_x.size()))
print('Sample data: \n{}'.format(sample_x))
print('Sample label: \n{}'.format(sample_y))

Sample dataset size: torch.Size([50, 200])
Sample data: 
tensor([[   0,    0,    0,  ...,  732,   23, 2157],
        [   0,    0,    0,  ...,    2,  719,  825],
        [   0,    0,    0,  ...,    5,  940,  373],
        ...,
        [   0,    0,    0,  ..., 1048,   17,   63],
        [   0,    0,    0,  ...,    1,    4,  806],
        [   0,    0,    0,  ..., 1193,    1,  689]], dtype=torch.int32)
Sample label: 
tensor([0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
        1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1,
        1, 1], dtype=torch.int32)


### Sentiment Recurrent Neural Network with Long short-term memory layer

In [117]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        self.dropout = nn.Dropout(0.3)
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x, hidden):
        batch_size = x.size(0)
        x = x.long()
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        out = self.dropout(lstm_out)
        out = self.fc(out)
        sig_out = self.sig(out)
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1]

        return sig_out, hidden
    
    
    def init_hidden(self, batch_size, use_cuda):
        weight = next(self.parameters()).data
        
        if use_cuda:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

### Hyperparameters and model configuration

In [118]:
vocab_size = len(vocab_to_int) + 1
output_size = 1
embedding_dim = 400
hidden_dim = 256
n_layers = 2

model = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(model)

SentimentRNN(
  (embedding): Embedding(50771, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


### Training

In [121]:
def train(model, train_loader, valid_loader, epochs, loss_fn, optimizer, use_cuda, save_path, batch_size=batch_size):
    counter = 0
    print_every = 100
    clip = 5
    val_loss_min = np.Inf

    if use_cuda:
        model.cuda()

    model.train()
    for e in range(epochs):
        h = model.init_hidden(batch_size, use_cuda)

        for inputs, labels in train_loader:
            counter += 1

            if use_cuda:
                inputs, labels = inputs.cuda(), labels.cuda()

            h = tuple([each.data for each in h])
            model.zero_grad()
            output, h = model(inputs, h)
            loss = loss_fn(output.squeeze(), labels.float())
            loss.backward()
            nn.utils.clip_grad_norm_(model.parameters(), clip)
            optimizer.step()

            if counter % print_every == 0:
                val_h = model.init_hidden(batch_size, use_cuda)
                val_losses = []
                model.eval()
                
                for inputs, labels in valid_loader:

                    val_h = tuple([each.data for each in val_h])

                    if (use_cuda):
                        inputs, labels = inputs.cuda(), labels.cuda()

                    output, val_h = model(inputs, val_h)
                    val_loss = loss_fn(output.squeeze(), labels.float())

                    val_losses.append(val_loss.item())

                model.train()
                mean_val_losses = np.mean(val_losses)
                print("Epoch: {}/{}...".format(e+1, epochs),
                      "Step: {}...".format(counter),
                      "Loss: {:.6f}...".format(loss.item()),
                      "Val Loss: {:.6f}".format(mean_val_losses))
                
                if mean_val_losses <= val_loss_min:
                    print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(val_loss_min, mean_val_losses))
                    torch.save(model.state_dict(), save_path)
                    val_loss_min = mean_val_losses
    
    return model

In [128]:
use_cuda = torch.cuda.is_available()
loss_fn = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
epochs = 4
train_model_save_path = './cache/sentiment_analysis/train.pt'

In [122]:
trained_model = train(model, train_loader, valid_loader, epochs, loss_fn, optimizer, use_cuda, train_model_save_path)

Epoch: 1/4... Step: 100... Loss: 0.424100... Val Loss: 0.592706
Validation loss decreased (inf --> 0.592706).  Saving model ...
Epoch: 1/4... Step: 200... Loss: 0.318558... Val Loss: 0.516958
Validation loss decreased (0.592706 --> 0.516958).  Saving model ...
Epoch: 1/4... Step: 300... Loss: 0.493396... Val Loss: 0.582965
Epoch: 1/4... Step: 400... Loss: 0.385282... Val Loss: 0.518720
Epoch: 1/4... Step: 500... Loss: 0.437951... Val Loss: 0.603074
Epoch: 1/4... Step: 600... Loss: 0.422860... Val Loss: 0.458582
Validation loss decreased (0.516958 --> 0.458582).  Saving model ...
Epoch: 1/4... Step: 700... Loss: 0.316570... Val Loss: 0.381392
Validation loss decreased (0.458582 --> 0.381392).  Saving model ...
Epoch: 2/4... Step: 800... Loss: 0.127956... Val Loss: 0.369119
Validation loss decreased (0.381392 --> 0.369119).  Saving model ...
Epoch: 2/4... Step: 900... Loss: 0.200017... Val Loss: 0.329930
Validation loss decreased (0.369119 --> 0.329930).  Saving model ...
Epoch: 2/4... S

In [124]:
trained_model.load_state_dict(torch.load(train_model_save_path))

### Testing

In [126]:
def test(model, test_loader, loss_fn, use_cuda):
    test_losses = []
    num_correct = 0
    
    if use_cuda:
        model.cuda()

    h = model.init_hidden(batch_size, use_cuda)

    model.eval()
    for inputs, labels in test_loader:

        h = tuple([each.data for each in h])

        if use_cuda:
            inputs, labels = inputs.cuda(), labels.cuda()

        output, h = model(inputs, h)
        test_loss = loss_fn(output.squeeze(), labels.float())
        test_losses.append(test_loss.item())
        pred = torch.round(output.squeeze())
        correct_tensor = pred.eq(labels.float().view_as(pred))
        correct = np.squeeze(correct_tensor.numpy()) if not use_cuda else np.squeeze(correct_tensor.cpu().numpy())
        num_correct += np.sum(correct)

    print("Test loss: {:.3f}".format(np.mean(test_losses)))
    test_acc = num_correct/len(test_loader.dataset)
    print("Test accuracy: {:.3f}".format(test_acc))

In [127]:
test(trained_model, test_loader, loss_fn, use_cuda)

Test loss: 0.194
Test accuracy: 0.928


### Test custom reviews

In [143]:
def tokenize_review(review, vocab_to_int):
    words = review_to_words(review)
    ints = []
    ints.append([vocab_to_int[word] for word in words])

    return ints

In [151]:
def predict(model, review, vocab_to_int, use_cuda):
    if use_cuda:
        model.cuda()
        
    model.eval()
    ints = tokenize_review(review, vocab_to_int)
    features = pad_features(ints)
    feature_tensor = torch.from_numpy(features)
    batch_size = feature_tensor.size(0)
    h = model.init_hidden(batch_size, use_cuda)
    
    if use_cuda:
        feature_tensor = feature_tensor.cuda()
    
    output, h = model(feature_tensor, h)
    
    pred = torch.round(output.squeeze()) 
    print('Prediction value, pre-rounding: {:.6f}'.format(output.item()))
    
    sentiment_message = '{} review detected.'.format('Positive' if pred.item() == 1 else 'Negative')
    print(sentiment_message)

In [80]:
test_review_pos = "It's not like that app is good or bad. It sometimes just work but you don't feel it's comfortable. I got this great chance to use an app which is good and comfortable."
predict(trained_model, test_review_pos, vocab_to_int, use_cuda)

Prediction value, pre-rounding: 0.859440
Positive review detected.


In [81]:
test_review_neg = "If this application did not have so many errors, I would like it."
predict(trained_model, test_review_neg, vocab_to_int, use_cuda)

Prediction value, pre-rounding: 0.024830
Negative review detected.


In [82]:
test_review_sentiment = "To say this application is bad it's like to lie. I don't like to lie that's why I won't say it."
predict(trained_model, test_review_sentiment, vocab_to_int, use_cuda)

Prediction value, pre-rounding: 0.244205
Negative review detected.


In [142]:
test_review_long_1 = "First and foremost, it is a great application for someone who is looking for something simple yet sophisticated. It allows for fair number of actions and allows to automate long and tireing processes. With it's admin pannel it is possible to manage and supervise so many aspects of daily tasks, reducing boring tasks to minimum and maximizing cool and initiative challenges! I don't like the layout though. It seems old and not responsive. In 20s I would expect more wow factor. Despite that all the advantages overcome slight shortcomings. The application is superior to anything available on the market right now and I foresee it to stay like that for a long time. I was looking for application like this!"
predict(trained_model, test_review_long_1, vocab_to_int, use_cuda)

Prediction value, pre-rounding: 0.056693
Negative review detected.


In [140]:
test_review_long_2 = "3D designers did a fine job with landscapes, so user can get impression as if those are from real movie. However, sometimes the interface is sinking with some objects, especially on level two, which spoils the whole effect. Critical error appeared when I wanted to restart game from checkpoint. It needs fixing."
predict(trained_model, test_review_long_2, vocab_to_int, use_cuda)

Prediction value, pre-rounding: 0.382295
Negative review detected.


In [143]:
test_teview_short = "App is not working properly. I get error on when I'm tryng to reach any item of the list."
predict(trained_model, test_teview_short, vocab_to_int, use_cuda)

Prediction value, pre-rounding: 0.007539
Negative review detected.
