# Beta Testing Opinions | Sentiment Analysis Model

_Author: Karolina Mamczarz_

_Based on: [Deep Learning Nanodegree Program | Udacity](https://www.udacity.com/course/deep-learning-nanodegree--nd101)_

## Description

PyTorch is used as a training tool. It is an open source machine learning framweork.

## Load dataset

Reaserch will use [Amazon Review Data (2018)](https://nijianmo.github.io/amazon/index.html) datasets (downloaded on March 4th, 2020):
* [Video Games subset](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz)
* [Software subset](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Software_5.json.gz)

See citiation below:

> Jianmo Ni, Jiacheng Li, Julian McAuley, **Justifying recommendations using distantly-labeled reviews and fined-grained aspects**, _Empirical Methods in Natural Language Processing (EMNLP)_, 2019

### Read sentiment data

In [106]:
import gzip
import json

def parse_dataset(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

In [107]:
def get_sentiment_data(path):
    data = {'pos': [], 'neg': []}
    labels = {'pos': [], 'neg': []}

    for review in parse_dataset(path):
        if 'reviewText' in review:
            if review['overall'] >= 4.0:
                data['pos'].append(review['reviewText'])
                labels['pos'].append(1)
            elif review['overall'] <= 2.0:
                data['neg'].append(review['reviewText'])
                labels['neg'].append(0)
    
    for sentiment in ['pos', 'neg']:
        assert len(data[sentiment]) == len(labels[sentiment]), \
                    "{} data size does not match labels size".format(sentiment)
    
    return data, labels   

In [108]:
pre_data_train, pre_labels_train = get_sentiment_data('./data/Video_Games_5.json.gz')
pre_data_test, pre_labels_test = get_sentiment_data('./data/Software_5.json.gz')

print('Reviews Video Games: {} pos / {} neg'.format(len(pre_data_train['pos']), len(pre_data_train['neg'])))
print('Reviews Software: {} pos / {} neg'.format(len(pre_data_test['pos']), len(pre_data_test['neg'])))

Reviews Video Games: 393267 pos / 55012 neg
Reviews Software: 8987 pos / 2219 neg


### Adjust probe number

In [109]:
# Adjustment for case, when train dataset is bigger than test dataset
# Reviews Video Games: 393267 pos / 55012 neg
# Reviews Software: 8987 pos / 2219 neg

def adjust_probe_number(train, test):
    probe_number = 10000
    adj_train = {'pos': [], 'neg': []}
    adj_test = {'pos': [], 'neg': []}
    
    for sentiment in ['pos', 'neg']:
        probe_test_diff = abs(probe_number - len(test[sentiment]))
        
        adj_test[sentiment] = test[sentiment] + train[sentiment][0:probe_test_diff]
        adj_train[sentiment] = train[sentiment][probe_test_diff:probe_test_diff+probe_number]
    
    return adj_train, adj_test

In [110]:
data_train, data_test = adjust_probe_number(pre_data_train, pre_data_test)
labels_train, labels_test = adjust_probe_number(pre_labels_train, pre_labels_test)

print('Reviews Video Games after probe number adjustment: {} pos / {} neg (labels  {} pos / {} neg)'.format(len(data_train['pos']), len(data_train['neg']), len(labels_train['pos']), len(labels_train['neg'])))
print('Reviews Software after probe number adjustment: {} pos / {} neg (labels  {} pos / {} neg)'.format(len(data_test['pos']), len(data_test['neg']), len(labels_test['pos']), len(labels_test['neg'])))

Reviews Video Games after probe number adjustment: 10000 pos / 10000 neg (labels  10000 pos / 10000 neg)
Reviews Software after probe number adjustment: 10000 pos / 10000 neg (labels  10000 pos / 10000 neg)


### Combine and shuffle sentiment data

In [111]:
from sklearn.utils import shuffle

def get_combined_data(data_train, labels_train, data_test, labels_test):
    d_train = data_train['pos'] + data_train['neg']
    l_train = labels_train['pos'] + labels_train['neg']
    d_test = data_test['pos'] + data_test['neg']
    l_test = labels_test['pos'] + labels_test['neg']
    
    d_train, l_train = shuffle(d_train, l_train)
    d_test, l_test = shuffle(d_test, l_test)
    
    return d_train, d_test, l_train, l_test

In [112]:
pre_train_X, pre_test_X, pre_train_Y, pre_test_Y = get_combined_data(data_train, labels_train, data_test, labels_test)
print("Reviews (combined): train = {}, test = {}".format(len(pre_train_X), len(pre_test_X)))

Reviews (combined): train = 20000, test = 20000


In [113]:
print(pre_train_X[100])
print(pre_train_Y[100])

This ain't Resident Evil, folks. More like Gears of Evil. The graphics are nice, too bad they decided to skip out on the gameplay or plot. It took 4 years to make this? Capcom pumped out the classics of Resident Evil 1,2,3 and Code Veronica in the same time span of 4 years. If you want a real survival horror or just a good game go play the originals. (Except for Zero)
0


### Clean up sentiment data

In [114]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [115]:
print(review_to_words(pre_train_X[100]))

['resid', 'evil', 'folk', 'like', 'gear', 'evil', 'graphic', 'nice', 'bad', 'decid', 'skip', 'gameplay', 'plot', 'took', '4', 'year', 'make', 'capcom', 'pump', 'classic', 'resid', 'evil', '1', '2', '3', 'code', 'veronica', 'time', 'span', '4', 'year', 'want', 'real', 'surviv', 'horror', 'good', 'game', 'go', 'play', 'origin', 'except', 'zero']
10000


### Process sentiment data

In [116]:
import pickle, os

cache_dir = os.path.join("./cache", "sentiment_analysis")
os.makedirs(cache_dir, exist_ok=True)

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass
    
    if cache_data is None:
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [117]:
train_X, test_X, train_Y, test_Y = preprocess_data(pre_train_X, pre_test_X, pre_train_Y, pre_test_Y)

Wrote preprocessed data to cache file: preprocessed_data.pkl


### Create word dictionary

In [118]:
import numpy as np

def build_dict(data, vocab_size = 5000):
    word_count = {}
    for review in data:
        for word in review:
            if word in word_count:
                word_count[word] += 1
            else:
                word_count[word] = 1
    
    sorted_words = [item[0] for item in sorted(word_count.items(), key=lambda x: x[1], reverse=True)]
    
    word_dict = {}
    for idx, word in enumerate(sorted_words[:vocab_size - 2]):
        word_dict[word] = idx + 2
        
    return word_dict

In [119]:
word_dict = build_dict(train_X)

In [120]:
count = 0
for word, key in word_dict.items():
    if count < 5:
        print(word)
        count += 1
    else:
        break

game
play
like
one
get


In [121]:
data_dir = './data/pytorch'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [122]:
with open(os.path.join(data_dir, 'word_dict.pkl'), "wb") as f:
    pickle.dump(word_dict, f)

### Transform data

In [123]:
def convert_and_pad(word_dict, sentence, pad=500):
    NOWORD = 0
    INFREQ = 1
    
    working_sentence = [NOWORD] * pad
    
    for word_index, word in enumerate(sentence[:pad]):
        if word in word_dict:
            working_sentence[word_index] = word_dict[word]
        else:
            working_sentence[word_index] = INFREQ
            
    return working_sentence, min(len(sentence), pad)

def convert_and_pad_data(word_dict, data, pad=500):
    result = []
    lengths = []
    
    for sentence in data:
        converted, leng = convert_and_pad(word_dict, sentence, pad)
        result.append(converted)
        lengths.append(leng)
        
    return np.array(result), np.array(lengths)

In [124]:
train_X, train_X_len = convert_and_pad_data(word_dict, train_X)
test_X, test_X_len = convert_and_pad_data(word_dict, test_X)

In [125]:
train_X_len[15]

40

In [128]:
train_X[15]

array([ 141,   34,    4,  949,   35,   43,  122, 1512, 1294,  154, 1682,
         22,   90,    2,  141,   34,    4, 1551,  217,   79,   53,  913,
         52,  182,   12,    1,   12, 3770,  182,   61, 1433,  602,  174,
        461,  141,   34,    4,  108,   13,  122,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

In [129]:
len(train_X[15])

500

In [147]:
import pandas as pd
    
pd.concat([pd.DataFrame(train_Y), pd.DataFrame(train_X_len), pd.DataFrame(train_X)], axis=1) \
        .to_csv(os.path.join(data_dir, 'train.csv'), header=False, index=False)

### Build and train PyTorch model

In [32]:
import torch
import torch.utils.data

import torch.optim as optim
from train.model import LSTMClassifier

def train(model, train_loader, epochs, optimizer, loss_fn, device, save_path):
    for epoch in range(1, epochs + 1):
        model.train()
        total_loss = 0
        for batch in train_loader:         
            batch_X, batch_y = batch

            batch_X = batch_X.to(device)
            batch_y = batch_y.to(device)

            optimizer.zero_grad()
            out = model.forward(batch_X)
            loss = loss_fn(out, batch_y)
            loss.backward()
            optimizer.step()

            total_loss += loss.data.item()
        
        torch.save(model.state_dict(), save_path)
        print("Epoch: {}, BCELoss: {}".format(epoch, total_loss / len(train_loader)))
        
    return model

In [33]:
def get_train_data_loader(train_sample):
    train_sample_y = torch.from_numpy(train_sample[[0]].values).float().squeeze()
    train_sample_X = torch.from_numpy(train_sample.drop([0], axis=1).values).long()
    train_sample_ds = torch.utils.data.TensorDataset(train_sample_X, train_sample_y)
    train_sample_dl = torch.utils.data.DataLoader(train_sample_ds, batch_size=50)
    
    return train_sample_dl

#### Test training on small sample

In [34]:
train_data_from_csv_250 = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None, nrows=250)
train_sample_dl = get_train_data_loader(train_data_from_csv_250)

In [36]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 100, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

train(model, train_sample_dl, 5, optimizer, loss_fn, device, os.path.join(data_dir, 'test_train.pt'))

Epoch: 1, BCELoss: 0.6920093178749085
Epoch: 2, BCELoss: 0.6814617395401001
Epoch: 3, BCELoss: 0.6715063571929931
Epoch: 4, BCELoss: 0.6588733911514282
Epoch: 5, BCELoss: 0.6392780542373657


LSTMClassifier(
  (embedding): Embedding(5000, 32, padding_idx=0)
  (lstm): LSTM(32, 100)
  (dense): Linear(in_features=100, out_features=1, bias=True)
  (sig): Sigmoid()
)

#### Provide actual training

In [37]:
train_data_from_csv_full = pd.read_csv(os.path.join(data_dir, 'train.csv'), header=None, names=None)
train_sample_dl_full = get_train_data_loader(train_data_from_csv_full)

In [38]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = LSTMClassifier(32, 200, 5000).to(device)
optimizer = optim.Adam(model.parameters())
loss_fn = torch.nn.BCELoss()

estimator = train(model, train_sample_dl_full, 10, optimizer, loss_fn, device, os.path.join(data_dir, 'train.pt'))

Epoch: 1, BCELoss: 0.5496622703969478
Epoch: 2, BCELoss: 0.40822387877851724
Epoch: 3, BCELoss: 0.3381731440499425
Epoch: 4, BCELoss: 0.27152110420167447
Epoch: 5, BCELoss: 0.21735492413863539
Epoch: 6, BCELoss: 0.18828568392433226
Epoch: 7, BCELoss: 0.17431037111207842
Epoch: 8, BCELoss: 0.1313941475469619
Epoch: 9, BCELoss: 0.10618830774910748
Epoch: 10, BCELoss: 0.10010675488039851


### Test PyTorch model

In [39]:
estimator.load_state_dict(torch.load(os.path.join(data_dir, 'train.pt')))