# Beta Testing Opinions | Sentiment Analysis Model

_Author: Karolina Mamczarz_

_Based on: [Deep Learning Nanodegree Program | Udacity](https://www.udacity.com/course/deep-learning-nanodegree--nd101)_

## Description

PyTorch is used as a training tool. It is an open source machine learning framweork.

## Load dataset

Reaserch will use [Amazon Review Data (2018)](https://nijianmo.github.io/amazon/index.html) datasets (downloaded on March 4th, 2020):
* [Video Games subset](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Video_Games_5.json.gz)
* [Software subset](http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/Software_5.json.gz)

See citiation below:

> Jianmo Ni, Jiacheng Li, Julian McAuley, **Justifying recommendations using distantly-labeled reviews and fined-grained aspects**, _Empirical Methods in Natural Language Processing (EMNLP)_, 2019

### Read sentiment data

In [106]:
import gzip
import json

def parse_dataset(path):
  g = gzip.open(path, 'r')
  for l in g:
    yield json.loads(l)

In [107]:
def get_sentiment_data(path):
    data = {'pos': [], 'neg': []}
    labels = {'pos': [], 'neg': []}

    for review in parse_dataset(path):
        if 'reviewText' in review:
            if review['overall'] >= 4.0:
                data['pos'].append(review['reviewText'])
                labels['pos'].append(1)
            elif review['overall'] <= 2.0:
                data['neg'].append(review['reviewText'])
                labels['neg'].append(0)
    
    for sentiment in ['pos', 'neg']:
        assert len(data[sentiment]) == len(labels[sentiment]), \
                    "{} data size does not match labels size".format(sentiment)
    
    return data, labels   

In [108]:
pre_data_train, pre_labels_train = get_sentiment_data('./data/Video_Games_5.json.gz')
pre_data_test, pre_labels_test = get_sentiment_data('./data/Software_5.json.gz')

print('Reviews Video Games: {} pos / {} neg'.format(len(pre_data_train['pos']), len(pre_data_train['neg'])))
print('Reviews Software: {} pos / {} neg'.format(len(pre_data_test['pos']), len(pre_data_test['neg'])))

Reviews Video Games: 393267 pos / 55012 neg
Reviews Software: 8987 pos / 2219 neg


### Adjust probe number

In [109]:
# Adjustment for case, when train dataset is bigger than test dataset
# Reviews Video Games: 393267 pos / 55012 neg
# Reviews Software: 8987 pos / 2219 neg

def adjust_probe_number(train, test):
    probe_number = 10000
    adj_train = {'pos': [], 'neg': []}
    adj_test = {'pos': [], 'neg': []}
    
    for sentiment in ['pos', 'neg']:
        probe_test_diff = abs(probe_number - len(test[sentiment]))
        
        adj_test[sentiment] = test[sentiment] + train[sentiment][0:probe_test_diff]
        adj_train[sentiment] = train[sentiment][probe_test_diff:probe_test_diff+probe_number]
    
    return adj_train, adj_test

In [110]:
data_train, data_test = adjust_probe_number(pre_data_train, pre_data_test)
labels_train, labels_test = adjust_probe_number(pre_labels_train, pre_labels_test)

print('Reviews Video Games after probe number adjustment: {} pos / {} neg (labels  {} pos / {} neg)'.format(len(data_train['pos']), len(data_train['neg']), len(labels_train['pos']), len(labels_train['neg'])))
print('Reviews Software after probe number adjustment: {} pos / {} neg (labels  {} pos / {} neg)'.format(len(data_test['pos']), len(data_test['neg']), len(labels_test['pos']), len(labels_test['neg'])))

Reviews Video Games after probe number adjustment: 10000 pos / 10000 neg (labels  10000 pos / 10000 neg)
Reviews Software after probe number adjustment: 10000 pos / 10000 neg (labels  10000 pos / 10000 neg)


### Combine and shuffle sentiment data

In [111]:
from sklearn.utils import shuffle

def get_combined_data(data_train, labels_train, data_test, labels_test):
    d_train = data_train['pos'] + data_train['neg']
    l_train = labels_train['pos'] + labels_train['neg']
    d_test = data_test['pos'] + data_test['neg']
    l_test = labels_test['pos'] + labels_test['neg']
    
    d_train, l_train = shuffle(d_train, l_train)
    d_test, l_test = shuffle(d_test, l_test)
    
    return d_train, d_test, l_train, l_test

In [112]:
pre_train_X, pre_test_X, pre_train_Y, pre_test_Y = get_combined_data(data_train, labels_train, data_test, labels_test)
print("Reviews (combined): train = {}, test = {}".format(len(pre_train_X), len(pre_test_X)))

Reviews (combined): train = 20000, test = 20000


In [113]:
print(pre_train_X[100])
print(pre_train_Y[100])

This ain't Resident Evil, folks. More like Gears of Evil. The graphics are nice, too bad they decided to skip out on the gameplay or plot. It took 4 years to make this? Capcom pumped out the classics of Resident Evil 1,2,3 and Code Veronica in the same time span of 4 years. If you want a real survival horror or just a good game go play the originals. (Except for Zero)
0


### Clean up sentiment data

In [114]:
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *

import re
from bs4 import BeautifulSoup

def review_to_words(review):
    nltk.download("stopwords", quiet=True)
    stemmer = PorterStemmer()
    
    text = BeautifulSoup(review, "html.parser").get_text() # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower()) # Convert to lower case
    words = text.split() # Split string into words
    words = [w for w in words if w not in stopwords.words("english")] # Remove stopwords
    words = [PorterStemmer().stem(w) for w in words] # stem
    
    return words

In [115]:
print(review_to_words(pre_train_X[100]))

['resid', 'evil', 'folk', 'like', 'gear', 'evil', 'graphic', 'nice', 'bad', 'decid', 'skip', 'gameplay', 'plot', 'took', '4', 'year', 'make', 'capcom', 'pump', 'classic', 'resid', 'evil', '1', '2', '3', 'code', 'veronica', 'time', 'span', '4', 'year', 'want', 'real', 'surviv', 'horror', 'good', 'game', 'go', 'play', 'origin', 'except', 'zero']
10000


### Process sentiment data

In [116]:
import pickle, os

cache_dir = os.path.join("./cache", "sentiment_analysis")
os.makedirs(cache_dir, exist_ok=True)

def preprocess_data(data_train, data_test, labels_train, labels_test,
                    cache_dir=cache_dir, cache_file="preprocessed_data.pkl"):
    cache_data = None
    if cache_file is not None:
        try:
            with open(os.path.join(cache_dir, cache_file), "rb") as f:
                cache_data = pickle.load(f)
            print("Read preprocessed data from cache file:", cache_file)
        except:
            pass
    
    if cache_data is None:
        words_train = [review_to_words(review) for review in data_train]
        words_test = [review_to_words(review) for review in data_test]
        
        if cache_file is not None:
            cache_data = dict(words_train=words_train, words_test=words_test,
                              labels_train=labels_train, labels_test=labels_test)
            with open(os.path.join(cache_dir, cache_file), "wb") as f:
                pickle.dump(cache_data, f)
            print("Wrote preprocessed data to cache file:", cache_file)
    else:
        words_train, words_test, labels_train, labels_test = (cache_data['words_train'],
                cache_data['words_test'], cache_data['labels_train'], cache_data['labels_test'])
    
    return words_train, words_test, labels_train, labels_test

In [None]:
train_X, test_X, train_Y, test_Y = preprocess_data(pre_train_X, pre_test_X, pre_train_Y, pre_test_Y)