<a href="https://colab.research.google.com/github/mariyaperchyk/codepubHPI/blob/master/AmazonBookReviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Lets prepare the data and our runtime environment

1. Install fastText via pip
2. Download our Amazon Review data

   Our dataset consists of more than 1.5M book reviews
3. Unzip it

In [11]:
!pip install fasttext
!wget -O amazon_review_data.zip https://owncloud.hpi.de/s/173VppQ6LPxqM7B/download
!unzip amazon_review_data.zip
!ls -lah

--2019-11-27 23:34:46--  https://owncloud.hpi.de/s/173VppQ6LPxqM7B/download
Resolving owncloud.hpi.de (owncloud.hpi.de)... 141.89.226.235
Connecting to owncloud.hpi.de (owncloud.hpi.de)|141.89.226.235|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 357272653 (341M) [application/zip]
Saving to: ‘amazon_review_data.zip’


2019-11-27 23:35:03 (22.6 MB/s) - ‘amazon_review_data.zip’ saved [357272653/357272653]

Archive:  amazon_review_data.zip
replace amazon_book_review_balanced_full.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: total 1.3G
drwxr-xr-x 1 root root 4.0K Nov 27 22:34 .
drwxr-xr-x 1 root root 4.0K Nov 27 22:30 ..
-rw-r--r-- 1 root root 927M Nov 24 16:27 amazon_book_review_balanced_full.csv
-rw-r--r-- 1 root root 341M Nov 27 23:35 amazon_review_data.zip
drwxr-xr-x 1 root root 4.0K Nov 21 16:30 .config
drwxr-xr-x 1 root root 4.0K Nov 21 16:30 sample_data


## Lets take a look into the data

We will work with **pandas** and **sklearn** python libraries for data analysis. 
The dataset has three columns, *star_rating*, *review_headline* and *review_body*. 
* *star_rating* - Number of stars in the review
* *review_headline* - Title of the review
* *review_body* - Actual review text

In [12]:
import pandas as pd
from sklearn.model_selection import train_test_split
import csv

full_data = pd.read_csv('amazon_book_review_balanced_full.csv', dtype=str)
full_data

Unnamed: 0,star_rating,review_headline,review_body
0,1,Completely Terrible,"Fortunately, I read this for free via Amazon's..."
1,1,"This is an essay, not a short story, barely 17...","Unfortunately, I didn't realize what I was buy..."
2,1,no.,zero stars if I could. I received the wrong book.
3,1,"What an underdeveloped, unoriginal waste of time","First off, I have no idea who is writing the 5..."
4,1,Caution: Cahn's theory is contradicted by empi...,I am a lecturer in empirical research methods ...
...,...,...,...
1686745,5,I enjoyed this book very much,I enjoyed this book very much. Once I got to ...
1686746,5,we lost hours of sleep because we could not pu...,My wife and I give UNHINGED five stars!!! I w...
1686747,5,Five Stars,Excellent
1686748,5,Five Stars,Really enjoyed this. Well told.


In [13]:
full_data.star_rating.value_counts()

1    337350
2    337350
4    337350
5    337350
3    337350
Name: star_rating, dtype: int64

## Next step: Prepare the data for classification

1. Remove all reviews with 3 stars, since 3 star reviews are neither positive nor negative. 
2.  Add an additional column *sentiment*, which will specify if the review is positive or negative. *Positive* reviews have 4 or 5 stars, and *negative* reviews have 1 or 2 stars. 
3. To make our initial process faster and execution times shorter, we will consider only **30%** of all our book reviews. 
4. Split the dataset into training and test data.

In [14]:
sentimentDataset = full_data.loc[full_data.star_rating != '3',] 
sentimentColumn = ['__label__positive' if (rating == '4' or rating == '5') else '__label__negative' for rating in sentimentDataset['star_rating']]
sentimentDataset = sentimentDataset.assign(sentiment=sentimentColumn)
sentimentDataset


Unnamed: 0,star_rating,review_headline,review_body,sentiment
0,1,Completely Terrible,"Fortunately, I read this for free via Amazon's...",__label__negative
1,1,"This is an essay, not a short story, barely 17...","Unfortunately, I didn't realize what I was buy...",__label__negative
2,1,no.,zero stars if I could. I received the wrong book.,__label__negative
3,1,"What an underdeveloped, unoriginal waste of time","First off, I have no idea who is writing the 5...",__label__negative
4,1,Caution: Cahn's theory is contradicted by empi...,I am a lecturer in empirical research methods ...,__label__negative
...,...,...,...,...
1686745,5,I enjoyed this book very much,I enjoyed this book very much. Once I got to ...,__label__positive
1686746,5,we lost hours of sleep because we could not pu...,My wife and I give UNHINGED five stars!!! I w...,__label__positive
1686747,5,Five Stars,Excellent,__label__positive
1686748,5,Five Stars,Really enjoyed this. Well told.,__label__positive


In [0]:
sentimentDataset = sentimentDataset.sample(frac=0.2)
trainBinary, testBinary = train_test_split(sentimentDataset, test_size=0.2)

# Last preparation step
We need to combine sentiment column and review body into one. Then, we write it to a file. 

In [0]:
train_dataset = trainBinary.sentiment + ' ' + trainBinary.review_body
train_dataset.to_csv('train_data.csv', index=False, quoting=csv.QUOTE_MINIMAL, header=False, sep='\t')

## Its time for fastText!
` train_supervised` will train a supervised model and return a model object. `input` must be a filepath. The input text does not need to be tokenized as per the tokenize function, but it must be preprocessed and encoded as UTF-8. 

In [0]:
import fasttext
model = fasttext.train_supervised(input='train_data.csv')

In [44]:
model.predict("This is a great book, I really enjoyed reading it")

(('__label__positive',), array([0.9999603]))

In [45]:
model.predict("This is a aweful book, I  was bored")

(('__label__negative',), array([0.98591971]))

## Now its time to test, how exact our predictions are.

1. Prepare the test data
2. Predict labels for test data

   Our predictions have 2 values in the tuple (predicted_label, probability). Lets drop the probability since we will not need it


In [46]:
ground_truth = testBinary.sentiment.values.tolist()
test_dataset = (testBinary.review_body).values.tolist()
test_dataset = [x.replace('\n', ' ') for x in test_dataset]
predictions = model.predict(test_dataset)
print(predictions[0][0], predictions[1][1])

['__label__negative'] [0.56501603]


In [0]:
predictions = [item[0] for item in predictions[0]]

## How good are we?
* Compare the actual labels to the predictions.
* Compute accuracy and confusion matrix

In [48]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

matrix = confusion_matrix(ground_truth, predictions)
accuracy = accuracy_score(ground_truth, predictions)
print(matrix)
print(accuracy)

[[7006 1063]
 [1045 7079]]
0.8698202927190761
