### Overview

The dataset we will be using is a collection of amazon reviews for instant video products found at http://jmcauley.ucsd.edu/data/amazon/. Our set includes the reviews’ product, reviewer, text review, rating, helpfulness/unhelpfulness, and other metadata. The specific dataset we are using is a “5-core” dataset meaning that it includes only products and reviewers that have more than 5 reviews on amazon. This means that each product in this dataset has at least 5 reviews and that each reviewer in this dataset has posted at least 5 reviews for this category. The set includes over 37,000 reviews. 

### Purpose

This data was originally collected by Amazon as consumers browsed, bought, and reviewed products. This data is necessary for amazon to be able to display the reviews on each product. The specific dataset we are using is UCSD’s collection that has been structured by two researchers (Julian McAuley, Alex Yang) to understand why and what reviews are helpful to consumers when it comes to purchasing an object. They outline their findings with this data in the following paper: http://cseweb.ucsd.edu/~jmcauley/pdfs/www16b.pdf

### Prediction Task

Reviews have become increasingly crucial in consumer purchase decisions. As discussed in a BrightLocal study, 88% of consumers incorporate reviews into their purchase decisions. This finding for most consumers is obvious. A less obvious finding is that although customers are relying more on review data, they are reading less of them. It was found that by reading up to six reviews 73% of consumers formed an opinion as opposed to 64% in 2014. Moreover, it was found that by reading just one to three reviews 40% of consumers formed an opinion as opposed to 29% in 2014. 

Consumers are basing decisions on less reviews. This trend introduces unique issues, for example, what if the first few reviews a consumer reads are not very helpful? What if all of the reviews on a product are unhelpful? This puts a premium on presenting and identifying useful reviews. If a company could know beforehand which types of reviews are helpful, they could enforce those attributes as rules or they could promote those reviews before other unhelpful reviews. 
Given this scenario, our prediction task for this data is the to determine the helpfulness of a review. 

### Third Parties

The parties interested in this result would include e-commerce vendors (Amazon, eBay, Alibaba, etc..) that offer reviews on products. This could give them insight on what reviews are generally helpful and how they can get these reviews in front of consumers before other reviews to help with product conversion. This would also allow them to enforce potential rules on reviews to remove unhelpful reviews that would not be helpful and potentially detrimental to product conversion. It would also help with products that do not yet have many reviews where the normal process of voting has not taken place yet. 

More third parties that would be interested in the data(although not specifically these findings) include advertisers, that would benefit from a text analysis of the reviews to see what relating products or needs are mentioned. They could use this information to know when to advertise certain products. Consumers will benefit from having more helpful reviews. 


### Sources
BrightLocal, Business2community, Bazaarvoice, webrepublic, reprevive, Econsultancy,business2community,Reevoo and Social Media Today:

https://www.vendasta.com/blog/50-stats-you-need-to-know-about-online-reviews

Forbes: 

https://www.forbes.com/sites/jaysondemers/2015/12/28/how-important-are-customer-reviews-for-online-marketing/#276bfb421928

UCSD:

http://jmcauley.ucsd.edu/data/amazon/

http://cseweb.ucsd.edu/~jmcauley/pdfs/www16b.pdf




## 2. Data understanding
In this section we will examine the dataset and its features.

In [1]:
#Imports
import sys
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy

print(sys.version)
print(np.__version__)

plt.style.use('seaborn-deep')

#load dataset
dataframe = pd.read_json('../lab-one/reviews_Amazon_Instant_Video_5.json',lines=True)
dataframe.info()

3.5.2 (default, Mar 30 2017, 20:11:21) 
[GCC 5.4.0]
1.14.2
<class 'pandas.core.frame.DataFrame'>
Int64Index: 37126 entries, 0 to 37125
Data columns (total 9 columns):
asin              37126 non-null object
helpful           37126 non-null object
overall           37126 non-null int64
reviewText        37126 non-null object
reviewTime        37126 non-null object
reviewerID        37126 non-null object
reviewerName      36797 non-null object
summary           37126 non-null object
unixReviewTime    37126 non-null int64
dtypes: int64(2), object(7)
memory usage: 2.8+ MB


From the inital load we can see that this dataset is generally well filled. There are some missing values for reviewer name. This may be due to reviewers preferring to remain anonymous. However, the reviewer names are not very important to our findings as the reviewerID takes care of linking a review to an author. Thus, we will not use that collumn in further analysis.

### 2.1. Data types
The features of this dataset are as follows :
+ asin - ID of the product - Nominal
+ helpful - A tuple containing the number of people that thought the review was helpful or unhelpful - Both ordinal, stored as integers
+ overall - The overall rating that a product received. Ordinal, stored as integer
+ reviewText - The full text of the review - Bag of words
+ reviewTime - The timestamp of the review - Interval
+ reviewerID - The ID of the reviewer - Nominal
+ reviewerName - The name of the reviewer - Nominal
+ summary - A summary of the review  - Bag of words
+ unixReviewTime - The UNIX timestamp of the review - Interval


Additionally, we extracted the following attributes from the already existing attributes that will help us in our analysis of the dataset :
+ numberHelpful - The amount of people that thought the review was helpful, extracted from the helpful tuple. Ordinal, stored as integer
+ numberUnhelpful - The amount of people that thought the review was unhelpful, extracted from the helpful tuple. Ordinal, stored as integer
+ reviewerNumberReviews - The number of reviews each reviewer had left. Ordinal, stored as integer
+ reviewLength  - The length of the review text. Ordinal, stored as integer

   

In [2]:
#extract information for number of reviews by each reviewer
reviewer_ids = list(dataframe['reviewerID'])
authorToNumReviews = {user_id:reviewer_ids.count(user_id) for user_id in set(reviewer_ids)}
dataframe['reviewLength'] = [len(text) for text in dataframe['reviewText']]
dataframe['reviewerNumberReviews'] = [authorToNumReviews[author] for author in dataframe['reviewerID']]

#extract additional columns for analysis
helpful_count = []
unhelpful_count = []
helpful_ratio = []
for (h,total) in dataframe['helpful']:
    helpful_count.append(h)
    unhelpful_count.append(total-h)
    if total == 0:
        helpful_ratio.append(None)
    else:
        helpful_ratio.append(h/(total))
dataframe['numberUnhelpful'] = unhelpful_count
dataframe['numberHelpful'] = helpful_count
dataframe['helpfulRatio'] = helpful_ratio

<br/> 
We discarded the attributes we would not need for our analysis and set the appropriate storage types to those that were not already assigned. These attributes include the time the review was written and the original helpful tuple. We removed the former because this dataset has reviews from a generally similar time period and beleive time will not be a large differentiator in review quality. Although this will be a good hypothesis to further test, the dataset is rich enough for us to focus our efforts on other features. The original helpful tuple has been removed as it has been replaced with our numeric collumns.
<br/>

In [87]:
import numpy as np
import nltk
import string
import re
words = set(nltk.corpus.words.words())
def english_filter(passed_string):
    passed_string = re.sub(r'[^\w\s]','',passed_string)
    passed_string = re.sub(' +',' ',passed_string)
    passed_string = ' '.join(y for y in passed_string.splitlines())
    passed_string = passed_string.rstrip()
    passed_string = passed_string.rstrip('\n')
    passed_string = passed_string.lower()
    passed_string = " ".join((w if w in words else "") for w in passed_string.split(' '))
    return passed_string
unneeded_attributes = ['unixReviewTime', 'helpful', 'reviewTime']
ordinal_attributes = ['numberHelpful', 'numberUnhelpful', 'reviewLength', 'overall', 'reviewerNumberReviews']
nominal_attributes = ['asin', 'reviewerID', 'reviewerName']

for attr in unneeded_attributes:
    if attr in dataframe:
        del dataframe[attr]

dataframe[ordinal_attributes] = dataframe[ordinal_attributes].astype(np.int64)
dataframe['reviewText'] = dataframe.apply(lambda row: english_filter(row['reviewText']), axis=1)
dataframe['summary'] = dataframe.apply(lambda row: english_filter(row['summary']), axis=1)
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 37126 entries, 0 to 37125
Data columns (total 11 columns):
asin                     37126 non-null object
overall                  37126 non-null int64
reviewText               37126 non-null object
reviewerID               37126 non-null object
reviewerName             36797 non-null object
summary                  37126 non-null object
reviewLength             37126 non-null int64
reviewerNumberReviews    37126 non-null int64
numberUnhelpful          37126 non-null int64
numberHelpful            37126 non-null int64
helpfulRatio             13133 non-null float64
dtypes: float64(1), int64(5), object(5)
memory usage: 4.6+ MB


### 2.2. Data quality

As mentioned before, this dataset is missing 329 reviewer names, as can be seen from the outputs above and below. We decided to keep this data as is, as there was another unique identifier for reviewers - reviewerID - that we could use to associate a reviewer with a review. Reviewer names were also not a part of our analysis or predicition, so it would in no way skew our results at the end. 
<br/>

The most probable reason why there were missing values for reviewerName is that Amazon had given the option to reviewers to post anonymously and had hidden their names in the review. Since only logged in members are allowed to post reviews, we can assume that the reviewerID is sufficient for us to identify reviews by the same members.

<br/> 
Furthermore, entries where numberHelpful and numberUnhelpful were 0 posed an interesting situation. Although there is an argument to classify these objects as missing and revmove or impute, we beleive these objects to be valuable as is. These reviews can help with predicting a different class of reviews: indistinct reviews. These are reviews that simply no other customers found to be helpful or unhelpful. It can be seen that over 50% of the values have 0 for those two attributes. 

<br/>

In [88]:
#Number of null_data in any of the columns. 
null_data = dataframe[dataframe.isnull().any(axis=1)]
len(null_data)

24095

In [89]:
#More information about the dataset
dataframe.describe()

Unnamed: 0,overall,reviewLength,reviewerNumberReviews,numberUnhelpful,numberHelpful,helpfulRatio
count,37126.0,37126.0,37126.0,37126.0,37126.0,13133.0
mean,4.20953,515.292033,10.667026,0.725475,1.293541,0.574588
std,1.11855,835.14561,13.346323,3.532468,8.301778,0.391384
min,1.0,4.0,5.0,0.0,0.0,0.0
25%,4.0,145.0,5.0,0.0,0.0,0.2
50%,5.0,232.0,7.0,0.0,0.0,0.666667
75%,5.0,484.0,10.0,0.0,1.0,1.0
max,5.0,18152.0,123.0,214.0,484.0,1.0


In [90]:
#Print the number of duplicate entries in the dataset
len(dataframe[dataframe.duplicated(['asin','reviewerID'],keep=False)])

0