## 2. Data understanding

In [88]:
#load dataset
import pandas as pd
dataframe = pd.read_json('reviews_Amazon_Instant_Video_5.json',lines=True)
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37126 entries, 0 to 37125
Data columns (total 9 columns):
asin              37126 non-null object
helpful           37126 non-null object
overall           37126 non-null int64
reviewText        37126 non-null object
reviewTime        37126 non-null object
reviewerID        37126 non-null object
reviewerName      36797 non-null object
summary           37126 non-null object
unixReviewTime    37126 non-null int64
dtypes: int64(2), object(7)
memory usage: 2.5+ MB


### 2.1. Data types
The features of this dataset are as follows :
+ asin - ID of the product - Nominal
+ helpful - A tuple containing the number of people that thought the review was helpful or unhelpful - Both ordinal, stored as integers
+ overall - The overall rating that a product received. Ordinal, stored as integer
+ reviewText - The full text of the review - Bag of words
+ reviewTime - The timestamp of the review - Interval
+ reviewerID - The ID of the reviewer - Nominal
+ reviewerName - The name of the reviewer - Nominal
+ summary - A summary of the review  - Bag of words
+ unixReviewTime - The UNIX timestamp of the review - Interval
   


Additionally, we extracted the following attributes from the already existing attributes that will help us in our analysis of the dataset :
+ numberHelpful - The amount of people that thought the review was helpful, extracted from the helpful tuple. Ordinal, stored as integer
+ numberUnhelpful - The amount of people that thought the review was unhelpful, extracted from the helpful tuple. Ordinal, stored as integer
+ reviewerNumberReviews - The number of reviews each reviewer had left. Ordinal, stored as integer
+ reviewLength  - The length of the review text. Ordinal, stored as integer


In [89]:
#extract additional columns for analysis
dataframe['numberUnhelpful']= [y for [x,y] in dataframe['helpful']]
dataframe['numberHelpful']= [x for [x,y] in dataframe['helpful'] ]
dataframe['reviewLength'] = [len(x) for x in dataframe['reviewText']]

#extract information for number of reviews by each reviewer
reviewerNames = list(dataframe['reviewerName'])
authorToNumReviews = {name:reviewerNames.count(name) for name in set(reviewerNames)}
dataframe['reviewerNumberReviews'] = [authorToNumReviews[author] for author in dataframe['reviewerName']]

<br/>
We had to convert the reviewText and summary to a bag of words model to ease us in our analysis and prediction. Those were stored in two variables, <i>review_bag</i> and <i>summary_bag</i>.
<br/> <br/>


In [90]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
review_bag = count_vect.fit_transform(dataframe['reviewText'])
summary_bag = count_vect.fit_transform(dataframe['summary'])

<br/> 
We discarded the attributes we would not need for our analysis and set the appropriate storage types to those that were not already assigned. 
<br/>

In [91]:
import numpy as np
unneeded_attributes = ['unixReviewTime', 'helpful', 'reviewTime']
ordinal_attributes = ['numberHelpful', 'numberUnhelpful', 'reviewLength', 'overall', 'reviewerNumberReviews']
nominal_attributes = ['asin', 'reviewerID', 'reviewerName']

for attr in unneeded_attributes:
    if attr in dataframe:
        del dataframe[attr]

dataframe[ordinal_attributes] = dataframe[ordinal_attributes].astype(np.int64)
dataframe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37126 entries, 0 to 37125
Data columns (total 10 columns):
asin                     37126 non-null object
overall                  37126 non-null int64
reviewText               37126 non-null object
reviewerID               37126 non-null object
reviewerName             36797 non-null object
summary                  37126 non-null object
numberUnhelpful          37126 non-null int64
numberHelpful            37126 non-null int64
reviewLength             37126 non-null int64
reviewerNumberReviews    37126 non-null int64
dtypes: int64(5), object(5)
memory usage: 2.8+ MB


### 2.2. Data quality

The dataset that was provided was not missing any values for the attributes we wanted to use. It was, however, missing 329 reviewer names, as can be seen from the outputs above and below. We decided to keep this data as is, as there was another unique identifier for reviewers - reviewerID - that we could use to associate a reviewer with a review. Reviewer names were also not a part of our analysis or predicition, so it would in no way skew our results at the end. 
<br/>
The most probable reason why there were missing values for reviewerName is that Amazon had given the option to reviewers to post anonymously and had hidden their names in the review. Since only logged in members are allowed to post reviews, we can assume that the reviewerID is sufficient for us to identify reviews by the same members.

In [97]:
null_data = dataframe[dataframe.isnull().any(axis=1)]
len(null_data)
#pd.get_dummies(dataframe.reviewerID)

329

<br/> 
Furthermore, we decided to not treat entries where numberHelpful and numberUnhelpful were 0 as missing, as it was not an erroneous entry of information, but could signify that simply no other customers found this review to be helpful or unhelpful and could damage our final result, should we choose to delete or impute these values. It can be seen that over 50% of the values have 0 for those two attributes. 
<br/>

In [93]:
dataframe.describe()

Unnamed: 0,overall,numberUnhelpful,numberHelpful,reviewLength,reviewerNumberReviews
count,37126.0,37126.0,37126.0,37126.0,37126.0
mean,4.20953,2.019016,1.293541,515.292033,25.152562
std,1.11855,10.086076,8.301778,835.14561,88.755763
min,1.0,0.0,0.0,4.0,1.0
25%,4.0,0.0,0.0,145.0,5.0
50%,5.0,0.0,0.0,232.0,7.0
75%,5.0,1.0,1.0,484.0,11.0
max,5.0,512.0,484.0,18152.0,645.0


<br/>
The dataset we chose had no duplicate entries in it and eliminated the need for us to handle such a case. There were no instances where the same review was posted twice or the same reviewer had posted on the same item twice. If we had to deal with such a case, we would have eliminated all such entries, as they could be interpreted as a mistake in the data collection. 

In [96]:
len(dataframe[dataframe.duplicated(['asin','reviewerID'],keep=False)])

0