 EDA on Amazon Fine Food Review dataset
 ===

# Mount Google Drive

In [1]:
# Mouting Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Import Required Modules

In [0]:
import sqlite3
import pandas as pd
import numpy as np

# Load Data

In [3]:
# Using sqlite read data from the database
con = sqlite3.connect('/content/drive/My Drive/Colab Notebooks/AFF-Review/database.sqlite')

# Get reviews which do not have score as 3
filtered_data = pd.read_sql_query(""" SELECT * FROM Reviews WHERE Score != 3 """, con)
filtered_data.head()

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


# Highlevel Statistics

In [4]:
filtered_data.describe()

Unnamed: 0,Id,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time
count,525814.0,525814.0,525814.0,525814.0,525814.0
mean,284599.060038,1.747293,2.209544,4.279148,1295943000.0
std,163984.038077,7.575819,8.195329,1.316725,48281290.0
min,1.0,0.0,0.0,1.0,939340800.0
25%,142730.25,0.0,0.0,4.0,1270598000.0
50%,284989.5,0.0,1.0,5.0,1310861000.0
75%,426446.75,2.0,2.0,5.0,1332634000.0
max,568454.0,866.0,878.0,5.0,1351210000.0


## Features/ Labels

In [5]:
filtered_data.columns

Index(['Id', 'ProductId', 'UserId', 'ProfileName', 'HelpfulnessNumerator',
       'HelpfulnessDenominator', 'Score', 'Time', 'Summary', 'Text'],
      dtype='object')

In [6]:
filtered_data.dtypes

Id                         int64
ProductId                 object
UserId                    object
ProfileName               object
HelpfulnessNumerator       int64
HelpfulnessDenominator     int64
Score                      int64
Time                       int64
Summary                   object
Text                      object
dtype: object

### Observation
- Totally 10 features given
- No labels given
- From Kaggle below information I have obtained about teach feature
  - https://www.kaggle.com/snap/amazon-fine-food-reviews
- Id
  - Row Id
- ProductId
  - Unique identifier for the product
- UserId
  - Unqiue identifier for the user
- ProfileName
  - Profile name of the user
- HelpfulnessNumerator
  - Number of users who found the review helpful
- HelpfulnessDenominator
  - Number of users who indicated whether they found the review helpful
- Score
  - Rating between 1 and 5
- Time
  - Timestamp for the review
- Summary
  - Brief summary of the review
- Text
  - Text of the review

# Data Cleaning

## Analysis

### Id

In [7]:
u = filtered_data.Id.value_counts()
u.unique()

array([1])

#### Observation
- No Id repeation

### ProductId

In [8]:
len(filtered_data.ProductId.unique())

72005

#### Observation
- 72005 Products

### UserId

In [9]:
len(filtered_data.UserId.unique())

243414

#### Observation
- 243414 Users

### HelpfulnessNumerator 

In [10]:
print(filtered_data.HelpfulnessNumerator.min(),
      filtered_data.HelpfulnessNumerator.max(),
      len(filtered_data.HelpfulnessNumerator.unique()))

0 866 222


#### Observation
- value ranges from 0 to 808
- 222 unique entries

### HelpfulnessDenominator

In [11]:
print(filtered_data.HelpfulnessDenominator.min(),
      filtered_data.HelpfulnessDenominator.max(),
      len(filtered_data.HelpfulnessDenominator.unique()))

0 878 227


In [12]:
# As per feature details, Denominator should be greater than Numerator
# Lets check whether the data follows that description
filtered_data[(filtered_data.HelpfulnessDenominator < filtered_data.HelpfulnessNumerator)]

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
41159,44737,B001EQ55RW,A2V0I904FH7ABY,Ram,3,2,4,1212883200,Pure cocoa taste with crunchy almonds inside,It was almost a 'love at first bite' - the per...
59301,64422,B000MIDROQ,A161DK06JJMCYF,"J. E. Stephens ""Jeanne""",3,1,5,1224892800,Bought This for My Son at College,My son loves spaghetti so I didn't hesitate or...


#### Observation
- value ranges from 0 to 878
- 227 unique entries
- **2 invalid entries found**
  - Denominator is greater than Numerator

### Score

In [13]:
filtered_data.Score.unique()

array([5, 1, 4, 2])

In [14]:
filtered_data.Score.value_counts()

5    363122
4     80655
1     52268
2     29769
Name: Score, dtype: int64

#### Observation
- Scores range from 1 to 5 only
- No invalid entries found
- **No equal amount of data points for each score**
  - We have an IMBALANCED dataset

### Time

In [15]:
len(filtered_data.Time.unique())

3157

In [0]:
#filtered_data['Time'].value_counts()

In [0]:
# Check whether any entry with same time for more than one product
# which is practically not possible
userid_group = filtered_data.groupby('UserId')
#g = userid_group.groups
#g.values()

In [18]:
userid_group.filter(lambda x:len(x)>1).sort_values('Time')

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
346055,374359,B00004CI84,A344SMIA5JECGM,Vincent P. Ross,1,2,5,944438400,A modern day fairy tale,"A twist of rumplestiskin captured on film, sta..."
417859,451878,B00004CXX9,A344SMIA5JECGM,Vincent P. Ross,1,2,5,944438400,A modern day fairy tale,"A twist of rumplestiskin captured on film, sta..."
212472,230285,B00004RYGX,A344SMIA5JECGM,Vincent P. Ross,1,2,5,944438400,A modern day fairy tale,"A twist of rumplestiskin captured on film, sta..."
346116,374422,B00004CI84,A1048CYU0OV4O8,Judy L. Eans,2,2,5,947376000,GREAT,THIS IS ONE MOVIE THAT SHOULD BE IN YOUR MOVIE...
417927,451949,B00004CXX9,A1048CYU0OV4O8,Judy L. Eans,2,2,5,947376000,GREAT,THIS IS ONE MOVIE THAT SHOULD BE IN YOUR MOVIE...
212533,230348,B00004RYGX,A1048CYU0OV4O8,Judy L. Eans,2,2,5,947376000,GREAT,THIS IS ONE MOVIE THAT SHOULD BE IN YOUR MOVIE...
417847,451864,B00004CXX9,A1B2IZU1JLZA6,Wes,19,23,1,948240000,WARNING: CLAMSHELL EDITION IS EDITED TV VERSION,"I, myself always enjoyed this movie, it's very..."
212458,230269,B00004RYGX,A1B2IZU1JLZA6,Wes,19,23,1,948240000,WARNING: CLAMSHELL EDITION IS EDITED TV VERSION,"I, myself always enjoyed this movie, it's very..."
346041,374343,B00004CI84,A1B2IZU1JLZA6,Wes,19,23,1,948240000,WARNING: CLAMSHELL EDITION IS EDITED TV VERSION,"I, myself always enjoyed this movie, it's very..."
346141,374450,B00004CI84,ACJR7EQF9S6FP,Jeremy Robertson,2,3,4,951523200,Bettlejuice...Bettlejuice...BETTLEJUICE!,What happens when you say his name three times...


In [19]:
#filtered_data[filtered_data['Summary'].str.contains('book')]
#type(filtered_data[filtered_data['Summary'].str.contains('book')].index.tolist())

#suspicious_indices = []
#
#l = filtered_data[filtered_data['Summary'].str.contains('book')].index.tolist()
#print("No. of entries having '{0}' is {1}".format('book', len(l)))
#suspicious_indices = suspicious_indices + l
#
#l = filtered_data[filtered_data['Summary'].str.contains('film')].index.tolist()
#print("No. of entries having '{0}' is {1}".format('film', len(l)))
#suspicious_indices = suspicious_indices + l
#
#l = filtered_data[filtered_data['Summary'].str.contains('Film')].index.tolist()
#print("No. of entries having '{0}' is {1}".format('Film', len(l)))
#suspicious_indices = suspicious_indices + l
#
#l = filtered_data[filtered_data['Summary'].str.contains('Book')].index.tolist()
#print("No. of entries having '{0}' is {1}".format('Book', len(l)))
#suspicious_indices = suspicious_indices + l

def getEntriesHavingTexts(df, col_to_search, text_list):
  indices = []
  counts = []
  for text in text_list:
    l = filtered_data[filtered_data[col_to_search].str.contains(text)].index.tolist()
    counts.append(len(l))
    indices = indices + l
  return indices, counts
  

text_list = ['[bB]ook', '[fF]ilm']
suspicious_indices, counts = getEntriesHavingTexts(filtered_data,
                                       'Summary',
                                       text_list)

for i in range(len(counts)):
  print("No. of entries having '{0}' is {1}".format(text_list[i], counts[i]))


No. of entries having '[bB]ook' is 85
No. of entries having '[fF]ilm' is 24


In [20]:
print('Total suspicious entries : ', len(suspicious_indices))
#filtered_data.iloc[suspicious_indices]
filtered_data.iloc[suspicious_indices[:4]]

Total suspicious entries :  109


Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
15797,17270,B000GPXRRW,A2R6RA8FRBS608,Matthew G. Sherwin,3,3,5,1185580800,Vanilla caramel Coffee-mate is TOPS in my book...,Vanilla caramel flavored Coffee-mate makes for...
40441,43949,B000CDZY8I,AI8Z1GP75O9J4,"Phyllis J. Kirk ""sttchurchlady""",2,3,5,1196553600,"Lindt Chocolate truffles -- not a book, but C...","This is excellent Belgian Chocolate, not readi..."
60089,65280,B00060ONFW,A3F0IH7U4M0O05,J. D. Smith,2,2,1,1296604800,No Hero in my book,I paid twice as much for shipping as the cost ...
65402,71046,B0007WGV6S,A1EGZYG8PC51U5,"D. Wilson ""SonRisedInTheEast""",6,7,4,1189036800,As good as candy gets in my book! So why only...,Milk Duds and I have had a longstanding love a...


#### Observation
- There are duplicates
  - Same user having review comments for more than one product at same timestamp which is impractical

## Cleaning

### Drop Duplicates

In [0]:
# Sort the data based on ProductID in ascending order so that we can keep only one kind of product review
sorted_data = filtered_data.sort_values('ProductId',axis=0, ascending=True, inplace=False, na_position='last')

In [22]:
# keep first entry, drop remaining duplicate entries
final_data = sorted_data.drop_duplicates(subset={'UserId','ProfileName','Time','Text'},keep='first',inplace=False)
print(final_data.shape)

(364173, 10)


### Remove invalid Helpfull Score entries

In [23]:
final_data = final_data[final_data.HelpfulnessNumerator <= final_data.HelpfulnessDenominator]
print(final_data.shape)

(364171, 10)


### Remove Invalid Summary Entries - TO DO

### Analyse and Remove any invalid entries in review text - TO DO

## Observation Summary

- TO DO