In [76]:
import numpy as np
import pandas as pd
import sqlite3
%matplotlib inline

# Amazon Fine Food Reviews, User Classification
The objective of this project is to identify a user using a text review. This is a supervised classification problem. The goal is to produce a predictive model that when given a reasonable sample of the data is able to make accurate predictions about the author. One of the first issues to consider is that not every reviewer has made multiple reviews. This makes testing and training difficult as the testing set will match perfectly with the training set if all reviews are left in. This was accounted for by only considering users that have made at least a threshold number of reviews. 

## Data Collection
The data was obtained from <a href="https://www.kaggle.com/snap/amazon-fine-food-reviews">kaggle</a>. The data consists of ~500,000 amazon fine food reviews. The dataset came as a csv as well as a sqlite database. For this project I chose to use the sqlite database to query the data. Overall the data was very clean and needed almost no cleaning to create a usable copy. 

In [78]:
db = sqlite3.connect("/data/amazon-fine-foods/amazon-fine-foods/database.sqlite")

In [90]:
toy = pd.read_sql_query("SELECT * from reviews limit 10", db)
toy.head(5)

Unnamed: 0,Id,ProductId,UserId,ProfileName,HelpfulnessNumerator,HelpfulnessDenominator,Score,Time,Summary,Text
0,1,B001E4KFG0,A3SGXH7AUHU8GW,delmartian,1,1,5,1303862400,Good Quality Dog Food,I have bought several of the Vitality canned d...
1,2,B00813GRG4,A1D87F6ZCVE5NK,dll pa,0,0,1,1346976000,Not as Advertised,Product arrived labeled as Jumbo Salted Peanut...
2,3,B000LQOCH0,ABXLMWJIXXAIN,"Natalia Corres ""Natalia Corres""",1,1,4,1219017600,"""Delight"" says it all",This is a confection that has been around a fe...
3,4,B000UA0QIQ,A395BORC6FGVXV,Karl,3,3,2,1307923200,Cough Medicine,If you are looking for the secret ingredient i...
4,5,B006K2ZZ7K,A1UQRSCLF8GW1T,"Michael D. Bigham ""M. Wassir""",0,0,5,1350777600,Great taffy,Great taffy at a great price. There was a wid...


In [80]:
counts = pd.read_sql_query("select UserId, count(UserId) as Count \
                           from Reviews \
                           group by UserId \
                           having Count > 30", db)

In [81]:
counts.head()

Unnamed: 0,UserId,Count
0,A100WO06OQR8BQ,55
1,A106ZCP7RSXMRU,60
2,A1080SE9X3ECK0,72
3,A10AFVU66A79Y1,41
4,A10G136JEISLVR,51


In [82]:
counts.shape

(703, 2)

In [83]:
text = pd.read_sql_query("select UserId, Text \
                           from Reviews", db)

In [84]:
text.head()

Unnamed: 0,UserId,Text
0,A3SGXH7AUHU8GW,I have bought several of the Vitality canned d...
1,A1D87F6ZCVE5NK,Product arrived labeled as Jumbo Salted Peanut...
2,ABXLMWJIXXAIN,This is a confection that has been around a fe...
3,A395BORC6FGVXV,If you are looking for the secret ingredient i...
4,A1UQRSCLF8GW1T,Great taffy at a great price. There was a wid...


Looking at a sample of the textual data. The text should be preprocessed to remove any punctuation.

In [91]:
text["Text"][7]

'This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!'

In [96]:
assert text["Text"].isnull().any() == False
assert text["UserId"].isnull().any() == False

In [86]:
text.shape

(568454, 2)

Merging the data to only select the data that meets the minimum review count. 

In [89]:
reviews = pd.merge(text, counts, on="UserId", how="inner")

In [88]:
reviews.shape

(39517, 3)

The dataset consists of 39517 reviews from 703 users. This is only considering reviewers that have written at least 30 reviews. With the data in a usable format it is time to gain a better understanding of how the data looks. From this notebook I created the query_data function found in support.py. 