In [7]:
import numpy as np
import pandas as pd
import sqlite3
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from support import query_data, unique_values
from sklearn.manifold import TSNE
%matplotlib inline

In [2]:
db = sqlite3.connect("/data/amazon-fine-foods/amazon-fine-foods/database.sqlite")

# Amazon Fine Food Reviews, User Classification
The objective of this project is to identify a user using a text review. This is a supervised classification problem. The goal is to produce a predictive model that when given a reasonable sample of the data is able to make accurate predictions about the author. One of the first issues to consider is that not every reviewer has made multiple reviews. This makes testing and training difficult as the testing set will match perfectly with the training set. This was accounted for by only considering users that have made at least a threshold number of reviews. 
Note: Do not actually run through this entire notebook. Many of the visualizations will not load. Reference "2 - Exploration" for the visualization code. 

[1. Data Collection](#Data-Collection)

[2. Visualization](#Visualization)

[3. Model Fitting](#Model-Fitting)

[4. Results](#Results)

## Data Collection
The data was obtained from <a href="https://www.kaggle.com/snap/amazon-fine-food-reviews">kaggle</a>. The data consists of ~500,000 amazon fine food reviews. The dataset came as a csv as well as a sqlite database. For this project I chose to use the sqlite database to query the data. Overall the data was very clean and needed almost no cleaning to create a usable copy. 

In [10]:
text = pd.read_sql_query("select UserId, Text \
                           from Reviews", db)
text.head()

Unnamed: 0,UserId,Text
0,A3SGXH7AUHU8GW,I have bought several of the Vitality canned d...
1,A1D87F6ZCVE5NK,Product arrived labeled as Jumbo Salted Peanut...
2,ABXLMWJIXXAIN,This is a confection that has been around a fe...
3,A395BORC6FGVXV,If you are looking for the secret ingredient i...
4,A1UQRSCLF8GW1T,Great taffy at a great price. There was a wid...


In [18]:
text["Text"][7]

'This taffy is so good.  It is very soft and chewy.  The flavors are amazing.  I would definitely recommend you buying it.  Very satisfying!!'

The dataset was limited to only consider users that had a large volume of textual data.

In [17]:
review_min = 75
reviews, rev_count, user_count = query_data(review_min, db)
print(str(user_count), "users,", str(rev_count), "reviews")

104 users, 13218 reviews


### Methods Used
Bag O' Words - TFIDF - term frequency–inverse document frequency 

<img src="td-idf-graphic.png">

<a href="http://filotechnologia.blogspot.com/2014/01/a-simple-java-class-for-tfidf-scoring.html">source</a>

Ex: "is" is very common accross all documents.

# Visualization
The next step is to get a better understanding of the data and explore the effect of parameters from TFIDF vectorization. This is done by iterating over many different Tfidfvectorization parameters to try and visualize the effect they have. A smaller subset of the data is used to ensure to produce a clearer visualization. The following document on <a href="http://nbviewer.jupyter.org/urls/gist.githubusercontent.com/AlexanderFabisch/1a0c648de22eff4a2a3e/raw/59d5bc5ed8f8bfd9ff1f7faa749d1b095aa97d5a/t-SNE.ipynb">TSNE</a> (t-Distributed Stochastic Neighbor Embedding) was the main source used for the visualization. The TSNE reduction process can take awhile when done on a large set. For the subset size used below the process takes around a minute or so depending on the number of features.

In [None]:
plt.figure(figsize=(15,10))
reviews, rev_count, user_count = query_data(100, db)
colors = unique_values(reviews)
vectorizer = TfidfVectorizer()
matrix = vectorizer.fit_transform(reviews["Text"])
X_transformed = TruncatedSVD(n_components=50).fit_transform(matrix)
X_embedded = TSNE(n_components=2, perplexity=40, verbose=0).fit_transform(X_transformed)
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=colors, cmap=plt.cm.get_cmap('spectral', 10))
plt.axis("off")
plt.title("t-SNE visualization of TFIDF vecorization on Amazon Reviews ("
                                               + str(user_count) + " users, "
                                               + str(rev_count) + " reviews)");

<img src="images/fullviz.png">

Look at all the pretty colors! Unfortunately this data is difficult to process in its current state. Reducing the number of users as well as isolating parameters will yield a better understanding of the data. 

## Visualization of TFIDF Parameters
To start to get a better understanding of tfidf and the dataset a smaller set of the data will be analyzed. The methods used to visualize the data consist of a TFIDF vectorization of the reviews and then applying an SVD and then a TNSE reduction. SVD does not produce very good results and is mainly used for TSNE to process. 

By only looking at the 7 most active reviewrs we see that we can begin to very clearly see the groupings of uesrs' review pattern. Intersetingly it seems as if users can have multiple different writing styles as users seem to have multiple clusters. 

### Ngram Range
The Ngram range effects the size of ngram that is considered in the count stage. Therefore users who frequently use the same pattern of words will become more distingushed.

In [6]:
parameters = {
    'ngram_range':((1, 1), (1, 2), (2, 2), (4, 4))
}

<img src="images/ngrams.png">

### Preprocessor
A preprocessor function can be applied to words before they are counted and vectorized. A simple version of this is used to eliminate all punctuation that surrounds words. The expected result of this is to see more defined clusters.

In [10]:
PUNCTUATION = '`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t\n'
def process(x):
    """
    Basic Preprocessor to remove punctuation from words. 
    """
    return x.strip(PUNCTUATION)

In [11]:
parameters = {
    'preprocessor':(None, process)
}

<img src="images/preprocessor.png">

### Maximum document frequency
The TFIDF Vectorizer allows for a maximum document frequency to be set that only considers terms that occur below a particular document frequency. This eliminates common words and focuses on the more unique phrasing of each user.  

In [13]:
parameters = {
    'max_df':(0.25, 1.0)
}

<img src="images/maxdf.png">

# Model Fitting
I initially considerd three models: RandomForest, MultinomialNB and BernoulliNB. Below we are constructing the pipeline to vectorize the text and then fit the RandomForestClassifier to it. The following process was extensively iterated upon which explains some of the reason why section below looks scattered. I initally started with most of the reasonable parameters in GridSearch and then slowly limited the values I search over until I no longer saw improvment. Once I found a value I liked I added it to the pipeline and focused on the remaining parameters. This process is important as running GridSearchCV on every possible parameter initially would have taken days.

In [26]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

The following parameters are what can be iterated over using GridSearchCV

In [24]:
TfidfVectorizer().get_params().keys()

dict_keys(['smooth_idf', 'lowercase', 'min_df', 'token_pattern', 'vocabulary', 'norm', 'analyzer', 'dtype', 'input', 'sublinear_tf', 'encoding', 'tokenizer', 'preprocessor', 'max_df', 'strip_accents', 'max_features', 'use_idf', 'binary', 'ngram_range', 'decode_error', 'stop_words'])

In [25]:
RandomForestClassifier().get_params().keys()

dict_keys(['warm_start', 'min_samples_leaf', 'random_state', 'max_leaf_nodes', 'class_weight', 'verbose', 'max_features', 'min_weight_fraction_leaf', 'bootstrap', 'n_estimators', 'max_depth', 'oob_score', 'criterion', 'n_jobs', 'min_samples_split'])

Using GridSearchCV iterations are able to be made over each of these parameters to find the configuration that yields the greatest classification. Along with RandomForestClassifier I also tried using MultinomialNB and BernoulliNB. These two did not have very good results therefore RandomForestClassifier was used in the end. 

# Results
With the three models previously tested, RandomForestClassifier produces the best results. GridSearchCV was heavily used to tweak the classifier to yield the greatest results. Those results are shown below.

In [27]:
PUNCTUATION = '`~!@#$%^&*()_-+={[}]|\:;"<,>.?/}\t\n'
def process(x):
    """
    Basic Preprocessor to remove punctuation from words. 
    """
    return x.strip(PUNCTUATION)

In [28]:
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(preprocessor=process, norm='l1', ngram_range=(1, 2), max_df=1.0, max_features=18000)),
    ('rf', RandomForestClassifier(n_estimators=500)),
])

In [29]:
X = reviews["Text"]
y = reviews["UserId"]

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

Fit the data can take about a minute. I believe this is due to the number of estimators in the random forest.

In [38]:
model = pipeline.fit(X_train, y_train)

In [39]:
y_pred = model.predict(X_test)

In [40]:
accuracy_score(y_test, y_pred)

0.75219364599092287

### Challenge Areas
Exploring Textual Data

    -TFIDF
    -TSNE and SVD
    
Extensively working with GridSearchCV

### Improvements
Utilize RandomizedSearchCV initially to cut down on narrowing process.

Start by breaking a piece of the data off initially to save for the end. Avoids variability.

### Questions
What should the review minimum be set to? What guides that decision? 