# Linear Classifier

## Import modules

In [1]:
import string
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression

## Data Cleaning

In [2]:
# Load Data
products = pd.read_csv('amazon_baby.csv')
# Fill N/A
products = products.fillna({'review':''})  # fill in N/A's in the review column
# Reomove Punctuation for Text Cleaning
products['review'] = products['review'].astype('str')
products['review_clean'] = products['review'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))
# Add Sentiment Column
products['sentiment'] = products['rating'].apply(lambda rating: 1 if rating >= 4 else -1)

Since `train-idx.json` and `test-idx.json` contains index only, we will extract train and test data set from `products` by using this indices.

## Extract Train Data

In [3]:
# Extract Train Data
train_index = pd.read_json('train-idx.json')
train_idx_list = train_index[0].values.tolist()
train_data = products.ix[train_idx_list]

## Extract Test Data

In [4]:
test_index = pd.read_json('test-idx.json')
test_idx_list = test_index[0].values.tolist()
test_data = products.ix[test_idx_list]

In [5]:
# Remove rating of 3 since it is neutral
products = products[products['rating'] != 3]
train_data = train_data[train_data['rating'] != 3]
test_data = test_data[test_data['rating'] != 3]

## Build the word count vector for each review 

In [6]:
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b')
# Use this token pattern to keep single-letter words
# First, learn vocabulary from the training data and assign columns to words
# Then convert the training data into a sparse matrix
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
# Second, convert the test data into a sparse matrix, using the same word-column mapping
test_matrix = vectorizer.transform(test_data['review_clean'])

## Train a sentiment classifier with logistic regression

We will now use logistic regression to create a sentiment classifier on the training data. This model will use the column **word_count** as a feature and the column **sentiment** as the target. We will use validation_set=None to obtain same results as everyone else.

In [7]:
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix,train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

Now that we have fitted the model, we can extract the weights. (coefficients)

In [8]:
coeffs_list = list(zip(vectorizer.vocabulary_.keys(),sentiment_model.coef_[0]))

In [9]:
sorted(coeffs_list, key=lambda x: x[1], reverse=True)[:5]

[('blanketjust', 2.0577917728848623),
 ('snickerdoodle', 2.0290640753254028),
 ('cute65281', 1.9988168665774535),
 ('premoistened', 1.9093435278344242),
 ('zebras', 1.8976852910477762)]

Number of positive coefficients will be:

In [10]:
print("Number of positive coefficients: ", sum(sentiment_model.coef_[0] >= 0))

Number of positive coefficients:  79684


## Making predictions with logistic regression

Now that a model is trained, we can make predictions on the test data. In this section, we will explore this in the context of 3 examples in the test dataset. We refer to this set of 3 examples as the sample_test_data.

In [11]:
sample_test_data = test_data[10:13]
sample_test_data

Unnamed: 0,name,review,rating,review_clean,sentiment
64,Our Baby Girl Memory Book,Really happy with this purchase. I was looking...,5,Really happy with this purchase I was looking ...,1
82,Cloth Diaper Pins Stainless Steel Traditional ...,It has been many years since we needed diaper ...,5,It has been many years since we needed diaper ...,1
102,Newborn Baby Tracker&reg; - Round the Clock Ch...,Love it love it love it!! Got my first baby t...,5,Love it love it love it Got my first baby tra...,1


## Predicting sentiment

These scores can be used to make class predictions as follows:
$$
\hat{y} = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{w}^T h(\mathbf{x}_i) >; 0 \\
      -1 & \mathbf{w}^T h(\mathbf{x}_i) \leq 0 \\
\end{array} 
\right.
$$

In [12]:
sample_test_matrix = vectorizer.transform(sample_test_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print("Score of sample test data: ", scores)

Score of sample test data:  [  9.42812899   7.48897389  10.70174709]


In [13]:
print("Predicted sentiment of sample data: ", sentiment_model.predict(sample_test_matrix))

Predicted sentiment of sample data:  [1 1 1]


## Probability predictions

Recall from the lectures that we can also calculate the probability predictions from the scores using:

$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))}.
$$

Using the variable **scores** calculated previously, write code to calculate the probability that a sentiment is positive using the above formula. For each row, the probabilities should be a number in the **range [0, 1]**.

In [14]:
import math

def calculate_probability(scores):
    """ Calculate the probability predictions from the scores.
    """
    prob = []
    for score in scores:
        pred =  1 / (1 + math.exp(-score))
        prob.append(pred)
    return prob

In [15]:
print("Probability Prediction: ", calculate_probability(scores))

Probability Prediction:  [0.9999195769246673, 0.9994410960670393, 0.9999774949221907]


## Find the most positive (and negative) review

We now turn to examining the full test dataset, test_data, and use GraphLab Create to form predictions on all of the test data points for faster performance.

Using the sentiment_model, find the 20 reviews in the entire test_data with the highest probability of being classified as a positive review. We refer to these as the "most positive reviews."

In [16]:
test_scores = sentiment_model.decision_function(test_matrix)
test_data['prob'] = calculate_probability(test_scores)

Let's explore test_data:

In [17]:
test_data.head(5)

Unnamed: 0,name,review,rating,review_clean,sentiment,prob
8,"Baby Tracker&reg; - Daily Childcare Journal, S...",A friend of mine pinned this product on Pinter...,5,A friend of mine pinned this product on Pinter...,1,0.973162
9,"Baby Tracker&reg; - Daily Childcare Journal, S...",This has been an easy way for my nanny to reco...,4,This has been an easy way for my nanny to reco...,1,0.897673
14,Nature's Lullabies First Year Sticker Calendar,"Space for monthly photos, info and a lot of us...",5,Space for monthly photos info and a lot of use...,1,0.993811
18,Nature's Lullabies Second Year Sticker Calendar,I completed a calendar for my son's first year...,4,I completed a calendar for my sons first year ...,1,0.996074
24,Nature's Lullabies Second Year Sticker Calendar,Wife loves this calender. Comes with a lot of ...,5,Wife loves this calender Comes with a lot of s...,1,0.995809


20 Most Positive Reviews:

In [18]:
test_data[['name','prob']].sort_values(by='prob',ascending=False).head(20)

Unnamed: 0,name,prob
152013,"UPPAbaby Vista Stroller, Denny",1.0
162354,HALO SleepSack SwaddleChange Diaper Pad Covers...,1.0
144112,"GroVia Hybrid Hook/Loop Shell Diaper, Surf, On...",1.0
123632,"Zooper 2011 Waltz Standard Stroller, Flax Brown",1.0
133651,"Britax 2012 B-Agile Stroller, Red",1.0
161127,Safety 1st Alpha Omega Elite Convertible 3-in-...,1.0
21557,"Joovy Caboose Stand On Tandem Stroller, Black",1.0
158209,Ubbi Cloth Diaper Pail Liner,1.0
50735,"Joovy Zoom 360 Swivel Wheel Jogging Stroller, ...",1.0
100166,"Infantino Wrap and Tie Baby Carrier, Black Blu...",1.0


20 Most Negative Reviews:

In [19]:
test_data[['name','prob']].sort_values(by='prob',ascending=True).head(20)

Unnamed: 0,name,prob
87026,Baby Einstein Around The World Discovery Center,5.764725e-21
120707,The European NANNY Baby Movement Monitor - EU ...,2.267863e-18
2186,Philips Avent 3 Pack 9oz Bottles,2.1672760000000002e-17
10370,Wimmer-Ferguson Infant Stim-Mobile,1.676004e-15
131738,"Kids Line Cascade Bow Diaper Bag, Black",9.148861e-14
20331,Cabinet Flex-Lock (2 pack) from Safety First,8.540356e-13
47740,"Room Magic Desk/Chair Set, Tropical Seas Natural",1.758822e-12
143095,"Graco ComfortSport Convertible Car Seat, Zara",2.330207e-12
38197,JJ Cole System Bag - Graphite/Green,7.251241e-12
131280,Motorola MBP36 Remote Wireless Video Baby Moni...,9.048731e-12


## Compute accuracy of the classifier

We will now evaluate the accuracy of the trained classifer. Recall that the accuracy is given by

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified examples}}{\mbox{# total examples}}
$$

Let's compute the classification accuracy of the sentiment_model on the test_data.

In [20]:
print("ACCURACY: ", (test_data['sentiment'] == sentiment_model.predict(test_matrix)).sum()/len(test_data))


ACCURACY:  0.933833245243


## Learn another classifier with fewer words

There were a lot of words in the model we trained above. We will now train a simpler logistic regression model using only a subet of words that occur in the reviews. For this assignment, we selected a 20 words to work with. These are:

In [21]:
significant_words = ['love', 'great', 'easy', 'old', 'little', 'perfect', 'loves', 
      'well', 'able', 'car', 'broke', 'less', 'even', 'waste', 'disappointed', 
      'work', 'product', 'money', 'would', 'return']

In [22]:
vectorizer_word_subset = CountVectorizer(vocabulary=significant_words) # limit to 20 words
train_matrix_word_subset = vectorizer_word_subset.fit_transform(train_data['review_clean'])
test_matrix_word_subset = vectorizer_word_subset.transform(test_data['review_clean'])

## Train a logistic regression model on a subset of data

We will now build a classifier with word_count_subset as the feature and sentiment as the target.

In [23]:
simple_model = LogisticRegression()
simple_model.fit(train_matrix_word_subset,train_data['sentiment'])

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We will inspect the weights (coefficients) of the simple_model:

In [24]:
pd.DataFrame(list(zip(significant_words,simple_model.coef_[0])), columns = ['features', 'estimated coefficients'])

Unnamed: 0,features,estimated coefficients
0,love,1.362031
1,great,0.925623
2,easy,1.186222
3,old,0.124464
4,little,0.500502
5,perfect,1.517395
6,loves,1.743727
7,well,0.470707
8,able,0.183111
9,car,0.091578


## Comparing Models

We will now compare the accuracy of the sentiment_model and the simple_model using the get_classification_accuracy method you implemented above.

First, compute the classification accuracy of the sentiment_model on the `train_data`:

In [25]:
print("ACCURACY (Sentiment Model, TRAIN DATA): ", (train_data['sentiment'] == sentiment_model.predict(train_matrix)).sum()/len(train_data))
print("ACCURACY (Simple Model, TRAIN DATA): ", (train_data['sentiment'] == simple_model.predict(train_matrix_word_subset)).sum()/len(train_data))

ACCURACY (Sentiment Model, TRAIN DATA):  0.963994654877
ACCURACY (Simple Model, TRAIN DATA):  0.865060380098


Now, we will repeat this excercise on the `test_data`. Start by computing the classification accuracy of the sentiment_model on the `test_data`:

In [26]:
print("ACCURACY (Sentiment Model, TEST DATA): ", (test_data['sentiment'] == sentiment_model.predict(test_matrix)).sum()/len(test_data))
print("ACCURACY (Simple Model, TEST DATA): ", (test_data['sentiment'] == simple_model.predict(test_matrix_word_subset)).sum()/len(test_data))

ACCURACY (Sentiment Model, TEST DATA):  0.933833245243
ACCURACY (Simple Model, TEST DATA):  0.868888742072


## Baseline: Majority class prediction

It is quite common to use the majority class classifier as the a baseline (or reference) model for comparison with your classifier model. The majority classifier model predicts the majority class for all data points. At the very least, you should healthily beat the majority class classifier, otherwise, the model is (usually) pointless.

In [27]:
num_positive  = (train_data['sentiment'] == +1).sum()
num_negative = (train_data['sentiment'] == -1).sum()
print("# Positive: ", num_positive)
print("# Negative: ", num_negative)

# Positive:  101674
# Negative:  19558


Accuracy of this majority class model will be:

In [28]:
print("ACCURACY (Majority Class Model, TEST DATA): ", (test_data['sentiment']==1).sum()/len(test_data))

ACCURACY (Majority Class Model, TEST DATA):  0.841635835095
