# Homework #5: Classification

## 1. Laptop vs. Phone 

We have collected over 5000 reviews of products that are either cell phones or laptops. In this exercise we are going to write your own Naive Bayes classifier for deciphering whether a review pertains to a Laptop or a Phone.

In the `reviews/` subdirectory, you will find a (poorly formatted) CSV file named `reviews.csv`.  First, you need to run the Python script `prepare.py` (in the first code cell below), which parses these file, constructs the objects (e.g., feature matrix and outcome vector) to be used by your classifier, and writes them into a `.pik` for faster loading by your subsequent analysis.

For a description of what these objects mean, see comments in the code cells below.

Your job is to modify and complete this notebook file to implement Naive Bayes classification according to the specification given in the code.  You need to implement the algorithm by yourself; do __NOT__ use the Naive Bayes implementations provided by Python `sklearn` package.

In addition to accuracy, note that the program will also print out the most important features used by this classifier---assuming that you have implemented the algorithm correctly according to the specification.  This information can be very useful debugging.  Do your top features make intuitive sense?

In [1]:
%run prepare.py

In [2]:
# DO NOT MODIFY
import sys
import numpy
from sklearn import cross_validation
from sklearn.base import BaseEstimator
import prepare



You can run the following cell to explore the datastructures that are used to store the features $(X)$ and labels $(Y)$.

In [3]:
# This is how the reviews data is preprocessed into feature vectors (X) and labels (Y)
X, y, instances, features = prepare.get_data()

# instances: a list (array) of review ids, one for each review in reviews.csv.
# features: a list (array) containing words that ever appear somewhere in reviews.
print '{} instances'.format(X.shape[0])
print '{} features'.format(X.shape[1])

print 'Review # {}'.format(instances[0]) # you can replace 0 with a number between 0 and X.shape[0]-1
print 'Feature # 3238 = {}'.format(features[3238]) # you can replace 0 with a number between 0 and X.shape[1]-1


# y: a vector (array) with one component per instance;
#    y[i] = 0 if review with id instance[i] is for a laptop,
#        or 1 if it is for a mobile phone.
print 'Label for Review # {} = {}'.format(instances[0], y[0]) # you can replace 0 with a number between 0 and X.shape[0]-1

# X: a matrix with one row per instance and one column per feature;n
#    X[i,j] = 1 if word features[j] appears in review with id instance[i],
#          or 0 if the word doesn't appear in it.

# X[i, j] tells you the value of feature j for instance i
print 'Does the feature `{}` appear in {}? {}'.format(features[3238], instances[0], 'yes' if X[0, 3238]==1 else 'no')

print 'The number of words that appear in this review: {}'.format(X[0, :].getnnz())

print 'All features for this review as a *dense* array {} ... this has {} 1\'s'.format(X[0].toarray()[0], sum(X[0].toarray()[0]))
print 'All features for this review as a *sparse* array (i.e., it only stores the indexes of features that are non-zero) \n{}'.format(X[0,:])


5648 instances
29427 features
Review # B00JJ9687W.json
Feature # 3238 = after
Label for Review # B00JJ9687W.json = 0
Does the feature `after` appear in B00JJ9687W.json? yes
The number of words that appear in this review: 141
All features for this review as a *dense* array [0 0 0 ... 0 0 0] ... this has 141 1's
All features for this review as a *sparse* array (i.e., it only stores the indexes of features that are non-zero) 
  (0, 95)	1
  (0, 3238)	1
  (0, 3250)	1
  (0, 3479)	1
  (0, 3525)	1
  (0, 3609)	1
  (0, 3687)	1
  (0, 3733)	1
  (0, 3745)	1
  (0, 3842)	1
  (0, 3878)	1
  (0, 4156)	1
  (0, 4161)	1
  (0, 4737)	1
  (0, 4785)	1
  (0, 5321)	1
  (0, 5393)	1
  (0, 5544)	1
  (0, 5728)	1
  (0, 5761)	1
  (0, 5906)	1
  (0, 5934)	1
  (0, 5971)	1
  (0, 7684)	1
  (0, 8038)	1
  :	:
  (0, 26335)	1
  (0, 26424)	1
  (0, 26460)	1
  (0, 26499)	1
  (0, 26566)	1
  (0, 26846)	1
  (0, 26931)	1
  (0, 26933)	1
  (0, 26982)	1
  (0, 27555)	1
  (0, 27578)	1
  (0, 27585)	1
  (0, 27591)	1
  (0, 27597)	1
  (0, 277

You are now ready to edit the code in the following cells to create your own Naive Bayes classifier.

In [4]:
# DO NOT MODIFY
#
# Please follow the comments inside the following coding cells and implement what it asks
# 
# This section is the definition of the class MyNaiveBayes
# The actual implementations of class functions are defined in the following three cells.
#

class MyNaiveBayes(BaseEstimator):
    def __init__(self): pass
    def fit(self, X_train, y_train): pass
    def predict(self, X_test): pass
    def score(self, X_test, y_test):
        return float(sum(predicted == actual \
                         for predicted, actual \
                         in zip(self.predict(X_test), y_test))) \
            / len(y_test)




In [46]:
# MODIFY IF REQUIRED

def __init__(self):
    # You should set the following attributes in the fit() method.
    # class_log_prior_:
    #   an array of two floats (real numbers), where class_log_prior_[k] 
    #   is the natural logarithm of the probability (
    #   estimated from training data) that
    #   class is k (k is 0 or 1).
    # feature_count_:
    #   number of features, or the number of columns in X.
    # feature_log_prob_:
    #   a matrix of floats, with two rows and as many columns as
    #   self.feature_count_; feature_log_prob_[k,f] is the natural logarithm
    #   of the probability (estimated from training data) of seeing the word
    #   corresponding to feature f as part of a review of class k. zip(x_train,y_train)
    self.class_log_prior_ = None
    self.feature_count_ = None
    self.feature_log_prob_ = None
    # You may use additional attributes as needed.
    
MyNaiveBayes.__init__ = __init__

In [47]:
# MODIFY AND COMPLETE

def fit(self, X_train, y_train): #x-train 0 and 1 
    # Train your classifier with given data.  See comments below
    # on X and y for an explanation of the format of X_train,
    # y_train (labels of phone or laptop)
    #
    # REPLACE THE FOLLOWING (INCORRECT) WITH YOUR IMPLEMENTATION:
    self.class_log_prior_ = numpy.log([(y_train.sum()*1.0/len(y_train)*1.0), ((len(y_train) - y_train.sum()*1.0)/len(y_train)*1.0)])  #([, 0.5]) #Num reviews in class(laptop or phone) / total num reviews
    self.feature_count_ = X_train.shape[1]
    self.feature_log_prob_ = numpy.zeros((2, self.feature_count_))
    
    counts = numpy.zeros((2, self.feature_count_)) #creates array 
    
    for review,k in zip(X_train, y_train): #([0,1,1] 1/0) k =0 or 1 laptop or phone
        counts[k] += review.toarray()[0]
        
        #count[k].sum()
        
        #in counts we have the total number of occurances of each word in all of reviews combined 
        
        #get probability of word appearing in reviews overall 
        
            # num words (in counts) +1 / (# of words in laptop reviews + total number of words)
    
    n_instances = X_train.shape[0]*1.0 #number of laptop reviews
    n_observations = X_train.sum()*1.0 #number of words (in all reviews)
    
    for k in (0, 1):
        for f in range(self.feature_count_): #for every feature 
            self.feature_log_prob_[k,f] = numpy.log((counts[k][f]*1.0 + 1) / ((counts[k].sum()*1.0) + n_observations)) #numpy.log(1.0 / n_observations) 

MyNaiveBayes.fit = fit

In [59]:
# MODIFY AND COMPLETE

def predict(self, X_test):
    # Return a vector (array) y_test; each component holds the
    # predicted class for each row of X_test.  See comments below
    # on X and y for an explanation of the format of X_test,
    # y_test.
    #
    # As a simplification, you may assume that all words in the
    # testing data are already in your vocabulary (so there is a
    # feature for each); however, do not assume that every word
    # appears in the training data.
    #
    # REPLACE THE FOLLOWING (INCORRECT) WITH YOUR IMPLEMENTATION:
    y_test = list()
    
    for x in X_test: # for each row
        probphone = 0.0
        problaptop = 0.0
       
        x = x.toarray()[0]
        for i in range(0,len(x)):
            if x[i] == 1:
                problaptop += self.feature_log_prob_[0,i]
                probphone += self.feature_log_prob_[1,i]
                
        problaptop += self.class_log_prior_[0]
        probphone += self.class_log_prior_[1]
            
        if problaptop > probphone:
            y_test.append(0)
        else: 
            y_test.append(1)
            
            
       # if x.sum() % 2 == 0:
            #y_test.append(1)
       # else:
         #   y_test.append(0) for f in range(self.feature_count_):
        
        
    return numpy.array(y_test)

MyNaiveBayes.predict = predict

In [60]:
# After you have implemented and runned the cells above, excute this cell to get the output of the classification algorithm
%run -i classify_main.py 

5648 instances, 29427 features
debugging on one train/test split:
train accuracy: 0.8198
test accuracy: 0.7511
top 20 features:
	             texting: -5.3085	                asus: +5.0305
	               nokia: -5.1918	                  i7: +4.8827
	                 htc: -5.0377	               intel: +4.6383
	            unlocked: -4.8784	                 ssd: +4.5559
	            motorola: -4.8487	                 hdd: +4.5497
	                 gsm: -4.7275	                  i3: +4.4593
	              sprint: -4.4697	                bios: +4.1983
	       international: -4.4598	             toshiba: +4.1717
	                  fm: -4.3986	                  i5: +4.1443
	                 lte: -4.3986	                acer: +4.0471
running 10-fold cross validation...
[0.06902655 0.06017699 0.04247788 0.55575221 1.         1.
 1.         1.         1.         1.        ]
accuracy: 0.6727 (+/- 0.8469)
