# ML Course 3: Classification:  Week 2 Assignment 1
## Implementing logistic regression from scratch

The goal of this notebook is to implement your own logistic regression classifier. You will:

 * Extract features from Amazon product reviews.
 * Convert an SFrame into a NumPy array.
 * Implement the link function for logistic regression.
 * Write a function to compute the derivative of the log likelihood function with respect to a single coefficient.
 * Implement gradient ascent.
 * Given a set of coefficients, predict sentiments.
 * Compute classification accuracy for the logistic regression model.
 
    
## Import SFrame



In [109]:
import sframe as sf
#import pandas as pd
#import numpy as np
from __future__ import division  #ensures floating point division

## Load review dataset

**1.** For this assignment, we will use a subset of the Amazon product review dataset. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted primarily of positive reviews.

One column of this dataset is 'sentiment', corresponding to the class label with +1 indicating a review with positive sentiment and -1 indicating one with negative sentiment.

In [78]:
#print len(products['sentiment'])   #53072 rows/samples
products['sentiment']

dtype: int
Rows: 53072
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... ]

**2.** Let us quickly explore more of this dataset.  The 'name' column indicates the name of the product.  Here we list the first 2 products in the dataset.  We then count the number of positive and negative reviews.

In [79]:
products.head(2)['name']

dtype: str
Rows: 2
["Stop Pacifier Sucking without tears with Thumbuddy To Love's Binky Fairy Puppet and Adorable Book", "Nature's Lullabies Second Year Sticker Calendar"]

In [80]:
print '# of positive reviews =', len(products[ products['sentiment']==1  ])
print '# of negative reviews =', len(products[ products['sentiment']==-1 ])
print  len(products['sentiment']) == ( len(products[ products['sentiment']==1  ]) \
                                     + len(products[ products['sentiment']==-1 ]) )

# of positive reviews = 26579
# of negative reviews = 26493
True


**Note:** For this assignment, we eliminated class imbalance by choosing 
a subset of the data with a similar number of positive and negative reviews. 

## Apply text cleaning on the review data

**3.** In this section, we will perform some simple feature cleaning using **SFrames**. The last assignment used all words in building bag-of-words features, but here we limit ourselves to 193 words (for simplicity). We compiled a list of 193 most frequent words into a JSON file. 

Now, we will load these words from this JSON file:

In [81]:
import json
with open('important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)           #reads entire file in one go 
    
# for each word in the list, force it to be date type string     
important_words = [str(s) for s in important_words]

In [82]:
print important_words

['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'retu

**4. ** Now, we will perform 2 simple data transformations:

1. Remove punctuation using [Python's built-in](https://docs.python.org/2/library/string.html) string functionality.
2. Compute word counts (only for **important_words**)

We start with *Step 1* which can be done as follows:

In [83]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

Apply the remove_punctuation function on every element of the review column and assign the result to the new column review_clean. Note. Many data frame packages support apply operation for this type of task. Consult appropriate manuals.

**5.** Now we proceed with *Step 2*. For each word in **important_words**, we compute a count for the number of times the word occurs in the review. We will store this count in a separate column (one for each word). The result of this feature processing is a single column for each word in **important_words** which keeps a count of the number of times the respective word occurs in the review text.

**Note:** There are several ways of doing this. 

In [84]:
# http://programminghistorian.org/lessons/counting-frequencies
wordstring = 'it was the best of times it was the worst of times '
wordstring += 'it was the age of wisdom it was the age of foolishness'

words = wordstring.split()

wordfreq = []
#for word in words:
#    wordfreq.append(words.count(word))

wordfreq = [words.count(w) for word in words] # a list comprehension
print("String\n" + wordstring +"\n")
print("List\n" + str(words) + "\n")
print("Frequencies\n" + str(wordfreq) + "\n")
print("Pairs\n" + str(zip(words, wordfreq)))

String
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness

List
['it', 'was', 'the', 'best', 'of', 'times', 'it', 'was', 'the', 'worst', 'of', 'times', 'it', 'was', 'the', 'age', 'of', 'wisdom', 'it', 'was', 'the', 'age', 'of', 'foolishness']

Frequencies
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Pairs
[('it', 1), ('was', 1), ('the', 1), ('best', 1), ('of', 1), ('times', 1), ('it', 1), ('was', 1), ('the', 1), ('worst', 1), ('of', 1), ('times', 1), ('it', 1), ('was', 1), ('the', 1), ('age', 1), ('of', 1), ('wisdom', 1), ('it', 1), ('was', 1), ('the', 1), ('age', 1), ('of', 1), ('foolishness', 1)]


In [85]:
# Given a list of words, return a dictionary of word-frequency pairs.

def wordListToFreqDict(wordlist):
    wordfreq = [wordlist.count(p) for p in wordlist]
    return dict(zip(wordlist,wordfreq))

wordListToFreqDict(words)

{'age': 2,
 'best': 1,
 'foolishness': 1,
 'it': 4,
 'of': 4,
 'the': 4,
 'times': 2,
 'was': 4,
 'wisdom': 1,
 'worst': 1}

Using above techniques of first splitting a list of words, in our case it wud be 'review', and then counting how many times each word occurs in the review, we create an anonymous function (lambda) that counts the occurrence of a particular word and apply it to every element in the review_clean column, we want to do it for all the rows and thats where apply() comes handy. Repeat this step for every word in important_words (for loop). In this assignment, we use the built-in *count* function for Python lists. Each review string is first split into individual words and the number of occurances of a given word is counted.

In [86]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

**6.** After #4 and #5, the data frame **products** should now contains one column for each of the 193 **important_words**. As an example, the column **perfect** contains a count of the number of times the word **perfect** occurs in each of the reviews.

In [87]:
products['perfect']

dtype: int
Rows: 53072
[0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... ]

**7.** Now, write some code to compute the number of product reviews that contain the word **perfect**.

**Hint**: 
* First create a column called `contains_perfect` which is set to 1 if the count of the word **perfect** (stored in column **perfect**) is >= 1.
* Sum the number of 1s in the column `contains_perfect`.

In [88]:
# products['contains_perfect'] = [ 1 if products['perfect'] >= 1 else 0 ]
# Always ge tthis error "ValueError": 
# The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

products['contains_perfect']  = products['perfect'].apply(lambda x : 1 if x >= 1 else 0)  #lambda in place function
print "%d reviews contain the word perfect" %sum(products['contains_perfect'])

2955 reviews contain the word perfect


**Quiz Question**. How many reviews contain the word **perfect**?

## Convert SFrame/data frame to multi-dimensional NumPy array

**8.** Now convert our data frame to a multi-dimensional array. Look for a package that provides a highly optimized matrix operations. In the case of Python, NumPy is a good choice. NumPy is a powerful library for doing matrix manipulation. Let us convert our data to matrices and then implement our algorithms with matrices.

First, do the following import.

In [89]:
import numpy as np

Write a function that extracts columns from a data frame/SFrame and converts them into a multi-dimensional NumPy array. Two arrays are returned: one representing features (a feature matrix H) and another representing class labels, y or true class labels. Note that the feature matrix includes an additional column 'intercept' to take account of the intercept term.

The function should accept three parameters:

*    dataframe: a data frame to be converted
*    features: a list of string, containing the names of the columns that are used as features.
*    label: a string, containing the name of the single column that is used as class labels.

The function should return two values:

*    one 2D array for features
*    one 1D array for class labels

The function should do the following:

*    Prepend a new column constant to dataframe and fill it with 1's. This column takes account of the intercept term. Make sure that the constant column appears first in the data frame.
*    Prepend a string 'constant' to the list features. Make sure the string 'constant' appears first in the list.
*    Extract columns in dataframe whose names appear in the list features.
*    Convert the extracted columns into a 2D array using a function in the data frame library. If you are using Pandas, you would use as_matrix() function.
*    Extract the single column in dataframe whose name corresponds to the string label.
*    Convert the column into a 1D array.
*    Return the 2D array and the 1D array.

In [90]:
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1
    features = ['intercept'] + features
    features_sframe = data_sframe[features]
    feature_matrix = features_sframe.to_numpy()  #.as_matrix() if using pandas data frame
    label_sarray = data_sframe[label]
    label_array = label_sarray.to_numpy()
    return(feature_matrix, label_array)

**9.** Using the function written in #8, extract two arrays, feature_matrix and sentiment. The 2D array feature_matrix would contain the content of the columns given by the list important_words. The 1D array sentiment would contain the content of the column sentiment, the true labels. Let us convert the data into NumPy arrays.

In [91]:
# Warning: This may take a few minutes...
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment') 

**Are you running this notebook on an Amazon EC2 t2.micro instance?** (If you are using your own machine, please skip this section)

It has been reported that t2.micro instances do not provide sufficient power to complete the conversion in acceptable amount of time. For interest of time, please refrain from running `get_numpy_data` function. Instead, download the [binary file](https://s3.amazonaws.com/static.dato.com/files/coursera/course-3/numpy-arrays/module-3-assignment-numpy-arrays.npz) containing the four NumPy arrays you'll need for the assignment. To load the arrays, run the following commands:
```
arrays = np.load('module-3-assignment-numpy-arrays.npz')
feature_matrix, sentiment = arrays['feature_matrix'], arrays['sentiment']
```

In [92]:
feature_matrix.shape   #this is our H matrix - see notes what H represents

(53072, 194)

** Quiz Question:** How many features are there in the **feature_matrix**?
We had 193 words, so 193 features plus 1 constant=1 col = 194 features.
** Quiz Question:** Assuming that the intercept is present, how does the number of features in **feature_matrix** relate to the number of features in the logistic regression model?

Now, let us see what the **sentiment** column looks like:

In [93]:
sentiment

array([ 1,  1,  1, ..., -1, -1, -1])

## Estimating conditional probability with link function

Recall from lecture that the link function is given by:
$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))},
$$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of **important_words** in the review  $\mathbf{x}_i$. Complete the following function that implements the link function:

In [114]:
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''
def predict_probability(feature_matrix, coefficients):
    # Take dot product of feature_matrix and coefficients  
    margin = np.dot(feature_matrix, coefficients)
    # this also works !!!
    #margin = np.dot(coefficients, feature_matrix)    
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    predictions = 1/(1 + np.exp(-margin))
    
    # return predictions
    return predictions

**Aside**. How the link function works with matrix algebra

Since the word counts are stored as columns in **feature_matrix**, each $i$-th row of the matrix corresponds to the feature vector $h(\mathbf{x}_i)$:
$$
[\text{feature_matrix}] =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right] =
\left[
\begin{array}{cccc}
h_0(\mathbf{x}_1) & h_1(\mathbf{x}_1) & \cdots & h_D(\mathbf{x}_1) \\
h_0(\mathbf{x}_2) & h_1(\mathbf{x}_2) & \cdots & h_D(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
h_0(\mathbf{x}_N) & h_1(\mathbf{x}_N) & \cdots & h_D(\mathbf{x}_N)
\end{array}
\right]
$$

By the rules of matrix multiplication, the score vector containing elements $\mathbf{w}^T h(\mathbf{x}_i)$ is obtained by multiplying **feature_matrix** and the coefficient vector $\mathbf{w}$.
$$
[\text{score}] =
[\text{feature_matrix}]\mathbf{w} =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right]
\mathbf{w}
= \left[
\begin{array}{c}
h(\mathbf{x}_1)^T\mathbf{w} \\
h(\mathbf{x}_2)^T\mathbf{w} \\
\vdots \\
h(\mathbf{x}_N)^T\mathbf{w}
\end{array}
\right]
= \left[
\begin{array}{c}
\mathbf{w}^T h(\mathbf{x}_1) \\
\mathbf{w}^T h(\mathbf{x}_2) \\
\vdots \\
\mathbf{w}^T h(\mathbf{x}_N)
\end{array}
\right]
$$

**Checkpoint**

Just to make sure you are on the right track, we have provided a few examples. If your `predict_probability` function is implemented correctly, then the outputs will match:

In [115]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])

correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),          1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_predictions = np.array( [ 1./(1+np.exp(-correct_scores[0])), 1./(1+np.exp(-correct_scores[1])) ] )

print 'The following outputs must match '
print '------------------------------------------------'
print 'correct_predictions           =', correct_predictions
print 'output of predict_probability =', predict_probability(dummy_feature_matrix, dummy_coefficients)

The following outputs must match 
------------------------------------------------
correct_predictions           = [ 0.98201379  0.26894142]
output of predict_probability = [ 0.98201379  0.26894142]


## Compute derivative of log likelihood with respect to a single coefficient

**11.** Recall from lecture:
$$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$

We will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:
* `errors` vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
* `feature` vector containing $h_j(\mathbf{x}_i)$  for all $i$. 

This corresponds to the j-th column of feature_matrix.

The function should do the following:

*    Take two parameters errors and feature.
*    Compute the dot product of errors and feature.
*    Return the dot product. This is the derivative with respect to a single coefficient w_j.


Complete the following code block:

In [96]:
def feature_derivative(errors, feature):     
    # Compute the dot product of errors and feature
    derivative = np.dot(errors, feature)
    
    return derivative      # Return the derivative

**12.** In the main lecture, our focus was on the likelihood.  In the advanced optional video, however, we introduced a transformation of this likelihood---called the log likelihood---that simplifies the derivation of the gradient and is more numerically stable.  Due to its numerical stability, we will use the log likelihood instead of the likelihood to assess the algorithm.

The log likelihood is computed using the following formula (see the advanced optional video if you are curious about the derivation of this equation):

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$

We provide a function to compute the log likelihood for the entire dataset. 

In [97]:
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
    indicator = (sentiment == +1)
    scores = np.dot(feature_matrix, coefficients)
    logexp = np.log(1. + np.exp(-scores))
    
    # Simple check to prevent overflow
    mask = np.isinf(logexp)
    logexp[mask] = -scores[mask]
    
    lp = np.sum((indicator-1)*scores - logexp)
    return lp

**Checkpoint**

Just to make sure we are on the same page, run the following code block and check that the outputs match.

In [116]:
dummy_feature_matrix = np.array([[1.,2.,3.], [1.,-1.,-1]])
dummy_coefficients = np.array([1., 3., -1.])
dummy_sentiment = np.array([-1, 1])

correct_indicators  = np.array( [ -1==+1,                                       1==+1 ] )
correct_scores      = np.array( [ 1.*1. + 2.*3. + 3.*(-1.),                     1.*1. + (-1.)*3. + (-1.)*(-1.) ] )
correct_first_term  = np.array( [ (correct_indicators[0]-1)*correct_scores[0],  (correct_indicators[1]-1)*correct_scores[1] ] )
correct_second_term = np.array( [ np.log(1. + np.exp(-correct_scores[0])),      np.log(1. + np.exp(-correct_scores[1])) ] )

correct_ll          =      sum( [ correct_first_term[0]-correct_second_term[0], correct_first_term[1]-correct_second_term[1] ] ) 

print 'The following outputs must match '
print '------------------------------------------------'
print 'correct_log_likelihood           =', correct_ll
print 'output of compute_log_likelihood =', compute_log_likelihood(dummy_feature_matrix, dummy_sentiment, dummy_coefficients)

The following outputs must match 
------------------------------------------------
correct_log_likelihood           = -5.33141161544
output of compute_log_likelihood = -5.33141161544


## Taking gradient steps

**13.** Now we are ready to implement our own logistic regression. All we have to do is to write a gradient ascent function that takes gradient steps towards the optimum. 

Write a function logistic_regression to fit a logistic regression model using gradient ascent.

The function accepts the following parameters:

*    feature_matrix: 2D array of features
*    sentiment: 1D array of class labels
*    initial_coefficients: 1D array containing initial values of coefficients
*    step_size: a parameter controlling the size of the gradient steps
*    max_iter: number of iterations to run gradient ascent

The function returns the last set of coefficients after performing gradient ascent.

The function carries out the following steps:

*    Initialize vector coefficients to initial_coefficients.
*    Predict the class probability P(y_i = +1 | x_i,w) using your predict_probability function and save it to variable predictions.
*    Compute indicator value for (y_i = +1) by comparing sentiment against +1. Save it to variable indicator.
*    Compute the errors as difference between indicator and predictions. Save the errors to variable errors.
*    For each j-th coefficient, compute the per-coefficient derivative by calling feature_derivative with the j-th column of feature_matrix. Then increment the j-th coefficient by (step_size*derivative).
*    Once in a while, insert code to print out the log likelihood.
*    Repeat steps 2-6 for max_iter times.

Complete the following function to solve the logistic regression model using gradient ascent:

In [99]:
from math import sqrt

def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in xrange(max_iter):

        # Predict P(y_i = +1|x_i,w) using your predict_probability() function
        predictions = predict_probability(feature_matrix, coefficients)
        # this also works!!
        predictions = predict_probability(coefficients, np.transpose(feature_matrix))
        
        # Compute indicator value for (y_i = +1)
        indicator = (sentiment == +1)  #indicator is list of True,False 
        #[ True  True  True ..., False False False]
        
        # Compute the errors as indicator - predictions
        errors = indicator - predictions
        for j in xrange(len(coefficients)): # loop over each coefficient
            
            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j].
            # Compute the derivative for coefficients[j]. Save it in a variable called derivative
            
            derivative = np.dot(errors, feature_matrix[:,j])
            
            # add the step size times the derivative to the current coefficient
            coefficients[j] += step_size *  derivative
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
            print 'iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)
    return coefficients

**14.** Now, let us run the logistic regression solver with the parameters below:

*    feature_matrix = feature_matrix extracted in #9
*    sentiment = sentiment extracted in #9
*    initial_coefficients = a 194-dimensional vector filled with zeros
*    step_size = 1e-7
*    max_iter = 301

Now, let us run the logistic regression solver and save the returned coefficients to variable coefficients.

In [100]:
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

iteration   0: log likelihood of observed labels = -36780.91768478
iteration   1: log likelihood of observed labels = -36775.13434712
iteration   2: log likelihood of observed labels = -36769.35713564
iteration   3: log likelihood of observed labels = -36763.58603240
iteration   4: log likelihood of observed labels = -36757.82101962
iteration   5: log likelihood of observed labels = -36752.06207964
iteration   6: log likelihood of observed labels = -36746.30919497
iteration   7: log likelihood of observed labels = -36740.56234821
iteration   8: log likelihood of observed labels = -36734.82152213
iteration   9: log likelihood of observed labels = -36729.08669961
iteration  10: log likelihood of observed labels = -36723.35786366
iteration  11: log likelihood of observed labels = -36717.63499744
iteration  12: log likelihood of observed labels = -36711.91808422
iteration  13: log likelihood of observed labels = -36706.20710739
iteration  14: log likelihood of observed labels = -36700.5020

**Quiz question:** As each iteration of gradient ascent passes, does the log likelihood increase or decrease? increase 

## Predicting sentiments

**15.** Recall from lecture that class predictions for a data point $\mathbf{x}$ can be computed from the coefficients $\mathbf{w}$ using the following formula:
$$
\hat{y}_i = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{x}_i^T\mathbf{w} > 0 \\
      -1 & \mathbf{x}_i^T\mathbf{w} \leq 0 \\
\end{array} 
\right.
$$

Now, we will write some code to compute class predictions. We will do this in two steps:
* **Step 1**: First compute the **scores** using **feature_matrix** and **coefficients** using a dot product.
* **Step 2**: Using the formula above, compute the class predictions from the scores - apply threshold 0 on the scores to compute the class predictions.

Step 1 can be implemented as follows:

In [101]:
# Compute the scores as a dot product between feature_matrix and coefficients.
scores = np.dot(feature_matrix, coefficients)

Now, complete the following code block for **Step 2** to compute the class predictions using the **scores** obtained above:

In [117]:
#products['pred'] = scores.apply(lambda x : +1 if x > 0 else -1)  #lambda in place function
products['pred_sentiment'] = [+1 if x > 0 else -1 for x in scores]   

** Quiz question: ** How many reviews were predicted to have positive sentiment?

In [118]:
num_positive  = (products['pred_sentiment'] == +1).sum()
num_negative = (products['pred_sentiment'] == -1).sum()
print num_positive
print num_negative
print round(num_positive/len(products['sentiment']), 2)

25126
27946
0.47


## Measuring accuracy

We will now measure the classification accuracy of the model. Recall from the lecture that the classification accuracy can be computed as follows:

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}}
$$

Complete the following code block to compute the accuracy of the model.

In [119]:
num_correct = sum( np.equal(products['pred_sentiment'], products['sentiment']))
num_mistakes = len(products) - num_correct
accuracy = num_correct/len(products)
print "-----------------------------------------------------"
print '# Reviews   correctly classified =', num_correct
print '# Reviews incorrectly classified =', num_mistakes
print '# Reviews total                  =', len(products)
print "-----------------------------------------------------"
print 'Accuracy = %.2f' % accuracy

-----------------------------------------------------
# Reviews   correctly classified = 39903
# Reviews incorrectly classified = 13169
# Reviews total                  = 53072
-----------------------------------------------------
Accuracy = 0.75


In [139]:
#alternatively using 
from sklearn import metrics

# now we do same but for predictions using our own classifier
y_pred =    products['pred_sentiment'].to_numpy()
y_true =    products['sentiment'].to_numpy()

print 'Accuracy = %.2f' % (metrics.accuracy_score(y_true, y_pred))

Accuracy = 0.75


**Quiz question**: What is the accuracy of the model on predictions made above? (round to 2 digits of accuracy) 0.75

In [137]:
#To check if comparable accuracy to sklearns built-in LogisticRegression
X_matrix = products[important_words].to_numpy()  #.as_matrix() if using pandas data frame

#we build X_matrix without 'constant'=1 column in feature_matrix - but same accuracy 0.79
#from (feature_matrix, sentiment) = get_numpy_data(products, important_words, 'sentiment') 

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(feature_matrix, y_true )

y_pred_class = logreg.predict( feature_matrix)

print 'Accuracy = %.2f' % (metrics.accuracy_score(y_true, y_pred_class))

Accuracy = 0.79
Accuracy = 0.75


## Which words contribute most to positive & negative sentiments?

Recall that in Module 2 assignment, we were able to compute the "**most positive words**". These are words that correspond most strongly with positive reviews. In order to do this, we will first do the following:
* Treat each coefficient as a tuple, i.e. (**word**, **coefficient_value**).
* Sort all the (**word**, **coefficient_value**) tuples by **coefficient_value** in descending order.

In [106]:
coefficients = list(coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words, coefficients)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)

Now, **word_coefficient_tuples** contains a sorted list of (**word**, **coefficient_value**) tuples. The first 10 elements in this list correspond to the words that are most positive.

### Ten "most positive" words

Now, we compute the 10 words that have the most positive coefficient values. These words are associated with positive sentiment.

In [111]:
print word_coefficient_tuples[:10]   #start index default 0 for slicing

[('great', 0.066546084170457695), ('love', 0.065890762922123258), ('easy', 0.06479458680257838), ('little', 0.045435626308421372), ('loves', 0.044976401394906038), ('well', 0.030135001092107074), ('perfect', 0.029739937104968462), ('old', 0.020077541034775381), ('nice', 0.018408707995268992), ('daughter', 0.01770319990570169)]


** Quiz question:** Which word is **not** present in the top 10 "most positive" words?

### Ten "most negative" words

Next, we repeat this exercise on the 10 most negative words.  That is, we compute the 10 words that have the most negative coefficient values. These words are associated with negative sentiment.

In [108]:
print word_coefficient_tuples[-10:]  #end index default len of data for slicing

# if you don't trust defaults - we can do same using very long stmt
print word_coefficient_tuples[len(word_coefficient_tuples)-10:len(word_coefficient_tuples)]


[('monitor', -0.024482100545891717), ('return', -0.026592778462247283), ('back', -0.027742697230661331), ('get', -0.028711552980192581), ('disappointed', -0.028978976142317068), ('even', -0.030051249236035808), ('work', -0.033069515294752737), ('money', -0.038982037286487109), ('product', -0.041511033392108897), ('would', -0.053860148445203121)]


** Quiz question:** Which word is **not** present in the top 10 "most negative" words?