# Logistic Regression

In this code, we will explore the concept of [*Logistic Regression*](https://en.wikipedia.org/wiki/Logistic_regression) and its application for sentimental analysis. 

The goal is to use the [amazon_baby_subset.csv](../Data/amazon_baby_subset.csv), which contains 4 columns: Product name, client review, client rate, and sentiment. The rating goes from 1 (worst) to 5 (best) and the sentiment is -1 if the rating is low (< 3) and 1 if it is good (>= 3). Here, the logistic regression method is used to give weights to each important word in the comments (the important words are given now. In the future, we will see how to select them) and to create a prediction model for future reviews, understanting if it is good or bad.

First, let's load all used packages and the dataset.

In [1]:
import turicreate as tc
import json
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

  from ._conv import register_converters as _register_converters


In [2]:
products = tc.SFrame('../Data/amazon_baby_subset.csv')

------------------------------------------------------
Inferred types from first 100 line(s) of file as 
column_type_hints=[str,str,int,int]
If parsing fails due to incorrect types, you can correct
the inferred type list above and pass it to read_csv in
the column_type_hints argument
------------------------------------------------------


In [3]:
products.head()

name,review,rating,sentiment
Stop Pacifier Sucking without tears with ...,All of my kids have cried non-stop when I tried to ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,We wanted to get something to keep track ...,5,1
Nature's Lullabies Second Year Sticker Calendar ...,My daughter had her 1st baby over a year ago. ...,5,1
"Lamaze Peekaboo, I Love You ...","One of baby's first and favorite books, and i ...",4,1
SoftPlay Peek-A-Boo Where's Elmo A Childr ...,Very cute interactive book! My son loves this ...,5,1
Our Baby Girl Memory Book,"Beautiful book, I love it to record cherished t ...",5,1
Hunnt&reg; Falling Flowers and Birds Kids ...,"Try this out for a spring project !Easy ,fun and ...",5,1
Blessed By Pope Benedict XVI Divine Mercy Full ...,very nice Divine Mercy Pendant of Jesus now on ...,5,1
Cloth Diaper Pins Stainless Steel ...,We bought the pins as my 6 year old Autistic son ...,4,1
Cloth Diaper Pins Stainless Steel ...,It has been many years since we needed diaper ...,5,1


We can count how many positive and negative reviews the data set has.

In [4]:
print 'Number of positive reviews =', len(products[products['sentiment']==1])
print 'Number of negative reviews =', len(products[products['sentiment']==-1])

Number of positive reviews = 26579
Number of negative reviews = 26493


Pretty close numbers!

The way the reviews are writen have punctuation. Let's clean them and also create columns with the important words count. The important words are in the file [important_words.json](../Data/important_words.json).

## Cleaning the data

First, load the important words from the *json* file:

In [5]:
with open('../Data/important_words.json', 'r') as f: # Reads the list of most frequent words
    important_words = json.load(f)
important_words = [str(s) for s in important_words]
print important_words

['baby', 'one', 'great', 'love', 'use', 'would', 'like', 'easy', 'little', 'seat', 'old', 'well', 'get', 'also', 'really', 'son', 'time', 'bought', 'product', 'good', 'daughter', 'much', 'loves', 'stroller', 'put', 'months', 'car', 'still', 'back', 'used', 'recommend', 'first', 'even', 'perfect', 'nice', 'bag', 'two', 'using', 'got', 'fit', 'around', 'diaper', 'enough', 'month', 'price', 'go', 'could', 'soft', 'since', 'buy', 'room', 'works', 'made', 'child', 'keep', 'size', 'small', 'need', 'year', 'big', 'make', 'take', 'easily', 'think', 'crib', 'clean', 'way', 'quality', 'thing', 'better', 'without', 'set', 'new', 'every', 'cute', 'best', 'bottles', 'work', 'purchased', 'right', 'lot', 'side', 'happy', 'comfortable', 'toy', 'able', 'kids', 'bit', 'night', 'long', 'fits', 'see', 'us', 'another', 'play', 'day', 'money', 'monitor', 'tried', 'thought', 'never', 'item', 'hard', 'plastic', 'however', 'disappointed', 'reviews', 'something', 'going', 'pump', 'bottle', 'cup', 'waste', 'retu

Now, let's remove the reviews punctuations:

In [6]:
def remove_punctuation(text):
    import string
    return text.translate(None, string.punctuation) 

products['review_clean'] = products['review'].apply(remove_punctuation)

The next step is to count each important word and create new columns with the results.

In [7]:
for word in important_words:
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

In [8]:
products[important_words[0]] # This is the count of the word 'baby' for each review.

dtype: int
Rows: 53072
[0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 7, 0, 0, 1, 0, 0, 0, 0, 0, 0, 13, 0, 1, 1, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 2, 0, 0, 0, ... ]

In [9]:
len(products[products[important_words[0]] > 0]) # Number of reviews that contain the word 'baby'

12174

How many time the word *baby* appears in all the reviews?

In [10]:
sum(products['baby'])

18715

## Starting the logistic regression

Logistic regression deals with categorical variables, and here in the sentimental analysis, it assumes only 2 number: 1 for a good review and -1 for a bad review.

The whole idea is to estimate the probabily on which the review is good or bad, and the probability is computed usin the [logistic function](https://en.wikipedia.org/wiki/Logistic_function) (also known as link function):

$$ P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + e^{-\mathbf{w}^T h(\mathbf{x}_i)}} $$

where, in our code, the feature vector $h(\mathbf{x}_i)$ represents the word counts of **important_words** in the review $\mathbf{x}_i$. In this equation, for a given vector of weights $\mathbf{w}$ and the word count $\mathbf{x}_i$, it computes the probabilty of the sentiment $\mathbf{i}_i$ be positive (equal to 1). The values go from 0 to 1. We can choose a threshold (usually 0.5) on which a higher probability is considered as a positive sentiment, while the opsite comes from a lower probability.

Best prediction comes from optimized weights. And a good way to obtain such weights is by [maximizing the likelihood function](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation). The [likelihood function](https://en.wikipedia.org/wiki/Likelihood_function) is writen as:

$$\ell(\mathbf{w}) = \prod_{i = 1}^{N}P(y_i | \mathbf{x}_i,\mathbf{w})$$

The optimization method can be done by by the gradient descent method. To find the optimized weights, we should derive the likelihood function by $\mathbf{w}$. However, such task is not easy for the current function. A good strategy is to work with the [natural logarithm](https://en.wikipedia.org/wiki/Natural_logarithm) of the likelihood function, called the **log-likelihood**.

$$\ell\ell(\mathbf{w}) = \ln\prod_{i = 1}^{N}P(y_i | \mathbf{x}_i,\mathbf{w}) = \sum_{i = 1}^{N}\ln P(y_i | \mathbf{x}_i,\mathbf{w})$$

The log-likelihood can be computed using the following formula:

$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \{ (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln[1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))]\} $$

The derivative of the log-likelihood is:

$$\frac{\partial\ell\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)$$

where $\mathbf{1}[y_i = +1]$ means that it is equal 1 if the sentment is positive (1) and 0 if the sentiment is negative (-1). Now that we can compute the gradient of the log-likelihood, the gradient descent method can be applied:

$$\mathbf{w}_{j+1} = \mathbf{w}_j + \eta\frac{\partial\ell\ell}{\partial w_j}$$

where $\eta$ is the step length.

Great!!! Now let's implement the equations above to find the optimal weights $\mathbf{w}$. Then we can use then to make predictions of the reviews sentiment.

The first step is to convert the Turicreate SFrame data to Numpy array to perform the math. The following function receives the SFrame data, the list of desired features (that will be the important words count), and the class label (in this case, 'sentiment'). It will return two outputs: a matrix with the features (word count) plus an initial intercept (= 1), and an array with the sentiment (-1 or +1) of each review.

In [11]:
def get_numpy_data(data_sframe, features, label):
    data_sframe['intercept'] = 1 # Initial value for the intercept, including it in the sframe
    features = ['intercept'] + features # Including 'intercept' in the features array
    features_sframe = data_sframe[features] # Saving a sframe with only the desired features 
    feature_matrix = features_sframe.to_numpy() # Converting the features sframe to a numpy array (matrix)
    label_sarray = data_sframe[label] # Picking the desired label
    label_array = label_sarray.to_numpy() # Converting the label to a numpy array
    return(feature_matrix, label_array)

In [12]:
feature_matrix, sentiment = get_numpy_data(products, important_words, 'sentiment')
feature_matrix.shape

(53072, 194)

In [13]:
print len(products[products['sentiment']]) # Number of reviews
print len(important_words) + 1 # Number of features plus the intercept

53072
194


Okay, the *feature_matrix* matches in size the number of reviews and the number of features (plus the intercept).

Let's now create a function to calculate the probability for positive sentiments (the negative sentiment probability is just one minus the positive sentiment probability), given an array of coefficients and the feature matrix. The output is an array with the probability predictions for each reviews (or length of the feature matrix).

In [14]:
def predict_probability(coefficients, feature_matrix):
    # Computing P(y_i = +1 | x_i, w), using the logistic function
    predictions = 1/(1 + np.exp(-np.dot(coefficients, feature_matrix)))
    return predictions

We will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:

* **errors** vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
* **feature** vector containing $h_j(\mathbf{x}_i)$  for all $i$. 

The derivative is just the dot product of the errors and the feature.

In [15]:
def feature_derivative(errors, feature):     
    derivative = np.dot(errors, feature)
    return derivative

We will also create a function to compute the log-likelihood for all the features. It is interesting to use it as a QC tool. The log-likelihood, in the gradient descent method, should increase at each interation, until a maximum is reached.

The function has three inputs: the features matris, the sentiment array, and the computed (or initial) coefficients. The output is a scalar with the value of the log-likelihood.

In [16]:
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
    isone = (sentiment==+1) # Saving an array with positive sentiments as 1 and others as 0
    
    # Computing each part of the log-likelihood formula
    dotfc = np.dot(feature_matrix, coefficients)
    lnexp = np.log(1. + np.exp(-dotfc))
    
    # Avoiding infinite results.
    mask = np.isinf(lnexp)
    lnexp[mask] = -dotfc[mask]
    
    # log-likelihood
    ll = np.sum((isone-1)*dotfc - lnexp)
    return ll

Now we can use the three functions above to do the logistic regression using the gradient descent method.
The next function receives the feature matrix, the sentiment array, the initial guess for the coefficients, the step length, and the maximum number of iterations. The output are the coefficients for each feature.

In [17]:
def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    
    # Start the gradient method and stop it in the maximum iteration
    for itr in xrange(max_iter):

        # Making the predictions with the initial or updated coefficients coefficients
        predictions = predict_probability(coefficients, np.transpose(feature_matrix))
        
        # Compute indicator value for +1 (positive sentiment)
        indicator = (sentiment==+1)
        
        # Compute the errors with the initial or updated coefficients coefficients
        errors = indicator - predictions
        
        # Apply the gradient method for each coefficient
        for j in xrange(len(coefficients)):
            
            # Update the coefficient for feature j
            coefficients[j] = coefficients[j] + step_size * np.sum(feature_derivative(errors, feature_matrix[:,j]))
            
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
            print 'Iteration %*d: log-likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)
    return coefficients

Now, let's check it working.

In [18]:
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

Iteration   0: log-likelihood of observed labels = -36780.91768478
Iteration   1: log-likelihood of observed labels = -36775.13434712
Iteration   2: log-likelihood of observed labels = -36769.35713564
Iteration   3: log-likelihood of observed labels = -36763.58603240
Iteration   4: log-likelihood of observed labels = -36757.82101962
Iteration   5: log-likelihood of observed labels = -36752.06207964
Iteration   6: log-likelihood of observed labels = -36746.30919497
Iteration   7: log-likelihood of observed labels = -36740.56234821
Iteration   8: log-likelihood of observed labels = -36734.82152213
Iteration   9: log-likelihood of observed labels = -36729.08669961
Iteration  10: log-likelihood of observed labels = -36723.35786366
Iteration  11: log-likelihood of observed labels = -36717.63499744
Iteration  12: log-likelihood of observed labels = -36711.91808422
Iteration  13: log-likelihood of observed labels = -36706.20710739
Iteration  14: log-likelihood of observed labels = -36700.5020

Now, with the coefficients in hands, let's predict the sentiments.

For this analysis, I am chosing to classify probabilities larger than 0.5 as positive and from 0.5 to 0 as negative. So:

$$
\hat{y}_i = 
\left\{
\begin{array}{ll}
      +1 & P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) > 0.5 \\
      -1 & P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) \leq 0.5 \\
\end{array} 
\right.
$$

Time to compute the predictions.

In [19]:
pred = predict_probability(feature_matrix,coefficients)

In [20]:
classify_predictions = tc.SArray(pred).apply(lambda x: 1 if x > 0.5 else -1)

In [21]:
len(classify_predictions[classify_predictions == 1])

25126

In [22]:
len(sentiment[sentiment == 1])

26579

Our prediction is close to the true sentiment (apparently). But we don't know if the -1's and 1's are on the correct reviews. So, let's compare the predictions with the true reviews and compute the **accuracy** of our logistic regression.

In [23]:
num_total = len(products) # Total number of reviews
num_correct = (classify_predictions == products['sentiment']).sum() # Is equal, return 1 (TRUE). Else, return 0 (FALSE)
num_wrong = num_total - num_correct
accuracy = 1.0*num_correct/num_total

print 'Number of correct reviews:', num_correct
print 'Number of wrong reviews:', num_wrong
print 'Number of reviews:', num_total
print 'Accuracy: %.2f' % accuracy

Number of correct reviews: 39903
Number of wrong reviews: 13169
Number of reviews: 53072
Accuracy: 0.75


Let's compare our result with the built in logistic classifier of Turicreate;

In [24]:
modeltc = tc.logistic_classifier.create(products, target='sentiment', features = important_words)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [25]:
predictionstc = modeltc.predict(products, output_type = 'class')
resultstc = modeltc.evaluate(products)
print 'Accuracy of our code: %.2f' % accuracy
print 'Accuracy of Turicreate: %.2f' % resultstc['accuracy']
print classify_predictions.head()
print predictionstc.head()
print products.head()['sentiment']

Accuracy of our code: 0.75
Accuracy of Turicreate: 0.79
[1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
[1, -1, 1, 1, 1, 1, 1, 1, 1, -1]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


We got pretty close to the Turicreate package. We can improve the accuracy by penalizing some words, as we did with the [polynomial regression](../Polynomial_Regression/Polynomial_Regression.ipynb).

## Logistic regression with L2 penalization (regularization)

Similar as in the [polynomial regression](../Polynomial_Regression/Polynomial_Regression.ipynb), we will include a term in the cost function to penalize large coefficients (overfitting). For the "simple" logistic regression, the cost function is the log likelihood:

$$Cost(\mathbf{w}) = \ell\ell(\mathbf{w})$$

Now, we include the penalizations term: the L2-norm of the coefficients multiplied by the tuning parameter $\lambda$:

$$Cost(\mathbf{w}) = \ell\ell(\mathbf{w}) - \lambda \|\mathbf{w}\|_{2}^{2}$$

To find the best coefficients, we have to take the derivative of the cost function. The derivative of the cost function is known. The derivative of the penalty term is, for the j-th coefficient is:

$$\frac{\partial \|\mathbf{w}\|_{2}^{2}}{\partial w_j} = \frac{\partial}{\partial w_j}(w_0^2 + w_1^2 + w_2^2 + \dots + w_j^2 + \dots + w_N^2) = 2w_j$$

For the gradient descent, the update for the coefficients will be:

$$\mathbf{w}_{j+1} = \mathbf{w}_j + \eta \big( \frac{\partial\ell\ell}{\partial w_j} - 2w_j \big)$$

where:

$$\frac{\partial\ell\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)$$

Usually the intercept does not suffer any kind of penalty. So, when we create our functions, let's keep it in mind.

The first step is to split the data into the train and validation sets.

In [26]:
train_data, validation_data = products.random_split(.8, seed = 1)

print 'Training set   : %d data points' % len(train_data)
print 'Validation set : %d data points' % len(validation_data)

Training set   : 42474 data points
Validation set : 10598 data points


Convert both sets to a numpy array.

In [27]:
feature_matrix_train, sentiment_train = get_numpy_data(train_data, important_words, 'sentiment')
feature_matrix_valid, sentiment_valid = get_numpy_data(validation_data, important_words, 'sentiment')

The gradient now must include the $-2w_j$ for all the coefficients but the intercept.

In [28]:
def feature_derivative_with_L2(errors, feature, coefficient, l2_penalty, intercept): 
    derivative = np.dot(errors, feature)

    # Add the L2 penalty to all coefficients but the intercept
    if not intercept:
        derivative -= 2 * l2_penalty * coefficient
        
    return derivative

The cost function (log-likelihood) also need the L2 term.

In [29]:
def compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty):
    isone = (sentiment==+1)
    lnexp = np.dot(feature_matrix, coefficients)
    lp = np.sum((isone-1)*lnexp - np.log(1. + np.exp(-lnexp))) - l2_penalty*np.sum(coefficients[1:]**2)
    
    return lp

Now, the logistic regression:

In [30]:
def logistic_regression_with_L2(feature_matrix, sentiment, initial_coefficients, step_size, l2_penalty, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array

    # Start the gradient method and stop it in the maximum iteration
    for itr in xrange(max_iter):

        # Making the predictions with the initial or updated coefficients coefficients
        predictions = predict_probability(feature_matrix, coefficients)
        
        # Compute indicator value for +1 (positive sentiment)
        indicator = (sentiment==+1)
        
        # Compute the errors with the initial or updated coefficients coefficients
        errors = indicator - predictions

        # Apply the gradient method for each coefficient
        for j in xrange(len(coefficients)): # loop over each coefficient
            is_intercept = (j == 0)

            # Computing the derivative
            derivative = feature_derivative_with_L2(errors, feature_matrix[:,j], coefficients[j], l2_penalty, is_intercept)
            
            # Updating the coefficient
            coefficients[j] += step_size * derivative
        
        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood_with_L2(feature_matrix, sentiment, coefficients, l2_penalty)
            print 'iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp)
    return coefficients

Let's check different values for the l2_penalty:

In [31]:
# L2_penalty = 0
coefficients_0_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=5e-6, l2_penalty=0, max_iter=501)

iteration   0: log likelihood of observed labels = -29256.53509469
iteration   1: log likelihood of observed labels = -29080.15933459
iteration   2: log likelihood of observed labels = -28910.70310738
iteration   3: log likelihood of observed labels = -28747.51727534
iteration   4: log likelihood of observed labels = -28590.11088099
iteration   5: log likelihood of observed labels = -28438.09355101
iteration   6: log likelihood of observed labels = -28291.14073932
iteration   7: log likelihood of observed labels = -28148.97259118
iteration   8: log likelihood of observed labels = -28011.34094170
iteration   9: log likelihood of observed labels = -27878.02119960
iteration  10: log likelihood of observed labels = -27748.80718346
iteration  11: log likelihood of observed labels = -27623.50775262
iteration  12: log likelihood of observed labels = -27501.94453552
iteration  13: log likelihood of observed labels = -27383.95033346
iteration  14: log likelihood of observed labels = -27269.3679

In [32]:
# L2_penalty = 4
coefficients_4_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=5e-6, l2_penalty=4, max_iter=501)

iteration   0: log likelihood of observed labels = -29256.53882048
iteration   1: log likelihood of observed labels = -29080.18103708
iteration   2: log likelihood of observed labels = -28910.75588332
iteration   3: log likelihood of observed labels = -28747.61324430
iteration   4: log likelihood of observed labels = -28590.26130285
iteration   5: log likelihood of observed labels = -28438.30891072
iteration   6: log likelihood of observed labels = -28291.43081329
iteration   7: log likelihood of observed labels = -28149.34650266
iteration   8: log likelihood of observed labels = -28011.80720946
iteration   9: log likelihood of observed labels = -27878.58778173
iteration  10: log likelihood of observed labels = -27749.48151749
iteration  11: log likelihood of observed labels = -27624.29679256
iteration  12: log likelihood of observed labels = -27502.85478602
iteration  13: log likelihood of observed labels = -27384.98788148
iteration  14: log likelihood of observed labels = -27270.5384

In [33]:
# L2_penalty = 10
coefficients_10_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=5e-6, l2_penalty=1, max_iter=501)

iteration   0: log likelihood of observed labels = -29256.53602614
iteration   1: log likelihood of observed labels = -29080.16476032
iteration   2: log likelihood of observed labels = -28910.71630190
iteration   3: log likelihood of observed labels = -28747.54126902
iteration   4: log likelihood of observed labels = -28590.14848947
iteration   5: log likelihood of observed labels = -28438.14739631
iteration   6: log likelihood of observed labels = -28291.21326647
iteration   7: log likelihood of observed labels = -28149.06608203
iteration   8: log likelihood of observed labels = -28011.45752709
iteration   9: log likelihood of observed labels = -27878.16287030
iteration  10: log likelihood of observed labels = -27748.97580016
iteration  11: log likelihood of observed labels = -27623.70505522
iteration  12: log likelihood of observed labels = -27502.17215166
iteration  13: log likelihood of observed labels = -27384.20978639
iteration  14: log likelihood of observed labels = -27269.6606

In [34]:
# L2_penalty = 1e2
coefficients_1e2_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=5e-6, l2_penalty=1e2, max_iter=501)

iteration   0: log likelihood of observed labels = -29256.62823956
iteration   1: log likelihood of observed labels = -29080.70154526
iteration   2: log likelihood of observed labels = -28912.02080432
iteration   3: log likelihood of observed labels = -28749.91187801
iteration   4: log likelihood of observed labels = -28593.86180762
iteration   5: log likelihood of observed labels = -28443.46039146
iteration   6: log likelihood of observed labels = -28298.36496072
iteration   7: log likelihood of observed labels = -28158.27896711
iteration   8: log likelihood of observed labels = -28022.93880567
iteration   9: log likelihood of observed labels = -27892.10557528
iteration  10: log likelihood of observed labels = -27765.55981884
iteration  11: log likelihood of observed labels = -27643.09807298
iteration  12: log likelihood of observed labels = -27524.53052351
iteration  13: log likelihood of observed labels = -27409.67934231
iteration  14: log likelihood of observed labels = -27298.3774

In [35]:
# L2_penalty = 1e3
coefficients_1e3_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=5e-6, l2_penalty=1e3, max_iter=501)

iteration   0: log likelihood of observed labels = -29257.46654341
iteration   1: log likelihood of observed labels = -29085.54856449
iteration   2: log likelihood of observed labels = -28923.72175265
iteration   3: log likelihood of observed labels = -28771.03546664
iteration   4: log likelihood of observed labels = -28626.73397364
iteration   5: log likelihood of observed labels = -28490.18963521
iteration   6: log likelihood of observed labels = -28360.86278202
iteration   7: log likelihood of observed labels = -28238.27724819
iteration   8: log likelihood of observed labels = -28122.00512090
iteration   9: log likelihood of observed labels = -28011.65695208
iteration  10: log likelihood of observed labels = -27906.87523962
iteration  11: log likelihood of observed labels = -27807.32988968
iteration  12: log likelihood of observed labels = -27712.71489976
iteration  13: log likelihood of observed labels = -27622.74581080
iteration  14: log likelihood of observed labels = -27537.1576

In [36]:
# L2_penalty = 1e5
coefficients_1e5_penalty = logistic_regression_with_L2(feature_matrix_train, sentiment_train,
                                                     initial_coefficients=np.zeros(194),
                                                     step_size=5e-6, l2_penalty=1e5, max_iter=501)

iteration   0: log likelihood of observed labels = -29349.67996719
iteration   1: log likelihood of observed labels = -29349.56226752
iteration   2: log likelihood of observed labels = -29349.55642623
iteration   3: log likelihood of observed labels = -29349.55401770
iteration   4: log likelihood of observed labels = -29349.55195595
iteration   5: log likelihood of observed labels = -29349.55009433
iteration   6: log likelihood of observed labels = -29349.54840914
iteration   7: log likelihood of observed labels = -29349.54688349
iteration   8: log likelihood of observed labels = -29349.54550226
iteration   9: log likelihood of observed labels = -29349.54425177
iteration  10: log likelihood of observed labels = -29349.54311966
iteration  11: log likelihood of observed labels = -29349.54209472
iteration  12: log likelihood of observed labels = -29349.54116680
iteration  13: log likelihood of observed labels = -29349.54032672
iteration  14: log likelihood of observed labels = -29349.5395

Let's compare the coefficients for each l2-penalty. We are going to create a table that will merge all the coefficients for all the features plus the intercept.

In [37]:
table = tc.SFrame({'word': ['(intercept)'] + important_words})
def add_coefficients_to_table(coefficients, column_name):
    table[column_name] = coefficients
    return table

In [38]:
add_coefficients_to_table(coefficients_0_penalty, 'coefficients [L2=0]')
add_coefficients_to_table(coefficients_4_penalty, 'coefficients [L2=4]')
add_coefficients_to_table(coefficients_10_penalty, 'coefficients [L2=10]')
add_coefficients_to_table(coefficients_1e2_penalty, 'coefficients [L2=1e2]')
add_coefficients_to_table(coefficients_1e3_penalty, 'coefficients [L2=1e3]')
add_coefficients_to_table(coefficients_1e5_penalty, 'coefficients [L2=1e5]')

word,coefficients [L2=0],coefficients [L2=4],coefficients [L2=10],coefficients [L2=1e2]
(intercept),-0.0745874045386,-0.0739703019154,-0.0744325272678,-0.0609039624246
baby,0.0914044653289,0.0912103099182,0.0913556716001,0.0872120460521
one,0.022029160162,0.0217417002678,0.0219568925235,0.0159507965338
great,0.80062465855,0.795865100799,0.799428111181,0.699758211124
love,1.03975150459,1.03225253965,1.0378658425,0.88209210377
use,0.012980586266,0.0131784070551,0.0130303425931,0.0170725614152
would,-0.290384002777,-0.289359116939,-0.290126305589,-0.268704669272
like,-0.00915773854399,-0.00918981709201,-0.00916578711491,-0.00985497846215
easy,0.974789783574,0.967963138841,0.973073202649,0.831173338471
little,0.527735163007,0.524703617655,0.526972975907,0.463614351574

coefficients [L2=1e3],coefficients [L2=1e5]
-0.00949455273584,0.00235878386375
0.0665918463126,0.00192984031292
-0.00184159157178,-0.00156175667368
0.373207112946,0.00880804781406
0.415568149057,0.00906466298034
0.0233987111383,0.000509449689132
-0.189556697274,-0.00814952117861
-0.0102424153145,-0.000945362361057
0.400522104686,0.0087949757352
0.254306309532,0.00602985393423


Let's take the $l2\_penalty = 10$ and find the 5 most important words (largest coefficients)

In [39]:
my_keys = table.column_names()
del my_keys[0]
table_temp = table.sort(my_keys[2], ascending = False)
print my_keys
print '--------------------------------------------------------------------------------------------------------'
print my_keys[2]
print '--------------------------------------------------------------------------------------------------------'
print 'Five most important words:'
print table_temp.head(5)['word']

['coefficients [L2=0]', 'coefficients [L2=4]', 'coefficients [L2=10]', 'coefficients [L2=1e2]', 'coefficients [L2=1e3]', 'coefficients [L2=1e5]']
--------------------------------------------------------------------------------------------------------
coefficients [L2=10]
--------------------------------------------------------------------------------------------------------
Five most important words:
['loves', 'love', 'easy', 'perfect', 'great']


Now, we need to check which one of the penalties is the most accurate. We can check the accuracy in the train and validation sets.

In [40]:
def get_classification_accuracy(feature_matrix, sentiment, coefficients):
    pred = predict_probability(feature_matrix,coefficients)
    predictions = tc.SArray(pred).apply(lambda x: 1 if x > 0.5 else -1)
    sentiment = tc.SArray(sentiment)
    
    num_correct = (predictions == sentiment).sum()
    accuracy = 1.0 * num_correct / len(feature_matrix)    
    return accuracy

In [41]:
sentiment_train

array([ 1,  1,  1, ..., -1, -1, -1])

Computing the accuracy for all penalties for the train and vallidation sets.

In [42]:
train_accuracy = {}
train_accuracy[0]   = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_0_penalty)
train_accuracy[4]   = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_4_penalty)
train_accuracy[10]  = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_10_penalty)
train_accuracy[1e2] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e2_penalty)
train_accuracy[1e3] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e3_penalty)
train_accuracy[1e5] = get_classification_accuracy(feature_matrix_train, sentiment_train, coefficients_1e5_penalty)

validation_accuracy = {}
validation_accuracy[0]   = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_0_penalty)
validation_accuracy[4]   = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_4_penalty)
validation_accuracy[10]  = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_10_penalty)
validation_accuracy[1e2] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e2_penalty)
validation_accuracy[1e3] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e3_penalty)
validation_accuracy[1e5] = get_classification_accuracy(feature_matrix_valid, sentiment_valid, coefficients_1e5_penalty)

In [43]:
for key in sorted(validation_accuracy.keys()):
    print "L2 penalty = %g" % key
    print "train accuracy = %s, validation_accuracy = %s" % (train_accuracy[key], validation_accuracy[key])
    print "--------------------------------------------------------------------------------"

L2 penalty = 0
train accuracy = 0.785586476433, validation_accuracy = 0.785525570862
--------------------------------------------------------------------------------
L2 penalty = 4
train accuracy = 0.785539388803, validation_accuracy = 0.785431213436
--------------------------------------------------------------------------------
L2 penalty = 10
train accuracy = 0.785633564063, validation_accuracy = 0.785619928288
--------------------------------------------------------------------------------
L2 penalty = 100
train accuracy = 0.783914865565, validation_accuracy = 0.784204566899
--------------------------------------------------------------------------------
L2 penalty = 1000
train accuracy = 0.773673306023, validation_accuracy = 0.771277599547
--------------------------------------------------------------------------------
L2 penalty = 100000
train accuracy = 0.744267081038, validation_accuracy = 0.740800150972
--------------------------------------------------------------------------

For this analysis, the $l2\_penalty = 10$ got the best accuracy for both train and validation sets.

Let's compare with the Turicreate logistic classifier.

In [44]:
modeltc_l2 = tc.logistic_classifier.create(train_data, target='sentiment', features = important_words, 
                                           l2_penalty = 10, max_iterations = 501)

PROGRESS: Creating a validation set from 5 percent of training data. This may take a while.
          You can set ``validation_set=None`` to disable validation tracking.



In [45]:
predictionstc_l2 = modeltc_l2.predict(validation_data, output_type = 'class')
resultstc_train_l2 = modeltc_l2.evaluate(train_data)
resultstc_val_l2 = modeltc_l2.evaluate(validation_data)
print "--------------------------------------------------------------------------------"
print 'Accuracy of our code:'
print "Train accuracy = %s, validation_accuracy = %s" % (train_accuracy[10], validation_accuracy[10])
print "--------------------------------------------------------------------------------"
print 'Accuracy of Turicreate:'
print "Train accuracy = %s, validation_accuracy = %s" % (resultstc_train_l2['accuracy'], resultstc_val_l2['accuracy'])
print "--------------------------------------------------------------------------------"

--------------------------------------------------------------------------------
Accuracy of our code:
Train accuracy = 0.785633564063, validation_accuracy = 0.785619928288
--------------------------------------------------------------------------------
Accuracy of Turicreate:
Train accuracy = 0.792932146725, validation_accuracy = 0.788545008492
--------------------------------------------------------------------------------


With similar parameters, we got close results. But Turicreate did it in just 6 iterations... we needed 501... )-;