# Implementing logistic regression from scratch

## Import Modules

In [1]:
import string
import numpy as np
import pandas as pd
from logistic_classifier_func import logistic_regression

## Load Data

In [2]:
# Load Data
products = pd.read_csv('amazon_baby_subset.csv')
# Fill N/A
products = products.fillna({'review':''})  # fill in N/A's in the review column
# Reomove Punctuation for Text Cleaning
products['review'] = products['review'].astype('str')
products['review_clean'] = products['review'].apply(lambda x: ''.join([i for i in x if i not in string.punctuation]))
# Add Sentiment Column
products['sentiment'] = products['rating'].apply(lambda rating: 1 if rating >= 4 else -1)

Let's explore some data:

In [3]:
products['name'].head(10)

0    Stop Pacifier Sucking without tears with Thumb...
1      Nature's Lullabies Second Year Sticker Calendar
2      Nature's Lullabies Second Year Sticker Calendar
3                          Lamaze Peekaboo, I Love You
4    SoftPlay Peek-A-Boo Where's Elmo A Children's ...
5                            Our Baby Girl Memory Book
6    Hunnt&reg; Falling Flowers and Birds Kids Nurs...
7    Blessed By Pope Benedict XVI Divine Mercy Full...
8    Cloth Diaper Pins Stainless Steel Traditional ...
9    Cloth Diaper Pins Stainless Steel Traditional ...
Name: name, dtype: object

Let us quickly explore more of this dataset. The 'name' column indicates the name of the product. Here we list the first 10 products in the dataset. We then count the number of positive and negative reviews.

In [4]:
print("\nCounting Positive & Negative Reviews")
num_pos = sum(products['sentiment'] == 1)
num_neg = sum(products['sentiment'] == -1)
print("Number of Positive Reviews: ", num_pos)
print("Number of Negative Reviews: ", num_neg)


Counting Positive & Negative Reviews
Number of Positive Reviews:  26579
Number of Negative Reviews:  26493


## Apply text cleaning on the review data

In [5]:
# Import Important words
important_words = pd.read_json('important_words.json')
print("Important Words: ", important_words)

# Counting Important words in `review_clean` column
for word in important_words[0].values.tolist():
    products[word] = products['review_clean'].apply(lambda s : s.split().count(word))

Important Words:                0
0          baby
1           one
2         great
3          love
4           use
5         would
6          like
7          easy
8        little
9          seat
10          old
11         well
12          get
13         also
14       really
15          son
16         time
17       bought
18      product
19         good
20     daughter
21         much
22        loves
23     stroller
24          put
25       months
26          car
27        still
28         back
29         used
..          ...
163     started
164    anything
165        last
166     company
167        come
168    returned
169       maybe
170        took
171       broke
172       makes
173        stay
174     instead
175        idea
176        head
177        said
178        less
179        went
180     working
181        high
182        unit
183       seems
184     picture
185  completely
186        wish
187      buying
188      babies
189         won
190         tub
191      almost
192   

The products now contains one column for each of the 193 important_words. As an example, the column perfect contains a count of the number of times the word perfect occurs in each of the reviews

In [6]:
products['contains_perfect'] = products['perfect'].apply(lambda x: 1 if x >= 1 else 0)
print("Number of reviews contain 'perfect': ", products['contains_perfect'].sum())

Number of reviews contain 'perfect':  2955


## Convert to NumPy array

In [7]:
products['intercept'] = 1
feature_set = ['intercept'] + important_words[0].tolist()
# Coverrt to ndarray
feature_matrix = products[feature_set].as_matrix()
sentiment = products['sentiment'].as_matrix()

In [8]:
# Shape of feature matrix
print("Size of feature matrix: ", feature_matrix.shape)

Size of feature matrix:  (53072, 194)


## Estimating conditional probability with link function

Link function is given by:

$$
P(y_i = +1 | \mathbf{x}_i,\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))},
$$

where the feature vector $h(\mathbf{x}_i)$ represents the word counts of important_words in the review  $\mathbf{x}_i$. Complete the following function that implements the link function:

In [9]:
'''
produces probablistic estimate for P(y_i = +1 | x_i, w).
estimate ranges between 0 and 1.
'''

def predict_probability(feature_matrix, coefficients):
    # Take dot product of feature_matrix and coefficients  
    # YOUR CODE HERE
    scores = np.dot(feature_matrix, coefficients)
    
    # Compute P(y_i = +1 | x_i, w) using the link function
    # YOUR CODE HERE
    predictions = 1. / (1 + np.exp(-scores))
    
    # return predictions
    return predictions

Aside. How the link function works with matrix algebra
Since the word counts are stored as columns in feature_matrix, each $i$-th row of the matrix corresponds to the feature vector $h(\mathbf{x}_i)$: $$
[\text{feature_matrix}] =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right] =
\left[
\begin{array}{cccc}
h_0(\mathbf{x}_1) & h_1(\mathbf{x}_1) & \cdots & h_D(\mathbf{x}_1) \\
h_0(\mathbf{x}_2) & h_1(\mathbf{x}_2) & \cdots & h_D(\mathbf{x}_2) \\
\vdots & \vdots & \ddots & \vdots \\
h_0(\mathbf{x}_N) & h_1(\mathbf{x}_N) & \cdots & h_D(\mathbf{x}_N)
\end{array}
\right]
$$
By the rules of matrix multiplication, the score vector containing elements $\mathbf{w}^T h(\mathbf{x}_i)$ is obtained by multiplying feature_matrix and the coefficient vector $\mathbf{w}$. $$
[\text{score}] =
[\text{feature_matrix}]\mathbf{w} =
\left[
\begin{array}{c}
h(\mathbf{x}_1)^T \\
h(\mathbf{x}_2)^T \\
\vdots \\
h(\mathbf{x}_N)^T
\end{array}
\right]
\mathbf{w}
= \left[
\begin{array}{c}
h(\mathbf{x}_1)^T\mathbf{w} \\
h(\mathbf{x}_2)^T\mathbf{w} \\
\vdots \\
h(\mathbf{x}_N)^T\mathbf{w}
\end{array}
\right]
= \left[
\begin{array}{c}
\mathbf{w}^T h(\mathbf{x}_1) \\
\mathbf{w}^T h(\mathbf{x}_2) \\
\vdots \\
\mathbf{w}^T h(\mathbf{x}_N)
\end{array}
\right]
$$

## Compute derivative of log likelihood with respect to a single coefficient

Recall from lecture: $$
\frac{\partial\ell}{\partial w_j} = \sum_{i=1}^N h_j(\mathbf{x}_i)\left(\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})\right)
$$
We will now write a function that computes the derivative of log likelihood with respect to a single coefficient $w_j$. The function accepts two arguments:
errors vector containing $\mathbf{1}[y_i = +1] - P(y_i = +1 | \mathbf{x}_i, \mathbf{w})$ for all $i$.
feature vector containing $h_j(\mathbf{x}_i)$ for all $i$.
Complete the following code block:

In [10]:
def feature_derivative(errors, feature):     
    # Compute the dot product of errors and feature
    derivative = np.dot(errors, feature)
    
    # Return the derivative
    return derivative

In the main lecture, our focus was on the likelihood. In the advanced optional video, however, we introduced a transformation of this likelihood---called the log likelihood---that simplifies the derivation of the gradient and is more numerically stable. Due to its numerical stability, we will use the log likelihood instead of the likelihood to assess the algorithm.
The log likelihood is computed using the following formula (see the advanced optional video if you are curious about the derivation of this equation):
$$\ell\ell(\mathbf{w}) = \sum_{i=1}^N \Big( (\mathbf{1}[y_i = +1] - 1)\mathbf{w}^T h(\mathbf{x}_i) - \ln\left(1 + \exp(-\mathbf{w}^T h(\mathbf{x}_i))\right) \Big) $$
We provide a function to compute the log likelihood for the entire dataset.

In [11]:
def compute_log_likelihood(feature_matrix, sentiment, coefficients):
    indicator = (sentiment==+1)
    scores = np.dot(feature_matrix, coefficients)
    lp = np.sum((indicator-1)*scores - np.log(1. + np.exp(-scores)))
    return lp

## Taking gradient steps

Now we are ready to implement our own logistic regression. All we have to do is to write a gradient ascent function that takes gradient steps towards the optimum.
Complete the following function to solve the logistic regression model using gradient ascent:

In [12]:
def logistic_regression(feature_matrix, sentiment, initial_coefficients, step_size, max_iter):
    coefficients = np.array(initial_coefficients) # make sure it's a numpy array
    for itr in range(max_iter):
        # Predict P(y_i = +1|x_1,w) using your predict_probability() function
        # YOUR CODE HERE
        predictions = predict_probability(feature_matrix, initial_coefficients)

        # Compute indicator value for (y_i = +1)
        indicator = (sentiment==+1)

        # Compute the errors as indicator - predictions
        errors = indicator - predictions

        for j in range(len(coefficients)): # loop over each coefficient
            # Recall that feature_matrix[:,j] is the feature column associated with coefficients[j]
            # compute the derivative for coefficients[j]. Save it in a variable called derivative
            # YOUR CODE HERE
            derivative = feature_derivative(errors, feature_matrix[:,j])

            # add the step size times the derivative to the current coefficient
            # YOUR CODE HERE
            coefficients[j] += step_size * derivative

        # Checking whether log likelihood is increasing
        if itr <= 15 or (itr <= 100 and itr % 10 == 0) or (itr <= 1000 and itr % 100 == 0) \
        or (itr <= 10000 and itr % 1000 == 0) or itr % 10000 == 0:
            lp = compute_log_likelihood(feature_matrix, sentiment, coefficients)
            print('iteration %*d: log likelihood of observed labels = %.8f' % \
                (int(np.ceil(np.log10(max_iter))), itr, lp))
    return coefficients

Now, let us run the logistic regression solver.

In [13]:
coefficients = logistic_regression(feature_matrix, sentiment, initial_coefficients=np.zeros(194),
                                   step_size=1e-7, max_iter=301)

iteration   0: log likelihood of observed labels = -36780.91768478
iteration   1: log likelihood of observed labels = -36775.13127954
iteration   2: log likelihood of observed labels = -36769.34795095
iteration   3: log likelihood of observed labels = -36763.56769899
iteration   4: log likelihood of observed labels = -36757.79052366
iteration   5: log likelihood of observed labels = -36752.01642492
iteration   6: log likelihood of observed labels = -36746.24540276
iteration   7: log likelihood of observed labels = -36740.47745714
iteration   8: log likelihood of observed labels = -36734.71258803
iteration   9: log likelihood of observed labels = -36728.95079539
iteration  10: log likelihood of observed labels = -36723.19207918
iteration  11: log likelihood of observed labels = -36717.43643934
iteration  12: log likelihood of observed labels = -36711.68387583
iteration  13: log likelihood of observed labels = -36705.93438858
iteration  14: log likelihood of observed labels = -36700.1879

## Predicting sentiments

Recall from lecture that class predictions for a data point $\mathbf{x}$ can be computed from the coefficients $\mathbf{w}$ using the following formula: $$
\hat{y}_i = 
\left\{
\begin{array}{ll}
      +1 & \mathbf{x}_i^T\mathbf{w} >; 0 \\
      -1 & \mathbf{x}_i^T\mathbf{w} \leq 0 \\
\end{array} 
\right.
$$
Now, we will write some code to compute class predictions. We will do this in two steps:
* Step 1: First compute the scores using feature_matrix and coefficients using a dot product.
* Step 2: Using the formula above, compute the class predictions from the scores.

Step 1 can be implemented as follows:

In [14]:
scores = np.dot(feature_matrix, coefficients)

Now, complete the following code block for Step 2 to compute the class predictions using the scores obtained above:

In [15]:
class_predictions = pd.DataFrame(scores)[0].apply(lambda x: 1 if x > 0 else -1)
print("Predicted Positive Sentiment: ", (class_predictions.values > 0).sum())

Predicted Positive Sentiment:  21348


## Measuring accuracy

We will now measure the classification accuracy of the model. Recall from the lecture that the classification accuracy can be computed as follows:

$$
\mbox{accuracy} = \frac{\mbox{# correctly classified data points}}{\mbox{# total data points}}
$$

Complete the following code block to compute the accuracy of the model.

In [16]:
# Measuring accuracy
num_mistakes = (class_predictions != sentiment).sum() # YOUR CODE HERE
num_correct = len(sentiment) - num_mistakes
accuracy = num_correct/len(sentiment) # YOUR CODE HERE
print("-----------------------------------------------------")
print('# Reviews   correctly classified =', num_correct)
print('# Reviews incorrectly classified =', num_mistakes)
print('# Reviews total                  =', len(products))
print("-----------------------------------------------------")
print('Accuracy = %.2f' % accuracy)

-----------------------------------------------------
# Reviews   correctly classified = 39325
# Reviews incorrectly classified = 13747
# Reviews total                  = 53072
-----------------------------------------------------
Accuracy = 0.74


 ## Which words contribute most to positive & negative sentiments?

* Treat each coefficient as a tuple, i.e. (word, coefficient_value).
* Sort all the (word, coefficient_value) tuples by coefficient_value in descending order.

In [17]:
coefficients = list(coefficients[1:]) # exclude intercept
word_coefficient_tuples = [(word, coefficient) for word, coefficient in zip(important_words[0].values, coefficients)]
word_coefficient_tuples = sorted(word_coefficient_tuples, key=lambda x:x[1], reverse=True)

### Most Positive Words:

Now, we compute the 10 words that have the most positive coefficient values. These words are associated with positive sentiment.

In [18]:
word_coefficient_tuples[:10]

[('great', 0.069275149999999647),
 ('love', 0.069004250000000072),
 ('easy', 0.06748420000000005),
 ('little', 0.046790450000000261),
 ('loves', 0.046414200000000044),
 ('well', 0.030355849999999917),
 ('perfect', 0.030355849999999917),
 ('old', 0.020272350000000126),
 ('nice', 0.018481399999999922),
 ('soft', 0.017954649999999985)]

### Most Negative Words:

Next, we repeat this exercise on the 10 most negative words. That is, we compute the 10 words that have the most negative coefficient values. These words are associated with negative sentiment.

In [19]:
word_coefficient_tuples[-10:]

[('return', -0.027977950000000137),
 ('monitor', -0.028324099999999935),
 ('disappointed', -0.030054849999999821),
 ('back', -0.031589949999999895),
 ('even', -0.03347119999999984),
 ('get', -0.03396785000000007),
 ('work', -0.036195249999999971),
 ('money', -0.04141760000000029),
 ('product', -0.047452650000000124),
 ('would', -0.063811999999999619)]