### Predicting IMDB Sentiment purely from words in review 

**Data**:  Data is from IMDB, neatly stored in a keras dataset.  IMDB provides data from 25_000 reviews and the corresponding result for whether the review is positive or negative.

**Hypothesis**: We can predict the sentiment of a review from a list of 1_000 of the most frequent words used in the reviews as a whole, and whether or not the word exists in a review.  

**Definitions**
Review:  A review's is the words of the review, represented as numbers in a numpy array.  The number indicates the frequency of the word in the dataset as a whole.  Smaller numbers are more frequent.

**Parameters/features/x_values** Variables we will use to predict the outcome.

**Supervised learning**: We will have the 'right answers' to the review's sentiment.  Y values will be either 0's or 1's indicating if the review was actually negative or positive respectively.  These discrete output value makes this a classification problem, which is great for logistic regression.

As features we will use whether or not a word is included in a review.


In [95]:
import numpy as np
from keras.datasets import imdb

TOP_N_WORDS = 1_000
# To speed up model, we will only analyze the most frequent TOP_N_WORDS words in all reviews.  

(x_train, y_train), (x_test, y_test) = imdb.load_data(path="imdb.npz",
                                                      num_words=TOP_N_WORDS,
                                                      )

# x_train is our training data.  We would like to use x_train to predict y_train.  Review data and
# corresponding sentiment.  
# x_test is the test data.  Used to predict y_test values.  We can use this to test our model at the end.


In [81]:
# As an example the first training review has len(x_train[0]) many words.
print('The first training review has {0} many words and is a {1} review.  \n 0 for negative, 1 for positive'.format(len(x_train[0]), y_test[0]))

The first training review has 218 many words and is a 0 review.  
 0 for negative, 1 for positive


In [87]:
# Vectorizing each review.
# We will sacrifice word order, word frequency in individual reviews, and word pairs for simplicity.
# Create a word vector with length TOP_N_WORDS + 1, holding 0s or 1s if the word exists in the review,
# aka our parameters

# + 1 is added for the y intercept parameter.  This simplifies vector multiplication later on.

def word_vector(review):
    result = np.zeros(TOP_N_WORDS + 1)
    for word_index in review:
        result[word_index] = 1
    result[0] = 1
    return result
    
def vectorize_data(x_is):
    return np.array([word_vector(review) for review in x_is])

new_x_train = vectorize_data(x_train)

In [89]:
thetas = np.zeros(TOP_N_WORDS + 1)  

# initialize thetas for each parameter at 0.  The predictive weight of each parameter, to tell us
# whether the model thinks the review is positive or negative.

# For example, if the 100th most frequent word in the dataset is the word 'good', then
# theta_100 is the predictive power for the word good, if it turns out to be higher than theta of other words, 
# That means that 'good' being present in a review has a higher likelihood that the review is positive.  


In [91]:
# We will feed our parameters for each review through the sigmoid function, a squasher function to 
# calculate predict sentiment for each review based on our thetas.  Outputs will be between 0 and 1 for each review.

def sigma(z):
    return 1 / (1 + np.exp(-z))  # 0 < output < 1

def calc_predictions(x_is, thetas):
    return sigma(x_is.dot(thetas))

predictions = calc_predictions(new_x_train, thetas)

In [93]:
acc = accuracy(predictions, y_train)
print(acc)

def accuracy(predictions, ys): # faster with a list comprehension?
    predict_was_right = [(prediction > 0.5) == ys[i] for i, prediction in enumerate(predictions)]
    return sum(predict_was_right) / ys.shape[0]

0.5


Our first guess at the thetas turned out to be 50%.  Which seems appropriate so far because we guessed 0 for all thetas.  We guessed that each of the TOP_N_WORDS existing in individual reviews had no predictive power.  

Now we can use a learning rate combined with subtracting the derivate of the cost function to adjust thetas up and doen respectively based on the derivative...

### Here is the function we will use to calculate the error for the model.

\begin{equation*}
CE = \frac{1}{m} * \sum_{i=1}^m [ -y^i * log(h_{ \theta}(x^i)) - (1 - y^i) * log(1 - h_{ \theta}(x^i) )]
\end{equation*}

Where h_theta(x) is the prediction from theta transpose x that we ran through the squasher function.  And i is the ith review.  

This error function calculates what is known as the cross entropy loss.  Taking the log of the difference is a good way to evaulate model effectiveness.  In the worst case scenario where a real value is actually a 1 [positive], and our model tells us 99.9999% certainty that the prediction is 0, we will have an error that approaches infinity.  Thus severely punishing the model for having the worst guess.

Later on when we run the model, we will run batches of update theta functions.  At the core of these functions, we will be subtracting the first derivative with respect to theta_j of our CE function.  Where j indicates the jth parameter


In [85]:
def cross_entropy(predictions, ys):
    positive_y_loss = -np.sum(ys * np.log(predictions))
    negative_y_loss = -np.sum((1 - ys) * np.log(1 - predictions))
    return (positive_y_loss + negative_y_loss) / ys.shape[0] # returns the average CE error
    

 **Want**: to minimize CE.  
 
 We can do this using the first derivative of CE.  Defined below.
 
 \begin{equation*}
 \frac{\partial}{\partial \theta_{j}}CE = \sum_{i=1}^m[ (h_{ \theta}(x^i) - y^i) * x^i_{j} ]
 \end{equation*}

As the heart of gradient descent, we will run updates across all thetas, where we subract out the derivative with respect to each individual theta.  The derivative is multiplied by a stepping rate (Learning Rate, alpha) to fine tune the steps.

\begin{equation*}
\theta_{j} = \theta_{j} - \alpha * \frac{\partial}{\partial \theta_{j}}CE = \theta_{j} - \alpha * \sum_{i=1}^m[ (h_{ \theta}(x^i) - y^i) * x^i_{j} ]
\end{equation*}


In [99]:
NUM_EPOCHS = 10
LEARNING_RATE = 0.1
BATCH_SIZE = 128

# deriv_wrt_theta is the derivative of our CE error function.

def deriv_wrt_theta_j(j, x_is, ys, thetas):
    predictions = calc_predictions(x_is, thetas)
    return np.sum((predictions - ys) * x_is[:,j]) / ys.shape[0]

def update_thetas(x_is, ys, thetas):   
    updated_thetas = [theta - LEARNING_RATE * deriv_wrt_theta_j(j, 
                                                                x_is, 
                                                                ys, 
                                                                thetas) for j, theta in enumerate(thetas)]
    return np.array(updated_thetas)

new_x_test = vectorize_data(x_test)


for epoch in range(0, NUM_EPOCHS):
    # by updating our thetas every #{BATCH_SIZE} reviews, we can arrive at an answer with lower CE and higher acc
    # more efficiently.
    for batch_start in range(0, new_x_train.shape[0], BATCH_SIZE):
        x_is = new_x_train[batch_start:(batch_start + BATCH_SIZE), :]
        ys = y_train[batch_start:(batch_start + BATCH_SIZE)]
        thetas = update_thetas(x_is, ys, thetas)
        
    predictions = calc_predictions(new_x_train, thetas)
    acc = accuracy(predictions, y_train)
    ce = cross_entropy(predictions, y_train)

    test_predictions = calc_predictions(new_x_test, thetas) # Try model wth test data.
    test_acc = accuracy(test_predictions, y_test)
    
    print(
        'Epoch #{3} | Error: {0:0.2f} | acc: {1:0.2f} | test acc: {2:0.2f}'.format(ce, acc, test_acc, epoch)
    )

Epoch #0 | Error: 0.31 | acc: 0.87 | test acc: 0.86
Epoch #1 | Error: 0.31 | acc: 0.87 | test acc: 0.86
Epoch #2 | Error: 0.31 | acc: 0.87 | test acc: 0.86
Epoch #3 | Error: 0.31 | acc: 0.87 | test acc: 0.86
Epoch #4 | Error: 0.31 | acc: 0.87 | test acc: 0.86
Epoch #5 | Error: 0.31 | acc: 0.87 | test acc: 0.86
Epoch #6 | Error: 0.30 | acc: 0.87 | test acc: 0.86
Epoch #7 | Error: 0.30 | acc: 0.87 | test acc: 0.86
Epoch #8 | Error: 0.30 | acc: 0.87 | test acc: 0.86
Epoch #9 | Error: 0.30 | acc: 0.88 | test acc: 0.86


Amazingly we can achieve an 86% accuracy of predicting sentiment solely from using the existance of the top 1_000 words in each review!