# Sentiment Analysis Lab
The purpose of this notebook is to experiment with different ways of solving the problem of text sentiment analysis, that is determining if a given sentence has a positive or negative fell to it.

We'll compare the following: 
* A simple TensorFlow Linear regresion model.
* An LSTM-based model.
* A custom Linear regression model implementing gradient descent.

## Table of Contents

* [2 - Framing the problem](#2)
    * [2.1 - Purpose](#2_1)
    * [2.2 - Type of problem](#2_2)
    * [2.3 - Logistic regression](#2_3)    
    * [2.3.1 - Logistic regression and the sigmoid function](#2_3_1)
* [3 - The data](#3)
* [4 - Exploring the data](#4)
* [5 - Processing the data](#5)
    * [5.1 - Feature engineering](#5_1)
    * [5.2 - Input processing and clean-up](#5_2)
    * [5.3 - Data set split](#5_3)
    * [5.4 - Extracting the features](#5_4)
* [6 - Model Exploration](#6)
    * [6.1 Logistic Regresion with TensorFlow](#6_1)
        * [6.1.1 Doing some tests](#6_1_1)
    * [6.2 LSTM-based Sentiment Analysis](#6_2)
        * [6.2.1 Feature engineering](#6_2_1)
        * [6.2.2 Model architecture](#6_2_2)
        * [6.2.3 LSTM v1](#6_2_3)
        * [6.2.4 LSTM v2](#6_2_4)
            * [6.2.4.1 Avoiding overfitting](#6_2_1)
            * [6.2.4.2 Results](#6_2_4_2)
            * [6.2.4.3 Comparing the LSTM version with the LR baseline](#6_2_4_3)
* [7 - Custom implementation a of Logistic regression model](#7)

# Framing the problem<a class="anchor" id="2"></a>

## Purpose<a class="anchor" id="2_1"></a>
Given a text sentence, in particular a tweet, we want to classify it as having a positive or negative sentiment. This will be used purely for statistical purposes. As a consequence, detecting abusive language is not a priority as this might required a specialized data set.

## Type of problem<a class="anchor" id="2_2"></a>
This can be solved with supervised and offline machine learning algorithms. Furthermore, this is a classification problem.

## Logistic regression<a class="anchor" id="2_3"></a>
Logistic regression is a statistical model to predict the probability of an event given a set of independent variables. 
$$ \hat{y} = P(y=1 | x), x \in   \mathbb{R^n} $$

It is usually used to categorize an input and it can be binary or multinomial. Therefore, it is a natural predictive model for binary classification problems such as sentiment analysis. 

### Logistic Regression and the Sigmoid function<a class="anchor" id="2_3_1"></a>

Logistic regression takes a regular linear regression, and applies to it a sigmoid function.

**Regression:**
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$

**Logistic regression:**
$$ h(z) = \frac{1}{1+\exp^{-z}}$$
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$


# The data <a class="anchor" id="3"></a>
NLTK's Twitter corpus currently contains a sample of 20k Tweets (named 'twitter_samples') retrieved from the Twitter Streaming API, together with another 10k which are divided according to sentiment into negative and positive.

# Exploring the data <a class="anchor" id="4"></a>
We can observe a few important characteristics of the input text. 
- Twitter handles contain specific characters and in most cases do not impact the sentiment of a tweet. We can safely remove them
- Hash tags are preceeded by the pound symbol and can contain meaningul information. We should keep them.
- They contain informal language, often enlarging words to express emotion or shortening them for brevity. This makes stemming a good tool for this problem.
- They contain emojis, combination of punctuaction symbols that should be kept as carry a lot of meaning.


In [None]:
import nltk
from os import getcwd
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples 
from utils import process_tweet, build_freqs

nltk.download('twitter_samples')
nltk.download('stopwords')

In [None]:
# select the set of positive and negative tweets
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_positive_count = len(all_positive_tweets)
all_negative_count = len(all_negative_tweets)

In [None]:
all_positive_tweets[:10]

<a class="anchor" id="5"></a>
# Processing the data

<a class="anchor" id="5_1"></a>
## Feature engineering
The raw input is text, which we could tokenize and convert to index numbers of the word in the corpus vocabulary. This would probably work as the model could potentially learn that certain words appear whenerver the label is positive or negative.

However, we can make things easier for the model by already condensing certain amount of information into the feature vectors. One of the most common approaches is to count frequencies of positive and negative words in a given tweet, since they have a clear impact on the sentiment of a tweet.

<a class="anchor" id="5_2"></a>
## Input processing and clean-up
For the task of sentiment analysis we will perform the following operations. Notice that we are not converting the words into numbers. This is becaue of our decision to create a feature vector based on positive and negative frequency counts.

- Remove stock market tickers like $GE
- Remove old style retweet text "RT"
- Remove hyperlinks    
- Remove hashtag symbols, keeping the names
- Remove stopwords
- Remove punctuation, keeping emojis
- Stem the words

## Data set split <a class="anchor" id="5_3"></a>
We will use 80% of the samples for training, the rest for validation.

In [None]:
# split the data into two pieces, one for training and one for testing (validation set) 
TRAIN_SPLIT = 0.8
train_pos_count = int(all_positive_count * TRAIN_SPLIT)
train_neg_count = int(all_negative_count * TRAIN_SPLIT)

test_pos = all_positive_tweets[train_pos_count:]
train_pos = all_positive_tweets[:train_pos_count]
test_neg = all_negative_tweets[train_neg_count:]
train_neg = all_negative_tweets[:train_neg_count]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
freqs = build_freqs(train_x, train_y)


## Extracting the features <a class="anchor" id="5_4"></a>

For this we need two steps:
- Count the number of times a word appears in a tweet labelled as positive and also as negative.
- Represent each tweet as a feature vector with two of its components being the positive and negative counts of all its words.

Our transformed inputs the models will work with from now on will be train_X, train_y, test_X, test_y

In [None]:
def extract_features(tweet, freqs, process_tweet=process_tweet):
    '''
    Input: 
        tweet: a list of words
        freqs: a dictionary with key: (word, label), value: count of word with label
    Output: 
        feat_vector: a feature vector of dimension (1,3): [bias, pos_count, neg_count]
    '''
    word_l = process_tweet(tweet)
    feat_vector = np.zeros(3) 
    feat_vector[0] = 1 # always set the bias s 1
    
    for word in word_l:        
        feat_vector[1] += freqs.get((word, 1), 0) # inc. count for positive label
        feat_vector[2] += freqs.get((word, 0), 0) # inc. count for negative label
    
    feat_vector = feat_vector[None, :]  # adding batch dimension for further processing

    return feat_vector

In [None]:
train_X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    train_X[i, :]= extract_features(train_x[i], freqs)

test_X = np.zeros((len(test_x), 3))
for i in range(len(test_x)):
    test_X[i, :]= extract_features(test_x[i], freqs)

# Model exploration <a class="anchor" id="6"></a>

## Logistic Regresion with TensorFlow <a class="anchor" id="6_1"></a>

We start with a very simple model with one dense layer made up of one neuron.

We can observe good accuracy results in both training and validation sets, with no signs of overfitting. The size of the data set is very small and not representative of a realistic problem setting. But for the purposes of comparing different implementations, we won't do more changes.

In [None]:
import tensorflow as tf
from tensorflow import keras

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(1, input_shape=((None, ) + train_X.shape), activation='sigmoid')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.003), 
              loss='bce',
              metrics=['accuracy']) # 1e-8
history = model.fit(np.array(train_X),
                    np.array(train_y),
                    batch_size=1024,
                    epochs=50,
                    validation_data=(np.array(test_X), np.array(test_y)))

In [None]:
from matplotlib import pyplot

# # == DRAWING THE ACCURACY == 
pyplot.plot(history.history['accuracy'][1:], label='train acc')
pyplot.plot(history.history['val_accuracy'], label='val')
pyplot.xlabel('Epoch')
pyplot.ylabel('Accuracy')
pyplot.legend(loc='lower right')


<img src="./images/lr-tf-acc.PNG" alt="LR TensorFlow accuracy" />

In [None]:
# == DRAWING THE LOSS == 
from matplotlib import pyplot 

pyplot.plot(history.history['loss'], label='train-val') 
pyplot.plot(history.history['val_loss'], label='test-val') 
pyplot.xlabel('Epoch')
pyplot.ylabel('Loss')
pyplot.legend()
pyplot.show()

<img src="./images/lr-tf-loss.PNG" alt="LR TensorFlow loss" />

In [None]:
model.evaluate(np.array(test_X), np.array(test_y))

### Doing some tests <a class="anchor" id="6_1_1"></a>

In [None]:
for tweet in ['she is very anoying and selfish', 'good job', 'I was happy after our first talk', 'I was undecided after our first talk']:
    feats = extract_features(tweet, freqs)
    print( '%s -> %f' % (tweet, model.predict(np.array(feats))))    


## LSTM-based Sentiment Analysis <a class="anchor" id="6_2"></a>

### Feature engineering <a class="anchor" id="6_2_1"></a>
In order to solve the problem with a recurrent neural network (RNN), we need to rethink how we input data into the model. RNNs expect a sequence of values (also called time steps) and at each step the previous state and the new value are used to update the internal state.

Therefore, using a frequency count based metric doesn't seem natural for this particular implementation. What we will do intead is to tokenize each word by representing them with their index in the courpus vocabulary.

Additionally, RNN expect a fixed-length sequence as the input. Since tweets can vary in length we need to pad them to guarantee they have the same length. We will choose post zero padding.


In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

def process_inputs(train_x, test_x, max_length=10, vocab_size=1000):
    train_x_processsed = []
    test_x_processsed = []

    for tweet in train_x:
        p_tweet = process_tweet(tweet)
        p_tweet_joined = ' '.join(p_tweet)    
        train_x_processsed.append(p_tweet_joined)

    for tweet in test_x:
        p_tweet = ' '.join(process_tweet(tweet))
        test_x_processsed.append(p_tweet)

    # Create Tokenizer
    tokenizer = Tokenizer(num_words=vocab_size, oov_token="<OOV>") # num_words=1000,
    tokenizer.fit_on_texts(train_x_processsed)

    # Padding
    train_sequences = tokenizer.texts_to_sequences(train_x_processsed)
    train_padded_sequences = pad_sequences(train_sequences, padding='post', maxlen=max_length)
    test_sequences = tokenizer.texts_to_sequences(test_x_processsed)
    test_padded_sequences = pad_sequences(test_sequences, padding='post', maxlen=max_length)
    
    return train_sequences, train_padded_sequences, test_sequences, test_padded_sequences

### Model architecture <a class="anchor" id="6_2_2"></a>
We will try an initial configuration where we use the full vocabulary to train the network. In addition to that, we will select ceratain amount of complexity for the dimension of the LSTM, embeddings and dense layers.

In [None]:
# == BUILD THE MODEL ==

def lstm_model_1(vocab_size=1000, embedding_dim=8, lstm_dim=8, dense_dim=12, max_length=10):
    # Model Definition with LSTM
    model = keras.Sequential([
        keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
        keras.layers.LSTM(lstm_dim),
        keras.layers.Dense(dense_dim, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid')
    ])

    # Set the training parameters
    model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

    # Print the model summary
    model.summary()  
    
    return model



In [None]:
import matplotlib.pyplot as plt

# Plot Utility
def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string])
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

### LSTM v1 <a class="anchor" id="6_2_3"></a>
We will try an initial configuration where we use the full vocabulary to train the network. In addition to that, we will select ceratain amount of complexity for the dimension of the LSTM, embeddings and dense layers.

#### Results <a class="anchor" id="6_2_3_1"></a>
The Accuracy achieved is 0.67 and the charts do not show a nice stabilizing curve for the loss and accuracy. Since overfitting is a common problem for LSTM networks, we'll try to apply counter measures in the next version.

In [None]:
max_length = 50
vocab_size = 9000
train_sequences, train_padded_sequences, test_sequences, test_padded_sequences = process_inputs(train_x, test_x, max_length=max_length, vocab_size=vocab_size)
model_1 = lstm_model_1(vocab_size=vocab_size, embedding_dim=16, lstm_dim=64, dense_dim=32, max_length=max_length)
history_lstm = model_1.fit(train_padded_sequences, train_y, epochs=10, validation_data=(test_padded_sequences, test_y))

In [None]:
# Plot the accuracy and loss history
plot_graphs(history_lstm, 'accuracy')
plot_graphs(history_lstm, 'loss')

# Repeat model evaluation to see it match graph
model_1.evaluate(test_padded_sequences, test_y, batch_size=128)

<img src="./images/lstm-v1.PNG" alt="LSTM-based architecture" />

### LSTM v2 <a class="anchor" id="6_2_4"></a>

#### Avoiding overfitting <a class="anchor" id="6_2_4_1"></a>
- Reduce the vocabulary size from 9008 to 1000. This is effectively reducing the complexity of the model, which is a typical measure against overfitting. This can cause a performance decrease though, as the model won't be able to learn from those unknown words.
- Define a shorter max_length to prevent the model from traying to learn from a large tail of zeros in each sentence
- Reduce the complexity of the model (number of neurons inside an LSTM, number of neurons in the dense layer, reduce the embedding layer's dimensionality)
- Add dropout layers. Excluding certain neurons from some of the training steps, helps prevent the model from overfitting the data set. Most of the improvements could be achieved only by modifying the hyperparameters. However, the dropout layer achieved an increase of accuracy, while delaying the increase in the loss, thought of in this context as the confidence in the result.
- From the charts, we can see that the val_loss although much closer to a stable horizontal line, increases with the number of epochs. Therefore in this case, training for longer, would not be beneficial.

#### Results <a class="anchor" id="6_2_4_2"></a>
- The accuracy in the validation set increased to 0.75
- The charts also displayed more common curves for loss and accuracy for LSTMs.

#### Comparing the LSTM version with the LR baseline <a class="anchor" id="6_2_4_3"></a>
Although we were able to decrease the overfitting and improve the performance on the training set, we are still far away from the baseline model's results. A few considerations are layed out below:
- The size of the training set is 8000 samples, definitely small and prone to overfitting. Increasing it should be an action moving forward.
- Secondly, given the simplicity of the problem, and the amount of information encoded in the frequency count based features, it seems natural that an sequence model approach could draw worse results.

In [None]:
def lstm_model_2(vocab_size=1000, embedding_dim=8, lstm_dim=8, dense_dim=12, max_length=10):
    # Model Definition with LSTM
    model = keras.Sequential([
        keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
        keras.layers.Dropout(0.2),
        keras.layers.LSTM(lstm_dim),
        keras.layers.Dense(dense_dim, activation='relu'),
        keras.layers.Dense(1, activation='sigmoid')
    ])

    # Set the training parameters
    model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])

    # Print the model summary
    model.summary()  
    
    return model

In [None]:
max_length = 10
vocab_size = 1000
train_sequences, train_padded_sequences, test_sequences, test_padded_sequences = process_inputs(train_x, test_x, max_length=max_length, vocab_size=vocab_size)
model_2 = lstm_model_2(vocab_size=vocab_size, embedding_dim=8, lstm_dim=8, dense_dim=12, max_length=max_length)
history_lstm = model_2.fit(train_padded_sequences, train_y, epochs=10, validation_data=(test_padded_sequences, test_y))

In [None]:
# Plot the accuracy and loss history
plot_graphs(history_lstm, 'accuracy')
plot_graphs(history_lstm, 'loss')

# Repeat model evaluation to see it match graph
model_2.evaluate(test_padded_sequences, test_y, batch_size=128)

<img src="./images/lstm-v2.PNG" alt="LSTM v2 accuracy and loss" />

In [None]:
# TRY WITH YOUR OWN TWEETS
my_tweet = ""
p_my_tweet = process_tweet(my_tweet)
seqs = tokenizer.texts_to_sequences([p_my_tweet])
train_padded_seqs = pad_sequences(seqs, padding='post', maxlen=max_length)
r = model_lstm.predict(train_padded_seqs)
result = r.flatten()[0] > 0.5
if result:
    print("Positive")
else:
    print("Negative")

# Custom implementation a of Logistic regression model <a class="anchor" id="7"></a>

## sigmoid function

In [None]:
def sigmoid(z): 
    '''
    Input:
        z: is the input (can be a scalar or an array)
    Output:
        h: the sigmoid of z
    '''
    
    return 1 / (1 + np.exp(-1 * np.array(z)))

<a name='1-2'></a>
## Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of training example 'i'.
* $h(z^{(i)})$ is the model's prediction for the training example 'i'.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x^{(i)}_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x^{(i)}_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


### Implementing gradientDescent
Cost function:
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$

Updating the weights:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$

In [None]:
def gradientDescent(x, y, theta, alpha, num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    Hint: you might want to print the cost to make sure that it is going down.
    '''

    # Number of rows in matrix x
    m = x.shape[0]
    
    for i in range(0, num_iters):        
        z = np.dot(x, theta) # x⋅z
        h = sigmoid(z)
        J = float(-1 / m) * (np.dot(np.transpose(y), np.log(h)) + np.dot(np.transpose(1 - y), np.log(1 - h)))
        theta = theta - alpha/m * np.dot(np.transpose(x), h-y)

    J = float(J)
    return J, theta

## Training Your Model

To train the model:
* Stack the features for all training examples into a matrix X. 
* Call `gradientDescent`, which you've implemented above.

This section is given to you.  Please read it for understanding and run the cell.

In [None]:
J, theta = gradientDescent(train_X, train_y, np.zeros((3, 1)), 1e-9, 1500)

## Testing the custom Logistic Regression version

In [None]:
def predict_tweet(tweet, freqs, theta):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label)
        theta: (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''
    x = extract_features(tweet, freqs)
    z = np.dot(x, theta)
    y_pred = sigmoid(z)
    
    return y_pred

In [None]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet, freqs, theta)))    
    

## Testing the model

In [None]:
def test_logistic_regression(test_x, test_y, freqs, theta, predict_tweet=predict_tweet):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """
    
    y_hat = []
    m = len(test_x)
    
    for tweet in test_x:
        y_pred = predict_tweet(tweet, freqs, theta)
        
        if y_pred > 0.5:
            y_hat.append(1.0)
        else:
            y_hat.append(0.0)

    y_hat_array = np.array(y_hat)
    t_y_array = np.reshape(test_y, m)    
    accuracy = 1/m * np.sum(y_hat_array == t_y_array)

    return accuracy

In [None]:
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, theta)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

In [None]:
# Feel free to change the tweet below
my_tweet = 'This is a ridiculously bright movie. The plot was terrible and I was sad until the ending!'
print(process_tweet(my_tweet))
y_hat = predict_tweet(my_tweet, freqs, theta)
print(y_hat)
if y_hat > 0.5:
    print('Positive sentiment')
else: 
    print('Negative sentiment')