# Sentiment Analysis - Logistic Regression (Deep Learning)
In this small POC, I will show you how you can apply logistic regression in order to determine if a given tweet has a positive or a negative sentiment behind it.
For this, we will use a dataset included in NLTK package that already contains 10k labeled tweets, 5k of them marked as positive and the other 5k as negative. This is really good, since we will have a perfect distribution of tweet so our model can learn better.

## Reading data and understanding it
Our first step is always getting along with the data, we need to understand what is the format of it and how we can obtain the information that we need to train our model with.

In [1]:
# Import some libraries we always use
import pandas as pd
import numpy as np

from nltk.corpus import twitter_samples


# To know the ids of the json files, we can run the following command
#print(twitter_samples.fileids())

# Reading our files
all_negative_tweets = twitter_samples.strings('negative_tweets.json')
all_positive_tweets = twitter_samples.strings('positive_tweets.json')

# Count how many type of tweets we have in each case
print(f"Amount of positive tweets: {len(all_positive_tweets)}")
print(f"Amount of negative tweets: {len(all_negative_tweets)}")
print()

# Let's display some samples
print("Some positive tweets:")
print(all_positive_tweets[1:10])
print()
print("Some negative tweets:")
print(all_negative_tweets[1:10])


Amount of positive tweets: 5000
Amount of negative tweets: 5000

Some positive tweets:
['@Lamb2ja Hey James! How odd :/ Please call our Contact Centre on 02392441234 and we will be able to assist you :) Many thanks!', '@DespiteOfficial we had a listen last night :) As You Bleed is an amazing track. When are you in Scotland?!', '@97sides CONGRATS :)', 'yeaaaah yippppy!!!  my accnt verified rqst has succeed got a blue tick mark on my fb profile :) in 15 days', '@BhaktisBanter @PallaviRuhail This one is irresistible :)\n#FlipkartFashionFriday http://t.co/EbZ0L2VENM', "We don't like to keep our lovely customers waiting for long! We hope you enjoy! Happy Friday! - LWWF :) https://t.co/smyYriipxI", '@Impatientraider On second thought, there’s just not enough time for a DD :) But new shorts entering system. Sheep must be buying.', 'Jgh , but we have to go to Bayan :D bye', 'As an act of mischievousness, am calling the ETL layer of our in-house warehousing app Katamari.\n\nWell… as the name im

## Pre-processing and cleaning our data
This is a classic step of the NLP pipeline. In here, we will proceed to clean the text, remove punctuation, tokenize it into separate words, remove stopwords (or most common words in English) and transform them into their root version (stemming).

For this particular case of tweets, we can see that there are some URLS inside of the message that we should also get rid of (since URLS don't add any value in the sentiment of a message). Same thing happens with quotes or tags (we can see in the previos messages some mentions to users such as "@ketchBurning" which won't add any sentiment either).
Let's get rid of all this information and clean our text.

In [2]:
import re                                  # library for regular expression operations
import string                              # for string operations

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings

ps = PorterStemmer()
en_stopwords = stopwords.words('english')

def process_tweet(tweet):
    # Remove hashtag, retweet marks, and hyperlinks
    # Remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)

    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)

    # Remove hashtags
    # Only removing the hash # sign from the word
    tweet = re.sub(r'#', '', tweet)  
    
    # Tokenize and lowercase words so ("Hello" and "hello" have the same meaning)
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,reduce_len=True)
    tweet_tokenized = tokenizer.tokenize(tweet)
    
    # Remove punctuation, stopwords and stem token
    tweet_tokenized_cleaned = []
    for token in tweet_tokenized:
        if token not in string.punctuation and token not in en_stopwords:
            tweet_tokenized_cleaned.append(ps.stem(token))
    
    return tweet_tokenized_cleaned
    

# Testing it out
sample_tweet = all_positive_tweets[0]
processed_tweet = process_tweet(sample_tweet)
print(f"Sample Tweet: {sample_tweet}")
print(f"Processed Tweet: {processed_tweet}")

Sample Tweet: #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)
Processed Tweet: ['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']


## What can we use to train our model with?
Well, we now have the capability of processing each one of our tweets. However, we need to think what type of information we're going to use in our model in order for it to determine if a given token has a positive or a negative sentiment.

One thing that is clear is that we want a vector representation for each one of our tweets, and we want that representation be as optimal as possible. This is called feature extraction.
What we could do in this case is start by counting how many times each word appears in possitive tweets and how many times it appears in negative ones. We will then generate a method that can give us a dictionary with (word,label) as keys and then as value we will have the total times that word appeared in messages with that label.

As an example, we might have:
("week", 1) -> 15
("week", 0) -> 5
Meaning that the token "week" appeared 15 times in possitive messages (that's why we use 1 as part of the key) and 5 times in tweets labeled as negative. We will use this dictionary to later build our features vector (this is, the vector we will use as input to our model).

In [3]:
# as input, this will receive all the tweets and labels (1 or 0) we use to train our model with
def frequency_dictionary(tweets,labels):  
    # I will zip the tweets and the labels so we can get a tuple representation of each tweet and his value
    tweets_labels = zip(tweets,labels)
    freq_dict = {}
    
    for (tweet,label) in tweets_labels:
        tweet_tokens = process_tweet(tweet)
        for token in tweet_tokens:
            pair = (token,label)
            if pair in freq_dict:
                freq_dict[pair] +=1
            else:
                freq_dict[pair] = 1
    
    return freq_dict
                
# Let's try this with 10 messages from each type of tweet
sample_msgs = all_positive_tweets[:10] + all_negative_tweets[:10]
sample_labels = np.append(np.ones((10,1)),np.zeros((10,1)))
sample_freq_dict = frequency_dictionary(sample_msgs,sample_labels)
print(sample_freq_dict)

{('followfriday', 1.0): 1, ('top', 1.0): 1, ('engag', 1.0): 1, ('member', 1.0): 1, ('commun', 1.0): 1, ('week', 1.0): 1, (':)', 1.0): 8, ('hey', 1.0): 1, ('jame', 1.0): 1, ('odd', 1.0): 1, (':/', 1.0): 1, ('pleas', 1.0): 1, ('call', 1.0): 2, ('contact', 1.0): 1, ('centr', 1.0): 1, ('02392441234', 1.0): 1, ('abl', 1.0): 1, ('assist', 1.0): 1, ('mani', 1.0): 1, ('thank', 1.0): 1, ('listen', 1.0): 1, ('last', 1.0): 1, ('night', 1.0): 1, ('bleed', 1.0): 1, ('amaz', 1.0): 1, ('track', 1.0): 1, ('scotland', 1.0): 1, ('congrat', 1.0): 1, ('yeaaah', 1.0): 1, ('yipppi', 1.0): 1, ('accnt', 1.0): 1, ('verifi', 1.0): 1, ('rqst', 1.0): 1, ('succeed', 1.0): 1, ('got', 1.0): 1, ('blue', 1.0): 1, ('tick', 1.0): 1, ('mark', 1.0): 1, ('fb', 1.0): 1, ('profil', 1.0): 1, ('15', 1.0): 1, ('day', 1.0): 1, ('one', 1.0): 1, ('irresist', 1.0): 1, ('flipkartfashionfriday', 1.0): 1, ('like', 1.0): 1, ('keep', 1.0): 1, ('love', 1.0): 1, ('custom', 1.0): 1, ('wait', 1.0): 1, ('long', 1.0): 1, ('hope', 1.0): 1, ('e

### Tweet representation as a vector
Since we will be using a deep learning model in this example, having a sparse matrix to represent each tweet is not the best choice. A sparse matrix is simply a matrix that contains as columns all the words being across all different tweets that we use to train our model with (this is called the vocabulary) and as rows each one of those messages. As values, you will have a 1 or a 0 if that word is being used in the given tweet or not.
The problem is that this matrix having 1's and 0's can be really big, and our model would have to learn a lot of parameters in order for it to be able to predict later on if a given tweet has a negative or positive sentiment.

Instead of this representation, we will represent each one of our tweets as a vector of 3 dimensions:
<b>[bias, sum of positive frequencies for each token in that tweet, sum of negative frequencies for each token in that tweet]</b>. 
Instead of learning "v" features, we will have to learn only 3 features.

Let's define a function that can help us extracting those features and generating the row vector that we need.

In [63]:
def extract_features(tweet, freqs):
    tweet_representation = np.zeros((1,3))
    tweet_representation[0,0] = 1
    
    pos_sum = 0
    neg_sum = 0

    for token in process_tweet(tweet):        
        # If the key is not found, it will return a 0, adding nothing to our sum
        tweet_representation[0,1] += freqs.get((token,1.0), 0)
        tweet_representation[0,2] += freqs.get((token,0.0), 0)
   
    return tweet_representation

In [64]:
print(sample_freq_dict)
print()
print(extract_features("I'm learning NLP :) alberta",sample_freq_dict))

{('followfriday', 1.0): 1, ('top', 1.0): 1, ('engag', 1.0): 1, ('member', 1.0): 1, ('commun', 1.0): 1, ('week', 1.0): 1, (':)', 1.0): 8, ('hey', 1.0): 1, ('jame', 1.0): 1, ('odd', 1.0): 1, (':/', 1.0): 1, ('pleas', 1.0): 1, ('call', 1.0): 2, ('contact', 1.0): 1, ('centr', 1.0): 1, ('02392441234', 1.0): 1, ('abl', 1.0): 1, ('assist', 1.0): 1, ('mani', 1.0): 1, ('thank', 1.0): 1, ('listen', 1.0): 1, ('last', 1.0): 1, ('night', 1.0): 1, ('bleed', 1.0): 1, ('amaz', 1.0): 1, ('track', 1.0): 1, ('scotland', 1.0): 1, ('congrat', 1.0): 1, ('yeaaah', 1.0): 1, ('yipppi', 1.0): 1, ('accnt', 1.0): 1, ('verifi', 1.0): 1, ('rqst', 1.0): 1, ('succeed', 1.0): 1, ('got', 1.0): 1, ('blue', 1.0): 1, ('tick', 1.0): 1, ('mark', 1.0): 1, ('fb', 1.0): 1, ('profil', 1.0): 1, ('15', 1.0): 1, ('day', 1.0): 1, ('one', 1.0): 1, ('irresist', 1.0): 1, ('flipkartfashionfriday', 1.0): 1, ('like', 1.0): 1, ('keep', 1.0): 1, ('love', 1.0): 1, ('custom', 1.0): 1, ('wait', 1.0): 1, ('long', 1.0): 1, ('hope', 1.0): 1, ('e

## Logistic Regression in practice
Logistic regression is a simple form of a neural network that classifies data categorically. For example, classifying emails as spam or non-spam is a classic use case of logistic regression. So how does it work? Simple. Logistic regression takes an input, passes it through a function called sigmoid function (https://en.wikipedia.org/wiki/Sigmoid_function) then returns an output of probability between 0 and 1. 
This sigmoid function is responsible for classifying the input and in our case it will tell us the probability of a given tweet to be positive or negative.

Now, we know that there is a high chance of a wrong classification by the sigmoid function, which is bad for the algorithm. This “mistake” is also known as weight or loss. The goal of a good logistic regression algorithm is to reduce loss or weight by improving the correctness of the output and this is achieved by a function called Gradient Descent. A good way to evaluate the performance of the logistic regression algorithm is by achieving a minimal cost function. Cost function quantifies the error between the predicted value and the expected values.

In our case, the weights will be parameters that are going to affect our input parameters (I mentioned before we had a column vector of 3 elements that we need to learn) and the objective of gradient descent is to help us find those weights that are going to produce the minimum cost or error at the end of the function.

### Logistic regression: regression and a sigmoid

Logistic regression takes a regular linear regression, and applies a sigmoid to the output of the linear regression.

Regression:
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
Note that the $\theta$ values are "weights".

Logistic regression
$$ h(z) = \frac{1}{1+\exp^{-z}}$$
$$z = \theta_0 x_0 + \theta_1 x_1 + \theta_2 x_2 + ... \theta_N x_N$$
We will refer to 'z' as the 'logits'.

### Cost function and Gradient

The cost function used for logistic regression is the average of the log loss across all training examples:

$$J(\theta) = -\frac{1}{m} \sum_{i=1}^m y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)}))\tag{5} $$
* $m$ is the number of training examples
* $y^{(i)}$ is the actual label of the i-th training example.
* $h(z(\theta)^{(i)})$ is the model's prediction for the i-th training example.

The loss function for a single training example is
$$ Loss = -1 \times \left( y^{(i)}\log (h(z(\theta)^{(i)})) + (1-y^{(i)})\log (1-h(z(\theta)^{(i)})) \right)$$

* All the $h$ values are between 0 and 1, so the logs will be negative. That is the reason for the factor of -1 applied to the sum of the two loss terms.
* Note that when the model predicts 1 ($h(z(\theta)) = 1$) and the label $y$ is also 1, the loss for that training example is 0.
* Similarly, when the model predicts 0 ($h(z(\theta)) = 0$) and the actual label is also 0, the loss for that training example is 0. 
* However, when the model prediction is close to 1 ($h(z(\theta)) = 0.9999$) and the label is 0, the second term of the log loss becomes a large negative number, which is then multiplied by the overall factor of -1 to convert it to a positive loss value. $-1 \times (1 - 0) \times log(1 - 0.9999) \approx 9.2$ The closer the model prediction gets to 1, the larger the loss. This happens the other way around as well (if the model prediction is close to 0 and the label is 1 the loss will generate a value that goes to infinity).

### Update the weights

To update your weight vector $\theta$, you will apply gradient descent to iteratively improve your model's predictions.  
The gradient of the cost function $J$ with respect to one of the weights $\theta_j$ is:

$$\nabla_{\theta_j}J(\theta) = \frac{1}{m} \sum_{i=1}^m(h^{(i)}-y^{(i)})x_j \tag{5}$$
* 'i' is the index across all 'm' training examples.
* 'j' is the index of the weight $\theta_j$, so $x_j$ is the feature associated with weight $\theta_j$

* To update the weight $\theta_j$, we adjust it by subtracting a fraction of the gradient determined by $\alpha$:
$$\theta_j = \theta_j - \alpha \times \nabla_{\theta_j}J(\theta) $$
* The learning rate $\alpha$ is a value that we choose to control how big a single update will be.


## Implement gradient descent function
* The number of iterations `num_iters` is the number of times that you'll use the entire training set.
* For each iteration, you'll calculate the cost function using all training examples (there are `m` training examples), and for all features.
* Instead of updating a single weight $\theta_i$ at a time, we can update all the weights in the column vector:  
$$\mathbf{\theta} = \begin{pmatrix}
\theta_0
\\
\theta_1
\\ 
\theta_2 
\\ 
\vdots
\\ 
\theta_n
\end{pmatrix}$$
* $\mathbf{\theta}$ has dimensions (n+1, 1), where 'n' is the number of features, and there is one more element for the bias term $\theta_0$ (note that the corresponding feature value $\mathbf{x_0}$ is 1).
* The 'logits', 'z', are calculated by multiplying the feature matrix 'x' with the weight vector 'theta'.  $z = \mathbf{x}\mathbf{\theta}$
    * $\mathbf{x}$ has dimensions (m, n+1) -> In our case n = 2 (the sum of positive freqs for each token in the tweet and the sum of negative freqs for each token in the tweet).
    * $\mathbf{\theta}$: has dimensions (n+1, 1)
    * $\mathbf{z}$: has dimensions (m, 1) -> Makes since, we have one predicted value per example (we have "m" examples)
* The prediction 'h', is calculated by applying the sigmoid to each element in 'z': $h(z) = sigmoid(z)$, and has dimensions (m,1).
* The cost function $J$ is calculated by taking the dot product of the vectors 'y' and 'log(h)'.  Since both 'y' and 'h' are column vectors (m,1), transpose the vector to the left, so that matrix multiplication of a row vector with column vector performs the dot product.
$$J = \frac{-1}{m} \times \left(\mathbf{y}^T \cdot log(\mathbf{h}) + \mathbf{(1-y)}^T \cdot log(\mathbf{1-h}) \right)$$
* The update of theta is also vectorized.  Because the dimensions of $\mathbf{x}$ are (m, n+1), and both $\mathbf{h}$ and $\mathbf{y}$ are (m, 1), we need to transpose the $\mathbf{x}$ and place it on the left in order to perform matrix multiplication, which then yields the (n+1, 1) answer we need:
$$\mathbf{\theta} = \mathbf{\theta} - \frac{\alpha}{m} \times \left( \mathbf{x}^T \cdot \left( \mathbf{h-y} \right) \right)$$


<b>IMPORTANT: we vectorize these calculations in order to avoid for loops (otherwise we would need to loop on each training example in order to calculate the sum needed for the cost function)</b>

In [22]:
def sigmoid(z):
    return 1/(1 + np.exp(-z))

In [47]:
def gradientDescent(x,y,theta,learning_rate,num_iters):
    '''
    Input:
        x: matrix of features which is (m,n+1)
        y: corresponding labels of the input matrix x, dimensions (m,1)
        theta: weight vector of dimension (n+1,1)
        alpha: learning rate
        num_iters: number of iterations you want to train your model for
    Output:
        J: the final cost
        theta: your final weight vector
    '''  
    m = x.shape[0]
    
    for i in range(0,num_iters):
        # This is also called as FORWARD PROPAGATION -------------------
        # Calculate z
        z = np.dot(x,theta)
        
        # Calculate h(z), this is the sigmoid function
        h = sigmoid(z)
                       
        # This step is also called as BACKWARDS PROPAGATION ------------    
        # Calculate the cost (this should go down in every iteration)
        J = (-1/m)*(np.dot(y.T,np.log(h)) + np.dot((1-y).T,np.log(1-h)))
        
        # Update theta according to the GD formula        
        theta = theta - (learning_rate/m)*(np.dot(x.T,(h-y)))
  
    J = float(J)
    return J,theta

## Training our model
Now we have everything we need. We have a method to generate the input vector that we want and we have a way of determining the optimal values for our theta (allowing us to predict).
The only thing that's missing is training our model. For this, I will split our data into training and test and proceed to fit the model and calculate the optimal values that we'll later use to predict.

In [54]:
from sklearn.model_selection import train_test_split

# Let's split our data into train and test
all_tweets = all_positive_tweets + all_negative_tweets
# Generate a matrix with our labels
labels = np.append(np.ones((len(all_positive_tweets),1)), np.zeros((len(all_negative_tweets),1)))

# Split our data into train and test data. Take 8000 as train and 2000 for test
x_train, x_test, y_train, y_test = train_test_split(all_tweets,labels,test_size=0.2)

# Generate our dictionary with the frequencies for the training data
frequencies_dict = frequency_dictionary(x_train,y_train)

# Reshape y_train so it's a matrix rather than a list
y_train = np.reshape(y_train, (len(y_train),1))

# Extract features for all of our x_train tweets
x = np.zeros((len(x_train), 3))
for i in range(0,len(x_train)):
    x[i:] = extract_features(x_train[i],frequencies_dict) 

# Hyperparameters
weights = np.zeros((3,1)) # PARAMETERS WE WANT TO LEARN
learning_rate = 1e-9
num_iterations = 5000
print("Running gradient descent....")
J, theta = gradientDescent(x,y_train,weights,learning_rate,num_iterations)

print(f"J:{J}, theta:{theta}")

Running gradient descent....
J:0.14125806473785132, theta:[[ 2.23722850e-07]
 [ 9.74921055e-04]
 [-8.94219667e-04]]


## Testing our logistic regression model
Let's create a method that can help us predict the sentiment of a given tweet. Now we have the parameters we need (we obtained them just before), so we can use them to predict

In [59]:
def predict_tweet(tweet, theta, freqs):
    '''
    Input: 
        tweet: a string
        freqs: a dictionary corresponding to the frequencies of each tuple (word, label) for the training set
        theta: (3,1) vector of weights, we obtained them before
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''
    # Transform the tweet to the features vector
    x = extract_features(tweet,freqs)
    
    # Apply the activation function we used before
    z = np.dot(x,theta)
    y_pred = sigmoid(z)
    
    return y_pred

In [65]:
# Run this cell to test your function
for tweet in ['I am happy', 'I am bad', 'this movie should have been great.', 'great', 'great great', 'great great great', 'great great great great']:
    print( '%s -> %f' % (tweet, predict_tweet(tweet,theta,frequencies_dict)))

I am happy -> 0.536411
I am bad -> 0.491138
this movie should have been great. -> 0.529330
great -> 0.529353
great great -> 0.558505
great great great -> 0.587259
great great great great -> 0.615430
