# Machine Learned Sentiment Analysis using Python

“Can our project correctly classify the sentiment of just about any
sentence in the English language?”

Our goal is to build and train a model that will be able to classify more
than one dataset with over 70% accuracy on each of them. To this end,
we might have to adjust some properties to avoid overfitting to one of the
datasets to have it perform better on a general scale. Because we are
still beginners to machine learning, we are more focused on getting a
model successfully running.


## Binary classification using binary logistic regression

For the first part of this notebook we will build a binary classification
model, evaluated using logistic regression, as we learned in class.

**Run the code cell below** to import the required packages.

To start with, we will import the libraries by nltk (Natural Language Toolkit) package, which contains 5000
which will help us with the preprocessing and training of our model. We will
also need some libraries such as regular expressions to filter out
unnecessary data.

In [1]:
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer
from nltk.corpus import twitter_samples
import sklearn.model_selection   # for train and test splits
import numpy as np

import nltk
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!


True

### Preprocessing the data
**Stop-Word Removal** : In English, words like a, an, the, as, in, on,
etc. are considered as stop-words so according to our requirements
we can remove them to reduce vocabulary size as these words don't
contribute to the final meaning or sentiment of a sentence.
To do this, we import the stopwords list from nltk. We also tokenize
each sentence  string into a list of words after cleaning up filler words
such as retweet, hashtags, and URLs.

As a final step, we perform stemming: **stemming** refers to the process of
removing suffixes and reducing a word to some base form such that all
different variants of that word can be represented by the same form
(e.g., “walk” and “walking” are both reduced to “walk”).

In [2]:
#Preprocessing tweets
def process_tweet(tweet):
    #Remove old style retweet text "RT"
    cleaned_tweet = re.sub(r'^RT[\s]','', tweet)

    #Remove URLS
    cleaned_tweet = re.sub(r'https?:\/\/.*[\r\n]*','', cleaned_tweet)

    #Remove hashtags
    cleaned_tweet = re.sub(r'#','',cleaned_tweet)

    #Instantiate tokenizer class
    tokenizer = TweetTokenizer(preserve_case=False,    strip_handles=True, reduce_len=True)

    #Tokenize tweets
    tweet_tokens = tokenizer.tokenize(cleaned_tweet)

    #Import the english stop words list from nltk
    stopwords_english = stopwords.words('english')

    #Creating a list of words without stopwords
    tweets_clean = []
    for word in tweet_tokens:
        if word not in stopwords_english and word not in string.punctuation:
            tweets_clean.append(word)

    #Instantiate stemming class
    stemmer = PorterStemmer()

    #Creating a list of stems of words in tweet
    tweets_stem = []
    for word in tweets_clean:
        stem_word = stemmer.stem(word)
        tweets_stem.append(stem_word)

    return tweets_stem

### Building the Frequency Dictionary

Here we define a function that will take as input tweets and their labels
as parameters. It will go through every tweet, preprocess them with the
function we just defined, count the occurrence of every word in the data
set and create a frequency dictionary.

In [3]:
#Frequency generating function
def build_freqs(tweets, ys):
    yslist = np.squeeze(ys).tolist() #squeeze is needed or the list
                                     # will end up with one element

    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            freqs[pair] = freqs.get(pair, 0) + 1

    return freqs

Next, let's create a function that will extract features from a tweet
through the use of the ‘freqs’ dictionary and the defined process_tweet
function from earlier.

In [4]:
def extract_features(tweet, freqs):
    # process_tweet tokenizes, stems, and removes stopwords
    word_l = process_tweet(tweet)

    # 3 elements in the form of a 1 x 3 vector
    x = np.zeros((1, 3))

    #bias term is set to 1
    x[0,0] = 1

    # loop through each word in the list of words
    for word in word_l:

        # increment the word count for the positive label 1
        x[0,1] += freqs.get((word,1),0)

        # increment the word count for the negative label 0
        x[0,2] += freqs.get((word,0),0)

    assert(x.shape == (1, 3))
    return x

Continuing, we are importing the sample tweets from nltk and splitting
the data into training sets and test sets.

In [12]:
#import dataset from nltk
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

#create labels
one_array = [1] * len(all_positive_tweets)
zero_array = [0] * len(all_negative_tweets)

all_tweets = all_positive_tweets + all_negative_tweets
all_labels = one_array + zero_array
tweets_and_labels = np.vstack((all_tweets,all_labels)).T

#introduce randomization to the training and test sets
np.random.shuffle(tweets_and_labels)

#extract X (data) and y (labels) columns
X = tweets_and_labels[:,0]
y = tweets_and_labels[:,1].T

#split the data set into train and test sets with a 80/20 ratio
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.20)
X_train = X_train.tolist()
X_test = X_test.tolist()
y_train = np.array([y_train]).T.astype(float)
y_test = np.array([y_test]).T.astype(float)

With our required functions for processing tweets ready to go, we can begin
to build our logistic regression model.

The logistic regression maps the input $\mathbf{x}_i$ into the following
output:

$p(y_i = 1 \mid \mathbf{x}_i, \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x}_i) =  \sigma(w_0 + w_1 x_1 + w_2 x_2)$.

$\sigma$ is the sigmoid function, that is defined as:

$\sigma(z) = \frac{1}{1 + e^{-z}} = (1+e^{-z})^{-1}$

The output of the sigmoid function is a value between 0 and 1. Let us
define the sigmoid function to be used in the model.

In [6]:
def sigmoid(z):
    """Returns the element-wise logistic sigmoid of z."""
    # Your code here. Aim for 1 line.
    return 1 / (1 + np.exp(-z))

## Cost Function and Gradient Descent

Our goal is to find a configuration of our parameters $\mathbf{w}$ that
minimizes our objective function (BCE). For logistic regression, we use
the binary cross-entropy loss. Recall from Lecture 1 that the basic
logistic regression training objective (learning objective) is:

$$
\ell_\text{LR}(\mathbf{w}) = \sum_{i=1}^N y_i \ln \sigma(\mathbf{w}^T \mathbf{x}_i) + (1-y_i) \ln \left(1-\sigma(\mathbf{w}^T \mathbf{x}_i)\right)
$$

The "basic" gradient for the above training objective is on a slide
titled "Maximum likelihood estimate for LR" from Lecture 1, and
reproduced here:

$$
\nabla \ell_\text{LR}(\mathbf{w}) = \sum_{i=1}^N (\sigma(\mathbf{w}^T \mathbf{x}_i) - y_i)\mathbf{x}_i
$$

Let's define a few functions that implement these operations.

In [7]:
def gradient_descent(x, y, w, lr, num_iters):

    m = len(x)

    for i in range(0, num_iters):

        z = np.dot(x,w)
        h = sigmoid(z)

        # calculate the cost function
        J = (-1/m)*(np.dot(y.T,np.log(h)) + np.dot((1-y).T,np.log(1-h)))

        # update the weights by gradient descent
        w = w - (lr/m)*np.dot(x.T, h-y)

    J = float(J)
    return J, w

As all the required functions are ready we can finally train our model
using the training set.

In [8]:
freqs = build_freqs(X_train, y_train)

# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(X_train), 3))
for i in range(len(X_train)):
    X[i, :]= extract_features(X_train[i], freqs)

# training labels corresponding to X
Y = y_train

# Apply gradient descent to extract the weight vector w
J, w = gradient_descent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost function at the end of training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(w)]}")

The cost after training is 0.23869960.
The resulting vector of weights is [7e-08, 0.00052556, -0.00055654]


Right above here, we receive and print the final cost J as well as the
weight matrix w after all the training is done. This weight matrix
constitutes our binary classification based on logistic regression model.

Let's proceed to write two more functions which when given a tweet, will
predict results using the freqs dictionary and weights matrix. The second
function will use the predict function and provide the accuracy of
the model on the given testing data set.

In [9]:
def predict_tweet(tweet, freqs, theta):
    """
    Input:
        tweet: a string
        freqs: a dictionary of the frequencies of each tuple (word, label)
        theta: vector of weights
    Output:
        y_pred: the probability of a tweet being positive or negative
    """

    # extract the features of the tweet and store it into x
    x = extract_features(tweet, freqs)

    # make the prediction using x and theta
    z = np.dot(x,theta)
    y_pred = sigmoid(z)
    return y_pred

def test_logistic_regression(X_test, y_test, freqs, theta):
    """
    Input:
        X_test: a list of tweets
        y_test: vector with the corresponding labels for the list of tweets
        freqs: a dictionary of the frequencies of each tuple (word, label)
        theta: vector of weights
    Output:
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """

    # the list for storing predictions
    y_hat = []

    for tweet in X_test:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, theta)

        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1)
        else:
            # append 0 to the list
            y_hat.append(0)
    # With the above implementation, y_hat is a list, but test_y is a
    # (m,1) array
    # convert both to one-dimensional arrays in order to compare them
    # using the '==' operator
    y_hat = np.array(y_hat)
    test_y = y_test.reshape(-1)
    accuracy = np.sum((test_y == y_hat).astype(int))/len(X_test)

    return accuracy

With all the required functions defined, we can proceed to try out our
model and look at the output.

In [10]:
acc = test_logistic_regression(X_test, y_test, freqs, w)

print("The accuracy on the test set is {}%".format(acc*100))

The accuracy on the test set is 99.15%


We have obtained a pretty high accuracy (around 99.0%) with the trained
model. Let's formulate a few sentences and see what the model predicts
out of them.

In [11]:
tweet = "I hate insta so much everyone is nicer on here"
prediction = predict_tweet(tweet, freqs, w)
print("This tweet is positive" if prediction >= 0.5 else "This tweet is negative")

tweet = "Everyday is my favorite, waking up one more day makes each day awesome"
prediction = predict_tweet(tweet, freqs, w)
print("This tweet is positive" if prediction >= 0.5 else "This tweet is negative")

tweet = "Checking the box full of ole' photos is always fun!"
prediction = predict_tweet(tweet, freqs, w)
print("This tweet is positive" if prediction >= 0.5 else "This tweet is negative")

This tweet is negative
This tweet is positive
This tweet is positive
