# Building Logistic Regression Model for Sentiment Analysis

Building a naive bayes model for sentiment analysis on the sample twitter dataset from NLTK

[**1. Preprocessing**](#1.-Preprocessing)

[**2. Model Building**](#2.-Model-Building)

[**3. Predicting and Evaluation**](#3.-Predicting-and-Evaluation)

## 1. Preprocessing

### 1.1. Initializing

In [142]:
import nltk
from nltk.corpus import twitter_samples
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer 
from nltk.tokenize import TweetTokenizer
import re
import string
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

In [143]:
nltk.download('twitter_samples')
nltk.download('stopwords')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Mahmoud\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Mahmoud\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [144]:
postw = twitter_samples.strings('positive_tweets.json')
negtw = twitter_samples.strings('negative_tweets.json')
print('Number of positive tweets: ', len(postw))
print('Number of negative tweets: ', len(negtw))

Number of positive tweets:  5000
Number of negative tweets:  5000


Spliting the data into train and test:

In [145]:
test_pos = postw[4000:]
train_pos = postw[:4000]
test_neg = negtw[4000:]
train_neg = negtw[:4000]

train_x = train_pos + train_neg 
test_x = test_pos + test_neg

Creating labels:

In [146]:
train_y = np.append(np.ones((len(train_pos), 1)), np.zeros((len(train_neg), 1)), axis=0)
test_y = np.append(np.ones((len(test_pos), 1)), np.zeros((len(test_neg), 1)), axis=0)
print("train_y.shape = " + str(train_y.shape))
print("test_y.shape = " + str(test_y.shape))

train_y.shape = (8000, 1)
test_y.shape = (2000, 1)


### 1.2. Tweet processing

Defining tweet processing function.

**Input**: a string containing a tweet  
**Output**: a list of words containing the processed tweet

In [147]:
def process_tweet(tweet):
    
    stemmer = PorterStemmer()
    stopwords_english = stopwords.words('english')
    
    # removing hyperlinks, Twitter marks and styles
    tweet = re.sub(r'\$\w*', '', tweet)
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    tweet = re.sub(r'#', '', tweet)
    
    # tokenizing tweets
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tweet_tokens = tokenizer.tokenize(tweet)

    # removing stop words and punctuations, stemming
    tweets_clean = []
    for word in tweet_tokens:
        if (word not in stopwords_english and  
                word not in string.punctuation):  
            stem_word = stemmer.stem(word) 
            tweets_clean.append(stem_word)

    return tweets_clean

### 1.3. Building word frequencies

Defining building word frequencies function.

**Input**: a list of tweets, an m x 1 array with the sentiment label of each tweet (either 0 or 1)  
**Output**: a dictionary mapping each (word, sentiment) pair to its frequency

In [148]:
def build_freqs(tweets, ys):

    yslist = np.squeeze(ys).tolist()

    freqs = {}
    for y, tweet in zip(yslist, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            freqs[pair] = freqs.get(pair, 0) + 1

    return freqs

Creating frequency dictionary:

In [149]:
freqs = build_freqs(train_x, train_y)
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 11340


## 2. Model Building

### 2.1. Lookup function

**Input**: a dictionary with the frequency of each pair (or tuple), the word to look up, the label corresponding to the word  
**Output**: the number of times the word with its corresponding label appears

In [150]:
def lookup(freqs, word, label):

    n = 0 

    pair = (word, label)
    if (pair in freqs):
        n = freqs[pair]

    return n

### 2.2. Naive bayes function

**Input**: dictionary from (word, label) to how often the word appears, a list of tweets, a list of labels correponding to the tweets (0,1)  
**Output**: the log prior, the log likelihood of naive bayes equation

In [151]:
def train_naive_bayes(freqs, train_x, train_y):

    loglikelihood = {}
    logprior = 0

    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)

    N_pos = N_neg = 0
    for pair in freqs.keys():
        if pair[1] > 0:
            N_pos += freqs.get(pair,0)
        else:
            N_neg += freqs.get(pair,0)

    D = len(train_y)
    D_pos = np.sum(train_y)
    D_neg = D - D_pos

    logprior = np.log(D_pos)-np.log(D_neg)

    for word in vocab:
        freq_pos = lookup(freqs, word, 1)
        freq_neg = lookup(freqs, word, 0)

        p_w_pos = (freq_pos+1)/(N_pos + V)
        p_w_neg = (freq_neg+1)/(N_neg + V)

        loglikelihood[word] = np.log(p_w_pos/p_w_neg)

    return logprior, loglikelihood

### 2.3. Training naive bayes

In [152]:
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)

## 3. Predicting and Evaluation

### 3.1. Prediction function

**Input**: a tweet (string), logprior (number), loglikelihood (a dictionary of words mapping to numbers)  
**Output**: the sum of all the logliklihoods of each word in the tweet (if found in the dictionary) + logprior

In [153]:
def naive_bayes_predict(tweet, logprior, loglikelihood):

    word_l = process_tweet(tweet)

    p = 0
    p += logprior

    for word in word_l:
        if word in loglikelihood:
            p += loglikelihood[word]
            
    return p

### 3.2. Accuracy calculator function

**Input**: a list of tweets, the corresponding labels for the list of tweets, the logprior, loglikelihood  
**Output**: (# of tweets classified correctly) / (total # of tweets)

In [154]:
def test_naive_bayes(test_x, test_y, logprior, loglikelihood):

    accuracy = 0

    y_hats = []
    for tweet in test_x:
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            y_hat_i = 1
        else:
            y_hat_i = 0
        y_hats.append(y_hat_i)
    y_hats = np.reshape(y_hats,(len(y_hats),1))
    
    error = np.mean(np.abs(y_hats-test_y))
    accuracy = 1 - error

    return accuracy

Evaluating the model performance on test set:

In [155]:
accuracy = test_naive_bayes(test_x, test_y, logprior, loglikelihood)
print("The naive bayes accuracy is %.2f %%" %(accuracy*100))

The naive bayes accuracy is 99.40 %
