##Introduction
In this notebook, I will walk you through the steps I took to analyze the sentiment of tweet sample data using natural language processing techniques. I started by importing the tweet sample data from the NLTK library, and then performed preprocessing tasks such as removing stop words and punctuations, tokenization, and stemming. Next, I built and trained a logistic regressor from scratch and utilized gradient descent to optimize the model. Finally, I tested the model with test data to evaluate its performance. Join me as we delve into the fascinating world of sentiment analysis and uncover insights from textual data using Python and the NLTK library.

In [9]:
# importing necessary libraries
import nltk
from nltk.corpus import twitter_samples
import matplotlib.pyplot as plt
import random
import numpy as np
import re                        
import string                    
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import TweetTokenizer

In [2]:
# import our dataset
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to /root/nltk_data...
[nltk_data]   Unzipping corpora/twitter_samples.zip.


True

In [4]:
# store positive and negative tweets in separate variables
positive = twitter_samples.strings('positive_tweets.json')
negative = twitter_samples.strings('negative_tweets.json')
type(positive)

list

In [6]:
# view samples of postive and negative tweets
print(positive[70])
print(negative[70])

#HappyBirthdayEmilyBett @emilybett :) Wishing you all the best you beautiful,sweet,talented,amazing… https://t.co/humtC1tr3I
Ahh Fam @MeekMill :( #RespectLost http://t.co/NT25MYnGYd


## Preprocessing Our Tweets for Sentiment Analysis

In [11]:
# store a sample to test preprocessing techniques
sample = positive[2277]
print(sample)

My beautiful sunflowers on a sunny Friday morning off :) #sunflowers #favourites #happy #Friday off… https://t.co/3tfYom0N1i


In [14]:
## preprocessing the sample tweet
# remove hashtag(#) from tweet
sample2 = re.sub('#','',sample)
# remove oldstyle 'RT' from our tweets
sample2 = re.sub(r'^RT[\s]+','',sample2)
# remove hyperlink
sample2 = re.sub(r'https?://[^s\n\r]+','',sample2) 
print(sample2)

My beautiful sunflowers on a sunny Friday morning off :) sunflowers favourites happy Friday off… 


In [15]:
# tokenize and change to lowercase
tokenizer = TweetTokenizer(preserve_case=False,strip_handles=True,reduce_len=True)
tokenized_sample = tokenizer.tokenize(sample2)
print(tokenized_sample)

['my', 'beautiful', 'sunflowers', 'on', 'a', 'sunny', 'friday', 'morning', 'off', ':)', 'sunflowers', 'favourites', 'happy', 'friday', 'off', '…']


In [65]:
nltk.download('stopwords') # download stopwords from nltk

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [21]:
stopwords_english = stopwords.words('english') #store english stopwords

# remove stopwords and punctuations from tokenized tweets
processed_tweets = []
for word in tokenized_sample:
  if word not in stopwords_english and word not in string.punctuation:
    processed_tweets.append(word)

print(processed_tweets)

['beautiful', 'sunflowers', 'sunny', 'friday', 'morning', ':)', 'sunflowers', 'favourites', 'happy', 'friday', '…']


In [22]:
# use stemming to reduce our vocabulary
stemmer = PorterStemmer()
stemmed_tweets = []
for word in processed_tweets:
  stem = stemmer.stem(word)
  stemmed_tweets.append(stem)

print(stemmed_tweets)

['beauti', 'sunflow', 'sunni', 'friday', 'morn', ':)', 'sunflow', 'favourit', 'happi', 'friday', '…']


In [25]:
# define a function to preprocess our tweet dataset
def process_tweet(tweet):
  '''
  input: 
    tweet - text to process
  output:
    tweet_clean - processed tweet
  '''
  stemmer = PorterStemmer()
  stopwords_english = stopwords.words('english')
  # remove hashtag from our tweets
  tweet = re.sub(r'#','',tweet)
  # remove oldstyle 'RT' from our tweets
  tweet = re.sub(r'^RT[\s]+','',tweet)
  # remove hyperlink from tweets
  tweet = re.sub(r'https?://[^s\n\r]+','',tweet)
  # remove stock market tickers like $GE
  tweet = re.sub(r'\$\w*', '', tweet)

  # tokenize tweets
  tokenizer = TweetTokenizer(preserve_case=False, 
                             strip_handles=True, reduce_len=True)
  tweet = tokenizer.tokenize(tweet)
  pro_tweet = []
  # remove stopwords and punctions
  for word in tweet:
    if word not in stopwords_english and word not in string.punctuation:
      pro_tweet.append(word)
  # stemming
  tweet_clean = []
  for word in pro_tweet:
    stem = stemmer.stem(word)
    tweet_clean.append(stem)
  
  return tweet_clean

In [27]:
# test our function
text = 'I am a bad man #badman testing https://mail.google.com/mail/u/0/#inbox'
process_tweet(text)

['bad', 'man', 'badman', 'test']

## Building a Frequency Dictionary.
We will be building a word frequency dictionary that will be made of a key value pair of a word from our vocabulary and its negative and positive frequency. i.e {(word, sentiment): frequncy}

In [31]:
# concate all our tweet samples
tweets = positive + negative
print(f'All tweets: {len(tweets)}')

# create an array for sentiments; 1 for positive, 0 for negative
labels = np.append(np.ones((len(positive),1)),np.zeros((len(negative),1)))
print(f'All labels: {len(labels)}')

All tweets: 10000
All labels: 10000


In [38]:
# define a function to build frequency dictionaries
def build_freq(tweets, labels):
  '''
  inputs:
    tweets - vocabulary to of text
    labels - corresponding sentiment
  output:
    freq_dict - frequency dictionary
  '''
  # create an empty dictionary 
  freq_dict = {}
  yslist = np.squeeze(labels).tolist()
  # loop thought tweets and labels to populate our dictiinary
  for y, tweet in zip(yslist, tweets):
    for word in process_tweet(tweet):
      pair = (word,y)
      if pair in freq_dict:
        freq_dict[pair] += 1
      else:
        freq_dict[pair] = 1
  return freq_dict

In [39]:
# bulid our frequency dictionary
freq = build_freq(tweets, labels)

# print the length and datatype of freq
print(f'Length: {len(freq)}')
print(f'Datatype: {type(freq)}')

Length: 13391
Datatype: <class 'dict'>


In [None]:
print(freq)

##Building and Training A Logistic Regressor

In [41]:
# separating our train and test data
train_pos = positive[:4000]
train_neg = negative[:4000]
test_pos = positive[4000:]
test_neg = negative[4000:]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

# store our labels in a variable
train_y = np.append(np.ones((len(train_pos),1)),np.zeros((len(train_neg),1)), axis=0)
test_y = np.append(np.ones((len(test_pos),1)),np.zeros((len(test_neg),1)), axis=0)

print(f'Number of tweets in training set: {len(train_x)}')
print(f'Number of tweets in test set: {len(test_x)}')
print(f'Number of labels in training set: {len(train_y)}')
print(f'Number of labels in test set: {len(test_y)}')

Number of tweets in training set: 8000
Number of tweets in test set: 2000
Number of labels in training set: 8000
Number of labels in test set: 2000


In [42]:
# build a frequency dictionary for our train data
freqs = build_freq(train_x,train_y)

# view our dictionary
print("type(freqs) = " + str(type(freqs)))
print("len(freqs) = " + str(len(freqs.keys())))

type(freqs) = <class 'dict'>
len(freqs) = 11617


In [43]:
# define a sigmoid function for our logistic regressor
def sigmoid(z):
  '''
  input:
    z - a scalar or an array
  output: 
    h - sigmoid of z
  '''
  h = 1/(1 + np.exp(-z))
  return h

In [46]:
# define a gradient descent function to train and optimize our module
def gradientDescent(x, y, weight, alpha, epochs):
  '''
  input:
    x - input vector
    y - labels for input vector
    weight - traing weight
    alpha - learning rate
    epochs - number of training iterations
  output:
    J - final cost of our training
    weight - final weight vector
  '''
  m = len(x)
  
  for i in range(0, epochs):
    # get the dot product of x and weight
    z = np.dot(x, weight)
    # get the sigmoid of z
    h = sigmoid(z)
    # calculate the cost function
    J = (-1/m)*((np.dot(np.transpose(y),np.log(h))) + np.dot(np.transpose(1 - y),np.log(1-h)))
    # update our model parameters
    weight = weight - (alpha/m) * (np.dot(np.transpose(x),(h-y)))
  J = float(J)
  return J, weight

In [49]:
# testing our function with random inputs and labels
np.random.seed(1)
# X input is 10 x 3 with ones for the bias terms
temp_x = np.append(np.ones((10,1)), np.random.rand(10,2)*2000, axis=1)
# label is a 10 x 1 array
temp_y = (np.random.rand(10,1) > 0.35).astype('float')

# test our gradent descent function
gradientDescent(temp_x, temp_y, np.zeros((3,1)), 1e-8, 700)

(0.6709497038162118,
 array([[4.10713435e-07],
        [3.56584699e-04],
        [7.30888526e-05]]))

In [52]:
# define a functions to extract features from our tweet
def extract_features(tweet, freqs, process_tweet=process_tweet):
  '''
  input:
    tweet - text to extract features
    freqs - frequency dictionary
  output:
    x - feature_vector
  '''
  # process tweet
  word_list = process_tweet(tweet)

  # create vector x with zeros
  x = np.zeros((1,3))

  # add a bias term, 1
  x[0,0] = 1

  #loop through each word to populate our x vector
  for word in word_list:
    # positive frequency
    x[0,1] += freqs.get((word,1),0)
    # negative frequency
    x[0,2] += freqs.get((word,0),0)
  
  return x 

In [53]:
# test our function
print(train_x[0])
extract_features(train_x[0], freqs)

#FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)


array([[1.000e+00, 3.051e+03, 6.100e+01]])

##Train Our Logistic Regressor

In [60]:
# collect the features 'x' and stack them into a matrix 'X'
X = np.zeros((len(train_x), 3))
for i in range(len(train_x)):
    X[i, :]= extract_features(train_x[i], freqs)

# training labels corresponding to X
Y = train_y

# Apply gradient descent
J, weight = gradientDescent(X, Y, np.zeros((3, 1)), 1e-9, 1500)
print(f"The cost after training is {J:.8f}.")
print(f"The resulting vector of weights is {[round(t, 8) for t in np.squeeze(weight)]}")

The cost after training is 0.23625149.
The resulting vector of weights is [7e-08, 0.00052853, -0.00055649]


In [61]:
# define a function to make predictions using our weights
def predict_tweet(tweet, freqs, weight):
    '''
    Input: 
        tweet - a string
        freqs - a dictionary corresponding to the frequencies of each tuple (word, label)
        theta - (3,1) vector of weights
    Output: 
        y_pred: the probability of a tweet being positive or negative
    '''    
    # extract the features of the tweet and store it into x
    x = extract_features(tweet,freqs)
    
    # make the prediction using x and theta
    y_pred = sigmoid(np.dot(x, weight))
    
    return y_pred

In [63]:
def test_logistic_regression(test_x, test_y, freqs, weight, predict_tweet=predict_tweet):
    """
    Input: 
        test_x: a list of tweets
        test_y: (m, 1) vector with the corresponding labels for the list of tweets
        freqs: a dictionary with the frequency of each pair (or tuple)
        theta: weight vector of dimension (3, 1)
    Output: 
        accuracy: (# of tweets classified correctly) / (total # of tweets)
    """    
    # create an empty list to store predictions
    y_hat = []
    
    for tweet in test_x:
        # get the label prediction for the tweet
        y_pred = predict_tweet(tweet, freqs, weight)
        
        if y_pred > 0.5:
            # append 1.0 to the list
            y_hat.append(1.0)
        else:
            # append 0 to the list
            y_hat.append(0.0)

    accuracy = np.mean(np.array(y_hat) == np.squeeze(test_y))
    
    return accuracy

In [64]:
tmp_accuracy = test_logistic_regression(test_x, test_y, freqs, weight)
print(f"Logistic regression model's accuracy = {tmp_accuracy:.4f}")

Logistic regression model's accuracy = 0.9950
