<h2>Naive Bayes Sentiment Classification for Tweets<h2>

In this project we will build a sentiment analysis tool that classifies positive & negative tweets from X (Twitter previously) using the multinomial Naive Bayes algorithm. Sentiment analysis can help evaluate the performance of a product, automate customer preference reports, and predict customer behavior in planning a product launch. We'll build our sentiment analyzer using this [Kaggle Tweet dataset](https://www.kaggle.com/datasets/yasserh/twitter-tweets-sentiment-dataset). Our goal is at least 80% accuracy.

<h3>Exploring the Data<h3>

In [2]:
import pandas as pd
twitter = pd.read_csv("Tweets.csv")

print(twitter.shape)
print(twitter.dtypes)
twitter.head(3)

(27481, 4)
text_ID           object
tweet_text        object
selected_text     object
sentiment_type    object
dtype: object


Unnamed: 0,text_ID,tweet_text,selected_text,sentiment_type
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative
2,088c60f138,my boss is bullying me...,bullying me,negative


In [5]:
#We'll only focus on positive & negative sentiment, so we'll exclude neutral
twitter = twitter[twitter["sentiment_type"]!="neutral"]

#Now we'll fetch the sentiment percentages
twitter["sentiment_type"].value_counts(normalize=True)

positive    0.524476
negative    0.475524
Name: sentiment_type, dtype: float64

52.4% of the tweets were positive and 47.6% are negative.

<h3>Create Training and Test Sets<h3>

Split dataset into a training and a test set. Training set will comprise 80% of the data, and the test set the remaining 20%.

In [6]:
#Randomize the data
randomized_data = twitter.sample(frac=1, random_state=1)

#Calculate the index at which the split should occur
split_index = round(len(randomized_data)*0.8)

#Split for Training & Test sets
training_set = randomized_data[:split_index].reset_index(drop=True)
test_set = randomized_data[split_index:].reset_index(drop=True)

#Look at shape
print(training_set.shape)
print(test_set.shape)

(13090, 4)
(3273, 4)


Now we'll analyze the percentage of positive and negative tweets in the training and test sets. We expect the percentages to be close to what we have in the full dataset, where about 52.4% of the tweets were positive and 47.6% were negative.

In [7]:
training_set["sentiment_type"].value_counts(normalize=True)

positive    0.521925
negative    0.478075
Name: sentiment_type, dtype: float64

In [8]:
test_set["sentiment_type"].value_counts(normalize=True)

positive    0.534678
negative    0.465322
Name: sentiment_type, dtype: float64

The results look good.

<h3>Clean Data<h3>

To calculate all the probabilities required by the algorithm, we'll first need to perform a bit of data cleaning for our training set to bring the data into a format that will allow us to easily extract all the information we need.

In [9]:
import re
def clean(string): 
    #Remove special characters with regex
    special_removed = re.sub(r'[^a-zA-Z0-9\s]', '', string) 
    #Remove punctuation
    punct_removed = special_removed.replace('\W', ' ')
    #Make all characters lower case
    lower_case = punct_removed.lower()
    return lower_case

training_set["tweet_text"] = training_set["tweet_text"].apply(str).apply(clean)
training_set["tweet_text"].sample(20)

11207    good morning  i dont think it has stopped rain...
8868                                     becuz you braggin
9262     woah 311 is really good the rain earlier was r...
5823     practically my whole body burns i cant bend ov...
2523           feeling much better  doing history research
8637      awww  omg garbo fake playing during one of th...
4339     finally made it to jp licks in coolidge corner...
8900      im watching some of your videos in youtube yo...
5976                 haha better drunken tweeting you mean
3461                                  girl talk is awesome
12461     really good but its definitely not a 12s so m...
11404               i kno i kknow  sigh been on but it sux
8909     having a great time with family big ups to my ...
5906                                      gooooood morning
3884      they actually use standard speaker wire betwe...
5415      i ake it youre at work then and not lazing at...
8379                     macbook dying switching to ipho

<h3> Creating a Vocabulary List<h3> 

Now that we have clean data, we'll create a list with all the unique words in our training set, that is our vocabulary.

In [10]:
training_set["tweet_text"] = training_set["tweet_text"].str.split()
print(training_set["tweet_text"].head(3))

vocabulary = []
for tweet in training_set["tweet_text"]:
    for word in tweet:
        vocabulary.append(word)

vocabulary = list(set(vocabulary))
print(vocabulary[:3])
print(len(vocabulary))

0    [haha, its, pretty, good, theyre, making, some...
1                [happy, mamas, day, to, all, mothers]
2    [im, sure, youd, consider, it, if, they, offer...
Name: tweet_text, dtype: object
['mcphee', 'friendyou', 'aahs']
17731


There are 17,731 unique words in our vocabulary

<h3>Final Training Set<h3>

In [11]:
#create a dictionary that has an empty list the length of the entire training_set for every unique word
word_counts_per_tweet = {}
for unique_word in vocabulary:
    empty_list = [0]*len(training_set["tweet_text"])
    word_counts_per_tweet[unique_word] = empty_list

#fill the empty lists in word_counts_per_tweet dictionary with the word counts for each tweet in the training_set
for index, tweet in enumerate(training_set["tweet_text"]):
    for word in tweet:
        word_counts_per_tweet[word][index]+=1
        
#turn word_counts_per_tweet into a dataframe
word_counts = pd.DataFrame(word_counts_per_tweet)
word_counts.head(2)

Unnamed: 0,mcphee,friendyou,aahs,mjname,pink,pollard,backbum,wisdom,tourdeus,defiance,...,degrees,encouragementit,jacked,alexander,rem,fusion,decorated,reals,sleepytown,genious
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [12]:
clean_training_set = pd.concat([training_set, word_counts], axis=1)
clean_training_set.head(2)

Unnamed: 0,text_ID,tweet_text,selected_text,sentiment_type,mcphee,friendyou,aahs,mjname,pink,pollard,...,degrees,encouragementit,jacked,alexander,rem,fusion,decorated,reals,sleepytown,genious
0,733d30d2f7,"[haha, its, pretty, good, theyre, making, some...",good,positive,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,42e9dce94a,"[happy, mamas, day, to, all, mothers]",Happy,positive,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
clean_training_set.shape

(13090, 17735)

<h3>Calculate Constants<h3>

Now that we're done with cleaning the training set we can create the sentiment classifier. The Naive Bayes algorithm will need to answer these probability questions to be able to classify new tweets:

$$
  P(Positive | w_1,w_2, ..., w_n) \propto P(Positive) \cdot \prod_{i=1}^{n}P(w_i|Positive)\hspace{2cm}(1)
$$

$$
  P(Negative | w_1,w_2, ..., w_n) \propto P(Negative) \cdot \prod_{i=1}^{n}P(w_i|Negative)\hspace{1.5cm}(2)
$$


Also, to calculate $P(w_{i}|Positive)$ and $P(w_{i}|Negative)$ inside the formulas above, we'll need to use these equations:

$$
  P(w_i|Positive) = \frac{N_{w_i|Positive} + \alpha}{N_{Positive} + \alpha \cdot N_{Vocabulary}}\hspace{5.8cm}(3)
$$

$$
  P(w_i|Negative) = \frac{N_{w_i|Negative} + \alpha}{N_{Negative} + \alpha \cdot N_{Vocabulary}}\hspace{5.5cm}(4)
$$


Some of the terms in the four equations above will have the same value for every new tweet. We can calculate the value of these terms once and avoid doing the computations again for every new tweet. Below, we'll use our training set to calculate:

- P(Positive) and P(Negative) <br>
- N<sub>Positive</sub>, N<sub>Negative</sub>, and N<sub>Vocabulary <br>

We'll also use Laplace smoothing and set $\alpha = 1$.

In [14]:
# Isolating positive and negative tweets first
positive_tweets = clean_training_set[clean_training_set["sentiment_type"] == "positive"]
negative_tweets = clean_training_set[clean_training_set["sentiment_type"] == "negative"]

print(positive_tweets.shape)
print(negative_tweets.shape)

(6832, 17735)
(6258, 17735)


In [15]:
# P(Positive) & P(Negative)
p_positive = len(positive_tweets) / len(clean_training_set)
p_negative = len(negative_tweets) / len(clean_training_set)

print(p_positive)
print(p_negative)

0.5219251336898396
0.47807486631016044


In [16]:
import numpy as np

# N_Positive
n_words_per_positive_tweet = positive_tweets["tweet_text"].apply(len) #ERROR ALERT
n_positive = n_words_per_positive_tweet.sum()

# N_Negative
n_words_per_negative_tweet = negative_tweets["tweet_text"].apply(len)
n_negative = n_words_per_negative_tweet.sum()

print(n_positive)
print(n_negative)

88025
82818


In [17]:
# N_Vocabulary
n_vocabulary = len(vocabulary)

# Laplace smoothing
alpha = 1

print(n_vocabulary)
print(alpha)

17731
1


<h3>Calculating Parameters<h3>

Now that we have the constant terms calculated above, we can move on with calculating the parameters
and $P(w_{i}|Positive)$ and $P(w_{i}|Negative)$. Each parameter will be a conditional probability value associated with each word in the vocabulary. The parameters are calculated using formulas (3) and (4) previously listed:

$$
  P(w_i|Positive) = \frac{N_{w_i|Positive} + \alpha}{N_{Positive} + \alpha \cdot N_{Vocabulary}}\hspace{5.8cm}(3)
$$

$$
  P(w_i|Negative) = \frac{N_{w_i|Negative} + \alpha}{N_{Negative} + \alpha \cdot N_{Vocabulary}}\hspace{5.5cm}(4)
$$

In [32]:
#Initiate parameters in the form of a dictionary using list comprehension:
parameters_positive = {unique_word:0 for unique_word in vocabulary}
parameters_negative = {unique_word:0 for unique_word in vocabulary}

import numpy as np

# Calculate parameters:
for word in vocabulary:
    n_word_given_positive = positive_tweets[word].sum() #positive_tweets already defined in a cell above
    p_word_given_positive = (n_word_given_positive + alpha) / (n_positive + alpha*n_vocabulary)
    parameters_positive[word] = p_word_given_positive
    
    n_word_given_negative = negative_tweets[word].sum() #negative_tweets already defined in a cell above
    p_word_given_negative = (n_word_given_negative + alpha) / (n_negative + alpha*n_vocabulary)
    parameters_negative[word] = p_word_given_negative

<h3>Classifying A New Tweet<h3>

Now that we have all our parameters calculated, we can start creating the sentiment classifier. The sentiment classifier can be understood as a function that:

•Takes in as input a new tweet (w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>). <br>
•Calculates P(Positive|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Negative|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) <br>
•Compares the values of P(Positive|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) and P(Negative|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>). If P(Positive|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) is greater than P(Negative|w<sub>1</sub>, w<sub>2</sub>, ..., w<sub>n</sub>) then the tweet is classified as positive, and vice versa. If both are equal then we can ask for human input.

In [33]:
def classify(tweet):

    tweet = clean(tweet)
    tweet = tweet.lower().split()
    
    p_positive_given_tweet = p_positive
    p_negative_given_tweet = p_negative
    
    for word in tweet:
        if word in parameters_positive:
            p_positive_given_tweet*=parameters_positive[word]
            
        if word in parameters_negative:
            p_negative_given_tweet*=parameters_negative[word]
    
    
    if p_positive_given_tweet>p_negative_given_tweet:
        return "positive"
    elif p_negative_given_tweet>p_positive_given_tweet:
        return "negative"
    else:
        return "needs human classification"    

In [34]:
test1 = classify("i love noodles")
test2 = classify("bad weather")
print(test1)
print(test2)

positive
negative


Now we'll add an additional column with the predicted sentiment:

In [35]:
test_set["predicted"] = test_set["tweet_text"].apply(classify)
test_set.head(4)

Unnamed: 0,text_ID,tweet_text,selected_text,sentiment_type,predicted
0,2c79fd035e,Bloody servers are down at work for at least 3...,down,negative,negative
1,093021fd94,i hope it doesnt rain tonight tomorrow my fam....,i hope,positive,positive
2,eaf0dc067f,is sipping OJ in the sun in San Pedro at La So...,Yummy!!!,positive,positive
3,c9100b0595,I know Buffie. I am sitting in my office inst...,bummer,negative,negative


Below we'll write a script to measure how accurate our tweet classifier is: 

In [36]:
correct = 0
total = test_set.shape[0]

for row in test_set.iterrows(): #user iterrows to iterate over test_set
    row_content = row[1] #extract row content from the (index, row content) tuple produced by iterrows
    if row_content["predicted"]==row_content["sentiment_type"]:
        correct+=1
        
accuracy = 100*(correct/total)
print(accuracy)

85.21234341582647


The test is 85% accurate

<h3>Next Steps<h3>

In this project, we built a sentiment classifier for Tweets from X (previously Twitter) using the multinomial Naive Bayes algorithm. The classifier had an accuracy of 85% on the test set we used, which exceeded our 80% goal.

Potential next steps to improve the classifier could include:<br>
• Analyzing the messages that were classified incorrectly and trying to figure out why the algorithm classified them incorrectly<br>
•Making the classifying process more complex by writing an algorithm that's sensitive to letter case, emoticons, and neutral sentiment<br>