# Sentiment Analysis with Naive Bayes

In this notebook, we will explore the use of Naive Bayes for sentiment analysis.

It is essentially the same task as in the previous assignment, except we use Naive Bayes this time.

## Dataset

We will use the same dataset as in the previous assignment, that is the NLTK tweets dataset.

Also we will do the same train/test split as in the previous assignment.

In [1]:
from nltk.corpus import twitter_samples

postive_tweets = twitter_samples.strings("positive_tweets.json")
negative_tweets = twitter_samples.strings("negative_tweets.json")
n_samples = len(postive_tweets) + len(negative_tweets)
n_pos = len(postive_tweets)
n_neg = len(negative_tweets)

print("Total number of tweets: ", n_samples)
print("Number of positive tweets: ", n_pos)
print("Number of negative tweets: ", n_neg)

n_train = int(n_samples * 0.8)
n_test = n_samples - n_train

print("Number of training samples: ", n_train)
print("Number of test samples: ", n_test)

n = int(n_train / 2)

# training data
train_data_pos = postive_tweets[:n]
train_data_neg = negative_tweets[:n]
print(f"train_data_pos: {len(train_data_pos)}")
print(f"train_data_neg: {len(train_data_neg)}")

# test data
test_data_pos = postive_tweets[n:]
test_data_neg = negative_tweets[n:]
print(f"test_data_pos: {len(test_data_pos)}")
print(f"test_data_neg: {len(test_data_neg)}")

# build train and test datasets
train_data = train_data_pos + train_data_neg
test_data = test_data_pos + test_data_neg
print(f"train_data: {len(train_data)}")
print(f"test_data: {len(test_data)}")

Total number of tweets:  10000
Number of positive tweets:  5000
Number of negative tweets:  5000
Number of training samples:  8000
Number of test samples:  2000
train_data_pos: 4000
train_data_neg: 4000
test_data_pos: 1000
test_data_neg: 1000
train_data: 8000
test_data: 2000


In [2]:
import numpy as np

# create labels
y_train = np.append(
    np.ones((len(train_data_pos), 1)), np.zeros((len(train_data_neg), 1)), axis=0
)
y_test = np.append(
    np.ones((len(test_data_pos), 1)), np.zeros((len(test_data_neg), 1)), axis=0
)

print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

y_train shape:  (8000, 1)
y_test shape:  (2000, 1)


## Preprocessing

We will reuse our preprocessing pipeline.

In [3]:
from htwgnlp.preprocessing import TweetProcessor

processor = TweetProcessor()
train_data_processed = [processor.process_tweet(tweet) for tweet in train_data]
train_data_processed[0]

['followfriday', 'top', 'engag', 'member', 'commun', 'week', ':)']

## Training

For training, the goal is to find the word probabilities for each class.

Also we need the log ratio of the probabilities, which are calculated from the word probabilities.

In [4]:
from htwgnlp.naive_bayes import NaiveBayes

model = NaiveBayes()

model.fit(train_data_processed, y_train)
model.word_probabilities

Unnamed: 0,0,1
hopeless,0.000083,0.000027
tmr,0.000110,0.000054
:(,0.101256,0.000054
everyth,0.000413,0.000300
kid,0.000468,0.000381
...,...,...
umair,0.000028,0.000054
thoracicbridg,0.000028,0.000054
5minut,0.000028,0.000054
nonscript,0.000028,0.000054


In [5]:
model.log_ratios

hopeless        -1.109570
tmr             -0.704105
:(              -7.527391
everyth         -0.321113
kid             -0.205114
                   ...   
umair            0.682189
thoracicbridg    0.682189
5minut           0.682189
nonscript        0.682189
soph             0.682189
Length: 9160, dtype: float64

## Testing

For testing, we need to make sure to apply the same preprocessing pipeline as for training.

Then we can calculate the log ratio of the probabilities for each class.

This is done by the `predict` function, which returns the predicted class label.

In [6]:
test_data_processed = [processor.process_tweet(tweet) for tweet in test_data]

In [7]:
y_pred = model.predict(test_data_processed)
y_pred

array([[1],
       [1],
       [1],
       ...,
       [0],
       [0],
       [0]])

## Evaluation

We can observe that we achieve a relatively high accuracy of 99.65% on the test set.

```
# expected output
Accuracy: 0.9965
```

In [8]:
print(f"Accuracy: {(y_pred == y_test).mean() * 100}")

Accuracy: 99.65


Now we can try to predict our own tweet.

In [9]:
tweet = "Konstanz is a great place to live!"
x_i = [processor.process_tweet(tweet)]
print(f"tweet: {x_i[0]}")
print(f"prediction: {model.predict(x_i)[0]}")

tweet: ['konstanz', 'great', 'place', 'live']
prediction: [1]


## Error Analysis

Finally, we can check the error cases to see where our model fails.

In [10]:
error_cases = np.nonzero((y_pred.flatten() != y_test.flatten()))[0]
y_prob = model.predict_prob(test_data_processed)

for i in error_cases:
    print(
        f"sample: {i:>4}, predicted class: {y_pred[i]}, actual class: {y_test[i]} log likelihood: {y_prob[i].item():7.4f}, tweet: {test_data[i]}"
    )

sample:   65, predicted class: [0], actual class: [1.] log likelihood: -1.4684, tweet: @jaredNOTsubway @iluvmariah @Bravotv Then that truly is a LATERAL move! Now, we all know the Queen Bee is UPWARD BOUND : ) #MovingOnUp
sample:  222, predicted class: [0], actual class: [1.] log likelihood: -1.0290, tweet: A new report talks about how we burn more calories in the cold, because we work harder to warm up. Feel any better about the weather? :p
sample:  753, predicted class: [0], actual class: [1.] log likelihood: -0.9607, tweet: off to the park to get some sunlight : )
sample:  822, predicted class: [0], actual class: [1.] log likelihood: -0.4665, tweet: @msarosh Uff Itna Miss karhy thy ap :p
sample: 1057, predicted class: [1], actual class: [0.] log likelihood:  0.7028, tweet: @rcdlccom hello, any info about possible interest in Jonathas ?? He is close to join Betis :( greatings
sample: 1298, predicted class: [1], actual class: [0.] log likelihood:  1.9149, tweet: @phenomyoutube u probs

To better understand our classifier, we can check which words have the most impact on the sentiment of the review.

We can use the log ratios of the conditional probabilities to find the words that are most indicative of a positive or negative tweet.

Remember from the lecture that a value greater than 0 means that the word is more likely to appear in a positive tweet, and a value less than 0 means that the word is more likely to appear in a negative tweet.

In [11]:
model.log_ratios.sort_values(ascending=False).head(10)

:)              6.883712
:-)             6.304400
:d              6.250534
:p              4.652481
stat            3.940286
bam             3.795705
warsaw          3.795705
blog            3.321247
fback           3.284879
followfriday    3.167096
dtype: float64

Looking at the counts may give us a better intuition.

In [12]:
df = model.df_freqs.copy()

df["ratio"] = (df[1] + 1) / (df[0] + 1)
df.sort_values(by="ratio", ascending=False).head(10)

Unnamed: 0,0,1,ratio
:),2,2960,987.0
:-),0,552,553.0
:d,0,523,524.0
:p,0,105,106.0
stat,0,51,52.0
warsaw,0,44,45.0
bam,0,44,45.0
blog,0,27,28.0
fback,0,26,27.0
followfriday,0,23,24.0


In [13]:
df.sort_values(by="ratio").head(10)

Unnamed: 0,0,1,ratio
:(,3675,1,0.000544
:-(,386,0,0.002584
》,210,0,0.004739
♛,210,0,0.004739
>:(,43,0,0.022727
justi̇n,35,0,0.027778
wi̇ll,35,0,0.027778
beli̇ev,35,0,0.027778
ｓｅｅ,35,0,0.027778
ｍｅ,35,0,0.027778


## Conclusion

The Naive Bayes classifier is a simple but powerful classifier that works well on text classification problems. 

It makes the assumption that the features are conditionally independent given the class, which is not true in general, but it still performs well in practice.
