In [None]:
%load_ext autoreload
%autoreload 2

# Sentiment Analysis with Naive Bayes

In this notebook, we will explore the use of Naive Bayes for sentiment analysis.

It is essentially the same task as in the previous assignment, except we use Naive Bayes this time.

## Dataset

We will use the same dataset as in the previous assignment, that is the NLTK tweets dataset.

Also we will do the same train/test split as in the previous assignment.

In [None]:
from nltk.corpus import twitter_samples

postive_tweets = twitter_samples.strings("positive_tweets.json")
negative_tweets = twitter_samples.strings("negative_tweets.json")
n_samples = len(postive_tweets) + len(negative_tweets)
n_pos = len(postive_tweets)
n_neg = len(negative_tweets)

print("Total number of tweets: ", n_samples)
print("Number of positive tweets: ", n_pos)
print("Number of negative tweets: ", n_neg)

n_train = int(n_samples * 0.8)
n_test = n_samples - n_train

print("Number of training samples: ", n_train)
print("Number of test samples: ", n_test)

n = int(n_train / 2)

# training data
train_data_pos = postive_tweets[:n]
train_data_neg = negative_tweets[:n]
print(f"train_data_pos: {len(train_data_pos)}")
print(f"train_data_neg: {len(train_data_neg)}")

# test data
test_data_pos = postive_tweets[n:]
test_data_neg = negative_tweets[n:]
print(f"test_data_pos: {len(test_data_pos)}")
print(f"test_data_neg: {len(test_data_neg)}")

# build train and test datasets
train_data = train_data_pos + train_data_neg
test_data = test_data_pos + test_data_neg
print(f"train_data: {len(train_data)}")
print(f"test_data: {len(test_data)}")

In [None]:
import numpy as np

# create labels
y_train = np.append(
    np.ones((len(train_data_pos), 1)), np.zeros((len(train_data_neg), 1)), axis=0
)
y_test = np.append(
    np.ones((len(test_data_pos), 1)), np.zeros((len(test_data_neg), 1)), axis=0
)

print("y_train shape: ", y_train.shape)
print("y_test shape: ", y_test.shape)

## Preprocessing

We will reuse our preprocessing pipeline.

In [None]:
from htwgnlp.preprocessing import TweetProcessor

processor = TweetProcessor()
train_data_processed = [processor.process_tweet(tweet) for tweet in train_data]
train_data_processed[0]

## Training

For training, the goal is to find the word probabilities for each class.

Also we need the log ratio of the probabilities, which are calculated from the word probabilities.

In [None]:
from htwgnlp.naive_bayes import NaiveBayes

model = NaiveBayes()

model.fit(train_data_processed, y_train)
model.word_probabilities

In [None]:
model.log_ratios

## Testing

For testing, we need to make sure to apply the same preprocessing pipeline as for training.

Then we can calculate the log ratio of the probabilities for each class.

This is done by the `predict` function, which returns the predicted class label.

In [None]:
test_data_processed = [processor.process_tweet(tweet) for tweet in test_data]

In [None]:
y_pred = model.predict(test_data_processed)
y_pred

## Evaluation

We can observe that we achieve a relatively high accuracy of 99.65% on the test set.

```
# expected output
Accuracy: 0.9965
```

In [None]:
print(f"Accuracy: {(y_pred == y_test).mean() * 100}")

Now we can try to predict our own tweet.

In [None]:
tweet = "Konstanz is a great place to live!"
x_i = [processor.process_tweet(tweet)]
print(f"tweet: {x_i[0]}")
print(f"prediction: {model.predict(x_i)[0]}")

## Error Analysis

Finally, we can check the error cases to see where our model fails.

In [None]:
error_cases = np.nonzero((y_pred.flatten() != y_test.flatten()))[0]
y_prob = model.predict_prob(test_data_processed)

for i in error_cases:
    print(
        f"sample: {i:>4}, predicted class: {y_pred[i]}, actual class: {y_test[i]} log likelihood: {y_prob[i].item():7.4f}, tweet: {test_data[i]}"
    )

To better understand our classifier, we can check which words have the most impact on the sentiment of the review.

We can use the log ratios of the conditional probabilities to find the words that are most indicative of a positive or negative tweet.

Remember from the lecture that a value greater than 0 means that the word is more likely to appear in a positive tweet, and a value less than 0 means that the word is more likely to appear in a negative tweet.

In [None]:
model.log_ratios.sort_values(ascending=False).head(10)

Looking at the counts may give us a better intuition.

In [None]:
df = model.df_freqs.copy()

df["ratio"] = (df[1] + 1) / (df[0] + 1)
df.sort_values(by="ratio", ascending=False).head(10)

In [None]:
df.sort_values(by="ratio").head(10)

## Conclusion

The Naive Bayes classifier is a simple but powerful classifier that works well on text classification problems. 

It makes the assumption that the features are conditionally independent given the class, which is not true in general, but it still performs well in practice.
