# Naive Bayes Sentiment Analysis on Tweets

Welcome to my learning notebook for Naive Bayes sentiment analysis! In this notebook, I implement a full pipeline for classifying tweets as positive or negative using NLP techniques and a custom Naive Bayes model. All code and explanations are my own, stepwise, and educational for sharing on GitHub.

**Outline:**
1. Import Required Libraries and Data
2. Process the Data
3. Implement Helper Functions
4. Train Naive Bayes Model
5. Test Naive Bayes Model
6. Analyze Word Ratios
7. Error Analysis
8. Predict Sentiment on Your Own Tweet

## 1. Import Required Libraries and Data

Let's start by importing all the necessary libraries, downloading the required NLTK datasets, and loading the tweet data for training and testing.

In [1]:
# Import libraries
import numpy as np
import pandas as pd
import nltk
import string
from nltk.corpus import stopwords, twitter_samples
from nltk.tokenize import TweetTokenizer

# Download NLTK datasets (run once)
nltk.download('twitter_samples')
nltk.download('stopwords')

# Load positive and negative tweets
twitter_samples_path = None  # For local use, can set custom path if needed
all_positive_tweets = twitter_samples.strings('positive_tweets.json')
all_negative_tweets = twitter_samples.strings('negative_tweets.json')

# Split data into training and testing sets
train_pos = all_positive_tweets[:4000]
test_pos = all_positive_tweets[4000:]
train_neg = all_negative_tweets[:4000]
test_neg = all_negative_tweets[4000:]

train_x = train_pos + train_neg
test_x = test_pos + test_neg

train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))
test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))

[nltk_data] Downloading package twitter_samples to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Dell\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Process the Data

Before training our model, we need to clean and preprocess the tweets. We'll use a custom `process_tweet` function to remove noise, punctuation, and stopwords, and to apply stemming.

In [2]:
import re
from nltk.stem import PorterStemmer

def process_tweet(tweet):
    """
    Custom tweet preprocessing: lowercase, remove links, handles, punctuation, stopwords, and apply stemming.
    """
    stemmer = PorterStemmer()
    stop_words = set(stopwords.words('english'))
    # Remove stock tickers like $GE
    tweet = re.sub(r'\$\w*', '', tweet)
    # Remove retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)
    # Remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # Remove hashtags
    tweet = re.sub(r'#', '', tweet)
    # Tokenize
    tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)
    tokens = tokenizer.tokenize(tweet)
    clean_tokens = []
    for word in tokens:
        if (word not in stop_words and  # remove stopwords
            word not in string.punctuation and  # remove punctuation
            len(word) > 1 and  # remove short tokens
            not word.isnumeric()):
            stem_word = stemmer.stem(word)
            clean_tokens.append(stem_word)
    return clean_tokens

# Example usage
sample_tweet = "RT @Twitter @chapagain Hello There! Have a great day. :) #good #morning http://chapagain.com.np"
print(process_tweet(sample_tweet))

['hello', 'great', 'day', ':)', 'good', 'morn']


## 3. Implement Helper Functions

To train our Naive Bayes model, we need to count how often each word appears in positive and negative tweets. We'll implement two helper functions: `count_tweets` to build the frequency dictionary, and `lookup` to retrieve word-label frequencies.

In [3]:
def count_tweets(result, tweets, ys):
    """
    Count the frequency of each word in positive and negative tweets.
    Args:
        result: dict to store counts
        tweets: list of tweet strings
        ys: list/array of labels (0 or 1)
    Returns:
        result: dict mapping (word, label) to frequency
    """
    for y, tweet in zip(ys, tweets):
        for word in process_tweet(tweet):
            pair = (word, y)
            if pair in result:
                result[pair] += 1
            else:
                result[pair] = 1
    return result

def lookup(freqs, word, label):
    """
    Return the frequency of a word with a given label from the frequency dictionary.
    """
    return freqs.get((word, label), 0)

# Test helper functions
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am tired']
ys = [1, 0, 0, 0, 0]
print(count_tweets(result, tweets, ys))

{('happi', 1): 1, ('trick', 0): 1, ('sad', 0): 1, ('tire', 0): 2}


## 4. Train Naive Bayes Model

Now let's build the frequency dictionary using our helper functions, and implement a custom Naive Bayes training function to compute the logprior and loglikelihood for each word.

In [4]:
def train_naive_bayes(freqs, train_x, train_y):
    """
    Train a Naive Bayes classifier: compute logprior and loglikelihood for each word.
    """
    loglikelihood = {}
    vocab = set([pair[0] for pair in freqs.keys()])
    V = len(vocab)
    N_pos = N_neg = 0
    for (word, label), count in freqs.items():
        if label == 1:
            N_pos += count
        else:
            N_neg += count
    D = len(train_y)
    D_pos = np.sum(train_y)
    D_neg = D - D_pos
    logprior = np.log(D_pos) - np.log(D_neg)
    for word in vocab:
        freq_pos = freqs.get((word, 1), 0)
        freq_neg = freqs.get((word, 0), 0)
        p_w_pos = (freq_pos + 1) / (N_pos + V)
        p_w_neg = (freq_neg + 1) / (N_neg + V)
        loglikelihood[word] = np.log(p_w_pos / p_w_neg)
    return logprior, loglikelihood

# Build frequency dictionary and train model
freqs = count_tweets({}, train_x, train_y)
logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)
print(f"Logprior: {logprior}")
print(f"Vocabulary size: {len(loglikelihood)}")

Logprior: 0.0
Vocabulary size: 8728


## 5. Test Naive Bayes Model

Let's implement a function to predict the sentiment of a tweet using our trained model, and another function to evaluate the model's accuracy on the test set.

In [5]:
def naive_bayes_predict(tweet, logprior, loglikelihood):
    """
    Predict the sentiment of a tweet using the trained Naive Bayes model.
    """
    words = process_tweet(tweet)
    p = logprior
    for word in words:
        if word in loglikelihood:
            p += loglikelihood[word]
    return p

def test_naive_bayes(test_x, test_y, logprior, loglikelihood):
    """
    Evaluate model accuracy on the test set.
    """
    y_hats = []
    for tweet in test_x:
        if naive_bayes_predict(tweet, logprior, loglikelihood) > 0:
            y_hats.append(1)
        else:
            y_hats.append(0)
    error = np.mean(np.abs(y_hats - test_y))
    accuracy = 1 - error
    return accuracy

# Test prediction and accuracy
print(f"Test accuracy: {test_naive_bayes(test_x, test_y, logprior, loglikelihood):.4f}")

Test accuracy: 0.9945


## 6. Analyze Word Ratios

Some words are much more likely to appear in positive or negative tweets. Let's implement functions to compute the positive/negative ratio for a word, and to filter words by their sentiment ratio.

In [6]:
def get_ratio(freqs, word):
    """
    Compute the positive/negative ratio for a word.
    """
    pos = lookup(freqs, word, 1)
    neg = lookup(freqs, word, 0)
    ratio = (pos + 1) / (neg + 1)
    return {'positive': pos, 'negative': neg, 'ratio': ratio}

def get_words_by_threshold(freqs, label, threshold):
    """
    Filter words by their positive/negative ratio.
    """
    word_list = {}
    for (word, _), _ in freqs.items():
        pos_neg_ratio = get_ratio(freqs, word)
        if label == 1 and pos_neg_ratio['ratio'] >= threshold:
            word_list[word] = pos_neg_ratio
        elif label == 0 and pos_neg_ratio['ratio'] <= threshold:
            word_list[word] = pos_neg_ratio
    return word_list

# Example: find strongly positive and negative words
print(get_words_by_threshold(freqs, label=1, threshold=10))
print(get_words_by_threshold(freqs, label=0, threshold=0.05))

{'followfriday': {'positive': 23, 'negative': 0, 'ratio': 24.0}, 'commun': {'positive': 27, 'negative': 1, 'ratio': 14.0}, ':)': {'positive': 2847, 'negative': 2, 'ratio': 949.3333333333334}, 'flipkartfashionfriday': {'positive': 16, 'negative': 0, 'ratio': 17.0}, ':d': {'positive': 498, 'negative': 0, 'ratio': 499.0}, ':p': {'positive': 104, 'negative': 0, 'ratio': 105.0}, 'influenc': {'positive': 16, 'negative': 0, 'ratio': 17.0}, ':-)': {'positive': 543, 'negative': 0, 'ratio': 544.0}, "here'": {'positive': 20, 'negative': 0, 'ratio': 21.0}, 'youth': {'positive': 14, 'negative': 0, 'ratio': 15.0}, 'bam': {'positive': 44, 'negative': 0, 'ratio': 45.0}, 'warsaw': {'positive': 44, 'negative': 0, 'ratio': 45.0}, 'shout': {'positive': 11, 'negative': 0, 'ratio': 12.0}, ';)': {'positive': 22, 'negative': 0, 'ratio': 23.0}, 'stat': {'positive': 51, 'negative': 0, 'ratio': 52.0}, 'arriv': {'positive': 57, 'negative': 4, 'ratio': 11.6}, 'via': {'positive': 60, 'negative': 1, 'ratio': 30.5}, 

## 7. Error Analysis

Let's look at some tweets that our model misclassified. This can help us understand the model's limitations and where it might be improved.

In [7]:
print('Truth\tPredicted\tTweet')
for x, y in zip(test_x, test_y):
    y_hat = naive_bayes_predict(x, logprior, loglikelihood)
    if y != (np.sign(y_hat) > 0):
        print(f'{int(y)}\t{int(np.sign(y_hat) > 0)}\t{" ".join(process_tweet(x))}')

Truth	Predicted	Tweet
1	0	
1	0	truli later move know queen bee upward bound movingonup
1	0	new report talk burn calori cold work harder warm feel better weather :p
1	0	harri niall harri born ik stupid wanna chang :d
1	0	
1	0	
1	0	harri niall harri born ik stupid wanna chang :d
1	0	
1	0	
1	0	park get sunlight
1	0	uff itna miss karhi thi ap :p
0	1	hello info possibl interest jonatha close join beti :( great
1	0	park get sunlight
1	0	uff itna miss karhi thi ap :p
0	1	hello info possibl interest jonatha close join beti :( great
0	1	prob fun david
0	1	prob fun david
0	1	pat jay
0	1	pat jay


## 8. Predict Sentiment on Your Own Tweet

Finally, let's use our trained model to predict the sentiment of a custom tweet!

In [8]:
my_tweet = 'I am happy because I am learning :)'
p = naive_bayes_predict(my_tweet, logprior, loglikelihood)
print(f"Tweet: {my_tweet}")
print(f"Predicted sentiment score: {p:.2f}")
if p > 0:
    print("Sentiment: Positive")
else:
    print("Sentiment: Negative")

Tweet: I am happy because I am learning :)
Predicted sentiment score: 9.53
Sentiment: Positive
