#### AI-Powered Brand Sentiment Tracking from Twitter Trends: Naive Bayes 

We wanted to explore how AI could track brand sentiment across competitors.In this notebook, we will go through how we achieved implementing Naive Bayes. At the end, we will compare all CSV files for both starbucks and dunkin to understand our results. 


In [1]:
# IMPORT STATEMENTS
import pandas as pd
import re
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from collections import Counter
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
import math
import naive_bayes as nb
import warnings
warnings.filterwarnings('ignore')


#### 1. PREPARE AND LOAD DATA 

Cleaning Data. What did that entail?
- Pulling data from X. We used search_tweets.py to pull about 500 tweets. 
- We had to manually label our data. 
- We had to figure out what information was important for our model. 
For our 'starbucks.csv' data, that left us with 435 tweets. No duplicates, all tweets in english, no links, no mentions, no rewteets. 
To test our model, we will use starbucks.csv to start doing sentiment analysis.

Using the pandas library, we read in our csv file and dropped duplicates and null values, reindexed our data, and then visualized our data.  

In [2]:
# Read in the starbucks data and save it into a dataframe. Remove any duplicates and reindex data. 
df = pd.read_csv('starbucks.csv')
df.drop_duplicates(subset=['text'], keep='last', inplace=True)
df.dropna(subset=['text','label'], inplace=True)
df.dropna()
df.reindex()
df.head(500)

Unnamed: 0,text,label
0,broccoli head just came in and had me make a l...,neutral
1,I took my dog with me to Starbucks &amp; to go...,neutral
2,Tomorrow Is Teacher Appreciation Week So All M...,positive
3,why did starbucks just call us out infront of ...,neutral
4,My cousins jst posted a story of them on a Sta...,neutral
...,...,...
430,1. give cool houseless guy a fiver as he said ...,neutral
431,My American girlfriend will not drink fresh sq...,neutral
432,they’re bringing back the blue drink at starbu...,positive
433,It’s too many Starbucks stores and they keep c...,negative


#### 2. SPLIT OUR DATA 

We used the scikit libaray to split our data. Using their parameters, we split our data 30-70 and used the stratify param to ensure we had equal number of positive, neutral, and negative data training our model. 

In [3]:
train_df, test_df = train_test_split(df, test_size=0.3, random_state=42, stratify=df['label'])
train_df = train_df.reset_index(drop=True)
test_df = test_df.reset_index(drop=True)

#### 3. PREPROCESS DATA 
Our first step in implementing our data into the alogorithm is preprocessing our data. That means, taking every tweet, breaking up the sentence into indivdiuals words and emojis, and saving it into an array as one of our tokens. We apply the preprocessing function to all of our text data using the .apply() function. We can test out how our data will look under the 'Tokens' col by printing out the top 5 messages using .head(). We wanted to incorporate emojis because it is such an integral part of communication. Saying "I love going grocery shopping ❤️" and "I love going grocery shopping 🙄" show two different sentiments using the same exact sentence. 
We applied our function to all of the positive, negative, and neutral messages in our train_df. 

In [4]:
def preprocess_text(text, remove_stopwords=True):
    # lowercase all words 
    lowercase = text.lower()
    # tokenize both words and emojis 
    emoji_pattern = re.compile("[\U0001F600-\U0001F64F\U0001F300-\U0001F5FF"
                        "\U0001F680-\U0001F6FF\U0001F1E0-\U0001F1FF"
                        "\U00002700-\U000027BF\U0001F900-\U0001F9FF"
                        "\U00002600-\U000026FF\u200d]+", flags=re.UNICODE)

    text = re.sub(r"[^a-z0-9\s" + emoji_pattern.pattern + "]", "", lowercase)
    tokens = text.split()
    #remove stopwords 
    stop_words = stopwords.words('english')
    if remove_stopwords:
        tokens = [w for w in tokens if w not in stop_words]
    return tokens 

test_message = 'i love going grocery shopping ❤️'
tokenized_message = preprocess_text(test_message)
print(f'Test Message: {test_message}')
print(f'Tokenized Message: {tokenized_message}')

Test Message: i love going grocery shopping ❤️
Tokenized Message: ['love', 'going', 'grocery', 'shopping', '❤️']


In [5]:
#take the training data and split it into positive, negative, and neutral dataframes
train_positive = train_df[train_df['label']=='positive']
train_negative = train_df[train_df['label']=='negative']
train_neutral = train_df[train_df['label']=='neutral']

#tokenize 
pos_counts = Counter()
neg_counts = Counter()
neu_counts = Counter()

for text in train_positive['text']:
    toks = preprocess_text(text)
    pos_counts.update(toks)
for text in train_negative['text']:
    toks = preprocess_text(text)
    neg_counts.update(toks)
for text in train_neutral['text']:
    toks = preprocess_text(text)
    neu_counts.update(toks)
    

#### 4. TERM FREQUENCY VECTORS

We calculated the number of unique words that were in our training data to understand our vocabulary size. 

In [6]:
positive_tokens_list = [list(pos_counts)] 
negative_tokens_list = [list(neg_counts)]
neutral_tokens_list  = [list(neu_counts)]

positive_word_counts = pos_counts
negative_word_counts = neg_counts
neutral_word_counts  = neu_counts

# total tokens per class
total_positive_tokens = sum(pos_counts.values())
total_negative_tokens = sum(neg_counts.values())
total_neutral_tokens  = sum(neu_counts.values())

# build vocab as union
vocab = set(pos_counts) | set(neg_counts) | set(neu_counts)
print(f"Vocabulary size: {len(vocab)}")
print(f"Total positive tokens: {total_positive_tokens}")

Vocabulary size: 2047
Total positive tokens: 1003


#### 5. CONDITIONAL PROBABILITY SMOOTHING

After that, we implemented the conditional probabilty function using laplace smoothing. We applied conditional probabilty to every token and returned whether each sentence was either positive, negative, or neutral. 

In [7]:
num_positive = len(train_positive)
num_negative = len(train_negative)
num_neutral = len(train_neutral)
num_total = len(train_df)

def conditional_probability_smoothing(word, class_label):
    V = len(vocab)
    if class_label == 'positive':
        count = positive_word_counts.get(word, 0)
        return (count + 1) / (total_positive_tokens + V)
    elif class_label == 'negative':
        count = negative_word_counts.get(word, 0)
        return (count + 1) / (total_negative_tokens + V)
    elif class_label == 'neutral':
        count = neutral_word_counts.get(word, 0)
        return (count + 1) / (total_neutral_tokens + V)
    else:
        raise ValueError("Invalid class")

# Compute the prior probabilities: P(positive), P(negative), P(neutral)
P_positive = num_positive/num_total
P_negative = num_negative/num_total
P_neutral = num_neutral/num_total

def naive_bayes(text):
    tokens = preprocess_text(text, remove_stopwords=True)

    # Initialize prob with prior prob
    log_prob_positive = math.log(P_positive)
    log_prob_negative = math.log(P_negative)
    log_prob_neutral = math.log(P_neutral)

    # Add log conditional prob to each word 
    for word in tokens:
        if word in vocab:
            log_prob_positive += math.log(conditional_probability_smoothing(word, 'positive'))
            log_prob_negative += math.log(conditional_probability_smoothing(word, 'negative'))
            log_prob_neutral += math.log(conditional_probability_smoothing(word, 'neutral'))
    
    if log_prob_positive > log_prob_negative and log_prob_positive > log_prob_neutral:
        return 'positive' 
    elif log_prob_negative > log_prob_positive and log_prob_negative > log_prob_neutral:
        return 'negative'
    else:
        return 'neutral'

# test messages
test_messages = [
    "I just went to this Paris Baguette bakery coffee place in Uptown Phoenix and I can see why Starbucks is having trouble bouncing back. So many other bakery coffee places are popping up in so many places and are way better than Starbucks.",  # likely negative
    "Omg I just tried the new matcha latte at starbucks and it's SO good!",  # likely positive
    "been at starbucks for 40 minutes and dont even have a page of this damn paper done this is bad" # likely neutral
]

for msg in test_messages:
    result = naive_bayes(msg)
    print(f'Message: {msg}')
    print(f'Predicted Sentiment: {result}\n')

Message: I just went to this Paris Baguette bakery coffee place in Uptown Phoenix and I can see why Starbucks is having trouble bouncing back. So many other bakery coffee places are popping up in so many places and are way better than Starbucks.
Predicted Sentiment: negative

Message: Omg I just tried the new matcha latte at starbucks and it's SO good!
Predicted Sentiment: positive

Message: been at starbucks for 40 minutes and dont even have a page of this damn paper done this is bad
Predicted Sentiment: neutral



#### 6. TEST OUR MODEL USING TRAIN_DF

Now that we can predict whether a message is positive, negative, or neutral, we can individual run through each message in train_df and compare our new array of labels to the actual labels. We will test a few messages from train_df to show how our model is woking. 

In [8]:
y_true = list(test_df['label'])
y_pred = []

for text in test_df['text']:
    pred_label = naive_bayes(text)
    y_pred.append(pred_label)

test_df["predicted"] = y_pred 

for i in range(10):
    actual = test_df['label'][i]
    predicted = test_df['predicted'][i]
    print(f'message: {test_df['text'][i]}')
    print(f'{actual} -> {predicted}\n\n')


message: My cock is so hard while I sit at this Starbucks and talk to my sluts.
neutral -> neutral


message: I’ve worked at McDonalds and been a barista at Starbucks. I know the deal.
neutral -> neutral


message: i’ve got a whole collection of my name misspelled on starbucks cups

so now I’m curious, how do you think you’re supposed to pronounce “aix”?
neutral -> neutral


message: Decided to try that new coffee shop downtown, will their latte be better than starbucks?
neutral -> negative


message: Omg I just reconnected with my old Starbucks barista so blessed
positive -> neutral


message: Smashing burgers to be a thick a Starbucks straw is unAmerican.
neutral -> neutral


message: I want to make a Starbucks pink drink.
neutral -> positive


message: Starbucks for the first time 🤣
neutral -> positive


message: In a Starbucks, striped sweater on, modest mouse playing, screenplay OUT WHO WANNA READ IT
positive -> neutral


message: Coffee is so much better when it's cold I'm really

#### 7. Analyze our Data 

To make it easier for us, we transferred our work here into 'naive_bayes.py' to run naive bayes on all of our csv files. We will use scikit's library to get the our classification report. This will tell us how our model performed in terms of accuracy, precision, recall, and f1 scoring. Our files include: 

- csv file 'starbucks.csv'
- csv file 'starbucks2.csv'
- last df holding both csv files 
- csv file 'dunkin.csv'
- final df of all 3 files to maximize training  

In [9]:
df1 = pd.read_csv('starbucks.csv')
df1.drop_duplicates(subset=['text'], keep='last', inplace=True)
df1.dropna()
df1.reindex()

starbucks1 = nb.naive_bayes(df1)

Accuracy: 0.6564885496183206
Precision: 0.6603256655190082
Recall: 0.6564885496183206
F1 Score: 0.650448632897622
Classification Report:
              precision    recall  f1-score   support

    negative       0.61      0.39      0.48        28
     neutral       0.74      0.73      0.73        62
    positive       0.58      0.73      0.65        41

    accuracy                           0.66       131
   macro avg       0.64      0.62      0.62       131
weighted avg       0.66      0.66      0.65       131



In [10]:
df2 = pd.read_csv('starbucks2.csv')
df2.drop_duplicates(subset=['text'], keep='last', inplace=True)
df2.dropna()
df2.reindex()

starbucks2 = nb.naive_bayes(df2)

Accuracy: 0.6291390728476821
Precision: 0.6246215704824977
Recall: 0.6291390728476821
F1 Score: 0.6196794476762884
Classification Report:
              precision    recall  f1-score   support

    negative       0.59      0.62      0.61        53
     neutral       0.60      0.39      0.48        38
    positive       0.67      0.78      0.72        60

    accuracy                           0.63       151
   macro avg       0.62      0.60      0.60       151
weighted avg       0.62      0.63      0.62       151



In [11]:
sb = pd.concat([df1, df2], ignore_index=True)

starbucks = nb.naive_bayes(sb)

Accuracy: 0.5567375886524822
Precision: 0.5642241962126303
Recall: 0.5567375886524822
F1 Score: 0.5413435346282688
Classification Report:
              precision    recall  f1-score   support

    negative       0.64      0.53      0.58        81
     neutral       0.54      0.34      0.42       100
    positive       0.53      0.79      0.63       101

    accuracy                           0.56       282
   macro avg       0.57      0.55      0.54       282
weighted avg       0.56      0.56      0.54       282



In [12]:
d = pd.read_csv('dunkin.csv')
d.drop_duplicates(subset=['text'], keep='last', inplace=True)
d.dropna(subset=['text','label'], inplace=True)
d.reindex()

dunkin = nb.naive_bayes(d)

Accuracy: 0.5
Precision: 0.5489046344837629
Recall: 0.5
F1 Score: 0.4827500850892663
Classification Report:
              precision    recall  f1-score   support

    negative       0.29      0.18      0.22        33
     neutral       0.78      0.40      0.53        63
    positive       0.46      0.76      0.57        66

    accuracy                           0.50       162
   macro avg       0.51      0.45      0.44       162
weighted avg       0.55      0.50      0.48       162



In [13]:
all_data = pd.concat([df1, df2, d], ignore_index=True)

_NB = nb.naive_bayes(all_data)

Accuracy: 0.54627539503386
Precision: 0.5667467087885334
Recall: 0.54627539503386
F1 Score: 0.5320721352980172
Classification Report:
              precision    recall  f1-score   support

    negative       0.60      0.41      0.49       114
     neutral       0.60      0.39      0.48       163
    positive       0.51      0.79      0.62       166

    accuracy                           0.55       443
   macro avg       0.57      0.53      0.53       443
weighted avg       0.57      0.55      0.53       443



In [14]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# 1. Load the data
df = df

# 2. Inspect the first few rows to confirm structure
print("Sample data:")
print(df.head(), "\n")

# 3. Split into features and labels
X = df['text']
y = df['label']

# 4. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# 5. Vectorize text using TF-IDF
vectorizer = TfidfVectorizer(
    lowercase=True,
    stop_words='english',
    max_df=0.9,
    min_df=5
)
X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

# 6. Train a Multinomial Naive Bayes classifier
nb = MultinomialNB()
nb.fit(X_train_vec, y_train)

# 7. Evaluate on the test set
y_pred = nb.predict(X_test_vec)
print("Classification Report:")
print(classification_report(y_test, y_pred))

Sample data:
                                                text     label
0  broccoli head just came in and had me make a l...   neutral
1  I took my dog with me to Starbucks &amp; to go...   neutral
2  Tomorrow Is Teacher Appreciation Week So All M...  positive
3  why did starbucks just call us out infront of ...   neutral
4  My cousins jst posted a story of them on a Sta...   neutral 

Classification Report:
              precision    recall  f1-score   support

    negative       0.67      0.29      0.40        28
     neutral       0.60      0.90      0.72        62
    positive       0.81      0.51      0.63        41

    accuracy                           0.65       131
   macro avg       0.69      0.57      0.58       131
weighted avg       0.68      0.65      0.62       131

