# Machine learning: Naive bayes for text prediction

### Here I use a multinomial naive bayes classifier to predict if a twitter message if a rumor or not
Data from Kaggle: https://www.kaggle.com/datasets/syntheticprogrammer/rumor-detection-acl-2017  
Rumor Detection Dataset (Twitter15 and Twitter16)

In [66]:
# Import libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from sklearn.feature_selection import SelectPercentile, f_classif
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB

## Read in data

In [4]:
# read in tweet data
df = pd.read_csv("./twitter15/source_tweets.txt", sep='\t', header = None)
df.head()

Unnamed: 0,0,1
0,731166399389962242,🔥ca kkk grand wizard 🔥 endorses @hillaryclinto...
1,714598641827246081,an open letter to trump voters from his top st...
2,691809004356501505,america is a nation of second chances —@potus ...
3,693204708933160960,"brandon marshall visits and offers advice, sup..."
4,551099691702956032,rip elly may clampett: so sad to learn #beverl...


In [7]:
# read in labels
labels = pd.read_csv("./twitter15/label.txt", sep=':', header = None)
labels.head()

Unnamed: 0,0,1
0,unverified,731166399389962242
1,unverified,714598641827246081
2,non-rumor,691809004356501505
3,non-rumor,693204708933160960
4,true,551099691702956032


## Preprocess data

In [82]:
df.shape # 1490 tweets

(1490, 3)

In [9]:
# rename columns
df.columns = ['tweet_id', 'tweet_text']
labels.columns = ['label', 'tweet_id']
print(df.head())
print(labels.head())

             tweet_id                                         tweet_text
0  731166399389962242  🔥ca kkk grand wizard 🔥 endorses @hillaryclinto...
1  714598641827246081  an open letter to trump voters from his top st...
2  691809004356501505  america is a nation of second chances —@potus ...
3  693204708933160960  brandon marshall visits and offers advice, sup...
4  551099691702956032  rip elly may clampett: so sad to learn #beverl...
        label            tweet_id
0  unverified  731166399389962242
1  unverified  714598641827246081
2   non-rumor  691809004356501505
3   non-rumor  693204708933160960
4        true  551099691702956032


In [15]:
# Look at all possible label values
labels.label.value_counts()

label
unverified    374
non-rumor     374
true          372
false         370
Name: count, dtype: int64

In [None]:
# 4 classes of tweets: Non-Rumor; false: False Rumor; true: True Rumor; unverified: Unverified Rumor

In [16]:
# check to see if our labels and tweets are in the same order
df['tweet_id'].equals(labels['tweet_id'])

True

In [28]:
# clean up text, remove emojis, links, mentions
def preprocess_text(text):
    import re
    text = re.sub(r"http\S+", "", text)  # Remove URLs
    text = re.sub(r"@\w+", "", text)    # Remove mentions
    # text = re.sub(r"#", "", text)       # Remove hashtags
    text = re.sub(r"\W+", " ", text)    # Remove non-word characters
    return text.strip().lower()

df['clean_text'] = df['tweet_text'].apply(preprocess_text)
df.head()

Unnamed: 0,tweet_id,tweet_text,clean_text
0,731166399389962242,🔥ca kkk grand wizard 🔥 endorses @hillaryclinto...,ca kkk grand wizard endorses neverhillary trum...
1,714598641827246081,an open letter to trump voters from his top st...,an open letter to trump voters from his top st...
2,691809004356501505,america is a nation of second chances —@potus ...,america is a nation of second chances on new r...
3,693204708933160960,"brandon marshall visits and offers advice, sup...",brandon marshall visits and offers advice supp...
4,551099691702956032,rip elly may clampett: so sad to learn #beverl...,rip elly may clampett so sad to learn beverlyh...


In [29]:
# compare a couple of our cleaned tweets to the original ones
print(df.tweet_text[0])
print(df.clean_text[0])

print(df.tweet_text[1])
print(df.clean_text[1])

🔥ca kkk grand wizard 🔥 endorses @hillaryclinton #neverhillary #trump2016 URL
ca kkk grand wizard endorses neverhillary trump2016 url
an open letter to trump voters from his top strategist-turned-defector URL via @xojanedotcom
an open letter to trump voters from his top strategist turned defector url via


In [None]:
# Set up our data and labels for the model

In [74]:
X = df['clean_text']  # Features (tweets)
y = labels['label']       # Target (rumor classes)

# split data into testing and training sets (80/20 split)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

In [75]:
# create a count vectorizer
count_vectorizer = CountVectorizer()

# fit the count vectorizer to the training data
count_vectorizer.fit(X_train)

# transform the training data using the count vectorizer
X_train_count = count_vectorizer.transform(X_train)

# transform the testing data using the count vectorizer
X_test_count = count_vectorizer.transform(X_test)

## Create classifier

In [76]:
# create a multinomial naive bayes classifier
clf = MultinomialNB()

# fit the classifier to the training data
clf.fit(X_train_count, y_train)

In [77]:
# make predictions on the testing data
y_pred = clf.predict(X_test_count)
print(classification_report(y_test, y_pred))

# precision = true positive / (true positives + false positives), high precision means that the model doesn’t classify negative instances as positives too often

# recall (aka Sensitivity) = proportion of actual positive instances that are correctly identified by the model
# sens formula true positive / true positives + false negatives

# f1-score = a metric to evaluate both precision and recall
# f1 formula: 2 x (precision x recall)/(precision + recall)

# support = # occurrences of each class in the test dataset

              precision    recall  f1-score   support

       false       0.80      0.80      0.80        74
   non-rumor       0.85      0.59      0.69        75
        true       0.78      0.93      0.85        74
  unverified       0.74      0.83      0.78        75

    accuracy                           0.79       298
   macro avg       0.79      0.79      0.78       298
weighted avg       0.79      0.79      0.78       298



In [78]:
# calculate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

# accuracy = number of tweets that are classified correctly divided by the total number of test tweets

0.785234899328859


# Conclusions:
Overall, our model does a pretty good job of predicting if a tweet is a rumor or not by using the tweet text with an accuracy of 0.79