# Machine Learning  - Naive Bayes for Text Classification

This is a fun little project comparing the tweets of U.S. Senators running for president in 2020.  The tweets are a small sample from spring 2019.  Many of the topics are the same, but each Senator also tweeted about state-specific news.  Their styles of writing differed in length, directness, and tone.  Let's see how a very simple naive bayes classifier handles it.

In [1]:
import numpy as np
import pandas as pd
import os
from os import listdir
from os.path import isfile, join
import re
import itertools
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
import nltk
from nltk.tokenize import TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer

### Import data

In [2]:
path = "c:/Users/h_lig/Documents/Data Science Classes/GitHub Projects/Senator Tweets Naive Bayes/"

#List file names
files = [f for f in listdir(path) if isfile(join(path, f))]

#Create hash
tweets_hash = dict.fromkeys(files)

for file in files:
    filetext = open(path + file, "r", encoding='utf-8', errors='ignore').read()
    tweets_hash[file]  = filetext

In [3]:
tweets_hash.keys()

dict_keys(['SenBooker.txt', 'SenGillibrand.txt', 'SenHarris.txt', 'SenKlobuchar.txt', 'SenSanders.txt', 'SenWarren.txt'])

### Parse Data into Lists

Each tweet becomes an item in a list of all the tweets.  Who said each tweet is saved in a separate list.

In [4]:
# Create class for parsing text files by tweet
class parser(): 
    
    def listTweets(self, key):
        twts = tweets_hash[key]
        twts = twts.replace('\xa0', '')
        twt_list = twts.split("\n\n")
        twt_author = key.replace('.txt', '')
        namelist = list(itertools.repeat(twt_author, len(twt_list)))
        return(twt_list, namelist)

In [5]:
#Create parser object
tweet_parse = parser()

# Create list of tweets and list of tweeters
tweetslist = []
whosaidlist = []
for key in tweets_hash.keys():
    parsedtweets, author = tweet_parse.listTweets(key)
    tweetslist.extend(parsedtweets)
    whosaidlist.extend(author)

### Example of pre-processed tweet and who said it

In [6]:
tweetslist[10]

'On #NationalVoterRegistrationDay please check your registration, confirm that your information is up to date & make a commitment to cast your ballot on November 6th.'

In [7]:
whosaidlist[10]

'SenBooker'

### Prepare target vector

In [8]:
labels = ['SenBooker', 'SenGillibrand', 'SenHarris', 'SenKlobuchar', 'SenSanders', 'SenWarren']

In [9]:
target = []

for s in whosaidlist:
    if s == 'SenBooker':
        t = 0
    if s == 'SenGillibrand':
        t = 1
    if s == 'SenHarris':
        t = 2
    if s == 'SenKlobuchar':
        t = 3    
    if s == 'SenSanders':
        t = 4
    if s == 'SenWarren':
        t = 5 
    target.append(t)
    
len(target)
# We should have 457 target values. 

457

### Vectorize the Tweets

Create a document-term frequency matrix where each row is a tweet

In [10]:
# Initialize the vector
# Also set some parameters for the vectorizer.  I am using the nltk package to provide better removal of stop words
ntlk_stop_words = nltk.corpus.stopwords.words('english')
vectorizer = CountVectorizer(stop_words=ntlk_stop_words)

# Fit and transform the tweets into the matrix
tfmatrix = vectorizer.fit_transform(tweetslist)

We can print the vocabulary that shows the numerical value associated with each word, although it is not necessary.

In [11]:
print(vectorizer.vocabulary_)




We can print the matrix, too.  It reads as follows:  

    (document index, vocabulary index)     frequency

For example, consider the sixth row.  In tweet index 0, the word associated with the vocabulary index 2076 is found twice.  That word is "nomination."


In [13]:
print(tfmatrix)

  (0, 2020)	1
  (0, 2320)	1
  (0, 1090)	1
  (0, 2999)	1
  (0, 777)	1
  (0, 2076)	2
  (0, 3374)	1
  (0, 1819)	1
  (0, 3005)	1
  (0, 2031)	2
  (0, 1312)	1
  (0, 1632)	1
  (0, 188)	1
  (0, 791)	1
  (0, 362)	1
  (0, 1681)	1
  (0, 1695)	1
  (0, 3350)	1
  (1, 3064)	1
  (1, 883)	1
  (1, 2944)	1
  (1, 2730)	1
  (1, 3095)	1
  (1, 153)	1
  (1, 433)	1
  :	:
  (455, 251)	1
  (455, 2952)	1
  (455, 121)	1
  (456, 2031)	1
  (456, 613)	1
  (456, 286)	1
  (456, 3252)	1
  (456, 3220)	1
  (456, 3371)	1
  (456, 2216)	1
  (456, 115)	1
  (456, 1753)	1
  (456, 2185)	1
  (456, 1040)	1
  (456, 2251)	1
  (456, 1617)	1
  (456, 1488)	1
  (456, 3017)	1
  (456, 577)	1
  (456, 1504)	1
  (456, 3250)	1
  (456, 272)	1
  (456, 3018)	1
  (456, 742)	1
  (456, 2026)	1


### Train the Model

In [14]:
# Create training and testing set
X_train, X_test, y_train, y_test = train_test_split(tfmatrix, target, test_size=0.3, random_state=0)

In [15]:
# Initialize the model
classifier = MultinomialNB()

# Fit the model on training data
classifier.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

### Predict on Testing Data

In [17]:
# transform the testing data and predict
y_test_predict = classifier.predict(X_test)

# Print the predicted label for the first item in the testing data.  
# The prediction is that this tweet belongs to the Sentaor whose index is 2.
y_test_predict[0]

2

In [18]:
# Index 2 belongs to:
labels[2]

'SenHarris'

In [19]:
# Test model accuracy
print('The accuracy of the model is: ' + str(classifier.score(X_test, y_test)))

The accuracy of the model is: 0.5072463768115942


The model accuracy is low.  Why?  The model as set up is very simple.  A more complex version that takes into account the length of a tweet or other features might do better.  Additionally, the data are very similar for each Senator because they were all tweeting about current news.