# Building a tweet classifier in Python

I think it's fair to say that we are currently living in a particularly partisan era of politics. This coupled with the rise of populist figures such as Trump, Sanders, and Corbyn makes politics an interesting time. Twitter makes things even more interesting because it lets our politicians interact with us directly to our phones. In the case of Donald Trump, the results of this are equally fascinating and horrifying.

This leads me to the question "can we build a machine learning model to tell us who wrote a given tweet?". I am fairly certain the answer to this question is "yes", so I am going to put my findings in this notebook.

## Step 1: Collecting the data

There is a script in this folder called get_tweets.py, which uses Tweepy to scrape the most recent 3,000 or so tweets from a number of accounts. The ones I chose for this were those of Donald Trump, Barack Obama, Hillary Clinton, Bernie Sanders, and... me. I would recommend the reader takes time to familarise themselves with this script if they intend to do something similar.

## Step 2: Pre-Processing the data
All the tweets I collected are in a text file, each on a line, with the tweet and the user separated by ',::,' (I wanted a delimiter that was unlikely to come up in the course of a normal tweet). I am going to load pandas and then store all of these tweets into a dataframe

In [1]:
#Import the pandas library to store my data in a dataframe
import pandas as pd

#Make a list of lists, where each sub list is [tweet,user]
tweet_list = []
labels = ['Tweet','User']
with open('tweets.txt','r') as tweets:
    for line in tweets:
        tweet_list.append(line.split(',::,'))

#Create the df, drop any rows where User == None and see the first 5 entries
tweet_df = pd.DataFrame.from_records(tweet_list,columns = labels)

tweet_df = tweet_df.dropna(axis=0)
tweet_df.head()

        

Unnamed: 0,Tweet,User
0,If you feel compelled to play fruit machines a...,James Blood\n
1,Never occurred to me that I might get a card. ...,James Blood\n
2,@colmocinneide It's the mass influx of cheap E...,James Blood\n
3,@tchanpoker I wonder what breast cheese would ...,James Blood\n
4,"I don't pay for the cpu time, so aug-cc-pVTZ h...",James Blood\n


Next we are going to need to import some machine learning and nlp packages to start work on constructing the model. We will be using the 'bag of words' approach as a basis. To start with, we will be trying out models built with Random Forests and Naive Bayes classifiers. We will see how each of these gets on and then evaluate our next step.

In [2]:
import nltk
from nltk.tokenize import RegexpTokenizer
from nltk.stem.snowball import SnowballStemmer
from nltk.corpus import stopwords

from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

import numpy as np

#set tokeniser, stemmer, and stopwords
tokeniser = RegexpTokenizer('\w+')
stemmer = SnowballStemmer('english')
sw = stopwords.words('english')

#there will be some links in the tweets that probably won't help
#let's add some extra stop words
link_text_words = ['http','https','com','co','www','ly']
for word in link_text_words:
    sw.append(word)

def text_processor(doc):
    '''
    takes document and does the tokenisation, stemming, and removal
    of stopwords
    '''
    tokens = tokeniser.tokenize(doc)
    #remove the stop words
    filtered_words = []
    for token in tokens:
        if token not in sw:
            filtered_words.append(token)
    
    #stem the remaining words
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    return ' '.join(stemmed_words)
#process all the tweets and add them to the dataframe
tokenised_tweets = [text_processor(tweet) for tweet in tweet_df['Tweet']]

tweet_df['Tokens'] = tokenised_tweets

In [3]:
#Get the TFidf for each tweet
tf_idf = TfidfVectorizer()
X = tf_idf.fit_transform(tweet_df['Tokens'])

We have cleaned our tweets ready for classification, but we need numerical values for our users. We will use the scikit learn label encoder that we imported earlier.

In [4]:
user_le = LabelEncoder()
tweet_df['User ID'] = user_le.fit_transform(tweet_df['User'].astype(str))
tweet_df.head()

Unnamed: 0,Tweet,User,Tokens,User ID
0,If you feel compelled to play fruit machines a...,James Blood\n,if feel compel play fruit machin hilton park s...,4
1,Never occurred to me that I might get a card. ...,James Blood\n,never occur i might get card brought tear eye ...,4
2,@colmocinneide It's the mass influx of cheap E...,James Blood\n,colmocinneid it mass influx cheap european phd...,4
3,@tchanpoker I wonder what breast cheese would ...,James Blood\n,tchanpok i wonder breast chees would tast like...,4
4,"I don't pay for the cpu time, so aug-cc-pVTZ h...",James Blood\n,i pay cpu time aug cc pvtz 3xmphgpsh0,4


We had done our TFidf, we have given numerical labels to our users, let's have a go at training some models. We will need to split our data into training and test data, remembering that the way the data was read into the dataframe, the tweets are currently sorted by class. We will need to make sure the data is shuffled before splitting. Luckily, train_test_split will do that for us.

In [5]:
y = tweet_df['User ID'].values

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)


Now we have split the data, we can fit our models and see how they perform. As previously mentioned, we are going to try with a multinomial Naive Bayes classifier and a random forest. The MultinomialNB is often used for textual analysis, so I have high hopes for that

In [6]:
rf = RandomForestClassifier(n_estimators = 40)
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
acc = rf.score(X_test,y_test)
print(acc)

0.783850931677


In [7]:
mnb = MultinomialNB()
mnb.fit(X_train,y_train)
y_pred = mnb.predict(X_test)
acc = mnb.score(X_test,y_test)
print(acc)

0.809627329193


The Naive Bayes model is about 2.5% better than the Random Forest, which makes sense given the wide use of that algorithm for text classification. Even more happily, the fitting of the model was about 3 orders of magnitude faster than for the random forest. Let's see what it got wrong:

In [8]:
#Dict of User IDs to people:
tweeters = {0: 'Obama', 1: 'Sanders', 2: 'Trump', 3: 'Clinton', 4: 'Blood'}
#get all our prediction data
predictions = zip(X_test, y_test, y_pred)

#print out the corpus, user, and predicted user for the failures
for p in predictions:
    if p[1] != p[2]:
        print(
            tf_idf.inverse_transform(p[0]), 'Actual:', tweeters[p[1]],
            ', Predicted: ', tweeters[p[2]])

[array(['afternoon', 'beauti', 'break', 'grc', 'jelfscompchem',
       'legwpvlfoh', 'mountain', 'nanopor', 'nh', 'rt', 'trip', 'white'], 
      dtype='<U102')] Actual: Blood , Predicted:  Trump
[array(['61inovwczf', 'bad', 'deal', 'don', 'make', 'news', 'sander',
       'trump'], 
      dtype='<U102')] Actual: Sanders , Predicted:  Trump
[array(['afford', 'coverag', 'don', 'go', 'health', 'p8dqwo89c7', 'sign',
       'today', 'wait'], 
      dtype='<U102')] Actual: Sanders , Predicted:  Obama
[array(['htt'], 
      dtype='<U102')] Actual: Clinton , Predicted:  Blood
[array(['black', 'casino', 'dealer', 'fine', 'floor', 'one', 'regul',
       'remov', 'repeat', 'state', 'trump'], 
      dtype='<U102')] Actual: Clinton , Predicted:  Sanders
[array(['cissieglynch', 'evangel', 'k5kgxpr2wa', 'live', 'pdpryor1',
       'pynanc', 'saysgabriell', 'trumptow', 'women'], 
      dtype='<U102')] Actual: Trump , Predicted:  Clinton
[array(['nationalpetday', 'nvursfujq', 'side'], 
      dtype='<U102

A lot of the mistakes look like they have come from the use of shortened URLs that I didn't clean out of the data. Although one of my jokey tweets where I watched Schweinsteiger score for Chicago Fire was classified as being by Obama, presumably because he tweets about Chicago a lot.

I think 81% accuracy is pretty good for five separate classes with very little tuning of the model or the features. Before I get too carried away, however, let's cross-validate it to try to make sure we haven't over-fit our model.

In [9]:
kfold = StratifiedKFold(n_splits = 10,
                       random_state = 42)
#initialise list for accuracy scores
scores = []
#perform cv
for train,test in kfold.split(X,y):
    X_train, X_test = X[train],X[test]
    y_train, y_test = y[train],y[test]
    mnb.fit(X_train, y_train)
    y_pred = mnb.predict(X_test)
    score = mnb.score(X_test,y_test)
    scores.append(score)
    #print model information
    print('Acc %.3f' %score)

print ('mean_score: %.3f, standard deviation: %.3f' % (np.mean(scores),np.std(scores)))

Acc 0.702
Acc 0.838
Acc 0.813
Acc 0.812
Acc 0.806
Acc 0.827
Acc 0.797
Acc 0.800
Acc 0.790
Acc 0.767
mean_score: 0.795, standard deviation: 0.036


Our model was within one standard deviation of the mean accuracy score, so I think we can be pretty confident that our model should be generalisable to any unseen data.

Hopefully this project was at least partially interesting to you. I would welcome any comments.