# Twitter Spam Learning: Classifer
## CSCE 670 Spring 2018, Course Project
### By: Rose Lin (826009602)

There will be series of notebooks outlining how we train our models. This one looks into [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), [random forest](https://en.wikipedia.org/wiki/Random_forest) and [SVM](https://en.wikipedia.org/wiki/Support_vector_machine) in details.

## Load the aggregated user data

Our main data sources come from the [Bot Repository](https://botometer.iuni.iu.edu/bot-repository/datasets.html):

* Varol-2017, which was released in 2017. It contains 2,573 user IDs crawled in April 2016. We repeatedly called Twitter API to crawl account and tweet information. Despite some suspended accounts, we were able to get information of 2,401 users. 
* cresci-2017, annotated by CrowdFlower contributors. We downloded the whole dataset and used the following labels: genuine, social_spambots_1, social_spambots_2, social_spambots_3, traditional_spambots_1, traditional_spambots_2, traditional_spambots_3, and traditional_spambots_4.

Initially, our aggregated user dataset was imbalanced. We had ~7,000 labeled spammers and ~5,000 labeled legitimate users. To balance it out, we did not use oversampling/downsampling; rather, we utilized Twitter API again. Our crawler started from President Trump's [Twitter account](https://twitter.com/realDonaldTrump) and scraped his friends lists, his friends' following lists, and so on until we collected 2,000 rows of user data. We assume that President Trump is following real users, and his friends follow authentic accounts as well. In this way, we were able to gather a balanced user dataset with spammers:legitimate users ratio roughly to be 1:1. It is acknowledged that our aggregated dataset may subject to biases. If time permits, we will collect a larger dataset that covers as many groups as possible.

In [6]:
#Loading the aggregated data
import pandas as pd
import os
from datetime import date
cwd = os.getcwd()

print "Loading the user data..."
user = pd.read_csv(cwd+"/all_users_balanced.csv",sep=',',header='infer')
print "Number of users:",user.id.nunique()
print "Number of spammers:",len(user[user.user_type == 1])
print "Number of legitmate users:",len(user[user.user_type == 0])

Loading the user data...
Number of users: 14202
Number of spammers: 7293
Number of legitmate users: 7364


In [7]:
# Look at the head
user.head()

Unnamed: 0,contributors_enabled,crawled_at,created_at,default_profile,default_profile_image,description,favourites_count,follow_request_sent,followers_count,following,...,profile_text_color,profile_use_background_image,protected,screen_name,statuses_count,time_zone,url,user_type,utc_offset,verified
0,,2014-04-19 14:46:19,Tue Mar 17 08:51:12 +0000 2009,1.0,1.0,,1,,22,,...,333333,1.0,,davideb66,1299,Rome,,1,7200.0,
1,,2014-05-18 23:20:58,Sun Apr 19 14:38:04 +0000 2009,,,Autrice del libro #unavitatuttacurve dal 9 apr...,16358,,12561,,...,333333,1.0,,ElisaDospina,18665,Greenland,http://t.co/ceK8TovxwI,1,-7200.0,
2,,2014-05-13 23:21:54,Wed May 13 15:34:41 +0000 2009,,,[Live Long and Prosper],14,,600,,...,333333,1.0,,Vladimir65,22987,Rome,,1,7200.0,
3,,2014-05-19 23:24:18,Wed Jul 15 12:55:03 +0000 2009,,,"Cuasi Odontologa*♥,#Bipolar, #Sarcastica & Som...",11,,398,,...,3E4415,1.0,,RafielaMorales,7975,Pacific Time (US & Canada),,1,-25200.0,
4,,2014-05-11 23:22:23,Wed Aug 05 21:12:49 +0000 2009,,,"I shall rise from my own death, to avenge hers...",162,,413,,...,D67345,1.0,,FabrizioC_c,20218,Rome,http://t.co/PK5F0JDKcy,1,7200.0,


We mainly converted the json response from Twitter API into the dataframe, with two additional features:
* crawled_at: the date a record was crawled. It will be used for account age computation.
* user_type: 0 = normal users, 1 = spammers. It serves as a binary indicator.

## Feature Engineering

Based on the visualization results, we will consider the following account features:

* Count of favorite tweets
* Friends to follower ratio
* Total status count
* Default profile image
* Default profile
* Account ages
* Username, count of characters
* Username, count of numbers
* Screen_name, count of characters
* Screen_name, count of numbers
* Length of description 
* Description text
* Average tweet per day

These features will be derived from the original dataset.

In [8]:
from datetime import date

# Create a new dataframe to store the result
usert = pd.DataFrame()
# add count of favorite tweets
usert['favorite_count'] = user['favourites_count']
# add friends to follower ratio
usert['friends_to_followers'] = user['friends_count'] / user['followers_count']
# add total status count
usert['statuses_count'] = user['statuses_count']
# add default profile image
temp_df = pd.get_dummies(user['default_profile_image'])
temp_df.columns = ['def_p_img_na','def_p_img_false','def_p_img_true']
usert = pd.concat([usert, temp_df], axis=1)
# add default profile
temp_df = pd.get_dummies(user['default_profile'])
temp_df.columns = ['def_p_na','def_p_false','def_p_true']
usert = pd.concat([usert, temp_df], axis=1)
# add account ages 
agedf = pd.to_datetime(user['crawled_at'])-pd.to_datetime(user['created_at'])
usert['age'] = agedf.dt.days
# add username, count of characters and letters
for index, item in user['name'].iteritems():
    letter = 0
    num = 0
    for c in item:
        if c.isalpha():
            letter += 1
        elif c.isdigit():
            num += 1
    usert.loc[index,'name_letter'] = letter
    usert.loc[index,'name_num'] = num
# add screen name, count of characters and letters
for index, item in user['screen_name'].iteritems():
    letter = 0
    num = 0
    for c in item:
        if c.isalpha():
            letter += 1
        elif c.isdigit():
            num += 1
    usert.loc[index,'screen_letter'] = letter
    usert.loc[index,'screen_num'] = num
# add len of description
usert['des_len'] = pd.Series([len(d) for d in user['description']])

In [9]:
# add description text (TFIDF)
import re, string
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

tfidf_transformer = TfidfVectorizer()
des_text = tfidf_transformer.fit_transform(user['description'].tolist())
des_text
# because the description text is tooooo large, we won't add it to the dataframe

<14656x21265 sparse matrix of type '<type 'numpy.float64'>'
	with 124455 stored elements in Compressed Sparse Row format>

In [10]:
# add average tweet per day
usert['avg_tweet_per_day'] = usert['statuses_count']/usert['age']
# Now let's look at the new dataframe!
print usert.head()
print ""
print usert.describe()

   favorite_count  friends_to_followers  statuses_count  def_p_img_na  \
0               1              1.818182            1299             0   
1           16358              0.274023           18665             1   
2              14              1.258333           22987             1   
3              11              0.879397            7975             1   
4             162              0.980630           20218             1   

   def_p_img_false  def_p_img_true  def_p_na  def_p_false  def_p_true   age  \
0                0               1         0            0           1  1859   
1                0               0         1            0           0  1855   
2                0               0         1            0           0  1826   
3                0               0         1            0           0  1769   
4                0               0         1            0           0  1740   

   name_letter  name_num  screen_letter  screen_num  des_len  \
0         13.0       0

Even though we have all the desired features now, we still need to do some final checks so that our classifiers can process these data without any question. We are mainly concern about 1) duplicates, and 2) missing values.

## Analysis on duplicates

In [11]:
# identify duplicate rows in the original dataframe
count = 0
for index, row in user.duplicated().iteritems():
    if row is True:
        print index, row
        count += 1
print "There are",count,"duplicated records in total."

10289 True
There are 1 duplicated records in total.


In [12]:
# see the duplicate
print user.iloc[10289]
# drop it from the user and usert dataframes
user.drop([10289],inplace=True)
usert.drop([10289],inplace=True)
# also update description text here
des_text = tfidf_transformer.fit_transform(user['description'].tolist())

contributors_enabled                                                                0.0
crawled_at                                                          2016-04-30 00:00:00
created_at                                               Sat Mar 26 11:56:05 +0000 2011
default_profile                                                                     0.0
default_profile_image                                                               0.0
description                           私は支那人です。日本皇军大好！大東亞共榮萬歲！支那事變日軍被迫進入支那解救僑民和駐軍，也是為...
favourites_count                                                                  84243
follow_request_sent                                                                    
followers_count                                                                    2271
following                                                                              
friends_count                                                                       640
geo_enabled                     

In [13]:
# Check again
count = 0
for index, row in user.duplicated().iteritems():
    if row is True:
        print index, row
        count += 1
print "There are",count,"duplicated records in total."

There are 0 duplicated records in total.


Looks like we are free from duplicates now! How about missing values?

## Analysis on missing values

In [14]:
# Check to see if there is any NA
for c in usert.columns.values:
    print "At column",c,"# of NA records:",usert[c].isnull().sum()

At column favorite_count # of NA records: 0
At column friends_to_followers # of NA records: 331
At column statuses_count # of NA records: 0
At column def_p_img_na # of NA records: 0
At column def_p_img_false # of NA records: 0
At column def_p_img_true # of NA records: 0
At column def_p_na # of NA records: 0
At column def_p_false # of NA records: 0
At column def_p_true # of NA records: 0
At column age # of NA records: 0
At column name_letter # of NA records: 0
At column name_num # of NA records: 0
At column screen_letter # of NA records: 0
At column screen_num # of NA records: 0
At column des_len # of NA records: 0
At column avg_tweet_per_day # of NA records: 0


So all other columns are good except for the *friends_to_followers* ratio column that contains some NA. Let's see what these records are.

In [15]:
# Extract the indexes of row that are nan
nan_index = [index for index, row in usert.friends_to_followers.iteritems() if np.isnan(row)]
for i in nan_index:
    record = user.iloc[i]
    print record.id, record.friends_count, record.followers_count

465196345 0 0
465306140 0 0
465318952 0 0
465320112 0 0
465325633 0 0
465328572 0 0
465335657 0 0
465338328 0 0
465343577 0 0
465349136 0 0
465360460 0 0
465366176 0 0
465369079 0 0
465369276 0 0
465371415 0 0
465373317 0 0
465373507 0 0
465373538 0 0
465376231 0 0
465376509 0 0
465378609 0 0
465379325 0 0
465387088 0 0
465387671 0 0
465392182 0 0
465393290 0 0
465398499 0 0
465418896 0 0
465427410 0 0
465428761 0 0
465434130 0 0
465434388 0 0
465435993 0 0
465442625 0 0
465445130 0 0
466109264 0 0
466114317 0 0
466116893 0 0
466121357 0 0
466124818 0 0
466125372 0 0
466126074 0 0
466143639 0 0
466152441 0 0
466154098 0 0
466155797 0 0
466163583 0 0
466175265 0 0
466182757 0 0
466183322 0 0
466184623 0 0
466188621 0 0
466189042 0 0
466189857 0 0
466192045 0 0
466192470 0 0
466194674 0 0
466195745 0 0
466200874 0 0
466205550 0 0
466212113 0 0
466217028 0 0
466217564 0 0
466220361 0 0
466225850 0 0
466226701 0 0
466226882 0 0
466227486 0 0
466229029 0 0
466232314 0 0
466234508 0 0
466235

Looks like most of the nan comes from dividing 0. Nonetheless, we do have two records that appear to have normal friends and follower counts:

| id        | friends_count | followers_count |
|-----------|---------------|-----------------|
| 704523342 | 476           | 142             |
| 923339203 | 3846          | 3668            |

To properly handle this issue, we would set the ratio to be **100000** if the followers_count is 0 (not infinity because the classifier can't handle infinities). For the two special cases, we hope that proper divison may help mitigate them (the classical python 2 division problem).

In [16]:
# Fixing the friends_to_followers ratio
# This way is slower but hopefully more accurate
from __future__ import division

for index, row in user.iterrows():
    if row['followers_count'] == 0:
        usert.loc[index,'friends_to_followers'] = 100000
    else:
        usert.loc[index,'friends_to_followers'] = row['friends_count'] / row['followers_count']

# Check if there is still any NA
print "At column friends_to_followers, # of NA records:",usert['friends_to_followers'].isnull().sum()

At column friends_to_followers, # of NA records: 0


Terrific! Now we are free from NAs :) We can proceed to the next step: training classifiers!

In [12]:
# to store the data
%store usert
%store user
%store des_text

Stored 'usert' (DataFrame)
Stored 'user' (DataFrame)
Stored 'des_text' (csr_matrix)


In [1]:
# to restore data
%store -r usert
%store -r user
%store -r des_text

## Classifier: training

We will split the whole dataset into 80% training and 20% testing.

Given that we have more observations than features, one may argue that we should normalize our data first. Nonetheless, our trained model will be used in online prediction. We have not came up with a way of normalizing input data under the online setting. Thus, we won't perform any transformation on our data further.

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(usert, user.user_type, test_size=0.2, random_state=0)
des_text_train, des_text_test = train_test_split(des_text, test_size=0.2, random_state=0)

# check the size
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape
print des_text_train.shape, des_text_test.shape

(11724, 16) (11724L,)
(2932, 16) (2932L,)
(11724, 21265) (2932, 21265)


In [3]:
# New feature set - with TFIDF!
X_train_new = np.hstack((X_train,des_text_train.toarray()))
X_test_new = np.hstack((X_test,des_text_test.toarray()))

In [6]:
# run the actual classifier
# without descrption text
from sklearn import linear_model
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

# Initial version: use default setting
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)

# Getting some evaluation metrics here
print "Logistic Regression report:"
print classification_report(y_test, pred)
print "Logistic Regression accuracy:", accuracy_score(y_test, pred)
print "ROC score:",roc_auc_score(y_test, pred)

Logistic Regression report:
             precision    recall  f1-score   support

          0       0.84      0.80      0.82      1518
          1       0.80      0.84      0.81      1414

avg / total       0.82      0.82      0.82      2932

Logistic Regression accuracy: 0.816848567531
ROC score: 0.817477865799


In [14]:
# with text
logreg.fit(X_train_new, y_train)
pred = logreg.predict(X_test_new)

# Getting some evaluation metrics here
print "Logistic Regression report:"
print classification_report(y_test, pred)
print "Logistic Regression accuracy:", accuracy_score(y_test, pred)
print "ROC score:",roc_auc_score(y_test, pred)

Logistic Regression report:
             precision    recall  f1-score   support

          0       0.83      0.78      0.80      1518
          1       0.77      0.83      0.80      1414

avg / total       0.80      0.80      0.80      2932

Logistic Regression accuracy: 0.800477489768
ROC score: 0.801376876818


It looks like adding description text TFIDF increases the number of features, thus causing an overfitting issue.

Next, we will try Random Forest.

In [6]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

# Getting some evaluation metrics here
print "Random Forest report:"
print classification_report(y_test, pred)
print "Random Forest accuracy:", accuracy_score(y_test, pred)
print "ROC score:",roc_auc_score(y_test, pred)

Random Forest report:
             precision    recall  f1-score   support

          0       0.94      0.96      0.95      1518
          1       0.96      0.93      0.95      1414

avg / total       0.95      0.95      0.95      2932

Random Forest accuracy: 0.948158253752
ROC score: 0.947559973389


In [16]:
# how about w/ text features?
clf.fit(X_train_new, y_train)
pred = clf.predict(X_test_new)

# Getting some evaluation metrics here
print "Random Forest report:"
print classification_report(y_test, pred)
print "Random Forest accuracy:", accuracy_score(y_test, pred)
print "ROC score:",roc_auc_score(y_test, pred)

Random Forest report:
             precision    recall  f1-score   support

          0       0.91      0.99      0.95      1518
          1       0.98      0.89      0.94      1414

avg / total       0.94      0.94      0.94      2932

Random Forest accuracy: 0.941336971351
ROC score: 0.939688378776


Similarly, random forest suffers when the description text is added. How about SVM?

In [7]:
# No model adjustment version
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)

# Getting some evaluation metrics here
print "SVM report:"
print classification_report(y_test, pred)
print "SVM accuracy:", accuracy_score(y_test, pred)
print "ROC score:",roc_auc_score(y_test, pred)

SVM report:
             precision    recall  f1-score   support

          0       0.72      1.00      0.84      1518
          1       1.00      0.58      0.74      1414

avg / total       0.86      0.80      0.79      2932

SVM accuracy: 0.798772169168
ROC score: 0.791371994342


In [18]:
# get support vectors
print clf.support_vectors_

# get indices of support vectors
print clf.support_ 

# get number of support vectors for each class
print clf.n_support_ 

[[  5.80000000e+02   4.58823529e+00   3.79000000e+03 ...,   0.00000000e+00
    1.29000000e+02   1.76525384e+00]
 [  6.84600000e+03   1.66101695e+00   1.71870000e+04 ...,   2.00000000e+00
    1.60000000e+02   6.55491991e+00]
 [  8.00000000e+01   2.63736264e-01   6.39000000e+02 ...,   0.00000000e+00
    1.74000000e+02   9.06382979e-01]
 ..., 
 [  0.00000000e+00   1.00000000e+05   4.70000000e+01 ...,   0.00000000e+00
    3.00000000e+01   5.85305106e-02]
 [  0.00000000e+00   8.39874411e-01   1.80000000e+02 ...,   0.00000000e+00
    1.08000000e+02   7.18562874e-02]
 [  1.00000000e+00   2.17721519e+00   1.13460000e+04 ...,   0.00000000e+00
    3.70000000e+01   2.11679104e+01]]
[    1     3     5 ..., 11709 11713 11718]
[5626 3729]


In [None]:
# how about w/ text features?
clf.fit(X_train_new, y_train)
pred = clf.predict(X_test_new)

# Getting some evaluation metrics here
print "SVM report:"
print classification_report(y_test, pred)
print "SVM accuracy:", accuracy_score(y_test, pred)
print "ROC score:",roc_auc_score(y_test, pred)

(SVM above timeout)

Our initial thought was to use LASSO to scale down the size of total features available (especially with TFIDF included). Nonetheless, it seems that LASSO is a variation of the generalized linear model and thus not applicable for this project (we are doing classification instead of regression). Thus, we won't explore further.

Below is a summary of model performance: (average reported for precision, recall and F1-score)

** NO TFIDF **

| Model               | Accuracy | Precision | Recall | F1-Score | ROC Score |
|---------------------|----------|-----------|--------|----------|-----------|
| Logistic Regression | 0.8168   | 0.82      | 0.82   | 0.82     | 0.8174    |
| Random Forest       | 0.9482   | 0.95      | 0.95   | 0.95     | 0.9475    |
| SVM                 | 0.7987   | 0.86      | 0.80   | 0.79     | 0.7913    |

** WITH TFIDF **

| Model               | Accuracy | Precision | Recall | F1-Score | ROC Score |
|---------------------|----------|-----------|--------|----------|-----------|
| Logistic Regression | 0.8004   | 0.80      | 0.80   | 0.80     | 0.8014    |
| Random Forest       | 0.9413   | 0.94      | 0.94   | 0.94     | 0.9397    |
| SVM                 |          |           |        |          |           |


It seems that without the description text, our models perform a little bit better.

Next, we will attempt to fine tune the model and consider output it for our website.

## Model Fine-tuning

## Model output