# Twitter Spam Learning: Classifer
## CSCE 670 Spring 2018, Course Project
### By: Rose Lin (826009602)

There will be series of notebooks outlining how we train our models. This one looks into [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression), [random forest](https://en.wikipedia.org/wiki/Random_forest) and [SVM](https://en.wikipedia.org/wiki/Support_vector_machine) in details.

## Load the aggregated user data

Our main data sources come from the [Bot Repository](https://botometer.iuni.iu.edu/bot-repository/datasets.html):

* Varol-2017, which was released in 2017. It contains 2,573 user IDs crawled in April 2016. We repeatedly called Twitter API to crawl account and tweet information. Despite some suspended accounts, we were able to get information of 2,401 users. 
* cresci-2017, annotated by CrowdFlower contributors. We downloded the whole dataset and used the following labels: genuine, social_spambots_1, social_spambots_2, social_spambots_3, traditional_spambots_1, traditional_spambots_2, traditional_spambots_3, and traditional_spambots_4.

Initially, our aggregated user dataset was imbalanced. We had ~7,000 labeled spammers and ~5,000 labeled legitimate users. To balance it out, we did not use oversampling/downsampling; rather, we utilized Twitter API again. Our crawler started from President Trump's [Twitter account](https://twitter.com/realDonaldTrump) and scraped his friends lists, his friends' following lists, and so on until we collected 2,000 rows of user data. We assume that President Trump is following real users, and his friends follow authentic accounts as well. 

After the initial run, we noticed that our account based model did not perform well. That is because most data was old (except for President Trump's 2000+ records). Thus, we acquired more data using the Twitter API again. 

* We were able to acquire additional 9,573 spam accounts information using the streaming API. We filtered out tweet streams on 4/15 and 4/16/2018. We used the following keywords, with the assumption that whoever sent a tweet containing this keyword was a spam account: ['make money from home','enter to win','Credit Card', 'lonely', 'debt','deals','ad', '100% free','Act now','apply online','Click below','Click here', 'Extra cash','Offer expires', 'order now','Save $','Serious cash','Satisfaction guaranteed', 'Supplies are limited', 'trial','Work from home','you are a winner','your income','Weight loss','why pay more']. Sources of the spam keywords: [455 Spam Trigger Words to Avoid in 2018](https://prospect.io/blog/455-email-spam-trigger-words-avoid-2018/), [“SPAM Tweets” – 5 Buzzwords that Attract Spammers](http://www.adweek.com/digital/spam-tweets-5-buzzwords-that-attract-spammers/)

* We also scraped 9,464 ham user data using the same assumption as the one for President Trump's above, but this time our initial seed is from Dr. [Philip Guo](https://twitter.com/pgbovine/), Assistant Professor of Cognitive Science at UC San Diego.

In this way, we were able to gather a balanced user dataset with spammers:legitimate users ratio roughly to be 1:1 (15,731 spammers, 16,828 legitimate users). It is acknowledged that our aggregated dataset may subject to biases. If time permits, we will collect a larger dataset that covers as many groups as possible.

In [11]:
#Loading the aggregated data
import pandas as pd
import numpy as np
import os
from datetime import date
cwd = os.getcwd()

print("Loading the user data...")
user = pd.read_csv(cwd+"/416.csv",sep=',',header='infer',low_memory=False)
print("Number of users:",user.id.nunique())
print("Number of spammers:",len(user[user.user_type == '1']))
print("Number of legitmate users:",len(user[user.user_type == '0']))

Loading the user data...
Number of users: 32500
Number of spammers: 15731
Number of legitmate users: 16828


In [12]:
# Look at the head
user.head()

Unnamed: 0,index,contributors_enabled,crawled_at,created_at,default_profile,default_profile_image,description,favourites_count,follow_request_sent,followers_count,...,profile_text_color,profile_use_background_image,protected,screen_name,statuses_count,time_zone,url,user_type,utc_offset,verified
0,0,,2014-04-19 14:46:19,Tue Mar 17 08:51:12 +0000 2009,1.0,1.0,,1,,22,...,333333,1.0,,davideb66,1299,Rome,,1,7200.0,
1,1,,2014-05-18 23:20:58,Sun Apr 19 14:38:04 +0000 2009,,,Autrice del libro #unavitatuttacurve dal 9 apr...,16358,,12561,...,333333,1.0,,ElisaDospina,18665,Greenland,http://t.co/ceK8TovxwI,1,-7200.0,
2,2,,2014-05-13 23:21:54,Wed May 13 15:34:41 +0000 2009,,,[Live Long and Prosper],14,,600,...,333333,1.0,,Vladimir65,22987,Rome,,1,7200.0,
3,3,,2014-05-19 23:24:18,Wed Jul 15 12:55:03 +0000 2009,,,"Cuasi Odontologa*♥,#Bipolar, #Sarcastica & Som...",11,,398,...,3E4415,1.0,,RafielaMorales,7975,Pacific Time (US & Canada),,1,-25200.0,
4,4,,2014-05-11 23:22:23,Wed Aug 05 21:12:49 +0000 2009,,,"I shall rise from my own death, to avenge hers...",162,,413,...,D67345,1.0,,FabrizioC_c,20218,Rome,http://t.co/PK5F0JDKcy,1,7200.0,


In [13]:
user.dtypes

index                                  int64
contributors_enabled                  object
crawled_at                            object
created_at                            object
default_profile                       object
default_profile_image                 object
description                           object
favourites_count                       int64
follow_request_sent                   object
followers_count                        int64
following                             object
friends_count                          int64
geo_enabled                           object
id                                     int64
is_translator                         object
lang                                  object
listed_count                           int64
location                              object
name                                  object
notifications                         object
profile_background_color              object
profile_background_image_url          object
profile_ba

In [14]:
# need to reset type here
user.contributors_enabled = user.contributors_enabled.astype('category')
user.default_profile = user.default_profile.astype('category')
user.default_profile_image = user.default_profile_image.astype('category')
user.geo_enabled = user.geo_enabled.astype('category')
user.is_translator = user.is_translator.astype('category')
user.profile_background_tile = user.profile_background_tile.astype('category')
user.profile_use_background_image = user.profile_use_background_image.astype('category')
user.user_type = pd.to_numeric(user.user_type, downcast='integer', errors='coerce')
user.user_type = user.user_type.fillna(0.0).astype('int64')
#user.crawled_at = user.crawled_at.astype('datetime64')
#user.created_at = user.created_at.astype('datetime64')
#user.favourites_count = user.favourites_count.astype('float64')
#user.followers_count = user.followers_count.astype('float64')
#user.friends_count = user.friends_count.astype('float64')

In [15]:
user.dtypes

index                                    int64
contributors_enabled                  category
crawled_at                              object
created_at                              object
default_profile                       category
default_profile_image                 category
description                             object
favourites_count                         int64
follow_request_sent                     object
followers_count                          int64
following                               object
friends_count                            int64
geo_enabled                           category
id                                       int64
is_translator                         category
lang                                    object
listed_count                             int64
location                                object
name                                    object
notifications                           object
profile_background_color                object
profile_backg

We mainly converted the json response from Twitter API into the dataframe, with two additional features:
* crawled_at: the date a record was crawled. It will be used for account age computation.
* user_type: 0 = normal users, 1 = spammers. It serves as a binary indicator.

## Feature Engineering

Based on the visualization results, we will consider the following account features:

* Count of favorite tweets
* Friends to follower ratio
* Total status count
* Default profile image
* Default profile
* Account ages
* Username, count of characters
* Username, count of numbers
* Screen_name, count of characters
* Screen_name, count of numbers
* Length of description 
* Description text
* ~~Average tweet per day~~ (eventually removed because it correlated with total status count and ages)

These features will be derived from the original dataset.

In [17]:
from datetime import date

# Create a new dataframe to store the result
usert = pd.DataFrame()
# add count of favorite tweets
usert['favorite_count'] = user['favourites_count']
# add friends to follower ratio
usert['friends_to_followers'] = user['friends_count'] / user['followers_count']
# add total status count
usert['statuses_count'] = user['statuses_count']
# add default profile image
temp_df = pd.get_dummies(user['default_profile_image'])
temp_df.columns = ['def_p_img_na','def_p_img_false','def_p_img_true']
usert = pd.concat([usert, temp_df], axis=1)
# add default profile
temp_df = pd.get_dummies(user['default_profile'])
temp_df.columns = ['def_p_na','def_p_false','def_p_true']
usert = pd.concat([usert, temp_df], axis=1)
# add account ages 
agedf = pd.to_datetime(user['crawled_at'])-pd.to_datetime(user['created_at'])
usert['age'] = agedf.dt.days
# add username, count of characters and letters
for index, item in user['name'].items():
    letter = 0
    num = 0
    for c in item:
        if c.isalpha():
            letter += 1
        elif c.isdigit():
            num += 1
    usert.loc[index,'name_letter'] = letter
    usert.loc[index,'name_num'] = num
# add screen name, count of characters and letters
for index, item in user['screen_name'].items():
    letter = 0
    num = 0
    for c in item:
        if c.isalpha():
            letter += 1
        elif c.isdigit():
            num += 1
    usert.loc[index,'screen_letter'] = letter
    usert.loc[index,'screen_num'] = num
# add len of description
usert['des_len'] = pd.Series([len(d) for d in user['description']])

In [18]:
# no description text (TFIDF)
#import re, string
#from nltk.stem import PorterStemmer
#from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

#tfidf_transformer = TfidfVectorizer()
#des_text = tfidf_transformer.fit_transform(user['description'].tolist())
#des_text
# because the description text is tooooo large, we won't add it to the dataframe

In [19]:
# add average tweet per day
#usert['avg_tweet_per_day'] = usert['statuses_count']/usert['age']
# Now let's look at the new dataframe!
print(usert.head())
print("")
print(usert.describe())

   favorite_count  friends_to_followers  statuses_count  def_p_img_na  \
0               1              1.818182            1299             0   
1           16358              0.274023           18665             1   
2              14              1.258333           22987             1   
3              11              0.879397            7975             1   
4             162              0.980630           20218             1   

   def_p_img_false  def_p_img_true  def_p_na  def_p_false  def_p_true   age  \
0                0               1         0            0           1  1859   
1                0               0         1            0           0  1855   
2                0               0         1            0           0  1826   
3                0               0         1            0           0  1769   
4                0               0         1            0           0  1740   

   name_letter  name_num  screen_letter  screen_num  des_len  
0         13.0       0.

Even though we have all the desired features now, we still need to do some final checks so that our classifiers can process these data without any question. We are mainly concern about 1) duplicates, and 2) missing values.

## Analysis on duplicates

In [20]:
# identify duplicate rows in the original dataframe
count = 0
for index, row in user.duplicated().items():
    if row is True:
        print(index, row)
        count += 1
print("There are",count,"duplicated records in total.")

There are 0 duplicated records in total.


Looks like we are free from duplicates now! How about missing values?

## Analysis on missing values

In [21]:
# Check to see if there is any NA
for c in usert.columns.values:
    print("At column",c,"# of NA records:",usert[c].isnull().sum())

At column favorite_count # of NA records: 0
At column friends_to_followers # of NA records: 381
At column statuses_count # of NA records: 0
At column def_p_img_na # of NA records: 0
At column def_p_img_false # of NA records: 0
At column def_p_img_true # of NA records: 0
At column def_p_na # of NA records: 0
At column def_p_false # of NA records: 0
At column def_p_true # of NA records: 0
At column age # of NA records: 0
At column name_letter # of NA records: 0
At column name_num # of NA records: 0
At column screen_letter # of NA records: 0
At column screen_num # of NA records: 0
At column des_len # of NA records: 0


So all other columns are good except for the *friends_to_followers* ratio column that contains some NA. Let's see what these records are.

In [22]:
# Extract the indexes of row that are nan
nan_index = [index for index, row in usert.friends_to_followers.items() if np.isnan(row)]
for i in nan_index:
    record = user.iloc[i]
    print(record.id, record.friends_count, record.followers_count)

465196345 0 0
465306140 0 0
465318952 0 0
465320112 0 0
465325633 0 0
465328572 0 0
465335657 0 0
465338328 0 0
465343577 0 0
465349136 0 0
465360460 0 0
465366176 0 0
465369079 0 0
465369276 0 0
465371415 0 0
465373317 0 0
465373507 0 0
465373538 0 0
465376231 0 0
465376509 0 0
465378609 0 0
465379325 0 0
465387088 0 0
465387671 0 0
465392182 0 0
465393290 0 0
465398499 0 0
465418896 0 0
465427410 0 0
465428761 0 0
465434130 0 0
465434388 0 0
465435993 0 0
465442625 0 0
465445130 0 0
466109264 0 0
466114317 0 0
466116893 0 0
466121357 0 0
466124818 0 0
466125372 0 0
466126074 0 0
466143639 0 0
466152441 0 0
466154098 0 0
466155797 0 0
466163583 0 0
466175265 0 0
466182757 0 0
466183322 0 0
466184623 0 0
466188621 0 0
466189042 0 0
466189857 0 0
466192045 0 0
466192470 0 0
466194674 0 0
466195745 0 0
466200874 0 0
466205550 0 0
466212113 0 0
466217028 0 0
466217564 0 0
466220361 0 0
466225850 0 0
466226701 0 0
466226882 0 0
466227486 0 0
466229029 0 0
466232314 0 0
466234508 0 0
466235

Looks like most of the nan comes from dividing 0.

To properly handle this issue, we would set the ratio to be **100000** if the followers_count is 0 (not infinity because the classifier can't handle infinities).

In [23]:
# Fixing the friends_to_followers ratio
# This way is slower but hopefully more accurate
for index, row in user.iterrows():
    if row['followers_count'] == 0:
        usert.loc[index,'friends_to_followers'] = 100000

# Check if there is still any NA
print("At column friends_to_followers, # of NA records:",usert['friends_to_followers'].isnull().sum())

At column friends_to_followers, # of NA records: 0


Terrific! Now we are free from NAs :) We can proceed to the next step: training classifiers!

Note: we remove *avg_tweet_per_day* here and save the remaining features into another dataframe as it correlates with other features.

In [25]:
# to store the data
%store usert
%store user

Stored 'usert' (DataFrame)
Stored 'user' (DataFrame)


In [1]:
# to restore data
%store -r usert
%store -r user

## Classifier: training

We will split the whole dataset into 80% training and 20% testing.

Given that we have more observations than features, one may argue that we should normalize our data first. Nonetheless, our trained model will be used in online prediction. We have not came up with a way of normalizing input data under the online setting. Thus, we won't perform any transformation on our data further.

In [27]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(usert, user.user_type, test_size=0.2, random_state=0)
#des_text_train, des_text_test = train_test_split(des_text, test_size=0.2, random_state=0)

# check the size
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
#print(des_text_train.shape, des_text_test.shape)

(26949, 15) (26949,)
(6738, 15) (6738,)


In [28]:
# New feature set - with TFIDF!
#X_train_new = np.hstack((X_train,des_text_train.toarray()))
#X_test_new = np.hstack((X_test,des_text_test.toarray()))

In [29]:
# run the actual classifier
# without descrption text
from sklearn import linear_model
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

# Initial version: use default setting
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)

# Getting some evaluation metrics here
print("Logistic Regression report:")
print(classification_report(y_test, pred))
print("Logistic Regression accuracy:", accuracy_score(y_test, pred))
print("ROC score:",roc_auc_score(y_test, pred))

Logistic Regression report:
             precision    recall  f1-score   support

          0       0.74      0.77      0.75      3419
          1       0.75      0.72      0.73      3319

avg / total       0.75      0.75      0.74      6738

Logistic Regression accuracy: 0.7451766102701098
ROC score: 0.7447525529710484


In [30]:
# with text
#logreg.fit(X_train_new, y_train)
#pred = logreg.predict(X_test_new)

# Getting some evaluation metrics here
#print("Logistic Regression report:")
#print(classification_report(y_test, pred))
#print("Logistic Regression accuracy:", accuracy_score(y_test, pred))
#print("ROC score:",roc_auc_score(y_test, pred))

It looks like adding description text TFIDF increases the number of features, thus causing an overfitting issue.

Next, we will try Random Forest.

In [31]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
pred = rf.predict(X_test)

# Getting some evaluation metrics here
print("Random Forest report:")
print(classification_report(y_test, pred))
print("Random Forest accuracy:", accuracy_score(y_test, pred))
print("ROC score:",roc_auc_score(y_test, pred))

Random Forest report:
             precision    recall  f1-score   support

          0       0.86      0.90      0.88      3419
          1       0.89      0.84      0.87      3319

avg / total       0.87      0.87      0.87      6738

Random Forest accuracy: 0.8725140991392104
ROC score: 0.8720828459715179


In [32]:
# how about w/ text features?
#rf.fit(X_train_new, y_train)
#pred = rf.predict(X_test_new)

# Getting some evaluation metrics here
#print("Random Forest report:")
#print(classification_report(y_test, pred))
#print("Random Forest accuracy:", accuracy_score(y_test, pred))
#print("ROC score:",roc_auc_score(y_test, pred))

Similarly, random forest suffers when the description text is added. How about SVM?

In [33]:
# No model adjustment version
from sklearn import svm

svm = svm.SVC()
svm.fit(X_train, y_train)
pred = svm.predict(X_test)

# Getting some evaluation metrics here
print("SVM report:")
print(classification_report(y_test, pred))
print("SVM accuracy:", accuracy_score(y_test, pred))
print("ROC score:",roc_auc_score(y_test, pred))

SVM report:
             precision    recall  f1-score   support

          0       0.59      1.00      0.74      3419
          1       1.00      0.29      0.45      3319

avg / total       0.79      0.65      0.60      6738

SVM accuracy: 0.651231819531018
ROC score: 0.6459777041277494


In [34]:
# get support vectors
print(svm.support_vectors_)

# get indices of support vectors
print(svm.support_) 

# get number of support vectors for each class
print(svm.n_support_)

[[1.04000000e+02 2.91666667e+00 1.96000000e+02 ... 1.00000000e+01
  0.00000000e+00 7.70000000e+01]
 [9.00000000e+01 5.82877960e-02 7.98000000e+02 ... 1.30000000e+01
  0.00000000e+00 1.43000000e+02]
 [1.05000000e+02 9.76627713e-01 4.05900000e+03 ... 9.00000000e+00
  0.00000000e+00 1.56000000e+02]
 ...
 [4.68900000e+03 2.96624088e+00 1.48010000e+04 ... 1.10000000e+01
  0.00000000e+00 6.40000000e+01]
 [1.34780000e+04 7.14845839e-01 1.18595000e+05 ... 1.00000000e+01
  0.00000000e+00 1.00000000e+00]
 [1.08000000e+02 2.55000000e+00 1.42000000e+03 ... 5.00000000e+00
  2.00000000e+00 1.00000000e+00]]
[    1     2     3 ... 26943 26944 26947]
[13022 11366]


In [None]:
# how about w/ text features?
#svm.fit(X_train_new, y_train)
#pred = svm.predict(X_test_new)

# Getting some evaluation metrics here
#print("SVM report:")
#print(classification_report(y_test, pred))
#print("SVM accuracy:", accuracy_score(y_test, pred))
#print("ROC score:",roc_auc_score(y_test, pred))

Our initial thought was to use LASSO to scale down the size of total features available (especially with TFIDF included). Nonetheless, it seems that LASSO is a variation of the generalized linear model and thus not applicable for this project (we are doing classification instead of regression). Thus, we won't explore further.

Below is a summary of model performance: (average reported for precision, recall and F1-score)

** NO DESCRIPTION TFIDF **

| Model               | Accuracy | Precision | Recall | F1-Score | ROC Score |
|---------------------|----------|-----------|--------|----------|-----------|
| Logistic Regression | 0.7451   | 0.75      | 0.75   | 0.74     | 0.7447    |
| Random Forest       | 0.8725   | 0.87      | 0.87   | 0.87     | 0.8720    |
| SVM                 | 0.6512   | 0.79      | 0.65   | 0.60     | 0.6459    |

Next, we will attempt to combine the models together and consider output it for our website.

## Model Fine-tuning

Based on our observations, Random Forest and KNN both have their accuracies above 93%. We would consider combine them together through the ensemble method to see if we could further push the performance up.

In [35]:
# source: https://stats.stackexchange.com/questions/139042/ensemble-of-different-kinds-of-regressors-using-scikit-learn-or-any-other-pytho
# import knn
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import VotingClassifier

# create the sub models
estimators = []
model1 = RandomForestClassifier()
estimators.append(('RF', model1))
model2 = KNeighborsClassifier(n_neighbors=5)
estimators.append(('KNN', model2))
# create the ensemble model
ensemble = VotingClassifier(estimators,n_jobs=2)
ensemble.fit(X_train, y_train)
pred = ensemble.predict(X_test)

# Getting some evaluation metrics here
print("Ensemble report (RF+KNN):")
print(classification_report(y_test, pred))
print("Ensemble accuracy:", accuracy_score(y_test, pred))
print("ROC score:",roc_auc_score(y_test, pred))

Ensemble report (RF+KNN):
             precision    recall  f1-score   support

          0       0.75      0.96      0.84      3419
          1       0.94      0.68      0.79      3319

avg / total       0.85      0.82      0.82      6738

Ensemble accuracy: 0.8202730780647076
ROC score: 0.8181515556377653


  if diff:


There seems to be a tradeoff here. Both the accuracy and the ROC score have decreased to around 80%, compared with the original random forest. But precision for spammers and recall for legitimate users increase correspondingly. This ensemble model is able to identify 75% of the legitimate users correctly as well as 94% of the spammers. It classifies 68% of the users correctly as spammers and 96% correct rate for non-spammers, accordingly. We will save this trained model in the next section.

## Model output

We could utilize the [model persistence](http://scikit-learn.org/stable/modules/model_persistence.html) feature in scikit-learn to output our trained models.

In [36]:
# outputting all trained classifiers
# (without the TFIDF version)
# Caution: in order to output the correct model, please refer to the steps above and rerun them (the ones without the TFIDF feature)
# so that the final pickle file captures the correct model
from sklearn.externals import joblib

#joblib.dump(logreg, 'logreg_user.pkl')
#joblib.dump(rf, 'rf_user.pkl') 
#joblib.dump(svm, 'svm_user.pkl')
joblib.dump(ensemble, 'ensemble_user_2.pkl')

['ensemble_user_2.pkl']

In [32]:
# try unpack
import pickle

clf = joblib.load('ensemble_user.pkl')

In [33]:
print(user.iloc[5047].screen_name)
print(clf.predict(X_test.iloc[2].values.reshape(1, -1)))
print(y_test.iloc[2])

PrizeTrain55275
[1]
1


  if diff:
