# Twitter Spam Learning: Classifer
## CSCE 670 Spring 2018, Course Project
### By: Rose Lin (826009602)

There will be series of notebooks outlining how we train our models. This one looks into [logistic regression](https://en.wikipedia.org/wiki/Logistic_regression) in details.

## Load the aggregated user data

Our main data sources come from the [Bot Repository](https://botometer.iuni.iu.edu/bot-repository/datasets.html):

* Varol-2017, which was released in 2017. It contains 2,573 user IDs crawled in April 2016. We repeatedly called Twitter API to crawl account and tweet information. Despite some suspended accounts, we were able to get information of 2,401 users. 
* cresci-2017, annotated by CrowdFlower contributors. We downloded the whole dataset and used the following labels: genuine, social_spambots_1, social_spambots_2, social_spambots_3, traditional_spambots_1, traditional_spambots_2, traditional_spambots_3, and traditional_spambots_4.

Initially, our aggregated user dataset was imbalanced. We had ~7,000 labeled spammers and ~5,000 labeled legitimate users. To balance it out, we did not use oversampling/downsampling; rather, we utilized Twitter API again. Our crawler started from President Trump's [Twitter account](https://twitter.com/realDonaldTrump) and scraped his friends lists, his friends' following lists, and so on until we collected 2,000 rows of user data. We assume that President Trump is following real users, and his friends follow authentic accounts as well. In this way, we were able to gather a balanced user dataset with spammers:legitimate users ratio roughly to be 1:1. It is acknowledged that our aggregated dataset may subject to biases. If time permits, we will collect a larger dataset that covers as many groups as possible.

In [1]:
#Loading the aggregated data
import pandas as pd
import os
from datetime import date
cwd = os.getcwd()

print "Loading the user data..."
user = pd.read_csv(cwd+"/all_users_balanced.csv",sep=',',header='infer')
print "Number of users:",user.id.nunique()
print "Number of spammers:",len(user[user.user_type == 1])
print "Number of legitmate users:",len(user[user.user_type == 0])

Loading the user data...
Number of users: 14202
Number of spammers: 7293
Number of legitmate users: 7364


In [2]:
# Look at the head
user.head()

Unnamed: 0,contributors_enabled,crawled_at,created_at,default_profile,default_profile_image,description,favourites_count,follow_request_sent,followers_count,following,...,profile_text_color,profile_use_background_image,protected,screen_name,statuses_count,time_zone,url,user_type,utc_offset,verified
0,,2014-04-19 14:46:19,Tue Mar 17 08:51:12 +0000 2009,1.0,1.0,,1,,22,,...,333333,1.0,,davideb66,1299,Rome,,1,7200.0,
1,,2014-05-18 23:20:58,Sun Apr 19 14:38:04 +0000 2009,,,Autrice del libro #unavitatuttacurve dal 9 apr...,16358,,12561,,...,333333,1.0,,ElisaDospina,18665,Greenland,http://t.co/ceK8TovxwI,1,-7200.0,
2,,2014-05-13 23:21:54,Wed May 13 15:34:41 +0000 2009,,,[Live Long and Prosper],14,,600,,...,333333,1.0,,Vladimir65,22987,Rome,,1,7200.0,
3,,2014-05-19 23:24:18,Wed Jul 15 12:55:03 +0000 2009,,,"Cuasi Odontologa*♥,#Bipolar, #Sarcastica & Som...",11,,398,,...,3E4415,1.0,,RafielaMorales,7975,Pacific Time (US & Canada),,1,-25200.0,
4,,2014-05-11 23:22:23,Wed Aug 05 21:12:49 +0000 2009,,,"I shall rise from my own death, to avenge hers...",162,,413,,...,D67345,1.0,,FabrizioC_c,20218,Rome,http://t.co/PK5F0JDKcy,1,7200.0,


We mainly converted the json response from Twitter API into the dataframe, with two additional features:
* crawled_at: the date a record was crawled. It will be used for account age computation.
* user_type: 0 = normal users, 1 = spammers. It serves as a binary indicator.

## Feature Engineering

Based on the visualization results, we will consider the following account features:

* Count of favorite tweets
* Friends to follower ratio
* Total status count
* Default profile image
* Default profile
* Account ages
* Username, count of characters
* Username, count of numbers
* Screen_name, count of characters
* Screen_name, count of numbers
* Length of description 
* Description text
* Average tweet per day

These features will be derived from the original dataset.

In [38]:
from datetime import date

# Create a new dataframe to store the result
usert = pd.DataFrame()
# add count of favorite tweets
usert['favorite_count'] = user['favourites_count']
# add friends to follower ratio
usert['friends_to_followers'] = user['friends_count'] / user['followers_count']
# add total status count
usert['statuses_count'] = user['statuses_count']
# add default profile image
temp_df = pd.get_dummies(user['default_profile_image'])
temp_df.columns = ['def_p_img_na','def_p_img_false','def_p_img_true']
usert = pd.concat([usert, temp_df], axis=1)
# add default profile
temp_df = pd.get_dummies(user['default_profile'])
temp_df.columns = ['def_p_na','def_p_false','def_p_true']
usert = pd.concat([usert, temp_df], axis=1)
# add account ages 
agedf = pd.to_datetime(user['crawled_at'])-pd.to_datetime(user['created_at'])
usert['age'] = agedf.dt.days
# add username, count of characters and letters
for index, item in user['name'].iteritems():
    letter = 0
    num = 0
    for c in item:
        if c.isalpha():
            letter += 1
        elif c.isdigit():
            num += 1
    usert.loc[index,'name_letter'] = letter
    usert.loc[index,'name_num'] = num
# add screen name, count of characters and letters
for index, item in user['screen_name'].iteritems():
    letter = 0
    num = 0
    for c in item:
        if c.isalpha():
            letter += 1
        elif c.isdigit():
            num += 1
    usert.loc[index,'screen_letter'] = letter
    usert.loc[index,'screen_num'] = num
# add len of description
usert['des_len'] = pd.Series([len(d) for d in user['description']])

In [33]:
# add description text (TFIDF)
import re, string
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfTransformer, TfidfVectorizer

tfidf_transformer = TfidfVectorizer()
des_text = tfidf_transformer.fit_transform(user['description'].tolist())
des_text
# because the description text is tooooo large, we won't add it to the dataframe

<14657x21265 sparse matrix of type '<type 'numpy.float64'>'
	with 124464 stored elements in Compressed Sparse Row format>

In [39]:
# add average tweet per day
usert['avg_tweet_per_day'] = usert['statuses_count']/usert['age']
# Now let's look at the new dataframe!
print usert.head()
print ""
print usert.describe()

   favorite_count  friends_to_followers  statuses_count  def_p_img_na  \
0               1              1.818182            1299             0   
1           16358              0.274023           18665             1   
2              14              1.258333           22987             1   
3              11              0.879397            7975             1   
4             162              0.980630           20218             1   

   def_p_img_false  def_p_img_true  def_p_na  def_p_false  def_p_true   age  \
0                0               1         0            0           1  1859   
1                0               0         1            0           0  1855   
2                0               0         1            0           0  1826   
3                0               0         1            0           0  1769   
4                0               0         1            0           0  1740   

   name_letter  name_num  screen_letter  screen_num  des_len  \
0         13.0       0

Now we have all the desired features, we will start training our models.

## Classifier: training

We will split the whole *usert* dataframe into 80% training and 20% testing.

In [40]:
import numpy as np
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(usert, user.user_type, test_size=0.2, random_state=0)

# check the size
print X_train.shape, y_train.shape
print X_test.shape, y_test.shape

(11725, 16) (11725L,)
(2932, 16) (2932L,)


In [41]:
# run the actual classifier
from sklearn import linear_model
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

# Initial version: use default setting
logreg = linear_model.LogisticRegression()
logreg.fit(X_train, y_train)
pred = logreg.predict(X_test)

# Getting some evaluation metrics here
print "Logistic Regression report:",
print classification_report(y_test, pred)
print "Logistic Regression accuracy:", accuracy_score(y_test, y_pred_dt)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').