The goal of this notebook is to train a random_forest classifier on our cleaned dataset.

Here is a link to a data dictionary (which explains what the column names mean): https://developer.twitter.com/en/docs/tweets/data-dictionary/overview/user-object.

See notebooks 4, 5, and 6 from the Classification lectures.

- [x] Implement a basic train-test split
- [x] Sample the data for training
- [ ] Train one model
- [ ] Quantify the model's accuracy
- [ ] Monkey with hyperparameters, etc.
- [ ] Make a more sophisticated train-test split
- [ ] Implement K-fold cross validation
- [ ] Create a proper scikit-learn pipeline
- [ ] Make some decent visualizations
- [ ] Add the categorical variables to dataframe (starting by making binary categories into boolean variables)
- [ ] One-hot encode the categorical variables, if necessary
- [ ] Examine feature importance
- [ ] Examine AUC

## Front matter

In [55]:
from pathlib import Path # For navigating to the datafiles
import json # For processing json files
import csv # For processing csv files

from functools import reduce # So Chris can pretend he's a CS major

import pandas as pd
import numpy as np

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt # For simple graphing
import seaborn as sns # For sns.set_style("whitegrid"), I guess?

## This sets the plot style
## to have a grid on a white background
sns.set_style("whitegrid")

## Prepare data

In [6]:
# Read "well_formatted_json_files.txt"

data_files_to_read_path = Path('well_formatted_json_files.txt')
with open(data_files_to_read_path) as file:
    lines = file.read().splitlines()
    # Remove commented lines and blank lines:
    lines = [li for li in lines if ((li[0] != '#') and ( li.strip()))]
    print(len(lines))

10


In [7]:
# Read in json and tsv files

data_path = Path('data')

l = list()
for li in lines:
    # Read into dataframes
    
    json_path = data_path / Path(li + '_tweets.json')
    with open(json_path) as file:
        very_raw_json_contents = json.load(file)
    raw_json_contents = [x['user'] for x in very_raw_json_contents]
    json_contents = pd.DataFrame(raw_json_contents)
    tsv_path = data_path / Path(li + '.tsv')
    tsv_contents = pd.read_csv(tsv_path, sep='\t')
    tsv_contents.columns = ['user_id','species']
    
    # Set indexes to user ids and remove duplicate indices
        
    json_contents = json_contents.set_index('id')
    tsv_contents = tsv_contents.set_index('user_id')
    json_contents = json_contents.loc[~json_contents.index.duplicated(keep='last')]
    tsv_contents = tsv_contents.loc[~tsv_contents.index.duplicated(keep='last')]
    
    # Merge
    
    merged_data = pd.concat([json_contents, tsv_contents], join='inner', axis = 1)
    merged_data = merged_data.assign(source=li)
    
    l.append(merged_data)


cumulative_raw_data = reduce(lambda x, y : pd.concat([x,y]), l)

In [94]:
columns_to_keep = ['has_extended_profile','profile_use_background_image','profile_background_image_url_https','verified','profile_text_color','profile_sidebar_fill_color','followers_count','profile_sidebar_border_color','profile_background_color','listed_count','utc_offset','statuses_count','profile_link_color','geo_enabled','profile_background_image_url','lang','profile_background_tile','favourites_count','url','created_at','time_zone','default_profile','friends_count']

target = cumulative_raw_data['species'] == 'bot'
# old code: target = np.asarray([cumulative_raw_data['species'] == 'bot']).reshape(-1,1)
source = cumulative_raw_data['source']
cut_raw_data = cumulative_raw_data[columns_to_keep]

cut_raw_data.sample(3)

Unnamed: 0,has_extended_profile,profile_use_background_image,profile_background_image_url_https,verified,profile_text_color,profile_sidebar_fill_color,followers_count,profile_sidebar_border_color,profile_background_color,listed_count,...,geo_enabled,profile_background_image_url,lang,profile_background_tile,favourites_count,url,created_at,time_zone,default_profile,friends_count
1552732999,False,True,https://abs.twimg.com/images/themes/theme1/bg.png,False,333333,DDEEF6,9,C0DEED,C0DEED,1,...,False,http://abs.twimg.com/images/themes/theme1/bg.png,en,False,75,,Fri Jun 28 09:56:01 +0000 2013,,True,0
623456010,True,False,https://abs.twimg.com/images/themes/theme1/bg.png,True,0,000000,1544019,000000,000000,858,...,False,http://abs.twimg.com/images/themes/theme1/bg.png,en,False,1604,https://t.co/SabJqZeRBD,Sun Jul 01 03:38:22 +0000 2012,Beijing,False,0
1131598854,True,True,https://abs.twimg.com/images/themes/theme10/bg...,False,333333,DDEEF6,1674,000000,000000,1,...,True,http://abs.twimg.com/images/themes/theme10/bg.gif,es,True,16263,,Tue Jan 29 17:11:24 +0000 2013,,False,821


In [62]:
numerical_cols = ['favourites_count','statuses_count','friends_count','followers_count','listed_count']
boolean_cols = ['profile_background_tile','default_profile','geo_enabled','verified','has_extended_profile','profile_use_background_image']

assert set(numerical_cols) <= set(cut_raw_data.columns)
assert set(boolean_cols) <= set(cut_raw_data.columns)

cleaned_data = cut_raw_data[numerical_cols + boolean_cols]

display(cleaned_data.sample(3, random_state=855))
print(cleaned_data.shape)

Unnamed: 0,favourites_count,statuses_count,friends_count,followers_count,listed_count,profile_background_tile,default_profile,geo_enabled,verified,has_extended_profile,profile_use_background_image
1870916851,132,30,0,1,0,False,True,False,False,False,True
340052871,2497,7452,183,199,0,True,False,True,False,False,True
156031870,6635,19851,146,259,4,True,False,True,False,True,True


(44595, 11)


NUMERICAL:

* favourites_count
* statuses_count
* friends_count
* followers_count
* listed_count

BOOLEAN:

* profile_background_tile
* default_profile [boolean]
* geo_enabled
* verified
* has_extended_profile
* profile_use_background_image

CATEGORICAL (NOT YET ADDED):

* profile_background_image_url_https [consider categorical: equal to theme1, theme14, NaN, or other]
* url (NaN or other)
* lang [categorical: en or other (maybe NaN, es, pt, it, too)]
* profile_link_color [categorical: 1DA1F2 or other]
* profile_background_image_url [categorical: theme1, NaN, or other]
* profile_sidebar_border_color [consider categorical: either equal to C0DEED, FFFFFF, 000000, or other]
* profile_background_color [categorical: equal to C0DEED, F5F8FA, 000000, or other]
* profile_text_color [consider categorical: either equal to 333333 or not]
* profile_sidebar_fill_color [consider categorical: either equal to DDEEF6 or not]

WEIRD/OTHER (NOT GOING TO BE ADDED)

* utc_offset [categorical? UTC offset measured in seconds]
* created at [look at day or week, or at time of day]
* time_zone [categorical: NaN or other]

## Train model

The train-test split stratifies by target right now, and not by source. We should stratify by both. (Note to Chris: Think of stratifying by a source-target ordered pair.)

In [101]:
# Train test split

samp_size = 1500
X_train,X_test,y_train,y_test = train_test_split(cleaned_data[['favourites_count']],
                                                target,
                                                train_size=samp_size,
                                                # test_size = 5000,
                                                shuffle=True,
                                                stratify=target,
                                                random_state=855)

It goes without saying there's some room for hyperparameter tuning.

In [107]:
rf_model = RandomForestClassifier(n_estimators = 100)
rf_model.fit(X_train,y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## Test model

y is True for bots, False for humans

In [112]:
pred = rf_model.predict(X_test)
np.sum(pred == y_test)/len(y_test)

0.6918668058939552