# Twitter Bot Detection: Exploratory Data Analysis

Since the US Presidential Election in 2016, I (like many) have become incredibly concerned about the growth of spreading misinformation throughout social media platforms. I knew that bots played a role, but knew very little about how they worked and so my imagination ran wild: these indefatigable chaos agents, silently sowing seeds of distrust among us and disrupting democracy.

While that depiction might be part of it, it's not true of *all* bots. Essentially, **a Twitter bot is a software bot that controls a Twitter account via the Twitter API**. It can tweet, retweet, like, follow, and direct message - just like any user. The bot is governed by a set of rules by the creator. And while nefarious bots exist, Twitter does try to regular improper usage. 

Some bots are just companies or organizations like the New York Times or NBA that have automated their social media. 

Armed with the [Twitter Bot Accounts](https://www.kaggle.com/davidmartngutirrez/twitter-bots-accounts?) dataset from Kaggle, I'm hoping to find features in account-level information that can aid in Twitter bot detection. 

The dataset is comprised of approximately 37,000 Twitter users, labeled bot or human, with account-level information like: 
* number of favorites/likes
* number of tweets
* number of followers
* number of friends (accounts their following)
* whether or not the profile is still in default mode
* and more

In this notebook, I'll be exploring some of these provided features as well as transforming the data to create some interesting interactions that might aid in creating a predictive classification model. 

In [1]:
# Basics
import pandas as pd
import psycopg2 as pg
import numpy as np

# Visuals
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

## Data import and setup

The dataset for this project is available on [Kaggle](https://www.kaggle.com/davidmartngutirrez/twitter-bots-accounts?) - however I've set up a local SQL database to house the data.  

In [1]:
# Postgres info to connect
connection_args = {
    'host': 'localhost',             # Connecting to _local_ version of psql
    'dbname': 'twitter_accounts',    # DB with Twitter info
    'port': 5432                     # port opened on AWS
}

connection = pg.connect(**connection_args)  

NameError: name 'pg' is not defined

In [4]:
raw_df = pd.read_sql('SELECT * FROM human_bots', connection)

NameError: name 'connection' is not defined

In [None]:
raw_df.head()

In [None]:
raw_df.info()

Setting up a few views to look at entry counts by account type

In [1]:
q = '''
SELECT account_type, COUNT(*) as total 
FROM human_bots
GROUP BY account_type
'''

account_counts = pd.read_sql(q, connection)
account_counts

NameError: name 'pd' is not defined

In [None]:
num_humans = account_counts.total.values[1]
num_bots = account_counts.total.values[0]

types = ['Humans', 'Bots']
counts = [num_humans, num_bots]

plt.figure(figsize=(4, 4))
sns.barplot(x = types, y = counts)
plt.title("Number of Entries by Account Type", fontsize=11)
sns.despine();

In [None]:
q = '''
WITH ver as
    (SELECT account_type, COUNT(*) as ver_true
        FROM human_bots
        WHERE verified='true'
        GROUP BY account_type),
    
    not_ver as
    (SELECT account_type, COUNT(*) as ver_false
        FROM human_bots
        WHERE verified='false'
        GROUP BY account_type)
    
    SELECT v.account_type, v.ver_true, nv.ver_false
        FROM ver v
        JOIN not_ver nv
        ON v.account_type = nv.account_type

'''
ver_status_by_type = "pd.read_sql(q, connection)"
ver_status_by_type

What about by verification status? 

In [None]:
bots_verified = ver_status_by_type.ver_true.values[0]
bots_not_verified = ver_status_by_type.ver_false.values[0]
humans_verified = ver_status_by_type.ver_true.values[1]
humans_not_verified = ver_status_by_type.ver_false.values[1]

types = ['Not Verified', 'Verified']
bot_counts = [bots_not_verified, bots_verified]
human_counts = [humans_not_verified, humans_verified]

plt.figure(figsize=(8, 4), dpi=100)
plt.suptitle("Count by Verification Status", fontsize=12)

plt.subplot(1, 2, 1)
sns.barplot(x = types, y = human_counts)
plt.title("Humans", fontsize=11)
plt.ylabel("Count", fontsize=10)
sns.despine()

plt.subplot(1, 2, 2)
sns.barplot(x = types, y = bot_counts)
plt.title("Bots", fontsize=11)
sns.despine();
#plt.savefig('imgs/account_counts_by_verification_status.png');

This is likely to make a big impact on the model - I'll dig into verification status more as a possible feature later.

**Data types**: converting the data to more usable forms - changing booleans to binary 1/0, and datetime conversion for the account creation timestamp. 

In [None]:
# drop extra index column
raw_df.drop(columns=['index'], inplace=True)

# Binary classifications for bots and boolean values
raw_df['bot'] = raw_df['account_type'].apply(lambda x: 1 if x == 'bot' else 0)
raw_df['default_profile'] = raw_df['default_profile'].astype(int)
raw_df['default_profile'] = raw_df['default_profile'].astype(int)
raw_df['default_profile_image'] = raw_df['default_profile_image'].astype(int)
raw_df['geo_enabled'] = raw_df['geo_enabled'].astype(int)
raw_df['verified'] = raw_df['verified'].astype(int)

# datetime conversion
raw_df['created_at'] = pd.to_datetime(raw_df['created_at'])
# hour created
raw_df['hour_created'] = pd.to_datetime(raw_df['created_at']).dt.hour

In [2]:
# usable df setup
df = raw_df[['bot', 'screen_name', 'created_at', 'hour_created', 'verified', 'acct_location', 'geo_enabled', 'lang', 'default_profile', 
              'default_profile_image', 'favourites_count', 'followers_count', 'friends_count', 'statuses_count',
             'average_tweets_per_day', 'account_age_days']]

del raw_df

NameError: name 'raw_df' is not defined

In [3]:
df.head()

NameError: name 'df' is not defined

In [4]:
df.info()

NameError: name 'df' is not defined

In [5]:
df.describe()

NameError: name 'df' is not defined

## Data transformations for EDA

After looking at a few plots I noticed the distributions for this data is highly skewed, so I'm setting up some log transformations for the sake of interpretability. 

The dataset provided one calculated feature - average tweets per day. I think there could be some other interesting rate-based features like follower acquisition. Also curious to look at some interactions like an overall network size and reach, like looking at tweets times followers.

In [6]:
# Interesting features to look at: 
df['avg_daily_followers'] = np.round(df['followers_count'] / df['account_age_days'])
df['avg_daily_friends'] = np.round(df['followers_count'] / df['account_age_days'])
df['avg_daily_favorites'] = np.round(df['followers_count'] / df['account_age_days'])

# Log transformations for highly skewed data
df['friends_log'] = np.round(np.log(1 + df['friends_count']), 3)
df['followers_log'] = np.round(np.log(1 + df['followers_count']), 3)
df['favs_log'] = np.round(np.log(1 + df['favourites_count']), 3)
df['avg_daily_tweets_log'] = np.round(np.log(1+ df['average_tweets_per_day']), 3)

# Possible interaction features
df['network'] = np.round(df['friends_log'] * df['followers_log'], 3)
df['tweet_to_followers'] = np.round(np.log( 1+ df['statuses_count']) * np.log(1+ df['followers_count']), 3)

# Log-transformed daily acquisition metrics for dist. plots
df['follower_acq_rate'] = np.round(np.log(1 + (df['followers_count'] / df['account_age_days'])), 3)
df['friends_acq_rate'] = np.round(np.log(1 + (df['friends_count'] / df['account_age_days'])), 3)
df['favs_rate'] = np.round(np.log(1 + (df['friends_count'] / df['account_age_days'])), 3)

NameError: name 'np' is not defined

## Correlations

Now that the transformations and interactions are set up, let's take a look at the correlation heatmaps of the different features.

### All data

In [7]:
plt.figure(figsize=(8,6), dpi=100)
cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(df.corr(), cmap=cmap, annot=False)
plt.title('Correlation of potential features: all data', fontsize=14);

NameError: name 'plt' is not defined

### Bots vs Humans

In [8]:
bots = df[df['bot'] == 1]
humans = df[df['bot'] == 0]

NameError: name 'df' is not defined

In [9]:
plt.figure(figsize=(8,6), dpi=100)
cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(bots.corr(), cmap=cmap, annot=False)
plt.title('Correlation of potential features: bots only', fontsize=14);

NameError: name 'plt' is not defined

In [10]:
plt.figure(figsize=(8,6), dpi=100)
cmap = sns.diverging_palette(230, 20, as_cmap=True)

sns.heatmap(humans.corr(), cmap=cmap, annot=False)
plt.title('Correlation of potential features: humans only', fontsize=14);

NameError: name 'plt' is not defined

## Distribution plots

Now to set up some distribution plots to better understand how some of the features differ between bot and human accounts.

### Account age

In [11]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='account_age_days', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title('Distribution of Account Ages', fontsize=12)
plt.xlabel('Account Age (# days)', fontsize=10)
plt.ylabel('Density', fontsize=10)
sns.despine();

NameError: name 'plt' is not defined

### Number of followers

In [12]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='followers_log', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title('Distribution of # Followers', fontsize=12)
plt.xlabel('Followers (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 18)
plt.ylim(0, 0.11)
sns.despine();
#plt.savefig('imgs/dist_followers.png');

NameError: name 'plt' is not defined

### Number of friends (follow*ing*)

In [13]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='friends_log', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title('Distribution of Friends (# Following)', fontsize=12)
plt.xlabel('Friends Count (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 16)
plt.ylim(0, 0.65)
sns.despine();

NameError: name 'plt' is not defined

In [14]:
plt.figure(figsize=(12,5), dpi=100)
plt.suptitle('Distribution of Followers & Friends', fontsize=14)

plt.subplot(1, 2, 1)
sns.histplot(x='followers_log', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=False)
plt.title('Distribution of # Followers', fontsize=12)
plt.xlabel('Followers (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 18)
plt.ylim(0, 0.25)
sns.despine()

plt.subplot(1, 2, 2)
sns.histplot(x='friends_log', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title('Distribution of Friends (# Following)', fontsize=12)
plt.xlabel('Friends Count (Log10 scale)', fontsize=10)
plt.ylabel("")
plt.yticks([])
plt.xlim(0, 18)
plt.ylim(0, 0.25)
sns.despine();
#plt.savefig('imgs/dist_followers_friends.png');

NameError: name 'plt' is not defined

### 'Network' Size

Defined as `(number of friends) * (number of followers)` (Log10 scale)

In [15]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='network', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title('Distribution of Network Size', fontsize=12)
plt.xlabel('Network Size [Friends * Followers] (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 150)
plt.ylim(0, 0.05)
sns.despine();
#plt.savefig('imgs/network_size.png');

NameError: name 'plt' is not defined

### Number of favorites

In [16]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='favs_log', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title("Distribution of Favorites Count", fontsize=12)
plt.xlabel('Favorites Count(Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 14)
plt.ylim(0, 0.3)
sns.despine();

NameError: name 'plt' is not defined

### Avg tweets per day

In [17]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='avg_daily_tweets_log', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title("Distribution of Tweets per Day", fontsize=12)
plt.xlabel('Average Tweets per Day (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 5)
plt.ylim(0, 0.7)
sns.despine();
#plt.savefig('imgs/tweets_per_day.png');

NameError: name 'plt' is not defined

### Follower acquisition rate

Can be thought of the even cadence rate of new followers per day.

In [18]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='follower_acq_rate', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title("Distribution of Follower Acquisition Rate", fontsize=12)
plt.xlabel('Followers per Day (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 10)
plt.ylim(0, 2.5)
sns.despine();

NameError: name 'plt' is not defined

### Friends acquisition rate

In [19]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='friends_acq_rate', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title("Distribution of Friend Acquisition Rate", fontsize=12)
plt.xlabel('Friends per Day (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 1)
plt.ylim(0, 6)
sns.despine();

NameError: name 'plt' is not defined

### Tweets to Followers Metric

The idea here is **network *reach*** -- how often they're tweeting and how many people might see it. Since this is a multiplication transformation, the metric will be higher for accounts with a high tweet frequency, high following, or both.

In [20]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='tweet_to_followers', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='upper right', labels=['Bot', 'Human'], frameon=False)
plt.title("Tweets to Followers Metric", fontsize=12)
plt.xlabel('Tweets * Followers (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 200)
plt.ylim(0, 0.009)
sns.despine();
#plt.savefig('imgs/tweets_to_followers.png');

NameError: name 'plt' is not defined

### Time of day account created

At what time of day are most human accounts created? When are most bots created? 

In [21]:
plt.figure(figsize=(7,5), dpi=100)
sns.histplot(x='hour_created', data=df, hue='bot', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Account Type', loc='best', labels=['Bot', 'Human'])
plt.title("Time of Day Account was Created", fontsize=12)
plt.xlabel('Time of Day (24h)', fontsize=10)
plt.ylabel('Density', fontsize=10)
plt.xlim(0, 24)
plt.ylim(0, 0.05)
sns.despine();
#plt.savefig('imgs/hour_created.png');

NameError: name 'plt' is not defined

## Verification Status

**Verification status** is likely to be a huge feature here. 

From Twitter's website: 
> An account may be verified if it is determined to be an account of public interest. Typically this includes accounts maintained by users in music, acting, fashion, government, politics, religion, journalism, media, sports, business, and other key interest areas.

Verification requires a vetting process and the earlier bar plot with verification status by account type showed that the numbers differ significantly between bot and human accounts. This may not mean that bots are less likely to be verified, however, and may be the result of an imbalanced dataset. Still, it has potential to be a very important feature in the model.

In [22]:
plt.figure(figsize=(12,5), dpi=100)
plt.suptitle('Tweets-Followers Metric by Verification Status', fontsize=14)

plt.subplot(1, 2, 1)
sns.histplot(x='tweet_to_followers', data=humans, hue='verified', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=False)

plt.title("Humans", fontsize=12)
plt.xlabel('Tweets * Followers (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
sns.despine(bottom = True, left = True)
plt.xlim(0, 200)
plt.ylim(0, 0.025)

plt.subplot(1, 2, 2)
sns.histplot(x='tweet_to_followers', data=bots, hue='verified', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)

plt.legend(title='Verification Status', loc='upper right', labels=['Verified', 'Not Verified'], frameon=False)
plt.title("Bots", fontsize=12)
plt.xlabel('Tweets * Followers (Log10 scale)', fontsize=10)
plt.ylabel("")
plt.yticks([])
sns.despine()
plt.xlim(0, 200)
plt.ylim(0, 0.025);
#plt.savefig('imgs/tweets_to_follows_by_verification_status.png');

NameError: name 'plt' is not defined

In [24]:
plt.figure(figsize=(12,5), dpi=100)
plt.suptitle('Network Size by Verification Status', fontsize=14)

plt.subplot(1, 2, 1)
sns.histplot(x='network', data=humans, hue='verified', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=False)
plt.xlabel('Network Size (Friends * Followers)')
plt.title("Humans", fontsize=12)
plt.xlabel('Network Size (Log10 scale)', fontsize=10)
plt.ylabel('Density', fontsize=10)
sns.despine(bottom = True, left = True)
plt.xlim(0, 150)
plt.ylim(0, 0.05)

plt.subplot(1, 2, 2)
sns.histplot(x='network', data=bots, hue='verified', alpha=.25, 
             kde=True, stat='density', common_bins=True, element='step', legend=True)
plt.xlabel('Network Size (Friends * Followers)')
plt.legend(title='Verification Status', loc='upper right', labels=['Verified', 'Not Verified'], frameon=False)
plt.title("Bots", fontsize=12)
plt.xlabel('Network Size (Log10 scale)', fontsize=10)
plt.ylabel("")
plt.yticks([])
sns.despine()
plt.xlim(0, 150)
plt.ylim(0, 0.05);
#plt.savefig('imgs/network_size_by_verification_status.png');

NameError: name 'plt' is not defined