## Createtime analysis to find chunks of twitter bots

This notebook performs a simple analysis of the seques of times at which Twitter follower accounts were created. It is the idea behind the plots in the NY Times "Follower factory" story from Jan 27, 2018:

https://www.nytimes.com/interactive/2018/01/27/technology/social-media-bots.html

Running the notebook requires a set of tokens for the Twitter API. To get your own tokens, see here:

https://developer.twitter.com/en/docs/basics/authentication/overview

I don't recommend trying to use this notebook unless you have a solid general understanding of python and web APIs, but still wanted to put it out there. Unless you have very good tokens, your patience will be severely tests if you try to use this notebook to look at users with >200k followers. So consider yourself advised.

In [None]:
import sys
import string
import simplejson
from twython import Twython
import matplotlib
import matplotlib.pyplot as plt

from datetime import datetime
from dateutil import parser as dateparser
import time

import twitter_creds # Your .py file in this folder with API_KEY, API_SECRET, TOKEN, TOKEN_SECRET variables assigned

#FOR OAUTH AUTHENTICATION -- NEEDED TO ACCESS THE TWITTER API
t = Twython(app_key=twitter_creds.API_KEY, 
    app_secret=twitter_creds.API_SECRET,
    oauth_token=twitter_creds.TOKEN,
    oauth_token_secret=twitter_creds.TOKEN_SECRET)


In [None]:
def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

In [None]:
#ego_screenname -- the person you're looking to query for
#ego_screenname = "jugander"

# "Bot scholars" that may be useful to look at
#ego_screenname = "gilgul"

ego_screenname = "informor"
#ego_screenname = "suneman"
#ego_screenname = "yy"

ego_id = t.lookup_user(screen_name=ego_screenname)[0]['id_str']

In [None]:
# Get the follower list. 
# Rate limiting caps this at ~80000 followers every 900 seconds. Hacky pauses added. 
follower_list = []
nextcurs = -1
sleep_step=60

while (nextcurs != 0):
    remaining = t.get_lastfunction_header(header='x-rate-limit-remaining')
    secs = int(t.get_lastfunction_header(header='x-rate-limit-reset')) - int(datetime.now().timestamp())
    if ((int(remaining) == 0) and (secs >= 0)):
        print("Waiting " + str(secs) + " seconds...")
        secs+=3 # extra
        while secs > 0:
                time.sleep(min(secs, sleep_step))
                print(str(secs) + "remaining")
                secs-=sleep_step
    
    follower_object = t.get_followers_ids(user_id = ego_id, count = 5000, cursor = nextcurs)
    follower_list += follower_object['ids']
    nextcurs = follower_object['next_cursor']
    print(len(follower_list)) 

In [None]:
# Status of API allowances from last call
print("Allowed " + str(t.get_lastfunction_header(header='x-rate-limit-remaining')) +
      " more follow requests over the next " + 
      str(int(t.get_lastfunction_header(header='x-rate-limit-reset')) - int(datetime.now().timestamp())) +
        " seconds.")

len(follower_list)

In [None]:
# Once you have the followers, now need to gather the create_times of those followers

# Optionally downsample
if (len(follower_list) > 20000):
    downsample_rate = 2
else:
    downsample_rate = 1

# Begin
i          = downsample_rate
k          = 0
ts_list_mk = []
chunk_size = 100

for follower_chunk in chunker(follower_list,chunk_size):
    k += chunk_size
    if (i == downsample_rate):
        i = 1
    else:
        i += 1
        continue
    
    if isinstance(t.get_lastfunction_header(header='x-rate-limit-reset'),str):
        remaining = t.get_lastfunction_header(header='x-rate-limit-remaining')
        secs = int(t.get_lastfunction_header(header='x-rate-limit-reset')) - int(datetime.now().timestamp())
        if ((int(remaining) == 0) and (secs >= 0)):
            print("Waiting " + str(secs) + " seconds...")
            time.sleep(secs+2)

    user_objects = t.lookup_user(user_id=follower_chunk)
    ts_mk        = [dateparser.parse(u['created_at']).timestamp() for u in user_objects]
    #    Next two lines will give you "date last posted" instead of "date created"
    #    user_objects = [u for u in user_objects if 'status' in u]  
    #    ts_mk        = [dateparser.parse(u['status']['created_at']).timestamp() for u in user_objects]
    ts_list_mk  += ts_mk
    print(str(len(ts_list_mk)) + " " + str(k)) 


In [None]:
ts_list_dt = [datetime.fromtimestamp(x) for x in ts_list_mk]
ts_list_dt.reverse()

In [None]:
# API status again, from last call
print("Allowed " + str(t.get_lastfunction_header(header='x-rate-limit-remaining')) +
      " more user requests over the next " + 
      str(int(t.get_lastfunction_header(header='x-rate-limit-reset')) - int(datetime.now().timestamp())) +
        " seconds.")

len(ts_list_dt)

In [None]:
# Histogram of create_times, sometimes reveals things:

plt.hist(ts_list_dt,bins=100)
plt.show()

In [None]:
# The "createtime fingerprint"

start=1
stop=len(ts_list_dt)
plt.plot(range(len(ts_list_dt[start:stop])),ts_list_dt[start:stop],'r.',markersize=2);
fig = plt.gcf()
fig.set_size_inches(10, 6)
plt.xlabel('follower number')
plt.ylabel('account creation date')
plt.title(ego_screenname)
plt.show()


### Other links:

* Sune's writing on bots (2013):
https://sunelehmann.com/2013/12/04/youre-here-because-of-a-robot/

* Gilad's writing on bots (2014):
https://medium.com/i-data/fake-friends-with-real-benefits-eec8c4693bd3

In [None]:
# Clear Twython object from notebook
t = None