The rough pipeline used to create this data is:
- Run
pull_sample.py
as long as necessary to get the data I needed (I got about 100k tweets in 24 hours) - Run
get_subset.py
to extract a random subset of the tweets collecting using a specific filtering criteria if desired (selected 5k randomly) - Run
extract_users.py
to re-format tweets as users - Run
get_followers.py
to get the followers of each user. There is a 60s delay between requests because of Twitter's rate limiting, so I ran this on my server and left it alone several days. - Run
get_countries.py
. This uses the Bing Maps reverse-geocoding API to associate a country with each user and follower based on their "location" tag. This is very probabilistic and errors definitely were introduced here. - Run
generate_graph.py
. This produces a .gexf file of the followers graph suitable for import into Gephi.