The rough pipeline used to create this data is:

Run pull_sample.py as long as necessary to get the data I needed (I got about 100k tweets in 24 hours)
Run get_subset.py to extract a random subset of the tweets collecting using a specific filtering criteria if desired (selected 5k randomly)
Run extract_users.py to re-format tweets as users
Run get_followers.py to get the followers of each user. There is a 60s delay between requests because of Twitter's rate limiting, so I ran this on my server and left it alone several days.
Run get_countries.py. This uses the Bing Maps reverse-geocoding API to associate a country with each user and follower based on their "location" tag. This is very probabilistic and errors definitely were introduced here.
Run generate_graph.py. This produces a .gexf file of the followers graph suitable for import into Gephi.

Provide feedback

Saved searches