# Finding Guido

In [72]:
## Finding Guido
import pandas as pd
from dotenv import load_dotenv
load_dotenv()
import os
from tweepy import Client, Paginator, API

bearer_token = os.environ.get("TWITTER_BEARER_TOKEN")
client = Client(bearer_token)
api = API()

We'll start by looking at the network of Guido van Rossum, creator of Python and former "Benevolent Dictator for Life". 

In [2]:
user_response = client.get_user(username="gvanrossum", user_fields=["description", "location", "public_metrics"])

We can retrive GVR's twitter id, which is required for API calls. We can also look at his public metrics. He has almost 270,000 accounts following him!

In [3]:
gvr = user_response[0].id
user_response[0].public_metrics

{'followers_count': 268158,
 'following_count': 514,
 'tweet_count': 3553,
 'listed_count': 4702}

Who does Guido Van Rossum follow? Originally I made the mistake of using "subscriptions" instead of "following", don't do that! He has almost 270K followers. Also, because I find it really confusing to talk about followers and following, I'm going to refer to the accounts that GVR follows as his "subscriptions". GVR subscribes to just over 500 acccounts. Conveniently, the user objects from tweepy map very nicely to pandas dataframes. A little pandas magic with json_normalize lets us explode the "public_metrics" dictionary into new columns in our frame. 

In [109]:
import pandas as pd

def get_subscriptions_df(id):
    """ 
    Function retrieves subscriptions (accounts that a user is following) for a given user id. 
    Function uses pagination if the user account has more than 1000 subscriptions, which is the limit 
    of what the Twitter api will return in one call. 
    """
    client = Client(bearer_token=bearer_token)
    subscriptions = []
    for subscription in Paginator(client.get_users_following, id, user_fields=["description", "location", "public_metrics"], max_results=1000):
        if subscription.data is None:
            return None
        else:
            subscriptions.extend(subscription.data)
    subscriptions_df = pd.DataFrame(subscriptions)

    
    # public metrics is a dictionary containing tweet and follower counts. 
    # we merge it with the original dataframe. 
    subscriptions_df = subscriptions_df.merge(
        pd.json_normalize(subscriptions_df.public_metrics), left_index=True, right_index=True).\
        drop("public_metrics", axis="columns")
    subscriptions_df["subscriber"] = id

    return subscriptions_df



In [5]:
gvr_subscriptions_df = get_subscriptions_df(gvr)
gvr_subscriptions_df.head()

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count,subscriber
0,"I work @Microsoft, making Python more awesome ...",14076724,"Redmond, WA",John Lam,john_lam,4406,586,6182,294,15804774
1,Intl. Relations & MPP Grad 🌍 noob intersection...,187196955,Berlin,Camila Gutiérrez (she/her),Mariacamilagl30,695,444,259,9,15804774
2,"Creator of @datasetteproj, co-creator Django. ...",12497,"San Francisco, CA",Simon Willison,simonw,41380,5174,41106,1594,15804774
3,Partner Software Architect at Microsoft on .NE...,19481808,Redmond,David Fowler 🇧🇧🇺🇸,davidfowl,103795,1318,61361,1607,15804774
4,The #1 Python-focused podcast covering the peo...,3098427092,"Portland, OR USA",Talk Python Podcast,TalkPython,61473,4043,6076,1342,15804774


We can take a look around at the accounts that Guido is following. They have a very wide range of "subscribers", from 16 to 19,250,000. At the top end, some of the accounts are for celebrities, including: 

- Edward Snowden
- Samantha Power, former US Ambassador to the UN
- Dave Matthews
- Grant Imahara
- Leonard Nimoy

We can also see a lot of organizations, such as Meta, BBC Breaking News, the Mars Rover, etc. For both sets of accunts, it is interesting that so many of these accounts have very small subscription counts of their own, though, much more in line with typical people. Al Gore, for instance, has 2.9 million subscribers but is subscribing to just 40 accounts himself. 

We are interested in the subscriptions for these accounts because if one of them is following GVR back, it's a pretty crucial indicator that they are part of the python universe (though a number of them could be friends or family or insitutions with personal connections). If we find such a person, we're also interested in who they follow because they are likely to be part of the community as well, and it gives us a sense of where they are / what they're interested in in the Python ecosystem. 

We are left with a problem, though. To find out if someone is following GVR back, we need to get their list of subscriptions This would leave us 508,982 user records to sort through! Even getting that list would be a difficult problem. The twitter api will return a max of 1000 subscriptions per call, and there is a rate limit of 15 calls in 15 minutes. Getting the entire list, then, would take a long time and would need to be babysat constantly. To get under this limit, we will need to take a different approach. As we are looking for the structure of the Python ecosystem, we will focus on GVR's subscriptions that mention Python in their descriptions. 

In [66]:
python_subscriptions_df = gvr_subscriptions_df[gvr_subscriptions_df.description.str.contains("ython")].copy()

This leaves us with about 100 accounts. Of these connections, about 75% have less than 1000 subscriptions and could be retrieved in one call. We'll start by getting their connections. 

In [67]:
python_subscriptions_df["following_count"].describe()

count     101.000000
mean      685.336634
std      1111.875202
min         0.000000
25%       113.000000
50%       311.000000
75%       772.000000
max      6644.000000
Name: following_count, dtype: float64

We start by getting "regular" users, defined for our purposes as anyone with less than 1000 followers. They are special because we can retrieve their entire subscription list in a single call to the twitter API. 

In [68]:
regular_user_ids = python_subscriptions_df.query("""following_count <= 1000""").id.values
len(regular_user_ids)
subscriptions_frames = []

83

I had thought about using a try-catch block here to get the TooManyRequests exception, but I wasn't sure what to do when it happened. The rate limit is a known quantity more than an exception, which means it's better if the code handles it explicitly (respects it) than crashing because it exceeds the limit. If we were going to handle it, it would make sense to do something about it in our get_subscriptions_df function rather than here, as that code would have to handle a situation where an account has more than 1000 followers and the call has to be paginated. I have included logic for recovery, though - any successful call gets added to a list of processed ids. If we have to make the call again, we will only do so for users that we have not yet processed. The outer loop makes 15 calls at a time to make sure we stay under the twitter rate limit. At each execution of the for loop, we sleep for 930 seconds.  

In [104]:
from tweepy.errors import TooManyRequests

from time import sleep
processed_ids = []

for i in range(len(subscriptions_frames), len(regular_user_ids), 15):
    print(f"Processing {i} to {i+15} of {len(regular_user_ids)}")
    for user_id in regular_user_ids[i:i+15]:
        user_subscriptions_df = get_subscriptions_df(user_id)
        if user_subscriptions_df is not None:
            subscriptions_frames.append(user_subscriptions_df)
        processed_ids.append(user_id)
    # sleep 15 minutes (+30s to be careful) so we don't exceed the rate limit
    sleep(930)

Processing 39 to 54 of 83
Processing 54 to 69 of 83
Processing 69 to 84 of 83


We concatenate the results of our calls and save the work so we can resume from this point if we have to reload the notebook. 

In [108]:
pd.concat(subscriptions_frames).to_parquet("light_subscribers.parquet")

The remaining accounts present an interesting intellectual problem / pain in the butt. Given pagination limits and rate limits, if we want to process accounts that follow more than 1000 we would have to solve something called "the knapsack problem". That is, we would have to make sure that we were not requesting more than 15000 subscriptions in any given round of cals. As there are only 18 such individuals, I'll take the hit of waiting 15 minutes after each call, despite the inefficiency. 

In [111]:
heavy_subscriber_userids = python_subscriptions_df.query("""following_count > 1000""").id.values
heavy_subscribers = []

In [112]:
for userid in heavy_subscriber_userids:
    subscriptions = get_subscriptions_df(user_id)
    if subscriptions is not None:
        heavy_subscribers.append(subscriptions)
    sleep(900)


We save our work here as well to make sure we don't have to do this again!

In [114]:
pd.concat(heavy_subscribers).to_parquet("heavy_subscribers.parquet")

Finally, we can concatenate both sets of data and save it as a single frame. 

In [119]:
pd.concat([pd.concat(subscriptions_frames), pd.concat(heavy_subscribers)]).to_parquet("gvr_subscribers.parquet")
subscriptions_df = pd.concat([pd.concat(subscriptions_frames), pd.concat(heavy_subscribers)])
subscriptions_df

We have to remember to add back GVR's own subscriptions!

In [137]:
subscribers_with_gvr_df = pd.concat([subscriptions_df, gvr_subscriptions_df])

There are 30000 subscriptions in our fledgling network for about 17000 individuals and 80 seeds. Time to bring in networkx!

In [142]:
import networkx as nx

python_twitterverse = nx.from_pandas_edgelist(subscribers_with_gvr_df, source="subscriber", target="id", create_using=nx.DiGraph())

I thought it would be nice to bring in the user attributes to associate with each node. I thought that would be as simple as removing our "subscriber" field from the subscriptions dataframe and dropping duplicates.  I had a surprise, though - in between my calls, the public metrics of certain users changed! This highlights the really dynamic nature of Twitter and is a cautionary tale about thinking it is possible to have a truly complete sense of the Python (or any) twitterverse, even with infinite computing power and no rate limits. 

In [143]:
users_df = subscriptions_df.drop("subscriber", axis="columns").drop_duplicates().copy()
users_df[users_df.id==users_df[users_df.id.duplicated()].id.values[0]]

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count
114,"46th President of the United States, husband t...",1349149096909668363,,President Biden,POTUS,25928667,12,4146,19997
4,"46th President of the United States, husband t...",1349149096909668363,,President Biden,POTUS,25928668,12,4146,19997
27,"46th President of the United States, husband t...",1349149096909668363,,President Biden,POTUS,25929994,12,4146,19997
198,"46th President of the United States, husband t...",1349149096909668363,,President Biden,POTUS,25934541,12,4148,20003
33,"46th President of the United States, husband t...",1349149096909668363,,President Biden,POTUS,25932578,12,4148,20000
114,"46th President of the United States, husband t...",1349149096909668363,,President Biden,POTUS,25932599,12,4148,20000


We drop duplicates based on the id column only, then set the index for the frame to id. This lets us make a dictionary mapping user ids to their attributes, which is what networkx needs from us. 

In [144]:
users_df = users_df.drop_duplicates(subset=["id"], keep="last")
users_df = users_df.set_index("id")
users_df = users_df.fillna("Not Provided")
users_dict = users_df.to_dict(orient="index")
id2user_dict = users_df["username"].to_dict()

We can now set attributes for our nodes. We also relabel them to use the usernames instead of ids, which will make our graph a little easier to play with. 

In [145]:
# enhance nodes by adding user details

nx.set_node_attributes(python_twitterverse, users_dict)
nx.relabel_nodes(python_twitterverse, users_df["username"].to_dict(), copy=False)

<networkx.classes.digraph.DiGraph at 0x7f8c67c23a00>

Finally, we write the network to a common graph file format, graphml. This can be read by networkx and also by other useful tools such as Gephi. 

In [146]:
nx.write_graphml(python_twitterverse, "gvr_twitterverse.graphml")