# Finding Guido

In [1]:
## Finding Guido
import pandas as pd
from dotenv import load_dotenv
load_dotenv()
import os
from tweepy import Client, Paginator

bearer_token = os.environ.get("TWITTER_BEARER_TOKEN")
client = Client(bearer_token)

We'll start by looking at the network of Guido van Rossum, creator of Python and former "Benevolent Dictator for Life". 

In [2]:
user_response = client.get_user(username="gvanrossum", user_fields=["description", "location", "public_metrics"])

In [3]:
gvr = user_response[0].id
user_response[0].public_metrics

{'followers_count': 268158,
 'following_count': 514,
 'tweet_count': 3553,
 'listed_count': 4702}

Who does Guido Van Rossum follow? Originally I made the mistake of using "subscriptions" instead of "following", don't do that! He has almost 270K followers. Also, because I find it really confusing to talk about subscriptions and following, I'm going to refer to the accounts that GVR follows as his "subscriptions". GVR subscribes to just over 500 acccounts. Conveniently, the user objects from tweepy map very nicely to pandas dataframes. A little pandas magic with json_normalize lets us explode the "public_metrics" dictionary into new columns in our frame. 

In [4]:
import pandas as pd

def get_subscriptions_df(id):
    """ 
    Function retrieves subscriptions (accounts that a user is following) for a given user id. 
    Function uses pagination if the user account has more than 1000 subscriptions, which is the limit 
    of what the Twitter api will return in one call. 
    """
    subscriptions = []
    for subscription in Paginator(client.get_users_following, id, user_fields=["description", "location", "public_metrics"], max_results=1000):
        subscriptions.extend(subscription.data)
    subscriptions_df = pd.DataFrame(subscriptions)
    
    # public metrics is a dictionary containing tweet and follower counts. 
    # we merge it with the original dataframe. 
    subscriptions_df = subscriptions_df.merge(
        pd.json_normalize(subscriptions_df.public_metrics), left_index=True, right_index=True).\
        drop("public_metrics", axis="columns")
    subscriptions_df["subscriber"] = id
    return subscriptions_df



In [5]:
gvr_subscriptions_df = get_subscriptions_df(gvr)
gvr_subscriptions_df.head()

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count,subscriber
0,"I work @Microsoft, making Python more awesome ...",14076724,"Redmond, WA",John Lam,john_lam,4406,586,6182,294,15804774
1,Intl. Relations & MPP Grad 🌍 noob intersection...,187196955,Berlin,Camila Gutiérrez (she/her),Mariacamilagl30,695,444,259,9,15804774
2,"Creator of @datasetteproj, co-creator Django. ...",12497,"San Francisco, CA",Simon Willison,simonw,41380,5174,41106,1594,15804774
3,Partner Software Architect at Microsoft on .NE...,19481808,Redmond,David Fowler 🇧🇧🇺🇸,davidfowl,103795,1318,61361,1607,15804774
4,The #1 Python-focused podcast covering the peo...,3098427092,"Portland, OR USA",Talk Python Podcast,TalkPython,61473,4043,6076,1342,15804774


We can take a look around at the accounts that Guido is following. They have a very wide range of "subscribers", from 16 to 19,250,000. At the top end, some of the accounts are for celebrities, including: 

- Edward Snowden
- Samantha Power, former US Ambassador to the UN
- Dave Matthews
- Grant Imahara
- Leonard Nimoy

We can also see a lot of organizations, such as Meta, BBC Breaking News, the Mars Rover, etc. For both sets of accunts, it is interesting that so many of these accounts have very small subscription counts of their own, though, much more in line with typical people. Al Gore, for instance, has 2.9 million subscribers but is subscribing to just 40 accounts himself. 

We are interested in the subscriptions for these accounts because if one of them is following GVR back, it's a pretty crucial indicator that they are part of the python universe. If we find such a person, we're also interested in who they follow because they are likely to be part of the community as well, and it gives us a sense of where they are / what they're interested in in the Python ecosystem. 

We are left with a problem, though. To find out if someone is following GVR back, we need to get their list of subscriptions This would leave us 508,982 user records to sort through! Even getting that list would be a difficult problem. The twitter api will return a max of 1000 subscriptions per call, and there is a rate limit of 15 calls in 15 minutes. Getting the entire list, then, would take a long time and would need to be babysat constantly. To get under this limit, we will need to take a different approach. In this case, we want to see just how many people have more than the 1000 paging limit. 

In [63]:
gvr_subscriptions_df.query("""following_count > 1000""")

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count,subscriber
2,"Creator of @datasetteproj, co-creator Django. ...",12497,"San Francisco, CA",Simon Willison,simonw,41380,5174,41106,1594,15804774
3,Partner Software Architect at Microsoft on .NE...,19481808,Redmond,David Fowler 🇧🇧🇺🇸,davidfowl,103795,1318,61361,1607,15804774
4,The #1 Python-focused podcast covering the peo...,3098427092,"Portland, OR USA",Talk Python Podcast,TalkPython,61473,4043,6076,1342,15804774
5,"Python, Cloud and OSS at Microsoft. Author of ...",18185983,"Sydney, Australia",Anthony Shaw 🇦🇺🤝🇺🇦,anthonypjshaw,21153,2827,19483,499,15804774
13,Symmathecist. Engineering Manager of Developer...,25103,"St. Louis, MO",Jessica Joy Kerr,jessitron,37994,2426,21012,1265,15804774
...,...,...,...,...,...,...,...,...,...,...
502,@missionbit board member & past lead instructo...,755178,"San Francisco, CA",Bob Ippolito,etrepum,5060,3145,16985,275,15804774
507,"Googler who looks after Open Source. Dad, Husb...",44423,Washington State,Chris DiBona,cdibona,41803,3046,20642,1673,15804774
508,Frequent fliers know lie-flat options in coach...,15341387,"New York, NY",Asya Kamsky,asya999,1585,1200,9338,80,15804774
511,She/her \nGeek Bridge addict \nGarden to Table...,8799612,Sillycon Valley,annaraven is so tired of this shit.,annaraven,1281,2145,20224,77,15804774


Surely there's a more elegant way to do this

In [None]:
subscription_frames = []

sub1k_subscriber
# call the people with <1000 followers first, 
for i in 

In [44]:
python_subscriptions_df = gvr_subscriptions_df[gvr_subscriptions_df.description.str.contains("ython")].copy()

In [54]:
user_ids = python_subscriptions_df.query("""following_count < 750""").id.values

In [55]:
from time import sleep
subscriptions_frames = []

for i in range(len(subscriptions_frames), len(subscription_ids), 15):
    for user_id in user_ids[i:i+15]:
        subscriptions_frames.append(get_subscriptions_df(user_id))
        sleep(1000)

ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

In [56]:
len(subscriptions_frames)

5

In [8]:
gvr_subscriptions_df.sort_values(by="followers_count").tail(50)

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count,subscriber
342,Tweets about SciPy (Scientific Python) and rel...,254791849,Houston,Scientific Python,SciPyTip,161528,17,7027,3123,15804774
320,A high-level Python Web framework that encoura...,191225303,Web,Django,djangoproject,165249,69,588,2470,15804774
204,Videos and podcasts by @BradyHaran,393621632,MSRI,Numberphile,numberphile,172377,140,5260,1871,15804774
400,"NYC gay techie. Founder of Fog Creek, Trello, ...",15948437,"New York, NY",Joel Spolsky,spolsky,178228,603,7759,5940,15804774
89,"Programmer, coach coach, artist, pokerist, sin...",16891384,"San Francisco, CA",Kent Beck 🌻,KentBeck,185191,822,16301,5384,15804774
12,CEO of Epic Games. We make games and to make o...,1686323288,,Tim Sweeney,TimSweeneyEpic,194957,285,12910,1479,15804774
377,,74979757,"The Hague, The Netherlands",Wim de Bie,wimdebie,211671,46,2693,1674,15804774
148,Apparently still a film director.\nDoesn't lik...,421944319,Somewhere in London,Terry Gilliam,TerryGilliam,231762,26,796,1881,15804774
363,"Creator of @jquery, Chief Software Architect a...",752673,"Hudson Valley, NY",John Resig,jeresig,257120,2649,10252,9168,15804774
19,I draw the comic xkcd,21146468,,Randall Munroe,xkcd,266488,1,742,1950,15804774


In [39]:
seed_users = client.get_users(usernames=["wesmckinn", "honnibal", "_inesmontani", "pwang"]).data

In [40]:
seed_ids = [user.id for user in seed_users]

In [41]:
subscription_frames = []
for seed_id in seed_ids: 
    subscriptions_df = get_subscriptions_df(seed_id)
    subscription_frames.append(subscriptions_df)

  subscriptions_df = pd.concat(subscription_frames, gvr_subscriptions_df)


TypeError: unhashable type: 'DataFrame'

In [46]:
subscriptions_df = pd.concat(subscription_frames)
subscriptions_df = pd.concat([subscriptions_df, gvr_subscriptions_df])
subscriptions_df.to_parquet("subscriptions.parquet")

We would be interested in anyone that at least 3 of these people follow. We don't need network analysis for this one, we can just look at counts by id. There are 68 accounts followed by at least 3 of their seeds; this may be a manageable number to work with. 

In [49]:
shared_subscribers = subscriptions_df.id.value_counts()
shared_subscribers = shared_subscribers[shared_subscribers > 2].index
len(shared_subscribers)

68

Let's take a closer look at these individuals. As we know already that they're in both sets, we'll look at them in GVR's subscriptions. There are a mix of people and organizations in the list. We are more interested in the people. Of those people, we are more interested, for this analysis, in people who mention python in their bios. We have several directions we could take from here: 
- Jake VanderPlaas, if we were interested in Astronomical / Scientific Python
- Fernando Perez, creator of iPython, if we are interested in the Jupyter community

We might also be interested in the accounts that have the largest number of followers, as it implies some amount of influence in the community, especially if the number of people they are following is realtively low. After exploring a bit I'm interested in Jacob Kaplan-Moss, who was heavily involved with Django. Django is a website framework and one of the more popular uses of Python. 

In [51]:
subscriptions_df[subscriptions_df.id.isin(shared_subscribers)].groupby(["id", "username", "name"])[["followers_count", "following_count"]].min()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,followers_count,following_count
id,username,name,Unnamed: 3_level_1,Unnamed: 4_level_1
25103,jessitron,Jessica Joy Kerr,37994,2426
765548,hmason,Hilary Mason,128902,2053
803694,mikeolson,Mike Olson 🇺🇸,25111,232
817083,EricaJoy,EricaJoy,124195,1067
823083,migueldeicaza,Miguel de Icaza,88127,4106
...,...,...,...,...
2543588034,clattner_llvm,Chris Lattner,65353,149
3422200198,spacy_io,spaCy,24749,23
3562121415,RealSexyCyborg,Naomi Wu 机械妖姬,221819,2913
794723912189775872,vorpalsmith,Nathaniel J. Smith,2954,210


In [None]:
subscriptions_df[subscriptions_df.id.isin(shared_subscribers)].groupby(["id", "username", "name"])[["followers_count", "following_count"]].min()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,followers_count,following_count
id,username,name,Unnamed: 3_level_1,Unnamed: 4_level_1
25103,jessitron,Jessica Joy Kerr,37994,2426
765548,hmason,Hilary Mason,128902,2053
803694,mikeolson,Mike Olson 🇺🇸,25111,232
817083,EricaJoy,EricaJoy,124195,1067
823083,migueldeicaza,Miguel de Icaza,88127,4106
...,...,...,...,...
2543588034,clattner_llvm,Chris Lattner,65353,149
3422200198,spacy_io,spaCy,24749,23
3562121415,RealSexyCyborg,Naomi Wu 机械妖姬,221819,2913
794723912189775872,vorpalsmith,Nathaniel J. Smith,2954,210


In [14]:
jacobian_subscriptions_df = get_subscriptions_df(18824526)

Having done so, we can now see some accounts that are followed by all 3 individuals. There are 6 accounts in total. One of these is for Jessica McKellar, who I recognize as a major contributor to the Twisted python library. I did a little research and saw that she was also at one point head of the Python Software Foundation. Interestingly, a lot of the individuals on the list have very small subscription accounts. The exception is Nick Coghlan, a CPython developer. There is also an indivdual, Alex Gaynor, who doesn't subscribe to anyone, but to whom many people are subscribed. This is much more the pattern of a media organization than a person, which is most curious. Between all these individuals, there are just over 4500 subscriptions. Let's get 'em all and call it the end of the day for the source of our network. 

In [15]:
follows_three = pd.concat([jacobian_subscriptions_df, gvr_subscriptions_df, wes_subscriptions_df]).id.value_counts()
follows_three = follows_three[follows_three > 2].index
len(follows_three)
jacobian_subscriptions_df[jacobian_subscriptions_df.id.isin(follows_three)]

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count,subscriber
16,"CTO, https://t.co/FESdoVX9di",24945605,"San Francisco, CA",Jessica McKellar,jessicamckellar,15291,339,1046,521,18824526
19,"Python, software, @coveragepy, typography, jug...",16273417,Boston,Ned Batchelder,nedbat,32189,162,9175,660,18824526
29,chef ➡️ developer ➡️ keynote speaker & enginee...,17213939,"🎶 Nashville, TN",adriennefriend,adriennefriend,4794,1135,36507,245,18824526
34,"CPython core developer, software architect @Tr...",261665936,"Brisbane, AU (Turrbal land)",Nick Coghlan,ncoghlan_dev,9713,2547,40207,447,18824526
45,Python core developer; snarky Canadian,14428320,"Vancouver, British Columbia",Brett Cannon,brettsky,17953,326,16319,574,18824526
49,"software resilience engineer, and generalist t...",14635493,here,Alex Gaynor,alex_gaynor,11730,0,114,769,18824526


In [16]:
subscriber_frames = []
for subscriber in jacobian_subscriptions_df.query("""following_count > 0""")[jacobian_subscriptions_df.id.isin(follows_three)].id.values:
    subscriber_frames.append(get_subscriptions_df(subscriber))
subscriptions_df = pd.concat(subscriber_frames)

  for subscriber in jacobian_subscriptions_df.query("""following_count > 0""")[jacobian_subscriptions_df.id.isin(follows_three)].id.values:


In [17]:
subscriptions_df = pd.concat([
    gvr_subscriptions_df,
    wes_subscriptions_df,
    jacobian_subscriptions_df,
    subscriptions_df
])
subscriptions_df

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count,subscriber
0,"I work @Microsoft, making Python more awesome ...",14076724,"Redmond, WA",John Lam,john_lam,4406,586,6182,294,15804774
1,Intl. Relations & MPP Grad 🌍 noob intersection...,187196955,Berlin,Camila Gutiérrez (she/her),Mariacamilagl30,695,444,259,9,15804774
2,"Creator of @datasetteproj, co-creator Django. ...",12497,"San Francisco, CA",Simon Willison,simonw,41365,5173,41097,1594,15804774
3,Partner Software Architect at Microsoft on .NE...,19481808,Redmond,David Fowler 🇧🇧🇺🇸,davidfowl,103781,1319,61351,1607,15804774
4,The #1 Python-focused podcast covering the peo...,3098427092,"Portland, OR USA",Talk Python Podcast,TalkPython,61467,4043,6076,1342,15804774
...,...,...,...,...,...,...,...,...,...,...
321,Working on Python infrastructure at Instagram,17147516,"Seattle, WA",Dino Viehland,DinoViehland,440,134,306,32,14428320
322,"software resilience engineer, and generalist t...",14635493,here,Alex Gaynor,alex_gaynor,11730,0,114,769,14428320
323,"Engineer, Cyclist, Pythonista, Foodie",38579562,Vancouver,Dr. Doug Latornell,dlatornell,180,442,1416,14,14428320
324,,36183197,Idaho,Whitney Cannon,cannonwe,2,0,2,0,14428320


In [19]:
subscriptions_df.subscriber.nunique()

8

There are just under 6000 subscriptions in our fledgling network for about 5200 individuals and 8 seeds. Time to bring in networkx!

In [116]:
import networkx as nx

python_twitterverse = nx.from_pandas_edgelist(subscriptions_df, source="subscriber", target="id", create_using=nx.DiGraph())

I thought it would be nice to bring in the user attributes to associate with each node. I thought that would be as simple as removing our "subscriber" field from the subscriptions dataframe and dropping duplicates.  I had a surprise, though - in between my calls, the public metrics of certain users changed! This highlights the really dynamic nature of Twitter and is a cautionary tale about thinking it is possible to have a truly complete sense of the Python twitterverse, even with infinite computing power and no rate limits. 

In [117]:
users_df = subscriptions_df.drop("subscriber", axis="columns").drop_duplicates().copy()
users_df[users_df.id==users_df[users_df.id.duplicated()].id.values[0]]

Unnamed: 0,description,id,location,name,username,followers_count,following_count,tweet_count,listed_count
446,"Fun Stack Vibing. Started Xamarin, Mono, Gnome...",823083,"boston, ma",Miguel de Icaza,migueldeicaza,88127,4107,100887,3417
163,"Fun Stack Vibing. Started Xamarin, Mono, Gnome...",823083,"boston, ma",Miguel de Icaza,migueldeicaza,88128,4107,100887,3417
71,"Fun Stack Vibing. Started Xamarin, Mono, Gnome...",823083,"boston, ma",Miguel de Icaza,migueldeicaza,88129,4107,100887,3417


We drop duplicates based on the id column only, then set the index for the frame to id. This lets us make a dictionary mapping user ids to their attributes, which is what networkx needs from us. 

In [118]:
users_df = users_df.drop_duplicates(subset=["id"], keep="last")
users_df = users_df.set_index("id")
users_df = users_df.fillna("Not Provided")
users_dict = users_df.to_dict(orient="index")
id2user_dict = users_df["username"].to_dict()

We can now set attributes for our nodes. We also relabel them to use the usernames instead of ids, which will make our graph a little easier to play with. 

In [121]:
users_df[users_df.username=="dongheena92"]

Unnamed: 0_level_0,description,location,name,username,followers_count,following_count,tweet_count,listed_count
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1169287827521015808,SDE @ Line corp & CPython Core-dev: Opinions a...,Not Provided,Dong-hee Na,dongheena92,760,188,863,11


In [125]:
users_dict[users_df[users_df.username=="dongheena92"].index.values[0]]

{'description': 'SDE @ Line corp & CPython Core-dev: Opinions are my own not my employer',
 'location': 'Not Provided',
 'name': 'Dong-hee Na',
 'username': 'dongheena92',
 'followers_count': 760,
 'following_count': 188,
 'tweet_count': 863,
 'listed_count': 11}

In [133]:
# enhance nodes by adding user details

nx.set_node_attributes(python_twitterverse, users_dict)
nx.relabel_nodes(python_twitterverse, users_df["username"].to_dict(), copy=False)

<networkx.classes.digraph.DiGraph at 0x7f38e68f4df0>

In [135]:
python_twitterverse.nodes()["dongheena92"]

{'description': 'SDE @ Line corp & CPython Core-dev: Opinions are my own not my employer',
 'location': 'Not Provided',
 'name': 'Dong-hee Na',
 'username': 'dongheena92',
 'followers_count': 760,
 'following_count': 188,
 'tweet_count': 863,
 'listed_count': 11}

It's an awfully big network to work with. It does make a beautiful graph drawing, for sure, though my computer couldn't really handle the load. I did notice, before it gave up, that most of the nodes in the network only have one connection to the seeds that we chose. We can significantly reduce the number of nodes in our network by removing anyone who only has that one connection. We are now down to 505 nodes!

In [136]:
python_degree = python_twitterverse.degree()
single_nodes = [node for node in python_twitterverse if python_degree[node] == 1]
python_twitterverse.remove_nodes_from(single_nodes)
len(python_twitterverse)

505

In [137]:
nx.get_node_attributes(python_twitterverse, "location")

{'maintainerati': 'A bike shed',
 'jreback': 'Not Provided',
 'necaris': 'Washington, D.C.',
 'MitchellBaker': 'Not Provided',
 'mxsash': 'she/her',
 'dmoisset': 'London, England',
 'matthew_d_green': 'Baltimore, MD',
 'TheASF': 'Worldwide',
 'andrewgodwin': 'Denver, CO, USA',
 'GaelVaroquaux': 'Paris, France',
 'pyconca': 'Canada',
 'TimirahJ': 'Los Angeles, CA',
 'BigDataBorat': 'Алматы',
 'OpenSourceOrg': 'Palo Alto, CA',
 'webology': 'Lawrence, KS',
 'drfeifei': 'Stanford, CA, U.S.A.',
 'gjbernat': 'Los Angeles, CA',
 'etrepum': 'San Francisco, CA',
 'stefanvdwalt': 'Truckee, California',
 'kushaldas': 'Sweden',
 'tarah': '(🇺🇸)🇮🇹🇫🇷🇬🇧',
 'berkerpeksag': 'Helsinki, Finland',
 'katie_k7r': 'Austin, TX',
 'ftrain': 'Brooklyn, NY',
 'OssAnna16': 'London, England 🇬🇧',
 'juliaelman': 'Not Provided',
 'kf': 'Portland, OR',
 'corbett': 'Sierra Madre, CA',
 'joeerl': 'Sweden',
 'mariatta': 'Port Moody, British Columbia',
 'PyCaribbean': 'Santo Domingo, DR',
 'aprilwensel': 'Encinitas, CA',
 

In [138]:
nx.write_graphml(python_twitterverse, "python_twitterverse_d2.graphml")

## Cliques - not just for Mean Girls!

Because of the way we constructed our network, the most connections anyone can have would be 7. 

In [2]:
for clique in nx.find_cliques(python_twitterverse.to_undirected()):
    if len(clique) > 6:
        print(clique)

NameError: name 'nx' is not defined

In [4]:
import networkx as nx
python_twitterverse = nx.read_graphml("python_twitterverse_d2.graphml")

In [7]:
python_twitterverse.degree()

DiDegreeView({'maintainerati': 2, 'jreback': 2, 'necaris': 3, 'MitchellBaker': 2, 'mxsash': 2, 'dmoisset': 2, 'matthew_d_green': 2, 'TheASF': 2, 'andrewgodwin': 3, 'GaelVaroquaux': 2, 'pyconca': 2, 'TimirahJ': 3, 'BigDataBorat': 2, 'OpenSourceOrg': 2, 'webology': 4, 'drfeifei': 2, 'gjbernat': 2, 'etrepum': 2, 'stefanvdwalt': 2, 'kushaldas': 4, 'tarah': 2, 'berkerpeksag': 2, 'katie_k7r': 2, 'ftrain': 2, 'OssAnna16': 2, 'juliaelman': 4, 'kf': 2, 'corbett': 2, 'joeerl': 2, 'mariatta': 4, 'PyCaribbean': 2, 'aprilwensel': 2, 'djangogirls': 2, 'fwiles': 4, 'alicegoldfuss': 2, 'PyData': 2, 'JukkaLeh': 2, 'jess_ingrass': 2, 'quansightai': 2, 'betswaliszewski': 3, 'hiwearespiders': 2, 'gergdotca': 2, 'raymondh': 3, 'cyen': 2, 'bostonpython': 2, 'zeynep': 2, 'rikkiends': 3, 'vboykis': 2, 'skippyhammond': 2, 'migueldeicaza': 3, 'MissPhilbin': 2, 'giovannibajo': 2, 'kjam': 2, 'yarkot': 3, 'jaykreps': 2, 'LorenaABarba': 3, 'lorencrary': 2, 'pyconpune': 2, 'flowerhack': 2, 'hynek': 2, 'Skud': 2, 'cl