# Project Ranking Over Time

In this notebook, we would like to understand the project rankings and how they vary over time for different CNCF projects. Using the graphical techniques such as PageRank, Betweenness and Closeness Centrality scores we can identify the rank for each project in a given time range.

## Connect to Augur database

We will be fetching the data from an Augur database which stores the GitHub data for a large number of open source repositories.

In [4]:
import pandas as pd
import psycopg2
import itertools
import collections
from operator import itemgetter
import sqlalchemy as salc
import json
import networkx as nx
import random
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.preprocessing import MinMaxScaler

from ipynb.fs.defs.graph_helper_functions import (
     get_all_repos,
     get_repos,
     get_contributors,
     created_melted_dfs,
     get_repos_outside,
     get_page_ranks,
     get_betweenness_centrality,
     get_closeness_centrality,
     plot_graph,
     project_nodes_edges_contributions
)

with open("../wasm_creds.json") as config_file:
    config = json.load(config_file)

In [5]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

In [6]:
READ_LOCALLY = False

## Retrieve Available Repositories

Fetch all the repositories for the WASM ecosystem.

In [7]:
df_repos = get_all_repos(engine)

In [8]:
df_repos.head()

Unnamed: 0,repo_id,repo_group_id,repo_git,repo_path,repo_name,repo_added,repo_type,url,owner_id,description,primary_language,created_at,forked_from,updated_at,repo_archived_date_collected,repo_archived,tool_source,tool_version,data_source,data_collection_date
0,25598,25432,https://github.com/fluencelabs/musl,github.com-fluencelabs-musl,musl,2023-08-01 08:24:29,Organization,,,,,,LinusU/musl,,,0,Frontend,1.0,Git,2023-08-01 08:24:29
1,25492,25430,https://github.com/wasmerio/wasmer,github.com-wasmerio-wasmer,wasmer,2023-08-01 08:24:09,Organization,,,,,,Parent not available,,,0,Frontend,1.0,Git,2023-08-01 08:24:09
2,25509,25431,https://github.com/ewasm/hera,github.com-ewasm-hera,hera,2023-08-01 08:24:16,Organization,,,,,,Parent not available,,,0,Frontend,1.0,Git,2023-08-01 08:24:16
3,25641,25432,https://github.com/fluencelabs/js-client,github.com-fluencelabs-js-client,js-client,2023-08-01 08:24:29,Organization,,,,,,Parent not available,,,0,Frontend,1.0,Git,2023-08-01 08:24:29
4,25490,25430,https://github.com/bytecodealliance/wasmtime,github.com-bytecodealliance-wasmtime,wasmtime,2023-08-01 08:24:09,Organization,,,,,,Parent not available,,,0,Frontend,1.0,Git,2023-08-01 08:24:09


In [9]:
df_repos.repo_id.nunique()

258

In [10]:
# Let's subset the df for just 30 repo ids for testing purposes only
repo_set = random.sample(df_repos['repo_id'].unique().tolist(), 30)

In [12]:
df_subset = df_repos[df_repos['repo_id'].isin(repo_set)]

In [13]:
# Extract corresponding names into a list
repo_names = df_subset['repo_name'].tolist()

In [14]:
repos = {"repo_names": repo_names}

In [15]:
repos

{'repo_names': ['musl',
  'wasm-gc',
  'yew',
  'lumos-hackathon',
  'fluence-network-environment',
  'fluence-shared-canvas',
  'hackfs-2022',
  'indexer_workshop',
  'dinps-22-hackathon',
  'fluence-service-template',
  'sqlite',
  'wasmbench',
  'devcontainer',
  'ewasm-tests',
  'wrc20-examples',
  'wast2wasm',
  'go-libp2p-kad-dht',
  'jabci',
  'p2p-fileshare',
  'poa-explorer',
  'pm',
  'assemblyscript-sdk',
  'wasm-timer',
  'ammo.js',
  'glas',
  'go-wasm-cli',
  'eth2.0-specs',
  'wasmi',
  'crate-versions-experiment',
  'links']}

## Retrieve All Contributors

In [16]:
if READ_LOCALLY:
    contrib_df = pd.read_pickle("ep_data/all_contrib.pkl")
else:
    contrib_df = get_contributors(repo_set, engine)
    contrib_df['created_at'] = pd.to_datetime(contrib_df['created_at'], utc=True)
    contrib_df['created_at'] = contrib_df['created_at'].dt.strftime('%Y-%m-%d')
    contrib_df['created_at'] = pd.to_datetime(contrib_df['created_at']).dt.normalize()
    contrib_df.to_pickle("ep_data/all_contrib.pkl")

In [17]:
contrib_df.head()

Unnamed: 0,cntrb_id,created_at,repo_id,action,repo_name,login,rank
0,0106f555-9300-0000-0000-000000000000,2023-07-06,25644,pull_request_comment,fluence-network-environment,fluencebot,9
1,0101b353-ff00-0000-0000-000000000000,2023-07-06,25644,pull_request_merged,fluence-network-environment,nahsi,16
2,0106f555-9300-0000-0000-000000000000,2023-07-06,25644,commit,fluence-network-environment,fluencebot,8
3,0106f555-9300-0000-0000-000000000000,2023-07-05,25644,pull_request_open,fluence-network-environment,fluencebot,7
4,0101b353-ff00-0000-0000-000000000000,2023-07-05,25644,pull_request_merged,fluence-network-environment,nahsi,15


In [18]:
contrib_df.repo_name.nunique()

30

In [19]:
contrib_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39876 entries, 0 to 39875
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   cntrb_id    39655 non-null  object        
 1   created_at  39876 non-null  datetime64[ns]
 2   repo_id     39876 non-null  int64         
 3   action      39876 non-null  object        
 4   repo_name   39876 non-null  object        
 5   login       39655 non-null  object        
 6   rank        39876 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 2.1+ MB


## Graph Type 1: Projects and Contributors as Nodes

In this section, we plot projects and contributors on the same graph as nodes and color them differently to see the relationships between them.

In [20]:
repo_contributions = contrib_df.groupby(['repo_name', 'cntrb_id', 'created_at']).size().unstack(fill_value=0)
repo_contributions = repo_contributions.reset_index()
repo_contributions.head()

created_at,repo_name,cntrb_id,2022-11-07 00:00:00,2011-10-30 00:00:00,2014-05-20 00:00:00,2014-05-23 00:00:00,2014-08-11 00:00:00,2014-08-16 00:00:00,2014-09-27 00:00:00,2014-10-05 00:00:00,...,2021-09-11 00:00:00,2021-11-18 00:00:00,2022-05-12 00:00:00,2020-12-21 00:00:00,2023-06-04 00:00:00,2023-04-08 00:00:00,2023-05-01 00:00:00,2023-07-09 00:00:00,2023-04-18 00:00:00,2023-08-04 00:00:00
0,ammo.js,01000000-8a00-0000-0000-000000000000,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,ammo.js,01000027-9900-0000-0000-000000000000,0,1,1,1,2,1,1,1,...,0,0,0,0,0,0,0,0,0,0
2,ammo.js,01000042-9200-0000-0000-000000000000,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ammo.js,010000a9-ae00-0000-0000-000000000000,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,ammo.js,01000115-0800-0000-0000-000000000000,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
df_melted = repo_contributions.melt(
    id_vars=['repo_name', 'cntrb_id'],
    var_name = 'date', value_name='number'
)
df_melted = df_melted[df_melted[df_melted.columns[3]] != 0]
df_melted.head()

Unnamed: 0,repo_name,cntrb_id,date,number
0,ammo.js,01000000-8a00-0000-0000-000000000000,2022-11-07,1
574,wasmi,01007d04-8300-0000-0000-000000000000,2022-11-07,4
622,wasmi,0105dc96-5d00-0000-0000-000000000000,2022-11-07,1
760,yew,01000717-d000-0000-0000-000000000000,2022-11-07,2
1040,yew,01006398-4b00-0000-0000-000000000000,2022-11-07,6


In [22]:
df_melted.rename(columns = {'number':'total_contributions'}, inplace = True)

In [23]:
df_melted.head()

Unnamed: 0,repo_name,cntrb_id,date,total_contributions
0,ammo.js,01000000-8a00-0000-0000-000000000000,2022-11-07,1
574,wasmi,01007d04-8300-0000-0000-000000000000,2022-11-07,4
622,wasmi,0105dc96-5d00-0000-0000-000000000000,2022-11-07,1
760,yew,01000717-d000-0000-0000-000000000000,2022-11-07,2
1040,yew,01006398-4b00-0000-0000-000000000000,2022-11-07,6


In [24]:
df_melted['year'] = df_melted['date'].dt.year

In [25]:
df_melted.head()

Unnamed: 0,repo_name,cntrb_id,date,total_contributions,year
0,ammo.js,01000000-8a00-0000-0000-000000000000,2022-11-07,1,2022
574,wasmi,01007d04-8300-0000-0000-000000000000,2022-11-07,4,2022
622,wasmi,0105dc96-5d00-0000-0000-000000000000,2022-11-07,1,2022
760,yew,01000717-d000-0000-0000-000000000000,2022-11-07,2,2022
1040,yew,01006398-4b00-0000-0000-000000000000,2022-11-07,6,2022


In [26]:
len(df_melted)

12915

In [27]:
# find the total number of contributions made by each contributor in that year
grouped_contributions_year = df_melted.groupby(['repo_name', 'cntrb_id', 'year'])['total_contributions'].sum()

In [28]:
grouped_contributions_year = grouped_contributions_year.reset_index()
grouped_contributions_year.head()

Unnamed: 0,repo_name,cntrb_id,year,total_contributions
0,ammo.js,01000000-8a00-0000-0000-000000000000,2022,1
1,ammo.js,01000027-9900-0000-0000-000000000000,2011,1
2,ammo.js,01000027-9900-0000-0000-000000000000,2014,11
3,ammo.js,01000027-9900-0000-0000-000000000000,2015,7
4,ammo.js,01000027-9900-0000-0000-000000000000,2016,2


In [29]:
len(grouped_contributions_year)

2031

### Plot Graphs

### Use PageRank and Betweenness Centrality to Subset Nodes

We can now try to run the `PageRank` algorithm to compute the ranking of the nodes in the graph based on the structure of the incoming links. 

We will also look into the betweenness centrality in the graph to compute the shortest-path betweenness centrality for nodes. It measures how often a node occurs on all shortest paths between two nodes. Here we are trying to analyze which are the common repositories that occur on all paths in the graph. 

### Page Rank

PageRank ranks important nodes by analyzing the quantity and quality of the links that point to it. In our case, links that point to repositories come from contributors. 

#### Run PageRank grouped by year

In [30]:
yearly_score_dict = dict()

In [31]:
yearly_score_dict

{}

In [32]:
repo_scores = pd.DataFrame(
    {'repo': repo_names
    })

In [33]:
all_years = grouped_contributions_year["year"].unique()

In [34]:
def get_page_rank_scores(df, repos, scores):
        
    bi_df = df.rename(columns={"repo_name": "cntrb_id", "cntrb_id": "repo_name"}, inplace=False)

    bidirect_df = pd.concat([df, bi_df], ignore_index=True)
    
    # Creating a directed graph to run page rank
    H = nx.from_pandas_edgelist(bidirect_df,
                            source='cntrb_id',
                            target='repo_name',
                            edge_attr='total_contributions',
                            create_using=nx.DiGraph())
    
    top_repos, pageranks, scores = get_page_ranks(H, 100, repos, scores)  

    return scores

### Betweenness Centrality

In [35]:
def get_betweenness_centrality_scores(df, repos, scores):
        
    G = nx.from_pandas_edgelist(df, 
                            source='repo_name',
                            target='cntrb_id',
                            edge_attr='total_contributions',
                            create_using=nx.Graph())
    
    top_repos, bc, scores = get_betweenness_centrality(G, 50, repos, scores)  

    return scores

In [36]:
def get_closeness_centrality_scores(df, repos, scores):
    
    result, common_repo_contri = project_nodes_edges_contributions(df)
    
    g = nx.Graph()
    g.add_weighted_edges_from(result)
    
    graphs = [g.subgraph(c) for c in nx.connected_components(g)]
    
    sub_graphs = []
    for g in graphs:
        n = g.nodes()
        if g.number_of_nodes() > 5 and (set(n) & set(repo_names)):
            sub_graphs.append(g)
            
    if len(sub_graphs) > 0:
        top_repos, cc, scores = get_closeness_centrality(sub_graphs[0], 100, repos, scores)
        
    return scores

In [37]:
# Create subsets of the DataFrame based on the unique values in the 'year' column
for year in all_years:
    
    subset_df = grouped_contributions_year[grouped_contributions_year["year"] == year]
    
    repo_scores = get_page_rank_scores(subset_df, repos, repo_scores)
    repo_scores = get_betweenness_centrality_scores(subset_df, repos, repo_scores)
    repo_scores = get_closeness_centrality_scores(subset_df, repos, repo_scores)
    
    yearly_score_dict[year] = repo_scores.copy()

In [38]:
yearly_score_dict

{2022:                            repo  page_rank  betweenness_centrality
 0                          musl        NaN                     NaN
 1                       wasm-gc        NaN                     NaN
 2                           yew   0.335235                0.831526
 3               lumos-hackathon   0.002288                0.000000
 4   fluence-network-environment        NaN                     NaN
 5         fluence-shared-canvas        NaN                     NaN
 6                   hackfs-2022   0.004066                0.000094
 7              indexer_workshop   0.003401                0.000000
 8            dinps-22-hackathon   0.002288                0.000000
 9      fluence-service-template   0.002620                0.000000
 10                       sqlite   0.002620                0.000000
 11                    wasmbench   0.006526                0.000070
 12                 devcontainer        NaN                     NaN
 13                  ewasm-tests        Na

In [43]:
yearly_score_df = pd.concat(yearly_score_dict.values(), keys=yearly_score_dict.keys()).reset_index()

In [44]:
yearly_score_df.head()

Unnamed: 0,level_0,level_1,repo,page_rank,betweenness_centrality,closeness_centrality
0,2022,0,musl,,,
1,2022,1,wasm-gc,,,
2,2022,2,yew,0.335235,0.831526,
3,2022,3,lumos-hackathon,0.002288,0.0,
4,2022,4,fluence-network-environment,,,


In [45]:
yearly_score_df = yearly_score_df.rename(columns={'level_0': 'year'})

In [46]:
yearly_score_df.head()

Unnamed: 0,year,level_1,repo,page_rank,betweenness_centrality,closeness_centrality
0,2022,0,musl,,,
1,2022,1,wasm-gc,,,
2,2022,2,yew,0.335235,0.831526,
3,2022,3,lumos-hackathon,0.002288,0.0,
4,2022,4,fluence-network-environment,,,


In [47]:
yearly_score_df = yearly_score_df[['repo', 'year', 'page_rank', 'betweenness_centrality', 'closeness_centrality']]

In [48]:
yearly_score_df.head()

Unnamed: 0,repo,year,page_rank,betweenness_centrality,closeness_centrality
0,musl,2022,,,
1,wasm-gc,2022,,,
2,yew,2022,0.335235,0.831526,
3,lumos-hackathon,2022,0.002288,0.0,
4,fluence-network-environment,2022,,,
