# Project Ranking Over Time

In this notebook, we would like to understand the project rankings and how they vary over time for different CNCF projects. Using the graphical techniques such as PageRank, Betweenness and Closeness Centrality scores we can identify the rank for each project in a given time range.

## Connect to Augur database

We will be fetching the data from an Augur database which stores the GitHub data for a large number of open source repositories.

In [1]:
import pandas as pd
import psycopg2
import itertools
import collections
from operator import itemgetter
import sqlalchemy as salc
import json
import networkx as nx
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.preprocessing import MinMaxScaler

from ipynb.fs.defs.graph_helper_functions import (
     get_repos,
     get_contributors,
     created_melted_dfs,
     get_repos_outside,
     get_page_ranks,
     get_betweenness_centrality,
     get_closeness_centrality,
     plot_graph,
     project_nodes_edges_contributions
)

with open("../copy_cage-padres.json") as config_file:
    config = json.load(config_file)

In [2]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

In [3]:
READ_LOCALLY = True

## Retrieve Available Repositories

We start by feeding an initial list of CNCF projects out of which we track projects in 3 categories:

1. Graduated projects - projects that are considered stable, widely adopted and production ready, attracting thousands of contributors
2. Incubating projects - projects used successfully in production by a small number of users
3. Sandbox projects - experimental projects not yet widely tested in production on the bleeding edge of technology

In [4]:
with open('../repo_lists/sandbox_cncf_repos.txt', 'r') as f:
    sandbox_projects = [line.strip() for line in f]

In [5]:
repo_set_sandbox, repo_name_set_sandbox = get_repos(sandbox_projects, engine)

In [6]:
org_repo_sandbox= [x.split("https://github.com/", 1)[1] for x in repo_name_set_sandbox]

In [7]:
org_repo_set = org_repo_sandbox

In [8]:
repo_set = repo_set_sandbox

## Retrieve All Contributors

In [9]:
if READ_LOCALLY:
    contrib_df = pd.read_pickle("ep_data/all_contrib.pkl")
else:
    contrib_df = get_contributors(repo_set, engine)
    contrib_df['created_at'] = pd.to_datetime(contrib_df['created_at'], utc=True)
    contrib_df['created_at'] = contrib_df['created_at'].dt.strftime('%Y-%m-%d')
    contrib_df['created_at'] = pd.to_datetime(contrib_df['created_at']).dt.normalize()
    contrib_df.to_pickle("ep_data/all_contrib.pkl")

In [10]:
contrib_df.head()

Unnamed: 0,cntrb_id,created_at,repo_id,action,repo_name,login,rank
0,010009da-4500-0000-0000-000000000000,2023-08-01,30910,pull_request_review_APPROVED,keylime,aplanas,663
1,0101563f-ff00-0000-0000-000000000000,2023-08-01,30910,pull_request_comment,keylime,codecov[bot],65
2,01001edd-e300-0000-0000-000000000000,2023-08-01,30910,pull_request_open,keylime,maugustosilva,1052
3,01001edd-e300-0000-0000-000000000000,2023-08-01,30910,pull_request_merged,keylime,maugustosilva,1051
4,01001edd-e300-0000-0000-000000000000,2023-08-01,30910,pull_request_review_APPROVED,keylime,maugustosilva,1050


In [11]:
contrib_df.repo_name.nunique()

24

In [12]:
contrib_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 421242 entries, 0 to 421241
Data columns (total 7 columns):
 #   Column      Non-Null Count   Dtype         
---  ------      --------------   -----         
 0   cntrb_id    420354 non-null  object        
 1   created_at  421242 non-null  datetime64[ns]
 2   repo_id     421242 non-null  int64         
 3   action      421242 non-null  object        
 4   repo_name   421242 non-null  object        
 5   login       420354 non-null  object        
 6   rank        421242 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(4)
memory usage: 22.5+ MB


## Graph Type 1: Projects and Contributors as Nodes

In this section, we plot projects and contributors on the same graph as nodes and color them differently to see the relationships between them.

In [13]:
repo_contributions = contrib_df.groupby(['repo_name', 'cntrb_id', 'created_at']).size().unstack(fill_value=0)
repo_contributions = repo_contributions.reset_index()
repo_contributions.head()

created_at,repo_name,cntrb_id,2022-12-15 00:00:00,2023-05-05 00:00:00,2020-06-08 00:00:00,2020-06-09 00:00:00,2021-01-11 00:00:00,2021-05-05 00:00:00,2021-05-06 00:00:00,2021-05-24 00:00:00,...,2017-11-12 00:00:00,2018-01-07 00:00:00,2018-07-06 00:00:00,2018-07-07 00:00:00,2018-04-15 00:00:00,2018-06-17 00:00:00,2018-07-14 00:00:00,2018-12-30 00:00:00,2018-06-10 00:00:00,2018-12-16 00:00:00
0,WasmEdge,01000022-0200-0000-0000-000000000000,2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,WasmEdge,01000029-9700-0000-0000-000000000000,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,WasmEdge,0100002a-3600-0000-0000-000000000000,0,6,1,1,3,1,1,4,...,0,0,0,0,0,0,0,0,0,0
3,WasmEdge,0100004b-6300-0000-0000-000000000000,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,WasmEdge,0100004f-7400-0000-0000-000000000000,0,0,2,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
df_melted = repo_contributions.melt(
    id_vars=['repo_name', 'cntrb_id'],
    var_name = 'date', value_name='number'
)
df_melted = df_melted[df_melted[df_melted.columns[3]] != 0]
df_melted.head()

Unnamed: 0,repo_name,cntrb_id,date,number
0,WasmEdge,01000022-0200-0000-0000-000000000000,2022-12-15,2
39,WasmEdge,01001078-0600-0000-0000-000000000000,2022-12-15,1
55,WasmEdge,01002a5e-b400-0000-0000-000000000000,2022-12-15,8
60,WasmEdge,01003291-1b00-0000-0000-000000000000,2022-12-15,1
143,WasmEdge,0101583e-5700-0000-0000-000000000000,2022-12-15,1


In [15]:
df_melted.rename(columns = {'number':'total_contributions'}, inplace = True)

In [16]:
df_melted.head()

Unnamed: 0,repo_name,cntrb_id,date,total_contributions
0,WasmEdge,01000022-0200-0000-0000-000000000000,2022-12-15,2
39,WasmEdge,01001078-0600-0000-0000-000000000000,2022-12-15,1
55,WasmEdge,01002a5e-b400-0000-0000-000000000000,2022-12-15,8
60,WasmEdge,01003291-1b00-0000-0000-000000000000,2022-12-15,1
143,WasmEdge,0101583e-5700-0000-0000-000000000000,2022-12-15,1


In [17]:
df_melted['year'] = df_melted['date'].dt.year

In [18]:
df_melted.head()

Unnamed: 0,repo_name,cntrb_id,date,total_contributions,year
0,WasmEdge,01000022-0200-0000-0000-000000000000,2022-12-15,2,2022
39,WasmEdge,01001078-0600-0000-0000-000000000000,2022-12-15,1,2022
55,WasmEdge,01002a5e-b400-0000-0000-000000000000,2022-12-15,8,2022
60,WasmEdge,01003291-1b00-0000-0000-000000000000,2022-12-15,1,2022
143,WasmEdge,0101583e-5700-0000-0000-000000000000,2022-12-15,1,2022


In [19]:
len(df_melted)

125637

In [20]:
# find the total number of contributions made by each contributor in that year
grouped_contributions_year = df_melted.groupby(['repo_name', 'cntrb_id', 'year'])['total_contributions'].sum()

In [21]:
grouped_contributions_year = grouped_contributions_year.reset_index()
grouped_contributions_year.head()

Unnamed: 0,repo_name,cntrb_id,year,total_contributions
0,WasmEdge,01000022-0200-0000-0000-000000000000,2022,2
1,WasmEdge,01000029-9700-0000-0000-000000000000,2023,1
2,WasmEdge,0100002a-3600-0000-0000-000000000000,2020,2
3,WasmEdge,0100002a-3600-0000-0000-000000000000,2021,394
4,WasmEdge,0100002a-3600-0000-0000-000000000000,2022,166


In [22]:
len(grouped_contributions_year)

15520

### Plot Graphs

### Use PageRank and Betweenness Centrality to Subset Nodes

We can now try to run the `PageRank` algorithm to compute the ranking of the nodes in the graph based on the structure of the incoming links. 

We will also look into the betweenness centrality in the graph to compute the shortest-path betweenness centrality for nodes. It measures how often a node occurs on all shortest paths between two nodes. Here we are trying to analyze which are the common repositories that occur on all paths in the graph. 

### Page Rank

PageRank ranks important nodes by analyzing the quantity and quality of the links that point to it. In our case, links that point to repositories come from contributors. 

#### TO DO: Run PageRank grouped by year