# RF5 GitHub Repo-Level Metrics

The following fields are included in the GitHub-based metrics analysis for Retro Funding 5:

1. `artifact_url`: Repository URL associated with the project based on the application. A project may have multiple repos in its application.
2. `project_name`: Name of the project.
3. `project_category_id`: Category ID of the project (1, 2, 3).
4. `num_contributors`: Total number of contributors before August 1, 2024. A contributor is defined as any non-bot user that has contributed to the repository (since 2015) by committing code directly to a repository, opening an issue, or opening/reviewing a pull request.
5. `num_trusted_contributors`: Number of trusted contributors before August 1, 2024. A subset of the contributors defined above, this is the number of contributors that are also in the top 420 of the [OpenRank](https://openrank.com/) developer trust score.
6. `num_contributors_last_6_months`: Number of contributors over the period Feburary 1, 2024 - August 1, 2024.
7. `num_stars`: Total number of stars as of September 23, 2024.
8. `num_trusted_stars`: Number of stars from trusted users before August 1, 2024. A subset of the stars defined above, this is the number of stars from users that are also in the top 420 of the [OpenRank](https://openrank.com/) developer trust score.
9. `trust_weighted_stars`: This metric is a percentage score between 0% and 100%, representing the sum of the reputation share of the developers who starred the repo.  If all developers in OpenRank developer ranking have starred a particular repo, the metric's value is going to be 100% for this particular repo. The more and the higher ranked developer who starred the repo, the higher the percentage value, the higher impact and quality of this repo. We calculate this metric by first calculating every developer's reputation share (%) based on their OpenRank score, then sum them up if they starred the target repo.
10. `num_forks`: Total number of forks as of September 23, 2024.
11. `num_trusted_forks`: Number of forks from trusted users before August 1, 2024. A subset of the forks defined above, this is the number of forks from users that are also in the top 420 of the [OpenRank](https://openrank.com/) developer trust score.
12. `trust_weighted_forks`: This metric is a percentage score between 0% and 100%, representing the sum of the reputation share of the developers who forked the repo. If all developers in OpenRank developer ranking have forked a particular repo, the metric's value is going to be 100% for this particular repo. The more and the higher ranked developer who forked the target repo, the higher the percentage value, the higher impact and quality of this repo. We calculate this metric by first calculating every developer's reputation share (%) based on their OpenRank score, then sum them up if they forked the target repo.
13. `trust_rank_for_repo_in_category`: Ranking of the repository's OpenRank trust score within its category. A score of 1 indicates the highest ranking repo in its category.
14. `age_of_project_years`: Age of the project in years, measured from the project's first public commit to August 1, 2024.
15. `license(s)`: License(s) used by the project.
16. `application_id`: Application ID in the sign-up and voting UIs.

In [1]:
from google.cloud import bigquery
import json
import os
import pandas as pd

In [2]:
# https://docs.opensource.observer/docs/get-started/
# add GCP project and credentials here

PROJECT = 'opensource-observer'
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '../../../gcp_credentials.json'
client = bigquery.Client()

# Load and process the applications

In [3]:
applications = json.load(open("data/applications.json"))
print(len(applications))

147


In [4]:
projects_data = []

for app in applications:
    project = app.get('project', {})
    repos = project.get('repos', [])
    project_name = project.get('name')
    project_id = project.get('id')
    category_id = app['impactStatementAnswer'][0]['impactStatement']['categoryId']
    application_id = app['impactStatementAnswer'][0]['applicationId']

    if repos:
        for repo in repos:
            repo_url = repo.get('url', None)
            
            # fix one edge case
            if repo_url.lower() == 'https://github.com/protocolguild/membership':
                repo_url = 'https://github.com/protocolguild/documentation'

            projects_data.append({
                'project_name': project_name,
                'project_id': project_id,
                'application_id': application_id,
                'category_id': category_id,
                'repo_url': repo_url,
                'repo_name': repo.get('name', None)                
            })
    else:
        projects_data.append({
            'project_name': project_name,
            'project_id': project_id,
            'application_id': application_id,
            'category_id': category_id,
            'repo_url': None,
            'repo_name': None            
        })

df_projects = pd.DataFrame(projects_data)

def extract_owner_and_repo(url):
    if url and isinstance(url, str):
        url = url.lower()
        if "github.com" in url:
            parts = url.split('/')
            if len(parts) >= 5:
                return f"{parts[3]}/{parts[4]}"
    return None

def clean_repo_url(owner_and_name):
    if owner_and_name:
        return f"https://github.com/{owner_and_name}"

df_projects['repo_owner_and_name'] = df_projects['repo_url'].apply(extract_owner_and_repo)
df_projects['clean_url'] = df_projects['repo_owner_and_name'].apply(clean_repo_url)

project_name_mappings = df_projects.set_index('application_id')['project_name'].to_dict()
project_category_mappings = df_projects.set_index('application_id')['category_id'].to_dict()

print(f"Loaded {len(df_projects)} records\
        \n... including {len(df_projects['clean_url'].dropna().unique())} repos\
        \n... from {len(df_projects['application_id'])} unique applications.\n\n")

repo_urls = list(df_projects['clean_url'].dropna().unique())
df_projects.tail(1)

Loaded 203 records        
... including 151 repos        
... from 203 unique applications.




Unnamed: 0,project_name,project_id,application_id,category_id,repo_url,repo_name,repo_owner_and_name,clean_url
202,OP Reth,0x5759249c433d67eeb2ca1b6ff827feec164b60b92e84...,dd8d7a7e-5f95-4523-8f6d-c3a64ee7754b,1,https://github.com/paradigmxyz/reth,OP Reth,paradigmxyz/reth,https://github.com/paradigmxyz/reth


# Fetch a snapshot of current repo metrics from OSO

In [5]:
# Get snapshot of repo metrics (taken 2024-09-23)

repo_urls_str = "'" + "','".join(repo_urls) + "'"
repos_query = f"""
    select
      abp.artifact_id,
      abp.artifact_namespace,
      abp.artifact_name,
      abp.artifact_url,
      abp.artifact_type,
      rm.is_fork,
      rm.fork_count,
      rm.star_count,
      rm.language,
      rm.license_spdx_id,
      abp.project_id as oso_project_id,
    from `{PROJECT}.oso.int_artifacts_in_ossd_by_project` as abp
    join `{PROJECT}.oso.int_repo_metrics_by_project` as rm
      on abp.artifact_id = rm.artifact_id
    where abp.artifact_url in ({repo_urls_str})
"""
repos_query_result = client.query(repos_query)
df_repos = repos_query_result.to_dataframe()
df_repos['license_spdx_id'] =  df_repos['license_spdx_id'].replace({'NOASSERTION': 'Custom'})
df_repos.tail(1)

Unnamed: 0,artifact_id,artifact_namespace,artifact_name,artifact_url,artifact_type,is_fork,fork_count,star_count,language,license_spdx_id,oso_project_id
136,EfQGtZdYoGnJfiBDCPu_UMv8mqjIZ9FWfTcUEpUz-EY=,ethereum-optimism,asterisc,https://github.com/ethereum-optimism/asterisc,REPOSITORY,False,14,98,Go,MIT,7tn6nZfvnltUNZjqR8QpXkjGDo-pYJanf8CoCwWAHpc=


In [6]:
# identify any repos in apps that do not have data
print("Ignored repos:")
valid_repo_urls = []
for repo in repo_urls:
    if repo not in df_repos['artifact_url'].unique():
        print(repo)
    else:
        valid_repo_urls.append(repo)

print()        
print("Indexed repos:",len(valid_repo_urls))        

Ignored repos:
https://github.com/jseiferth/op-analytics
https://github.com/mali030303/monstersonbasee
https://github.com/mali030303/base-btc-earth--
https://github.com/mali030303/dragons-on-op-stack--
https://github.com/blockpilabs/aggregator
https://github.com/richardgreg/execution-specs-contribution
https://github.com/richardgreg/op-docs-improvements
https://github.com/zeus199803/8-bit-cats--
https://github.com/zeus199803/opstack-for-cats-dream-
https://github.com/users/zeus199803
https://github.com/blockchaindevsh/optimism
https://github.com/jsvisa/retro5
https://github.com/nonboring/nft-starter
https://github.com/blockchef-io/op-rpgf

Indexed repos: 137


In [7]:
repo_app_mapping = (
    df_projects[df_projects.clean_url.isin(valid_repo_urls)]
    [['clean_url', 'application_id', 'project_id']]
    .drop_duplicates()
    .set_index('clean_url')['application_id']
    .to_dict()
)
df_repos['application_id'] = df_repos['artifact_url'].map(repo_app_mapping)

artifact_app_mapping = df_repos.set_index('artifact_id')['application_id'].to_dict()
artifact_url_mapping = df_repos.set_index('artifact_url')['artifact_id'].to_dict()

df_repos.tail(1)

Unnamed: 0,artifact_id,artifact_namespace,artifact_name,artifact_url,artifact_type,is_fork,fork_count,star_count,language,license_spdx_id,oso_project_id,application_id
136,EfQGtZdYoGnJfiBDCPu_UMv8mqjIZ9FWfTcUEpUz-EY=,ethereum-optimism,asterisc,https://github.com/ethereum-optimism/asterisc,REPOSITORY,False,14,98,Go,MIT,7tn6nZfvnltUNZjqR8QpXkjGDo-pYJanf8CoCwWAHpc=,946c73a2-8263-4849-9db0-e550782ee23b


# Fetch OSO event data from relevant repos

In [8]:
# Get all event data (cutoff date of 2024-08-01)

artifact_ids = list(artifact_app_mapping.keys())
artifact_ids_str = "'" + "','".join(artifact_ids) + "'"

CUTOFF = '2024-08-01'

events_query = f"""
    select
        time,
        event_type,
        from_artifact_name as user,
        from_artifact_id,
        to_artifact_id 
    from `{PROJECT}.oso.int_events`
    where
        to_artifact_id in ({artifact_ids_str})
        and time < '{CUTOFF}'
"""

# uncomment everything below if you want live data, otherwise uses local backup

# events_query_results = client.query(events_query)
# df_events = events_query_results.to_dataframe()

# # add application ids
# df_events['application_id'] = df_events['to_artifact_id'].map(artifact_app_mapping)

# # filter bot activity
# bot_list = ['codecov-commenter', 'claassistant', 'googlebot', 'omahs']
# github_users = list(df_events['user'].unique())
# bots = [x for x in github_users if '[bot]' in x or x in bot_list]
# df_events = df_events[df_events['user'].isin(bots) == False]

# df_events.to_parquet("data/rf5_events.parquet")
df_events = pd.read_parquet("data/rf5_events.parquet")

df_events['bucket_day'] = pd.to_datetime(df_events['time'].dt.date)
df_events['amount'] = 1
df_events.tail(1)

Unnamed: 0,time,event_type,user,from_artifact_id,to_artifact_id,application_id,bucket_day,amount
1052682,2023-07-31 13:25:23+00:00,PULL_REQUEST_REVIEW_COMMENT,thomaseizinger,xOfgF7_wYw1J5fCCwpUuFs53BTw1iXb1wenhuspVXXM=,dxsMNRXWzfg8lMvq0M4bY-NZ5961glN0Q-X64anZ8BI=,0xdf1bb03d08808e2d789f5eac8462bdc560f1bb5b0877...,2023-07-31,1


# Join OpenRank metrics for trusted developers on our repo snapshot

In [9]:
# identify the top N users from openrank

N = 420
users = pd.read_csv('data/openrank/devrank_20240910_user_scores.csv')
top_users = users['i'].iloc[:N].to_list()

In [10]:
stars = pd.read_csv('data/openrank/devrank_20240910_star_or_scores.csv', index_col=0)
forks = pd.read_csv('data/openrank/devrank_20240910_fork_or_scores.csv', index_col=0)
repo_score = pd.read_csv('data/openrank/devrank_20240910_repo_scores.csv', index_col=0)

open_rank_metrics = pd.concat([stars, forks, repo_score], axis=1).reset_index().fillna(0)
open_rank_metrics.columns = ['repo', 'trust_weighted_stars', 'trust_weighted_forks', 'trust_score_for_repo']
open_rank_metrics['repo_url'] = open_rank_metrics['repo'].apply(lambda x: f"https://github.com/{x.lower()}")
open_rank_metrics['artifact_id'] = open_rank_metrics['repo_url'].map(artifact_url_mapping)

open_rank_metrics

Unnamed: 0,repo,trust_weighted_stars,trust_weighted_forks,trust_score_for_repo,repo_url,artifact_id
0,paradigmxyz/reth,0.363514,0.255173,3.657535e-03,https://github.com/paradigmxyz/reth,dgCmpNNSMNgI_DiE_5Ule1csx3ZrridU0QJCzxtpTLE=
1,testinprod-io/op-erigon,0.344644,0.007681,9.908351e-02,https://github.com/testinprod-io/op-erigon,56tds-rywDPjo1MNO5QPWamXkI903fCVy9LXdgtFoDA=
2,protolambda/asterisc,0.315052,0.037471,2.881870e-04,https://github.com/protolambda/asterisc,
3,a16z/magi,0.311287,0.163477,9.311861e-02,https://github.com/a16z/magi,
4,ethereum-optimism/optimism,0.308565,0.422081,1.621405e-01,https://github.com/ethereum-optimism/optimism,
...,...,...,...,...,...,...
47049,ccpgames/aws-nginx-ha-manager,0.000000,0.000000,2.275016e-18,https://github.com/ccpgames/aws-nginx-ha-manager,
47050,ccpgames/gcloud-python,0.000000,0.000000,2.025622e-18,https://github.com/ccpgames/gcloud-python,
47051,ethelo/bonmin,0.000000,0.000000,1.572443e-18,https://github.com/ethelo/bonmin,
47052,ccpgames/esky,0.000000,0.000000,1.272280e-18,https://github.com/ccpgames/esky,


In [11]:
df_merged = df_repos.set_index('artifact_id').join(open_rank_metrics.set_index('artifact_id'))
metadata_cols = ['application_id', 'artifact_url', 'language', 'license_spdx_id']
metric_cols = ['fork_count', 'star_count', 'trust_weighted_stars', 'trust_weighted_forks', 'trust_score_for_repo']

cols = metadata_cols + metric_cols
df_merged[metric_cols] = df_merged[metric_cols].fillna(0)
df_merged = df_merged[cols]
df_merged.tail(1)

Unnamed: 0_level_0,application_id,artifact_url,language,license_spdx_id,fork_count,star_count,trust_weighted_stars,trust_weighted_forks,trust_score_for_repo
artifact_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
EfQGtZdYoGnJfiBDCPu_UMv8mqjIZ9FWfTcUEpUz-EY=,946c73a2-8263-4849-9db0-e550782ee23b,https://github.com/ethereum-optimism/asterisc,Go,MIT,14,98,0.042962,0.036703,0.088941


# Derive contributor metrics from the GitHub event data

In [12]:
contributor_event_types = [
    'COMMIT_CODE',
    'PULL_REQUEST_OPENED',
    'PULL_REQUEST_REVIEW_COMMENT',
    'ISSUE_OPENED'
]

In [13]:
metrics = []

metrics.append(
    df_events[df_events.event_type.isin(contributor_event_types)]
    .groupby('to_artifact_id')['from_artifact_id']
    .nunique()
    .rename('num_contributors')
)

metrics.append(
    df_events[
        (df_events.event_type.isin(contributor_event_types))
        & (df_events.user.isin(top_users))
    ]
    .groupby('to_artifact_id')['from_artifact_id']
    .nunique()
    .rename('num_trusted_contributors')
)

metrics.append(
    df_events[
        (df_events.event_type.isin(contributor_event_types))
        & (df_events['bucket_day'] >= pd.to_datetime('2024-02-01'))
    ]
    .groupby('to_artifact_id')['from_artifact_id']
    .nunique()
    .rename('num_contributors_last_6_months')
)

metrics.append(
    df_events[
        (df_events.event_type == 'STARRED')
        & (df_events.user.isin(top_users))
    ]
    .groupby('to_artifact_id')['from_artifact_id']
    .nunique()
    .rename('num_trusted_stars')
)

metrics.append(
    df_events[
        (df_events.event_type == 'FORKED')
        & (df_events.user.isin(top_users))
    ]
    .groupby('to_artifact_id')['from_artifact_id']
    .nunique()
    .rename('num_trusted_forks')
)

metrics.append(
    df_events
    .groupby('to_artifact_id')['bucket_day']
    .min()
    .apply(lambda x: (2024. + 9/12.) - (x.year + x.month/12.))
    .rename('age_of_project_years')
)

contributor_metrics = pd.concat(metrics, axis=1).fillna(0)
contributor_metrics.tail(1)

Unnamed: 0_level_0,num_contributors,num_trusted_contributors,num_contributors_last_6_months,num_trusted_stars,num_trusted_forks,age_of_project_years
to_artifact_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
zWq7G5YTyumXUtfEgJB_M1vHQlCsxWlsyvKyAsKaZRc=,3,1.0,3.0,1.0,0.0,0.333333


# Join all the metrics together and export to a CSV

In [14]:
df_metrics = df_merged.join(contributor_metrics)

df_metrics['project_name'] = df_metrics.application_id.map(project_name_mappings)
df_metrics['project_category_id'] = df_metrics.application_id.map(project_category_mappings)
df_metrics['trust_rank_for_repo_in_category'] = df_metrics.groupby('project_category_id')['trust_score_for_repo'].rank(ascending=False)
df_metrics.rename(columns={'star_count': 'num_stars', 'fork_count': 'num_forks', 'license_spdx_id': 'license(s)'}, inplace=True)

cols = [
    'artifact_url', 'project_name', 'project_category_id', 
    'num_contributors', 'num_trusted_contributors', 'num_contributors_last_6_months', 
    'num_stars', 'num_trusted_stars', 'trust_weighted_stars',
    'num_forks', 'num_trusted_forks', 'trust_weighted_forks',
    'trust_rank_for_repo_in_category', 'age_of_project_years', 'license(s)', 'application_id'
]
df_metrics = (
    df_metrics[cols]
    .set_index('artifact_url', drop=True)
    .sort_values(by=['project_category_id', 'project_name'])
)
df_metrics.tail(1)

Unnamed: 0_level_0,project_name,project_category_id,num_contributors,num_trusted_contributors,num_contributors_last_6_months,num_stars,num_trusted_stars,trust_weighted_stars,num_forks,num_trusted_forks,trust_weighted_forks,trust_rank_for_repo_in_category,age_of_project_years,license(s),application_id
artifact_url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
https://github.com/optimism-java/dispute-explorer-frontend,superproof,3,2.0,0.0,2.0,0,0.0,0.0,0,0.0,0.0,22.0,0.166667,,0e20af70-ef20-4ef7-aec5-ee747d971fbb


In [16]:
df_metrics.to_csv('data/rf5_applicant_github_metrics.csv')
df_metrics

Unnamed: 0_level_0,project_name,project_category_id,num_contributors,num_trusted_contributors,num_contributors_last_6_months,num_stars,num_trusted_stars,trust_weighted_stars,num_forks,num_trusted_forks,trust_weighted_forks,trust_rank_for_repo_in_category,age_of_project_years,license(s),application_id
artifact_url,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
https://github.com/eth-infinitism/account-abstraction,Account Abstraction - ERC-4337,1,128.0,6.0,34.0,1510,17.0,0.017811,624,9.0,0.005068,20.0,2.416667,GPL-3.0,41923142-8520-43d2-9a97-1a9f895644ee
https://github.com/ethereum/act,Act,1,16.0,1.0,4.0,215,13.0,0.063650,36,0.0,0.000270,22.0,5.000000,AGPL-3.0,c9b60136-216c-424b-aa79-52bcb81285da
https://github.com/dappnode/dappnode,Dappnode,1,137.0,5.0,11.0,583,10.0,0.031963,100,1.0,0.001142,15.0,6.583333,GPL-3.0,03c24056-2795-45f8-963b-b675a616e1ac
https://github.com/erigontech/erigon,Erigon,1,1019.0,54.0,220.0,3106,52.0,0.002988,1095,34.0,0.000245,12.0,5.333333,LGPL-3.0,85dd5d93-d1b6-47d0-ac10-012618d999cf
https://github.com/rzmahmood/ethereum-pos-testnet,Ethereum POS Testnet,1,8.0,0.0,3.0,41,1.0,0.003047,18,0.0,0.000006,28.0,1.000000,MIT,82c49ee5-985f-48f2-8a28-c4cd8db34105
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
https://github.com/wevm/viem,Viem: TypeScript Tooling for OP Stack,3,675.0,28.0,312.0,2466,28.0,0.006152,746,20.0,0.017511,4.0,1.500000,Custom,a62680b3-1a97-4585-8201-193dc5cb7abf
https://github.com/dapp-learning-dao/dapp-learning,[DappLearning] Web3 Development Tutorial,3,229.0,3.0,41.0,5094,9.0,0.007433,1291,3.0,0.001201,10.0,3.250000,MIT,eb59793e-ec56-47fc-969c-4bb34c5d7647
https://github.com/0xfableorg/roll-op,roll-op,3,11.0,4.0,4.0,89,16.0,0.074109,25,4.0,0.003980,6.0,1.250000,BSD-3-Clause-Clear,44cefdfb-c91d-440d-8128-f02ba7426865
https://github.com/optimism-java/dispute-explorer,superproof,3,1.0,0.0,1.0,0,0.0,0.000000,0,0.0,0.000000,20.0,0.250000,MIT,0e20af70-ef20-4ef7-aec5-ee747d971fbb
