## Bus Factor: How high is the risk to a project should the most active people leave?

Bus factor quantifies the amount of contributors a project can afford to lose before it stalls by hypothetically having these people get run over a bus. Typically, it is the smallest number of people that make up 50% of contributions. 

In this notebook, we analyze bus factor according to a user-inputted percent of contributions with the option to filter out outliers and to parameterize by a time window and step size. We explore contributors including but not limited to individuals **1) making commits, 2) closing issues, and 3) having their PRs merged or merging PRs**.

**Note:** In the notebook, we use the Ansible repository as an example but our analyses can be replicated with any GitHub repository.

### Table of Contents
- [1. Connect to Augur database](#connect-to-augur-database)
- [2. Query data](#query-data)
- [3. Calculate bus factpr](#calculate-bus-factor)
- [4. Bus factor by number of commits](#bus-factor-by-number-of-commits)
- [5. Bus factor by number of issues closed](#bus-factor-by-number-of-issues-closed)
- [5. Bus factor by number of PRs](#bus-factor-by-number-of-prs)

In [None]:
!pip install -q sqlalchemy

In [56]:
# import the required libraries and packages
import sqlalchemy as salc
import json
import os

import psycopg2
import pandas as pd 
import numpy as np
from scipy import stats

import datetime as dt
from dateutil.relativedelta import relativedelta

import warnings
warnings.filterwarnings('ignore')

## Connect to Augur database

In [2]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

## Query data
Let's first define some functions that will help us organize the data in a data frame and count the number of contributions each individual has made.  

In [22]:
def get_df(repo_query):
    '''
    :param repo_query: A string representing a SQL statement
    returns a DataFrame where the columns are contrb_id, their commit, and the commit date 
    '''
    conn = pd.read_sql_query(repo_query, con=engine.connect())
    conn = conn.reset_index()
    if 'date' in conn.columns:
        conn['date'] = pd.to_datetime(conn['date'])
    conn.drop('index', axis=1, inplace=True)
    return conn

In [4]:
def preprocess_df(df, activity):
    '''
    :param df: DataFrame
    :param activity: A string indicating the type of contribution
    returns a DataFrame where the columns are the cntrb_ids and number of commits they've made
    '''
    # count the number of contributions per contributor
    df = (df.groupby('cntrb_id')[activity].count()).to_frame()
    df.sort_values(by=activity, ascending=False, inplace=True)
    df = df.reset_index()
    return df

## Calculate bus factor

While the bus factor is typically defined using 50% as threshold for the percent of contributors a project can lose before it stalls, we're interested in a range of thresholds. In the following function `calc_bus_factor`, we've added a parameter `threshold` to do so.

Additionally, we're interested in filtering out outliers i.e individuals who make contributions significantly below or above average. We've added a parameter `remove_outliers` to give us the option to do so. In our implementation, we define outliers as any contributors with commits that are 2 standard deviations or more away from the mean number of commits.

In [145]:
def calc_bus_factor(df, activity, threshold, remove_outliers):
    '''
    :param df: DataFrame 
    :param threshold: A double indicating the cut off threshold for bus factor
    :param remove_outliers: A boolean indicating whether to remove outliers or not
    returns the bus factor
    '''
    if remove_outliers:
        length_df = len(df)
        df = df[(np.abs(stats.zscore(df[activity])) < 2)]
        num_outliers = length_df - len(df)
    else: 
        num_outliers = 0
    
    # calculate total number of commits
    t_commits = df[activity].sum()

    # caluclate threshold % of commits
    cntrb_ids = []
    cum_per = 0
    df = df.assign(cntrbs = lambda x: (x[activity]/t_commits))
    df = df.reset_index()
    
    for i in range(len(df)):
        cum_per += df.loc[i, 'cntrbs']
        cntrb_ids.append(df.loc[i, 'cntrb_id'])
        if cum_per  >= threshold:
            break;
    
    bus_factor=len(cntrb_ids)
    return bus_factor, cntrb_ids, num_outliers

The bus factor can fluctuate over the lifetime of the project. Thus, we're interested in the analyzing the bus factor in a specific window of time and seeing how it evolves over different periods of time. 

In [171]:
def bus_factor_by_windows(df, activity, start_date, end_date, window_width, step_size, threshold, remove_outliers):
    ''' 
    :param df: DataFrame 
    :param start_date: A string representing start date of the interval
    :param end_date: A string representing the end data of the interval
    :param window_width: An integer representing the window in months for which the bus factor is calculated. 
    :param step_size:  An integer representing the sliding window in months
    :param threshold: A double indicating the cut off threshold for bus factor
    :param remove_outliers: A boolean indicating whether to remove outliers or not
    returns a DataFrame where the columns are the period to, period from, contributor id, and bus factor for that time interval
    '''
    # if the start date and end date are null, set them to the first and last commit date
    if start_date == None:
        start_date = min(df['date'])
    
    if end_date == None:
        end_date = max(df['date'])

    # intialize variables
    start_date = dt.datetime.strptime(start_date, "%Y-%m-%d")
    end_date = dt.datetime.strptime(end_date, "%Y-%m-%d")
    window_width=relativedelta(months=window_width)
    step_size=relativedelta(months=step_size)

    bfs=[]
    cntrb_ids = []
    period_from = []
    period_to = [] 

    # calculate bus factor for each window
    while start_date + window_width < end_date:
        period_from.append(start_date.strftime('%Y-%m-%d'))
        period_to.append((start_date + window_width).strftime('%Y-%m-%d'))

        # index df such that its rows are between the state date and start date + window 
        mask = (df['date'] >= start_date) & (df['date'] <= start_date+window_width)
        cntrbs = preprocess_df(df.loc[mask], activity)

        # calcuate bus factor
        bus_factor, cntrb_id, _ = calc_bus_factor(cntrbs, activity, threshold, remove_outliers)
        
        if (bus_factor == 0) or (not cntrb_id):
            bfs.append(bus_factor)
            cntrb_ids.append('outlier removed')
        else:
            bfs.append(bus_factor)
            cntrb_ids.append(cntrb_id[0])

        # shift the window by the step size
        start_date += step_size

    # create a dataframe with period to, period from, bus factor, and cntrb_ids as columns
    df = pd.DataFrame(list(zip(period_from, period_to, cntrb_ids, bfs)), columns = ['period_from', 'period_to', 'cntrb_ids', 'bus_factor'])
    return df

Now that we have all of our functions defined, let's see them in practice!

## Bus factor by number of commits

In [10]:
repo_name = 'Ansible'
repo_statement = str([28336])
repo_statement = repo_statement[1:-1]

repo_query = salc.sql.text(f"""
                SELECT
                    ca.cntrb_id, 
                    c.cmt_commit_hash AS commit,
                    c.cmt_author_date AS date
                FROM
                    contributors_aliases ca
                JOIN commits c 
                ON c.cmt_committer_email = ca.alias_email
                WHERE
                    c.repo_id in({repo_statement})
               """)

In [11]:
# organize commits data in a DataFrame
conn_commits = get_df(repo_query)
display(conn_commits.head())

Unnamed: 0,cntrb_id,commit,date
0,01012f1b-7f00-0000-0000-000000000000,00003cbb7b8ac1ab4f729662190f028f9cd7fd62,2019-02-23
1,01012f1b-7f00-0000-0000-000000000000,00003cbb7b8ac1ab4f729662190f028f9cd7fd62,2019-02-23
2,01001c87-8900-0000-0000-000000000000,00031e454e02272d50051df4841e5f9487d27bf7,2017-01-06
3,01000d6a-1100-0000-0000-000000000000,00067c8d54e40fa564a1a8923857905a3dd99d77,2019-05-29
4,01012f1b-7f00-0000-0000-000000000000,00068e9fcc298b0b18858360d2b8b08708b4993b,2018-03-23


In [70]:
# count the number of commits per contributor and order the rows from most to least number of commits
commits = preprocess_df(conn_commits, 'commit')
display(commits.head())

Unnamed: 0,cntrb_id,commit
0,01012f1b-7f00-0000-0000-000000000000,58857
1,01000c4d-d800-0000-0000-000000000000,19832
2,01000099-ac00-0000-0000-000000000000,15935
3,01022886-a200-0000-0000-000000000000,11416
4,01000331-5a00-0000-0000-000000000000,9954


Based on the number of commits per contributor, we can see that the first most active contributor has made a significant amount of commits - more than 2x the second. Let's calculate the bus factor threshold of 50% without removing outliers and with removing outliers and compare our results.

In [152]:
bf_commits, cntrb_ids, _ = calc_bus_factor(commits, activity='commit', threshold=0.50, remove_outliers=False)
print(f'The bus factor in terms of number of commits for {repo_name} is {bf_commits}.')
print(' ')
bf_no_outliers, cntrb_ids, num_outliers = calc_bus_factor(commits, activity='commit', threshold=0.50, remove_outliers=True)
print(f'Filtering out {num_outliers} contributors that have been identified as outliers...')
print(f'The bus factor in terms of number of commits for {repo_name} with outliers removed is {bf_no_outliers}.')

The bus factor in terms of number of commits for Ansible is 3.
 
Filtering out 8 contributors that have been identified as outliers...
The bus factor in terms of number of commits for Ansible with outliers removed is 12.


Clearly, the first three most active contributors are extreme outliers since in total, we identified 8 outlieres, yet they make up the roughly 50% of the contributions to the repo.

Now let's analyze how the bus factor has evolved over a period of the project by taking window sizes of 6 months from Jan. 24, 2017 to May 24, 2017 and sliding by a step size of 4 months.

In [172]:
bf_by_windows = bus_factor_by_windows(conn_commits, activity='commit',start_date='2017-01-24', end_date= '2018-01-24', window_width=6, step_size=4, threshold=0.5, remove_outliers=False)
bf_by_windows

Unnamed: 0,period_from,period_to,cntrb_ids,bus_factor
0,2017-01-24,2017-07-24,0 01012f1b-7f00-0000-0000-000000000000...,1
1,2017-05-24,2017-11-24,0 01012f1b-7f00-0000-0000-000000000000...,1


Now that we've analyzed the bus factor for one type of contribution to a project, let's explore others. Not only are contributors individuals who have directly committed to the project but also those who have contributed by solving issues.

## Bus factor by number of issues closed

In [126]:
issue_events_query = salc.sql.text(f"""
                SELECT
                    ie.cntrb_id,
                    ie.issue_id AS issue,
                    ie.created_at AS date, 
                    ie.action AS action
                FROM
                    issue_events ie
                WHERE
                    ie.repo_id in({repo_statement}) AND
                    ie.action = 'closed'
               """)

conn_issues = get_df(issue_events_query)
display(conn_issues.head())


In [149]:
issues_closed = preprocess_df(conn_issues, 'issue')
display(issues_closed.head())

Unnamed: 0,cntrb_id,issue
0,0100647b-c300-0000-0000-000000000000,495
1,01000099-ac00-0000-0000-000000000000,272
2,01000cc2-4b00-0000-0000-000000000000,233
3,0100128c-1e00-0000-0000-000000000000,194
4,0101cdd9-cf00-0000-0000-000000000000,143


In [151]:
bf_issues, cntrb_ids, _ = calc_bus_factor(issues_closed, activity='issue', threshold=0.50, remove_outliers=False)
print(f'The bus factor in terms of number of issues closed for {repo_name} is {bf_issues}.')
print(' ')
bf_no_outliers_issues, cntrb_ids, num_outliers = calc_bus_factor(issues_closed, activity='issue', threshold=0.50, remove_outliers=True)
print(f'Filtering out {num_outliers} contributors that have been identified as outliers...')
print(f'The bus factor in terms of number of issues closed for {repo_name} with outliers removed is {bf_no_outliers_issues}.')


The bus factor in terms of number of issues closed for Ansible is 4.
 
Filtering out 6 contributors that have been identified as outliers...
The bus factor in terms of number of issues closed for Ansible with outliers removed is 3.


We can further extend our notion of bus factor by number of PRs. Here we calculate bus factor seperately for contributors who've had their PR's merged and reviewers who have merged PRs.

## Bus factor by number of PRs

In [81]:
pr_query = salc.sql.text(f"""
                 SELECT 
                    pr.pull_request_id as pr, 
                    pr.pr_created_at,
                    pr.pr_merged_at,
                    pre.cntrb_id, 
                    prr.cntrb_id as reviewer
                FROM
                    repo r,
                    pull_requests pr, 
                    pull_request_events pre, 
                    pull_request_reviewers prr
                WHERE
                    r.repo_id in({repo_statement}) AND
                    pr.repo_id = r.repo_id AND
                    pr.pull_request_id = pre.pull_request_id AND
                    pr.pull_request_id = prr.pull_request_id AND 
                    pre.cntrb_id != prr.cntrb_id AND
                    pr_merged_at IS NOT NULL
        """)

conn_pr = get_df(pr_query)
display(conn_pr)

Unnamed: 0,pr,pr_created_at,pr_merged_at,cntrb_id,reviewer
0,116604254,2023-06-05 17:30:39,2023-06-07 14:30:25,01021389-0f00-0000-0000-000000000000,01000067-2300-0000-0000-000000000000
1,36457647,2022-12-08 15:59:14,2023-01-09 16:42:07,01001700-7c00-0000-0000-000000000000,01008121-3500-0000-0000-000000000000
2,36457647,2022-12-08 15:59:14,2023-01-09 16:42:07,01012aa8-bd00-0000-0000-000000000000,01008121-3500-0000-0000-000000000000
3,36457647,2022-12-08 15:59:14,2023-01-09 16:42:07,01012aa8-bd00-0000-0000-000000000000,01008121-3500-0000-0000-000000000000
4,36457647,2022-12-08 15:59:14,2023-01-09 16:42:07,01012aa8-bd00-0000-0000-000000000000,01008121-3500-0000-0000-000000000000
...,...,...,...,...,...
3746,68493795,2023-03-27 11:23:09,2023-04-06 19:27:59,0100647b-c300-0000-0000-000000000000,0101cdd9-cf00-0000-0000-000000000000
3747,68493795,2023-03-27 11:23:09,2023-04-06 19:27:59,0100647b-c300-0000-0000-000000000000,0101cdd9-cf00-0000-0000-000000000000
3748,68493795,2023-03-27 11:23:09,2023-04-06 19:27:59,0100647b-c300-0000-0000-000000000000,0101cdd9-cf00-0000-0000-000000000000
3749,68493795,2023-03-27 11:23:09,2023-04-06 19:27:59,0100647b-c300-0000-0000-000000000000,0101cdd9-cf00-0000-0000-000000000000


In [133]:
prs_cntrb = conn_pr[['pr', 'pr_merged_at', 'cntrb_id']]
prs_cntrb = preprocess_df(prs_cntrb, activity='pr')
display(prs_cntrb.head())

Unnamed: 0,cntrb_id,pr
0,0100647b-c300-0000-0000-000000000000,1744
1,010008d3-ef00-0000-0000-000000000000,244
2,01000099-ac00-0000-0000-000000000000,216
3,01000c4d-d800-0000-0000-000000000000,212
4,01001700-7c00-0000-0000-000000000000,181


Based on the number of merged PRs by contributor, there clearly seems to be 1 outlier. Let's confirm our hypothesis by calculating the bus factor.

In [153]:
bf_prs_cntrbs, cntrb_ids, _ = calc_bus_factor(prs_cntrb, activity='pr', threshold=0.50, remove_outliers=False)
print(f'The bus factor in terms of contributors who had their PRs merged for {repo_name} is {bf_prs_cntrbs}.')
print(' ')
bf_no_outliers_prs_cntrbs, cntrb_ids, num_outliers = calc_bus_factor(prs_cntrb, activity='pr', threshold=0.50, remove_outliers=True)
print(f'Filtering out {num_outliers} contributors that have been identified as outliers...')
print(f'The bus factor in terms of contributors who had their PRs merged for {repo_name} with outliers removed is {bf_no_outliers_prs_cntrbs}.')

The bus factor in terms of contributors who had their PRs merged for Ansible is 2.
 
Filtering out 1 contributors that have been identified as outliers...
The bus factor in terms of contributors who had their PRs merged for Ansible with outliers removed is 5.


Great, we've validated our hypothesis. Now let's repeat the process for reviewers who've merged PRs.

In [134]:
prs_reviewers = conn_pr[['pr', 'pr_merged_at', 'reviewer']]
prs_reviewers = prs_reviewers.rename(columns={'reviewer': 'cntrb_id'})
prs_reviewers = preprocess_df(prs_reviewers, activity='pr')
display(prs_reviewers.head())

Unnamed: 0,cntrb_id,pr
0,01000099-ac00-0000-0000-000000000000,746
1,01006763-cc00-0000-0000-000000000000,647
2,01000cc2-4b00-0000-0000-000000000000,360
3,0101cdd9-cf00-0000-0000-000000000000,290
4,01012aa8-bd00-0000-0000-000000000000,284


In [155]:
bf_prs_reviwers, cntrb_ids, _ = calc_bus_factor(prs_reviewers, activity='pr', threshold=0.50, remove_outliers=False)
print(f'The bus factor in terms of reviewers who have merged PRs for {repo_name} is {bf_prs_reviwers}.')
print(' ')
bf_no_outliers_prs_reviewers, cntrb_ids, num_outliers = calc_bus_factor(prs_reviewers, activity='pr', threshold=0.50, remove_outliers=True)
print(f'Filtering out {num_outliers} contributors that have been identified as outliers...')
print(f'The bus factor in terms of reviewers who have merged PRs for {repo_name} with outliers removed is {bf_no_outliers_prs_reviewers}.')

The bus factor in terms of reviewers who have merged PRs for Ansible is 4.
 
Filtering out 2 contributors that have been identified as outliers...
The bus factor in terms of reviewers who have merged PRs for Ansible with outliers removed is 5.
