# Using Count-based Categorizing Intuition to Identify Core Developers in a GitHub Repository

In this notebook, we perform count-based operations to identify core contributors for a GitHub repository. We apply the standard 80th percentile threshold (core contributors are responsible for 80% of the contributions), because of its wide use and its justification based on the data following a Zipf distribution. 

Count-based operations outlined in [Joblin et al, Classifying Developers into Core and Peripheral (2017)](https://drive.google.com/file/d/19WdPBmpEdnU76aVnwOEnY8YtK8extM7C/view):
1. Commit count
2. Lines of code (LOC) count

Note, in this notebook we perform analysis on the Ansible-2 repository, however the code is general and can be replicated with any GitHub repository.

### Definitions:

  <ins>Core developers</ins> - play an essential role in developing the system architecture and forming the general leadership structure, and they have substantial, long-term involvement.

<ins>Peripheral developers</ins> - typically involved in bug fixes/small enhancements, and they have irregular or short-term involvement.


## Setting up imports and file path

In [1]:
!pip install --upgrade 'sqlalchemy<2.0'

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sqlalchemy<2.0
  Downloading SQLAlchemy-1.4.47-cp39-cp39-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m26.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sqlalchemy
  Attempting uninstall: sqlalchemy
    Found existing installation: SQLAlchemy 2.0.9
    Uninstalling SQLAlchemy-2.0.9:
      Successfully uninstalled SQLAlchemy-2.0.9
Successfully installed sqlalchemy-1.4.47


In [2]:
import sqlalchemy as salc
import json
from google.colab import drive
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [3]:
#Linking to Gdrive 
drive.mount('/content/drive', force_remount=True)

#change your folder path (where you want your files to be saved to or pulled from) 
%cd /content/drive/MyDrive/Aspen Research

Mounted at /content/drive
/content/drive/MyDrive/Aspen Research


## Connect to Augur database

In [4]:
with open("copy_cage-padres.json") as config_file: # MS changed path from ../comm_cage.json
    config = json.load(config_file)
    
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])
dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

## Core contributors by # of commits

**Note:** We count distinct cmt_commit_hash, since the cmt_id table tracks all of the changes for **each file** in a commit. 

In [5]:
repo_org = 'ansible'
repo_name = 'ansible-2'
cmt_query = salc.sql.text(f"""
                 SET SCHEMA 'augur_data';
                 SELECT 
                    ca.cntrb_id,
                    COUNT(DISTINCT c.cmt_commit_hash) as num_commits,
                    COUNT(DISTINCT c.cmt_commit_hash)*100/(SELECT COUNT(DISTINCT cmt_commit_hash)*1.0 FROM commits WHERE repo_id = (SELECT repo_id FROM repo WHERE repo_name ='{repo_name}')) as pct_commits
                FROM
                    repo_groups a,
                    repo b,
                    commits c, 
                    contributors_aliases ca 
                WHERE
                    a.repo_group_id = b.repo_group_id AND
                    a.rg_name = \'{repo_org}\' AND
                    b.repo_name = \'{repo_name}\' AND 
                    c.cmt_committer_email = ca.alias_email AND 
                    b.repo_id = c.repo_id
                GROUP BY
                    ca.cntrb_id
                ORDER BY
                    num_commits DESC
        """)
    
cmt_data = pd.read_sql(cmt_query, con=engine)
display(cmt_data)

Unnamed: 0,cntrb_id,num_commits,pct_commits
0,01012f1b-7f00-0000-0000-000000000000,9831,18.399431
1,01000c4d-d800-0000-0000-000000000000,7774,14.549606
2,01022886-a200-0000-0000-000000000000,5289,9.898748
3,01000cc2-4b00-0000-0000-000000000000,4233,7.922367
4,01000067-2300-0000-0000-000000000000,2896,5.420074
...,...,...,...
1278,01000ae9-7b00-0000-0000-000000000000,1,0.001872
1279,01000afd-b700-0000-0000-000000000000,1,0.001872
1280,01000134-de00-0000-0000-000000000000,1,0.001872
1281,010001a6-f100-0000-0000-000000000000,1,0.001872


Identify contributors responsible for 80% of commits

In [6]:
total_pct = 0
top_cmt_contributors = []
for i, row in cmt_data.iterrows():
    if total_pct < 80:
        total_pct += row['pct_commits']
        top_cmt_contributors.append(row['cntrb_id'])

print('Core contributors:', top_cmt_contributors)
print('Number of core contributors:', len(top_cmt_contributors))
print('Total percentage:', total_pct)

Core contributors: [UUID('01012f1b-7f00-0000-0000-000000000000'), UUID('01000c4d-d800-0000-0000-000000000000'), UUID('01022886-a200-0000-0000-000000000000'), UUID('01000cc2-4b00-0000-0000-000000000000'), UUID('01000067-2300-0000-0000-000000000000'), UUID('01000331-5a00-0000-0000-000000000000'), UUID('01000e5a-0d00-0000-0000-000000000000'), UUID('01001c87-8900-0000-0000-000000000000'), UUID('01000d6a-1100-0000-0000-000000000000'), UUID('010009b6-c200-0000-0000-000000000000'), UUID('0100005d-0100-0000-0000-000000000000'), UUID('01000099-ac00-0000-0000-000000000000'), UUID('010009ab-a500-0000-0000-000000000000'), UUID('010005ec-6600-0000-0000-000000000000'), UUID('01008121-3500-0000-0000-000000000000'), UUID('01006763-cc00-0000-0000-000000000000'), UUID('01003dc3-f500-0000-0000-000000000000'), UUID('0101cdd9-cf00-0000-0000-000000000000'), UUID('01000ac2-8900-0000-0000-000000000000'), UUID('01012aa8-bd00-0000-0000-000000000000'), UUID('01000653-a200-0000-0000-000000000000'), UUID('010009e9

## Core contributors by # lines of code

In [7]:
repo_org = 'ansible'
repo_name = 'ansible-2'
loc_query = salc.sql.text(f"""
                 SET SCHEMA 'augur_data';
                 SELECT 
                    ca.cntrb_id,
                    SUM(c.cmt_added+c.cmt_removed) as num_lines,
                    100.0 * SUM(c.cmt_added+c.cmt_removed) / 
                        (SELECT SUM(c2.cmt_added+c2.cmt_removed) 
                         FROM commits c2 JOIN repo r2 ON c2.repo_id = r2.repo_id 
                         WHERE r2.repo_name = '{repo_name}'
                        ) as pct_lines                
                 FROM
                    repo_groups a,
                    repo b,
                    commits c, 
                    contributors_aliases ca 
                WHERE
                    a.repo_group_id = b.repo_group_id AND
                    a.rg_name = \'{repo_org}\' AND
                    b.repo_name = \'{repo_name}\' AND 
                    c.cmt_committer_email = ca.alias_email AND 
                    b.repo_id = c.repo_id
                GROUP BY
                    ca.cntrb_id
                ORDER BY
                    num_lines DESC
        """)
    
loc_data = pd.read_sql(loc_query, con=engine)
display(loc_data)

Unnamed: 0,cntrb_id,num_lines,pct_lines
0,01000099-ac00-0000-0000-000000000000,2431622,31.261367
1,01012f1b-7f00-0000-0000-000000000000,1688957,21.713533
2,01000c4d-d800-0000-0000-000000000000,617697,7.941223
3,01000e5a-0d00-0000-0000-000000000000,499929,6.427177
4,01022886-a200-0000-0000-000000000000,282792,3.635625
...,...,...,...
1278,01000d1a-8900-0000-0000-000000000000,1,0.000013
1279,01000e45-5100-0000-0000-000000000000,1,0.000013
1280,010001d4-ce00-0000-0000-000000000000,0,0.000000
1281,01000380-b400-0000-0000-000000000000,0,0.000000


Identify contributors responsible for 80% of lines of code in commits

In [8]:
total_pct = 0
top_loc_contributors = []
for i, row in loc_data.iterrows():
    if total_pct < 80:
        total_pct += row['pct_lines']
        top_loc_contributors.append(row['cntrb_id'])

print('Core contributors:', top_loc_contributors)
print('Number of core contributors:', len(top_loc_contributors))
print('Total percentage:', total_pct)

Core contributors: [UUID('01000099-ac00-0000-0000-000000000000'), UUID('01012f1b-7f00-0000-0000-000000000000'), UUID('01000c4d-d800-0000-0000-000000000000'), UUID('01000e5a-0d00-0000-0000-000000000000'), UUID('01022886-a200-0000-0000-000000000000'), UUID('0100ece0-2b00-0000-0000-000000000000'), UUID('01000331-5a00-0000-0000-000000000000'), UUID('01000cc2-4b00-0000-0000-000000000000'), UUID('010009b6-c200-0000-0000-000000000000')]
Number of core contributors: 9
Total percentage: 80.01333442867977


In [9]:
top_contributors = set(top_cmt_contributors).intersection(set(top_loc_contributors))
print('Core contributors by both commit and LOC counts:')
for id in top_contributors:
    print(id)
print('Number of core contributors:', len(top_contributors))

Core contributors by both commit and LOC counts:
01000099-ac00-0000-0000-000000000000
01012f1b-7f00-0000-0000-000000000000
01000e5a-0d00-0000-0000-000000000000
01022886-a200-0000-0000-000000000000
01000c4d-d800-0000-0000-000000000000
01000331-5a00-0000-0000-000000000000
01000cc2-4b00-0000-0000-000000000000
010009b6-c200-0000-0000-000000000000
Number of core contributors: 8
