# Graphical Analysis of GitHub Repositories and Contributors

In this notebook, we programatically view the connections between open source projects, determine project clusters, and map out technology ecosystems. We explore the Augur GitHub data to view relationships between open source projects and communities by studying graphs based on relations such as common contributors and project activities between different GitHub repositories.

## Connect to Augur database

Until the Operate First enviroment can connect to the DB, use config file to access. Do not push config file to Github repo

In [1]:
import psycopg2
import pandas as pd
import collections
from functools import reduce
import datetime

import sqlalchemy as salc
import json
import os
import networkx as nx
import matplotlib.pyplot as plt

with open("../../../config.json") as config_file:
    config = json.load(config_file)

In [2]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

## Get all the Data, Build all the Graphs

In [3]:
# Subset repositories based on a category
# Selecting repositories that fall under the Containers org on Github
science_repo_sql = salc.sql.text(f"""
                 SET SCHEMA 'augur_data';
                    --science 
                    SELECT
                    	repo_git, ntile(4) over ( order by commits_all_time) 
                    FROM
                    	repo A,
                    	(
                    	SELECT C.repo_id, d.commits_all_time
                    	FROM
                    		augur_operations.users A,
                    		augur_operations.user_groups b,
                    		augur_operations.user_repos C, 
                    		api_get_all_repos_commits d 
                    	WHERE
                    		A.user_id = b.user_id 
                    		AND b.group_id = C.group_id 
                    		AND d.repo_id= c.repo_id
                    		AND b.NAME = 'Google' --AND lower(A.login_name)='numfocus'
                    		
                    	ORDER BY
                    		A.login_name,
                    		d.commits_all_time,
                    		b.group_id 
                    	) b 
                    WHERE
                    	A.repo_id = b.repo_id order by commits_all_time desc
                            """)

with engine.connect() as conn:
    #df = pd.read_sql(sql, cnxn)
    results = conn.execute(science_repo_sql)
    df_results = pd.DataFrame(results) 

print(df_results)

                                               repo_git  ntile
0                       https://github.com/google/kmsan      4
1                https://github.com/google/ghost-kernel      4
2              https://github.com/google/capsicum-linux      4
3          https://github.com/google/intellij-community      4
4                https://github.com/google/llvm-project      4
...                                                 ...    ...
2466                  https://github.com/google/x-amber      1
2467        https://github.com/google/dropzone-polyfill      1
2468           https://github.com/google/gnostic-models      1
2469              https://github.com/google/lut3d_utils      1
2470  https://github.com/google/vertex-pipelines-boi...      1

[2471 rows x 2 columns]


In [4]:
#find unique ntile values
ntiles = df_results.ntile.unique()
display(df_results) 
#sort values smallest to largest
ntiles.sort()

#display sorted values
display(ntiles)

#run analysis in ntiles
#surveys_df[surveys_df.year == 2002]

for i in ntiles:
    repo_set=[]
    repo_git_set = []
    repo_name_set = []
    result_tile = df_results[df_results.ntile==i]
    #print(i)
    #print(result_tile)

    
    print("Graphs for NTILE: " + str(i))
    print('Starting Data Collection for NTILE: ' + str(i))
    ct = datetime.datetime.now()
    print("current time:-", ct)

#    for index, row in result_tile.iterrows():
    for row in result_tile.itertuples(index = True):
        trepo_git=getattr(row,"repo_git")
        display(trepo_git)
        repo_query = salc.sql.text(f"""
                     SET SCHEMA 'augur_data';
                     SELECT 
                        b.repo_id,
                        b.repo_name
                    FROM
                        repo_groups a,
                        repo b
                    WHERE
                        a.repo_group_id = b.repo_group_id AND
                        b.repo_git = '{trepo_git}'
            """)
        with engine.connect() as conn:
            results = conn.execute(repo_query)
            df2_results = pd.DataFrame(results) 
        df2_results.reset_index(drop=True, inplace=True) 
        repo_id = int(df2_results['repo_id'].values[0])
        repo_name = df2_results['repo_name'].to_string(index=False)
        repo_set.append(repo_id)
        repo_name_set.append(repo_name)
        repo_git_set.append(trepo_git)    

        #Issue Contributors
        issue_contrib = pd.DataFrame()
        for repo_id in repo_set:
            repo_query = salc.sql.text(f"""
                        SET SCHEMA 'augur_data';
                        SELECT r.repo_id,
                        r.repo_git,
                        i.reporter_id as cntrb_id,
                        i.issue_id
                        FROM
                        repo r, issues i
                         WHERE
                        i.repo_id = {repo_id} AND
                        i.repo_id = r.repo_id
                """)
            df_current_repo = pd.read_sql(repo_query, con=engine)
            issue_contrib = pd.concat([issue_contrib, df_current_repo])
        
        issue_contrib = issue_contrib.reset_index()
        issue_contrib.drop("index", axis=1, inplace=True)
        issue_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'issue_id']
        #display(issue_contrib)
        #issue_contrib.dtypes    
    
        #PR Contributors
        pr_contrib = pd.DataFrame()
        
        for repo_id in repo_set:
            repo_query = salc.sql.text(f"""
                        SET SCHEMA 'augur_data';
                        SELECT r.repo_id,
                        r.repo_git,
                        prm.cntrb_id,
                        prm.pull_request_id
                        FROM
                        repo r, pull_request_meta prm
                        WHERE
                        prm.repo_id = {repo_id} AND
                        prm.repo_id = r.repo_id
                """)
            df_current_repo = pd.read_sql(repo_query, con=engine)
            pr_contrib = pd.concat([pr_contrib, df_current_repo])
        
        pr_contrib = pr_contrib.reset_index()
        pr_contrib.drop("index", axis=1, inplace=True)
        pr_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'pull_request_id']
        #display(pr_contrib)
        #pr_contrib.dtypes

    
        #PR Reviewers
        prr_contrib = pd.DataFrame()
        
        for repo_id in repo_set:
            repo_query = salc.sql.text(f"""
                        SET SCHEMA 'augur_data';
                        SELECT r.repo_id,
                        r.repo_git,
                        prr.cntrb_id,
                        prr.pull_request_id
                        FROM
                        repo r, pull_request_reviews prr
                        WHERE
                        prr.repo_id = {repo_id} AND
                        prr.repo_id = r.repo_id
                """)
            df_current_repo = pd.read_sql(repo_query, con=engine)
            prr_contrib = pd.concat([prr_contrib, df_current_repo])
        
        pr_contrib = pr_contrib.reset_index()
        pr_contrib.drop("index", axis=1, inplace=True)
        prr_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'pull_request_id']
        #display(prr_contrib)
        #prr_contrib.dtypes

    
        # Commit Contributors
        commit_contrib = pd.DataFrame()
        
        for repo_id in repo_set:
            repo_query = salc.sql.text(f"""
                        SET SCHEMA 'augur_data';
                        SELECT r.repo_id,
                        r.repo_git,
                        ca.cntrb_id,
                        c.cmt_id
                        FROM
                        repo r, commits c, contributors_aliases ca
                        WHERE
                        c.repo_id = {repo_id} AND
                        c.repo_id = r.repo_id and
                        c.cmt_committer_email = ca.alias_email
                """)
            df_current_repo = pd.read_sql(repo_query, con=engine)
            commit_contrib = pd.concat([commit_contrib, df_current_repo])
        
        commit_contrib = commit_contrib.reset_index()
        commit_contrib.drop("index", axis=1, inplace=True)
        commit_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'cmt_id']
        #display(commit_contrib)
        #commit_contrib.dtypes

###########################################################################

    print('Completed Data Collection for NTILE: ' + str(i))
    ct = datetime.datetime.now()
    print("current time:-", ct)
    
    # Commit Contributor Graph
    df_commit = commit_contrib.groupby(['repo_id', 'cntrb_id']).size().unstack(fill_value=0)
    df_commit.head()
    df_commit = df_commit.reset_index()
    
    df_melted_commit = df_commit.melt(
        ['repo_id'],
        var_name = 'cntrb_id',value_name='number')
    
    df_melted_commit = df_melted_commit[df_melted_commit[df_melted_commit.columns[2]] != 0]
    df_melted_commit.head()
    G = nx.from_pandas_edgelist(df_melted_commit, 
                                source='repo_id',
                                target='cntrb_id',
                                edge_attr='number',
                                create_using=nx.MultiGraph())
    nodes = G.nodes()
    Repo_id = df_melted_commit['repo_id'].to_list()
    contributor_id = df_melted_commit['cntrb_id'].to_list()
    colors = ['blue' if n in Repo_id else 'yellow' for n in nodes]

    fig, ax = plt.subplots(figsize=(20,20))
    #yellow_patch = mpatches.Patch(color='yellow', label='Contributor')
    #blue_patch = mpatches.Patch(color='blue', label='Repository')
    #ax.legend(handles=[yellow_patch, blue_patch])
    print('commit contributor graph')
  

    pos = nx.fruchterman_reingold_layout(G)
    spos = nx.spring_layout(G, pos=pos, k=.4)
    anonygraph = nx.draw_networkx(G, node_color=colors, with_labels=False, pos=spos, alpha=0.2, font_size=7, ax=ax)
    filename = "commit_contributor_graph_ntile_" + str(i) +"_corporate.png"
    plt.savefig(fname=filename, format="png")
    plt.show(nx.draw_networkx(G, node_color=colors, pos=spos, with_labels=False, alpha=0.2, font_size=7, ax=ax))
    nx.draw_networkx(G, node_color=colors, with_labels=False, pos=spos, alpha=0.2, font_size=7, ax=ax)   
    
    #nx.draw_networkx(G, node_color=colors, with_labels=False, font_size=8, ax=ax)
    print('Graph for commit contributors should have just printed')
    ct = datetime.datetime.now()
    print("current time:-", ct)

    # Issue Contributor Graph
    df_issue = issue_contrib.groupby(['repo_id', 'cntrb_id']).size().unstack(fill_value=0)
    df_issue = df_issue.reset_index()
    df_melted_issue = df_issue.melt(
        ['repo_id'],
        var_name = 'cntrb_id',value_name='number')
    Repo_id = df_melted_issue['repo_id'].to_list()
    contributor_id = df_melted_issue['cntrb_id'].to_list()
    
    G = nx.from_pandas_edgelist(df_melted_issue, 
                                source='repo_id',
                                target='cntrb_id',
                                edge_attr='number',
                                create_using=nx.MultiGraph())
    nodes = G.nodes()
    colors = ['blue' if n in Repo_id else 'yellow' for n in nodes]
    fig, ax = plt.subplots(figsize=(20,20))
    #ax.legend(handles=[yellow_patch, blue_patch])
    print('issue contributor graph')


    pos = nx.fruchterman_reingold_layout(G)
    spos = nx.spring_layout(G, pos=pos, k=.4)
    anonygraph = nx.draw_networkx(G, node_color=colors, with_labels=False, pos=spos, alpha=0.2, font_size=7, ax=ax)
    filename = "issue_contributor_graph_ntile_" + str(i) +"_corporate.png"
    plt.savefig(fname=filename, format="png")
    plt.show(nx.draw_networkx(G, node_color=colors, pos=spos, with_labels=False, alpha=0.2, font_size=7, ax=ax))
    nx.draw_networkx(G, node_color=colors, with_labels=False, pos=spos, alpha=0.2, font_size=7, ax=ax)   

    
    print('Graph for issue contributors should have just printed')
    ct = datetime.datetime.now()
    print("current time:-", ct)

    ### PR Contributor Graph
    df_pr = pr_contrib.groupby(['repo_id', 'cntrb_id']).size().unstack(fill_value=0)
    df_pr = df_pr.reset_index()
    df_melted_pr = df_pr.melt(
        ['repo_id'],
        var_name = 'cntrb_id',value_name='number')

    df_melted_pr = df_melted_pr[df_melted_pr[df_melted_pr.columns[2]] != 0]
    Repo_id = df_melted_issue['repo_id'].to_list()
    contributor_id = df_melted_issue['cntrb_id'].to_list()
    
    G = nx.from_pandas_edgelist(df_melted_pr, 
                                source='repo_id',
                                target='cntrb_id',
                                edge_attr='number',
                                create_using=nx.MultiGraph())

    nodes = G.nodes()
    colors = ['blue' if n in Repo_id else 'yellow' for n in nodes]

    fig, ax = plt.subplots(figsize=(20,20))
    pos = nx.fruchterman_reingold_layout(G)
    print('PR contributor graph')


    pos = nx.fruchterman_reingold_layout(G)
    spos = nx.spring_layout(G, pos=pos, k=.4)
    anonygraph = nx.draw_networkx(G, node_color=colors, with_labels=False, pos=spos, alpha=0.2, font_size=7, ax=ax)
    filename = "PR_contributor_graph_ntile_" + str(i) +"_corporate.png"
    plt.savefig(fname=filename, format="png")
    plt.show(nx.draw_networkx(G, node_color=colors, pos=spos, with_labels=False, alpha=0.2, font_size=7, ax=ax))
    nx.draw_networkx(G, node_color=colors, with_labels=False, pos=spos, alpha=0.2, font_size=7, ax=ax)   
    
    print('Graph for PR contributors should have just printed')
    ct = datetime.datetime.now()
    print("current time:-", ct)



Unnamed: 0,repo_git,ntile
0,https://github.com/google/kmsan,4
1,https://github.com/google/ghost-kernel,4
2,https://github.com/google/capsicum-linux,4
3,https://github.com/google/intellij-community,4
4,https://github.com/google/llvm-project,4
...,...,...
2466,https://github.com/google/x-amber,1
2467,https://github.com/google/dropzone-polyfill,1
2468,https://github.com/google/gnostic-models,1
2469,https://github.com/google/lut3d_utils,1


array([1, 2, 3, 4])

Graphs for NTILE: 1
Starting Data Collection for NTILE: 1
current time:- 2023-07-03 11:29:26.073643


'https://github.com/google/lk_onload_stub'

'https://github.com/google/n-digit-mnist'

'https://github.com/google/budget-protector'

'https://github.com/google/truestreet'

'https://github.com/google/ruby-openid-apps-discovery'

'https://github.com/google/python-yaml-config'

'https://github.com/google/spindle-dv360'

'https://github.com/google/ashier'

'https://github.com/google/voice-iot-maker-demo'

'https://github.com/google/video-localized-narratives'

'https://github.com/google/rtb_creative_filtering_report'

'https://github.com/google/mockable_filesystem.dart'

'https://github.com/google/guice-aqueduct'

'https://github.com/google/asset-check'

'https://github.com/google/schemaorg-java'

'https://github.com/google/security-annotation-tools'

'https://github.com/google/hana-bq-beam-connector'

'https://github.com/google/rttcp'

'https://github.com/google/episodes.dart'

'https://github.com/google/androiddevicesjs'

'https://github.com/google/keep-sorted'

'https://github.com/google/it-cert-automation-practice'

'https://github.com/google/ai_video_dubbing'

'https://github.com/google/segment'

'https://github.com/google/deepboost'

'https://github.com/google/liburing_cpp'

'https://github.com/google/mweb-analysis-tools'

'https://github.com/google/memcpy-gemm'

'https://github.com/google/custom-tab-groups'

'https://github.com/google/randomized-graphics-shaders'

'https://github.com/google/secret-manager-with-sendgrid'

'https://github.com/google/angular_node_bind.dart'

'https://github.com/google/encrypted-bigquery-client'

'https://github.com/google/python-laurel'

'https://github.com/google/java-sourcetools'

'https://github.com/google/python-atfork'

'https://github.com/google/ct-hackday-schwag'

'https://github.com/google/git-tree'

'https://github.com/google/active-qa'

'https://github.com/google/rustcxx'

'https://github.com/google/dart-emacs-plugin-unsupported'

'https://github.com/google/poweranalysis'

'https://github.com/google/.allstar'

'https://github.com/google/ngrx-visualizer'

'https://github.com/google/go-circuits'

'https://github.com/google/goldfinch'

'https://github.com/google/responsible-innovation'

'https://github.com/google/adcase'

'https://github.com/google/model_search'

'https://github.com/google/googlesource-auth-tools'

'https://github.com/google/sgtm-migrator'

'https://github.com/google/web-starter-kit-extras'

'https://github.com/google/rfmt'

'https://github.com/google/hangouts-chat-bot-cloud-function-nodejs-example'

'https://github.com/google/traceout'

'https://github.com/google/arc-proselint'

'https://github.com/google/pymql'

'https://github.com/google/tim-gan'

'https://github.com/google/gpu-emulation-stress-test'

'https://github.com/google/godata'

'https://github.com/google/chrome-tabber'

'https://github.com/google/virtualdesktops-extension'

'https://github.com/google/eclipse2017'

'https://github.com/google/paper-gui'

'https://github.com/google/globalfoundries-pdk-ip-gf180mcu_fd_ip_sram'

'https://github.com/google/project-gameface'

'https://github.com/google/zetasketch'

'https://github.com/google/e3d_lstm'

'https://github.com/google/gce-rescue'

'https://github.com/google/tsunami-security-scanner-testbed'

'https://github.com/google/gitprotocolio'

'https://github.com/google/tabletopaudio-action'

'https://github.com/google/marmot'

'https://github.com/google/receipts-to-riches-part1'

'https://github.com/google/google.github.io'

'https://github.com/google/librato.dart'

'https://github.com/google/fchan-go'

'https://github.com/google/ion'

'https://github.com/google/aegis_cipher'

'https://github.com/google/allstar-config'

'https://github.com/google/deluca-igpc'

'https://github.com/google/i18n_sanitycheck'

'https://github.com/google/jarvan'

'https://github.com/google/amss'

'https://github.com/google/dynamex-proto'

'https://github.com/google/dynamic-video-depth'

'https://github.com/google/chat-enhanced'

'https://github.com/google/webview-local-server'

'https://github.com/google/coursebuilder-hello-world-module'

'https://github.com/google/tsviewdb'

'https://github.com/google/mcp2221-rs'

'https://github.com/google/chrome-language-immersion'

'https://github.com/google/flutter-stream-extensions'

'https://github.com/google/java-video-live-stream'

'https://github.com/google/vapso'

'https://github.com/google/.github'

'https://github.com/google/lecam-gan'

'https://github.com/google/rysim'

'https://github.com/google/linux-sensor'

'https://github.com/google/us-altgr-intl'

'https://github.com/google/ot-crdt-papers'

'https://github.com/google/ga4-ecom-attributor'

'https://github.com/google/cpu-check'

'https://github.com/google/compynator'

'https://github.com/google/causalexpanalysis'

'https://github.com/google/coursebuilder-lti-module'

'https://github.com/google/crumsort-rs'

'https://github.com/google/retrieval-qa-eval'

'https://github.com/google/voltair'

'https://github.com/google/strawnet'

'https://github.com/google/gpu-mux'

'https://github.com/google/actions-on-google-pq2-template-sdk'

'https://github.com/google/device-access-sample-web-app'

'https://github.com/google/hell0world-curriculum'

'https://github.com/google/geoexperimentsresearch'

'https://github.com/google/crossmodal-3600'

'https://github.com/google/mozart'

'https://github.com/google/csp-validator'

'https://github.com/google/libcppbor'

'https://github.com/google/vmregistry'

'https://github.com/google/android-wear-stitch-script'

'https://github.com/google/recog'

'https://github.com/google/meet-on-fhir'

'https://github.com/google/pinotify'

'https://github.com/google/argtail-check'

'https://github.com/google/swift'

'https://github.com/google/mobile-data-download'

'https://github.com/google/helm-broker'

'https://github.com/google/blockly-experimental'

'https://github.com/google/formal-ml'

'https://github.com/google/h2e_technical_documentation'

'https://github.com/google/weighted-dict'

'https://github.com/google/android-auto-companion-calendarsync-ios'

'https://github.com/google/py-decorators-tutorial'

'https://github.com/google/dcm-bulk-trafficking'

'https://github.com/google/sg2im'

'https://github.com/google/cloud-berg'

'https://github.com/google/cabal2bazel'

'https://github.com/google/cordova-plugin-browsertab'

'https://github.com/google/nips_assignments'

'https://github.com/google/video-exclusion-toolbox'

'https://github.com/google/picasa-app-demo'

'https://github.com/google/elfling'

'https://github.com/google/dv360-bidbyweather'

'https://github.com/google/pbvi'

'https://github.com/google/atlassian-addons-audit-sheet'

'https://github.com/google/cm-creatives-drive-uploader'

'https://github.com/google/gke-auditor'

'https://github.com/google/pdl-language'

'https://github.com/google/skywater-pdk-libs-sky130_fd_bd_sram'

'https://github.com/google/terminal-py'

'https://github.com/google/exists-ref'

'https://github.com/google/go-webdav'

'https://github.com/google/teknowledge'

'https://github.com/google/deluca-lung'

'https://github.com/google/gulp-google-closure-deps'

'https://github.com/google/edf'

'https://github.com/google/dcm_bulk_onboarding'

'https://github.com/google/google-drive-dokany'

'https://github.com/google/distributed-git-forks'

'https://github.com/google/tcav-for-ehr'

'https://github.com/google/agency-ads-management-solutions'

'https://github.com/google/vbootrom'

'https://github.com/google/googlesre'

'https://github.com/google/ceviche-challenges'

'https://github.com/google/multibox'

'https://github.com/google/ds-trix-addon'

'https://github.com/google/library-wrapper-unity-common'

'https://github.com/google/web-audio-recognition'

'https://github.com/google/go-pipeline'

'https://github.com/google/boundedwait'

'https://github.com/google/service_worker.dart'

'https://github.com/google/trueview-sdf-generator'

'https://github.com/google/py-ast-utils'

'https://github.com/google/shipshape-demo'

'https://github.com/google/libhidtelephony'

'https://github.com/google/go-write'

'https://github.com/google/libprio-cc'

'https://github.com/google/uvq'

'https://github.com/google/flutter-mediapipe'

'https://github.com/google/cog'

'https://github.com/google/swiftlogfirecloud'

'https://github.com/google/testrunner-rosemary'

'https://github.com/google/go-pcie-tlp'

'https://github.com/google/git-appraise-eclipse'

'https://github.com/google/rust-multihash'

'https://github.com/google/http2preload'

'https://github.com/google/metrosvg'

'https://github.com/google/embedding-tests'

'https://github.com/google/go-pcie-screamer'

'https://github.com/google/power-traces'

'https://github.com/google/drone-firebase'

'https://github.com/google/inception'

'https://github.com/google/adapt-googleanalytics'

'https://github.com/google/bgu'

'https://github.com/google/speech_intelligibility_index'

'https://github.com/google/dreambooth'

'https://github.com/google/selinux-policy-languages'

'https://github.com/google/wikiloop-analysis'

'https://github.com/google/favcolor-findidp'

'https://github.com/google/asset-inventory-worksheet'

'https://github.com/google/trustscore'

'https://github.com/google/graph_distillation'

'https://github.com/google/easybundler'

'https://github.com/google/python-dimond'

'https://github.com/google/plb'

'https://github.com/google/ads-account-structure-script'

'https://github.com/google/copper'

'https://github.com/google/android-arscblamer'

'https://github.com/google/tpm-js'

'https://github.com/google/file-based-test-driver'

'https://github.com/google/ad_language_monitor'

'https://github.com/google/vertex-ai-nas'

'https://github.com/google/dom-tutorials'

'https://github.com/google/clerk'

'https://github.com/google/minetest_pnr'

'https://github.com/google/gde-speakersbureau'

'https://github.com/google/mpact-sim-codelabs'

'https://github.com/google/creatine-ads-inspector'

'https://github.com/google/templatekit'

'https://github.com/google/gps_data_solutions'

'https://github.com/google/cost-attribution-solution'

'https://github.com/google/xssxss'

'https://github.com/google/zoom-to-inpaint'

'https://github.com/google/rescue-tools-reiserfs'

'https://github.com/google/wayback-machine-button'

'https://github.com/google/prog-edu-assistant-quizzes'

'https://github.com/google/uiimage-additions'

'https://github.com/google/hyperprotobench'

'https://github.com/google/render-timing-for-unity'

'https://github.com/google/image-compression'

'https://github.com/google/revisiting-self-supervised'

'https://github.com/google/grpc-java-bazel-minimal'

'https://github.com/google/deputy-api-python-client'

'https://github.com/google/ga4-gtm-utilities'

'https://github.com/google/ios-chatbot'

'https://github.com/google/chive-varying-prosody-icml-2019'

'https://github.com/google/audit-normalmap'

'https://github.com/google/turing-doodle'

'https://github.com/google/go-microservice-helpers'

'https://github.com/google/geovelo'

'https://github.com/google/orfconverter'

'https://github.com/google/hammer-kit'

'https://github.com/google/kiosk-app-reference-implementation'

'https://github.com/google/looker-studio-dashboard-cloner'

'https://github.com/google/dart-immutables'

'https://github.com/google/gke-cloud-dns-tls'

'https://github.com/google/sonic-midi'

'https://github.com/google/portrait-shadow-manipulation'

'https://github.com/google/dynamicworld'

'https://github.com/google/gsrsup'

'https://github.com/google/msft-on-gcp-code-samples'

'https://github.com/google/applied-machine-learning-intensive'

'https://github.com/google/aiyprojects-raspbian-tools'

'https://github.com/google/bocado'

'https://github.com/google/angular-sticky-element'

'https://github.com/google/actions-on-google-flashcards-template-sdk'

'https://github.com/google/dev-on-chromeos-che'

'https://github.com/google/hark'

'https://github.com/google/vim-codereview'

'https://github.com/google/app-resource-bundle'

'https://github.com/google/blockbuster'

'https://github.com/google/gtm-currency-rates-sync'

'https://github.com/google/forcefield'

'https://github.com/google/sqlcommenter-php'

'https://github.com/google/spirv-tutor'

'https://github.com/google/red'

'https://github.com/google/wasserstein-dist'

'https://github.com/google/octoprint-heatertimeout'

'https://github.com/google/dev-on-chromeos-openvpn'

'https://github.com/google/emoticon-composer-font'

'https://github.com/google/streaming_hdp'

'https://github.com/google/dualhttp'

'https://github.com/google/web-prototyping-tool'

'https://github.com/google/flutter_async_storage'

'https://github.com/google/ctap2-test-tool-corpus'

'https://github.com/google/quickshift'

'https://github.com/google/zhi'

'https://github.com/google/llvm-propeller'

'https://github.com/google/tadau'

'https://github.com/google/project_cartesian'

'https://github.com/google/experience-accessibility'

'https://github.com/google/ihmehimmeli'

'https://github.com/google/poseshield-tfjs'

'https://github.com/google/overcoming-conflicting-data'

'https://github.com/google/appspeedindex'

'https://github.com/google/dnae'

'https://github.com/google/image-supplemental-feed-creator'

'https://github.com/google/ehr-predictions'

'https://github.com/google/dev-on-chromeos-gce-setup'

'https://github.com/google/sqlcommenter-laravel-php'

'https://github.com/google/perforce-utils'

'https://github.com/google/channel-id-enclave'

'https://github.com/google/xdelta3-decoder-js'

'https://github.com/google/source_transformer.dart'

'https://github.com/google/zombies-on-steroids'

'https://github.com/google/nodejs-wiki'

'https://github.com/google/anthos-microk8s'

'https://github.com/google/wikiloop-wikidata-game'

'https://github.com/google/intellij-protocol-buffer-editor'

'https://github.com/google/older-mirrored-patches'

'https://github.com/google/prometheus-slo-burn-example'

'https://github.com/google/blue-green-deployment-controller'

'https://github.com/google/vertex-ai-benchmarker'

'https://github.com/google/dcm-trix-addon'

'https://github.com/google/realestate10k'

'https://github.com/google/realtime-help'

'https://github.com/google/cloud-function-edit-drive-permissions'

'https://github.com/google/gofountain'

'https://github.com/google/videotts'

'https://github.com/google/sa360-flightsfeed'

'https://github.com/google/ps_log'

'https://github.com/google/mimosa'

'https://github.com/google/amt-forensics'

'https://github.com/google/rbe-integration-test'

'https://github.com/google/indexable-pwa-samples'

'https://github.com/google/mcafp'

'https://github.com/google/speedy'

'https://github.com/google/gumbel_sinkhorn'

'https://github.com/google/dv360_feature_adoption'

'https://github.com/google/safevarargs'

'https://github.com/google/plusfish'

'https://github.com/google/touchtime'

'https://github.com/google/voter-info-tool'

'https://github.com/google/golden'

'https://github.com/google/cloud-reporting'

'https://github.com/google/tcp_killer'

'https://github.com/google/glassbox'

'https://github.com/google/repose'

'https://github.com/google/vulkan-pre-rotation-demo'

'https://github.com/google/analytics-audience-automation-tool'

'https://github.com/google/prerender-test'

'https://github.com/google/powered-caster-vehicle'

'https://github.com/google/applied-data-structures-algorithms'

'https://github.com/google/ngx_token_binding'

'https://github.com/google/impl_trait_utils'

'https://github.com/google/talos-dv360'

'https://github.com/google/text2text'

'https://github.com/google/bottlerocket'

'https://github.com/google/hierarchical-state-machine.dart'

'https://github.com/google/budoux-illustrator-script'

'https://github.com/google/ahdlc'

'https://github.com/google/mr4c'

'https://github.com/google/globalfoundries-pdk-libs-gf180mcu_fd_bd_sram'

'https://github.com/google/agata'

'https://github.com/google/flutter_minimal_store'

'https://github.com/google/rxcppuniq'

'https://github.com/google/gtd-txt'

'https://github.com/google/cluster-scoped-cicd'

'https://github.com/google/active-learning'

'https://github.com/google/lexical-masks'

'https://github.com/google/skywater-pdk-libs-sky90fd_fd_pr'

'https://github.com/google/ukip'

'https://github.com/google/xscreensaver-dbus'

'https://github.com/google/pyctr'

'https://github.com/google/quark'

'https://github.com/google/driblet'

'https://github.com/google/gcp-fwrule-open-domain'

'https://github.com/google/orderedcode'

'https://github.com/google/coursebuilder-android-container-module'

'https://github.com/google/zarathustra'

'https://github.com/google/openjdk-kerberos'

'https://github.com/google/jse4conf'

DatabaseError: (psycopg2.DatabaseError) could not receive data from server: Operation timed out
SSL SYSCALL error: Operation timed out

[SQL: 
                        SET SCHEMA 'augur_data';
                        SELECT r.repo_id,
                        r.repo_git,
                        prm.cntrb_id,
                        prm.pull_request_id
                        FROM
                        repo r, pull_request_meta prm
                        WHERE
                        prm.repo_id = 54541 AND
                        prm.repo_id = r.repo_id
                ]
(Background on this error at: https://sqlalche.me/e/20/4xp6)

    ## Nodes as projects edge as contributors
    #  In this section, we represent data differently and try out another graph representation where nodes represent the project repositories, and the edges are shared contributions between those projects

    print('contributor graph: Nodes as projects, edges as contributors')
    #    print(`contributorGraph` above is a dictionary where each key is a project repository, and the value is a list of **"connected"** project repositories and the number of 
    #**"shared connections"** between them. Let's explain **"connected"** repositories and shared "connections.")
    #**"shared connections"** constitute of *commits*, *pull requests*, *issues* and *pull request reviews* that are made by the same contributor.
    #We will call 2 project repositories **"connected"** if they have a **"shared connection"** between them. 
    #This means if they have a contributor who makes a *commit*, *pull request*, *issue*, or *pull request review* in both repositories, 
    # they count as a shared contributor, and the repositories are connected. 

    #structure of `contributorGraph` =  
       # {  
       # `repo1`: [(`repo2`, `PRs by same authors in repo 1 and repo 2`)],  
       # `repo2`: [(`repo4`, `PRs created by same authors in repo 1 and repo 4` ), (`repo5`, `PRs by same authors in repo 2 and repo 5`)]  
       # }

#We track the number of shared contributions between 2 repositories for creating this graph plot.
    contributorGraph = {}
    for i, row in df_melted_pr.iterrows():
        if row['cntrb_id'] not in contributorGraph:
            contributorGraph[row['cntrb_id']] = []
        if(row['number'] > 0):
            contributorGraph[row['cntrb_id']].append((row['repo_id'], row['number']))

    commonRepoContributionsByContributor = collections.defaultdict(int)
    for key in contributorGraph:
        if len(contributorGraph[key])-1 <= 0:
            continue
        for repoContributionIndex in range(len(contributorGraph[key])-1):
            commonRepoContributionsByContributor[(contributorGraph[key][repoContributionIndex][0], contributorGraph[key][repoContributionIndex+1][0])] += contributorGraph[key][repoContributionIndex][1]+contributorGraph[key][repoContributionIndex+1][1]
    res = []
    for key in commonRepoContributionsByContributor:
        res.append(tuple(str(k) for k in list(key)) + (commonRepoContributionsByContributor[key],))

    G = nx.Graph()
    G.add_weighted_edges_from(res)
    
    fig, ax = plt.subplots(figsize=(30,30))

    pos = nx.fruchterman_reingold_layout(G)
    #spos = nx.spring_layout(G, pos=pos, k=.4)
    anonygraph = nx.draw_networkx(G, node_color=colors, with_labels=False, pos=pos, alpha=0.2, font_size=7, ax=ax)
    filename = "2_repo_contributors_contributor_graph_ntile_" + str(i) +".png"
    plt.savefig(fname=filename, format="png")
    plt.show(nx.draw_networkx(G, node_color=colors, pos=spos, with_labels=False, alpha=0.2, font_size=7, ax=ax))
    nx.draw_networkx(G, node_color=colors, with_labels=False, pos=spos, alpha=0.2, font_size=7, ax=ax)   

    nx.draw_networkx(g, node_size=120, with_labels=False, font_size=14, ax=ax)
    print('Graph for Nodes:Projects, Edges: Contributors should have just printed')
    ct = datetime.datetime.now()
    print("current time:-", ct)


The above graph represents project repositories and how close or far they are to each other based on their degree of connected (number of shared contributions amongst them). If 2 nodes are close to each other, the 2 projects have a high number of shared contributions and vice versa. Each node in this graph has atleast one connection. We are not plotting lone projects in this graph as we want to identify project repositories in connection to existing known repositories.  
Note: this is not a complete (fully-connected) graph. All projects are not **"connected"** to each project. See above for the definition of **"connected"** 

## Conclusion

In this notebook, we created initial graph representations of existing open source GitHub repositories falling under a certain category using [NetworkX](https://networkx.org/). 

We used 2 type of graph representations:

- One where repositories and contributors both are both nodes (differently colored). Viewing which repositories share which set of contributors and analyzing their clusters can give an idea about how projects are connected to each other and to what degree 
- One where repositories are nodes, and edges are number of contributions. The distance between repositories, how close or far they are will depend on the number of shared contributions that exist between them.