# Graphical Analysis of GitHub Repositories and Contributors

In this notebook, we programatically view the connections between open source projects, determine project clusters, and map out technology ecosystems. We explore the Augur GitHub data to view relationships between open source projects and communities by studying graphs based on relations such as common contributors and project activities between different GitHub repositories.

## Connect to Augur database

Until the Operate First enviroment can connect to the DB, use config file to access. Do not push config file to Github repo

In [25]:
import psycopg2
import pandas as pd
import collections
from functools import reduce

import sqlalchemy as salc
import json
import os
import networkx as nx
import matplotlib.pyplot as plt

with open("../../../config.json") as config_file:
    config = json.load(config_file)

In [26]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

## Retrieve Available Repositories

In [50]:
# Subset repositories based on a category
# Selecting repositories that fall under the Containers org on Github
repo_git_set = []
repo_name_set = []
science_repo_sql = salc.sql.text(f"""
                 SET SCHEMA 'augur_data';
                    --science 
                    select repo_git from repo a, 
                    (
                    SELECT 
                    	C.repo_id
                    FROM
                    	augur_operations.users A,
                    	augur_operations.user_groups b,
                    	augur_operations.user_repos C 
                    WHERE
                    	A.user_id = b.user_id 
                    	AND b.group_id = C.group_id 
                    	AND b.name='science'
                    	--AND lower(A.login_name)='numfocus'
                    ORDER BY
                    	A.login_name,
                    	b.group_id) b 
                    	where a.repo_id = b.repo_id; 
                            """)

#t = engine.execute(repo_query)
with engine.connect() as conn:
    #df = pd.read_sql(sql, cnxn)
    results = conn.execute(science_repo_sql)
    df_results = pd.DataFrame(results) 
    
#id_count = results.first()[0]
#print(id_count) 
print(df_results)
#repo_gits = results[repo_git]

#print(results)
#num_fields = len(results)
#field_names = [i[0] for i in results.description]

#print(num_fields)
#print(field_names)

                                              repo_git
0             https://github.com/iqss/dataverse-people
1                 https://github.com/rstudio/shinyapps
2        https://github.com/rstudio/lucid-kube-fledged
3         https://github.com/rstudio/spark.rstudio.com
4                  https://github.com/yulab-smu/scplot
...                                                ...
1914               https://github.com/ropensci/spatsoc
1915        https://github.com/jupyterhub/batchspawner
1916      https://github.com/bioconductor/bioc2015dday
1917    https://github.com/bioconductor/splicinggraphs
1918  https://github.com/bioconductor/hcamatrixbrowser

[1919 rows x 1 columns]


In [52]:
#for row in df_results: 
#    print(results)
#    #results = results.mappings().all()[0]
#    repo_git = df_results['repo_git']
#    repo_git_set.append(repo_git)

#print(repo_git_set)

<sqlalchemy.engine.cursor.CursorResult object at 0x13e3f74c0>
[0               https://github.com/iqss/dataverse-people
1                   https://github.com/rstudio/shinyapps
2          https://github.com/rstudio/lucid-kube-fledged
3           https://github.com/rstudio/spark.rstudio.com
4                    https://github.com/yulab-smu/scplot
                              ...                       
1914                 https://github.com/ropensci/spatsoc
1915          https://github.com/jupyterhub/batchspawner
1916        https://github.com/bioconductor/bioc2015dday
1917      https://github.com/bioconductor/splicinggraphs
1918    https://github.com/bioconductor/hcamatrixbrowser
Name: repo_git, Length: 1919, dtype: object]


https://github.com/iqss/dataverse-people
https://github.com/iqss/dataverse-people
https://github.com/rstudio/shinyapps
https://github.com/rstudio/shinyapps
https://github.com/rstudio/lucid-kube-fledged
https://github.com/rstudio/lucid-kube-fledged
https://github.com/rstudio/spark.rstudio.com
https://github.com/rstudio/spark.rstudio.com
https://github.com/yulab-smu/scplot
https://github.com/yulab-smu/scplot
https://github.com/publicmapping/districtbuilder-website
https://github.com/publicmapping/districtbuilder-website
https://github.com/rstudio/production.rstudio.com
https://github.com/rstudio/production.rstudio.com
https://github.com/jupyterhub/research-facilities
https://github.com/jupyterhub/research-facilities
https://github.com/eka/jenni
https://github.com/eka/jenni
https://github.com/rstudio/thematic
https://github.com/rstudio/thematic
https://github.com/eka/lolcommits
https://github.com/eka/lolcommits
https://github.com/rstudio/py-htmltools
https://github.com/rstudio/py-htmltool

In [71]:
repo_id_set=[]
#print(df_results[repo_git])

for index, row in df_results.iterrows():
    #print(row["repo_git"])
    trepo_git=row["repo_git"]
    #print(trepo_git) 
    
#for repo_git in df_results:
    #print(df_results[repo_git])[repo_git]
    repo_query = salc.sql.text(f"""
                 SET SCHEMA 'augur_data';
                 SELECT 
                    b.repo_id,
                    b.repo_name
                FROM
                    repo_groups a,
                    repo b
                WHERE
                    a.repo_group_id = b.repo_group_id AND
                    b.repo_git = '{trepo_git}'
        """)

    #t = engine.execute(repo_query)
    with engine.connect() as conn:
        results = conn.execute(repo_query)
        df2_results = pd.DataFrame(results) 
    #print(df2_results)
    
    #results = t.mappings().all()[0]
    #range(results)
    #len(results)
    #print(df2_results)
    #results = results.mappings().all()[0]
    repo_id = df2_results['repo_id']
    #print(repo_id)
    repo_name = df2_results['repo_name']
    repo_id_set.append(repo_id)
    repo_name_set.append(repo_name)
#print(repo_id_set)
#print(repo_name_set)

KeyboardInterrupt: 

### Retrieve Issue Contributors

We will now fetch all Issue contributors for various repositories.

In [None]:
issue_contrib = pd.DataFrame()
for repo_id in repo_set:
    repo_query = salc.sql.text(f"""
                SET SCHEMA 'augur_data';
                SELECT r.repo_id,
                r.repo_git,
                i.cntrb_id,
                i.issue_id
                FROM
                repo r, issues i
                 WHERE
                i.repo_id = \'{repo_id}\' AND
                i.repo_id = r.repo_id
        """)
    df_current_repo = pd.read_sql(repo_query, con=engine)
    issue_contrib = pd.concat([issue_contrib, df_current_repo])

issue_contrib = issue_contrib.reset_index()
issue_contrib.drop("index", axis=1, inplace=True)
issue_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'issue_id']
display(issue_contrib)
issue_contrib.dtypes

### Retrieve PR Contributors

We will now fetch all the PR contributors for various repositories.

In [None]:
pr_contrib = pd.DataFrame()

for repo_id in repo_set:
    repo_query = salc.sql.text(f"""
                SET SCHEMA 'augur_data';
                SELECT r.repo_id,
                r.repo_git,
                prm.cntrb_id,
                prm.pull_request_id
                FROM
                repo r, pull_request_meta prm
                WHERE
                prm.repo_id = \'{repo_id}\' AND
                prm.repo_id = r.repo_id
                LIMIT 50000
        """)
    df_current_repo = pd.read_sql(repo_query, con=engine)
    pr_contrib = pd.concat([pr_contrib, df_current_repo])

pr_contrib = pr_contrib.reset_index()
pr_contrib.drop("index", axis=1, inplace=True)
pr_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'pull_request_id']
display(pr_contrib)
pr_contrib.dtypes

### Retrieve PR Reviewers

We will now fetch all the PR Reviewers for various repositories.

In [None]:
prr_contrib = pd.DataFrame()

for repo_id in repo_set:
    repo_query = salc.sql.text(f"""
                SET SCHEMA 'augur_data';
                SELECT r.repo_id,
                r.repo_git,
                prr.cntrb_id,
                prr.pull_request_id
                FROM
                repo r, pull_request_reviews prr
                WHERE
                prr.repo_id = \'{repo_id}\' AND
                prr.repo_id = r.repo_id
        """)
    df_current_repo = pd.read_sql(repo_query, con=engine)
    prr_contrib = pd.concat([prr_contrib, df_current_repo])

pr_contrib = pr_contrib.reset_index()
pr_contrib.drop("index", axis=1, inplace=True)
prr_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'pull_request_id']
display(prr_contrib)
prr_contrib.dtypes

### Retrieve Commit Contributors

We will now fetch all the Commit contributors for various repositories.

In [None]:
commit_contrib = pd.DataFrame()

for repo_id in repo_set:
    repo_query = salc.sql.text(f"""
                SET SCHEMA 'augur_data';
                SELECT r.repo_id,
                r.repo_git,
                ca.cntrb_id,
                c.cmt_id
                FROM
                repo r, commits c, contributors_aliases ca
                WHERE
                c.repo_id = \'{repo_id}\' AND
                c.repo_id = r.repo_id and
                c.cmt_committer_email = ca.alias_email
        """)
    df_current_repo = pd.read_sql(repo_query, con=engine)
    commit_contrib = pd.concat([commit_contrib, df_current_repo])

commit_contrib = commit_contrib.reset_index()
commit_contrib.drop("index", axis=1, inplace=True)
commit_contrib.columns =['repo_id', 'repo_git', 'cntrb_id', 'cmt_id']
display(commit_contrib)
commit_contrib.dtypes

## Projects and Contributors as Nodes

In this section, we plot projects and contributors on the same graph as nodes and color them differently to see the relationships between them.

### Commit Contributor Graph

In [None]:
df_commit = commit_contrib.groupby(['repo_id', 'cntrb_id']).size().unstack(fill_value=0)
df_commit.head()

In the above dataframe, each row represents a repository ID and each column represents a contributor. The dataframe contains counts for the number of times a contributor has made contributions to a particular repository. In the dataframe below `df_commit`, each contribution represents a commit. A value 0 means that a particular contributor has made no commits to the repository, and a a number x means that the contributor has made x number of commits to the repository.

In [None]:
df_commit = df_commit.reset_index()

In [None]:
df_commit.head()

In [None]:
deps_df = pd.DataFrame()


deps_query = salc.sql.text(f"""
            SET SCHEMA 'augur_data';
            SELECT
            	repo_id,
            	dep_name,
            	number 
            FROM
            	(
            	SELECT
            		augur_data.repo_dependencies.dep_name,
            		augur_data.repo_dependencies.repo_id,
            		COUNT ( * ) AS number 
            	FROM
            		augur_data.repo_dependencies 
            	GROUP BY
            		augur_data.repo_dependencies.dep_name,
            		augur_data.repo_dependencies.repo_id 
            	ORDER BY
            		number DESC 
            	) A 
            WHERE
            	dep_name IN ( 'flask', 'requests', 'logging' ) 
            ORDER BY
            	repo_id;
    """)
deps_df = pd.read_sql(deps_query, con=engine)

df_deps = deps_df.groupby(['repo_id', 'dep_name']).size().unstack(fill_value=0)


display(deps_df)
display(df_deps)
deps_df.dtypes
df_deps.dtypes

df_deps = df_deps.reset_index()

df_melted_deps = df_deps.melt(
    ['repo_id'],
     var_name='dep_name', value_name='number') 

print(df_melted_deps)

G = nx.from_pandas_edgelist(df_melted_deps, 
                            source='dep_name',
                            target='repo_id',
                            edge_attr='number',
                            create_using=nx.MultiGraph())

nodes = G.nodes()


Repo_id = df_melted_deps['repo_id'].to_list()
dep_name = df_melted_deps['dep_name'].to_list()
colors = ['red' if n in Repo_id else 'yellow' for n in nodes]

fig, ax = plt.subplots(figsize=(20,20))
#yellow_patch = mpatches.Patch(color='yellow', label='Contributor')
#blue_patch = mpatches.Patch(color='blue', label='Repository')
#ax.legend(handles=[yellow_patch, blue_patch])
nx.draw_networkx(G, node_color=colors, font_size=8, ax=ax)

In [None]:
df_melted_commit = df_commit.melt(
    ['repo_id'],
    var_name = 'cntrb_id',value_name='number')

In [None]:
df_melted_commit = df_melted_commit[df_melted_commit[df_melted_commit.columns[2]] != 0]
df_melted_commit.head()

In `df_melted_commit` we transpose the contributor IDs. Each row is a combination of a unique repository and a unique contributor and the number represents the number of times the contributor has made contributors to the particular repository.

In [None]:
G = nx.from_pandas_edgelist(df_melted_commit, 
                            source='repo_id',
                            target='cntrb_id',
                            edge_attr='number',
                            create_using=nx.MultiGraph())

In [None]:
nodes = G.nodes()

In [None]:
Repo_id = df_melted_commit['repo_id'].to_list()
contributor_id = df_melted_commit['cntrb_id'].to_list()
colors = ['blue' if n in Repo_id else 'yellow' for n in nodes]

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
#yellow_patch = mpatches.Patch(color='yellow', label='Contributor')
#blue_patch = mpatches.Patch(color='blue', label='Repository')
#ax.legend(handles=[yellow_patch, blue_patch])
nx.draw_networkx(G, node_color=colors, font_size=8, ax=ax)

What we see above is a certain set of repositories and thier contributors plotted on the same graph. The blue dots represent project repositories and the yellow dots represent their contributors. This gives us an idea of central projects which have a large number of contributors and how other projects are connected to them. However, just given the number of repositories, this graph is hard to dig into, so lets subset this graph to create a smaller plot.

In [None]:
#subsetting the first 50 repo nodes for a smaller plot
smaller_df_melted_commit = df_melted_commit[0:50]

Here, we narrow down the entire set of nodes into view only 50 nodes plotted on a graph. Note, this is just for visual simplicilty. This is not a logical filtering and not all contributors for a project are going to be seen on the same plot

In [None]:
G = nx.from_pandas_edgelist(smaller_df_melted_commit, 
                            source='repo_id',
                            target='cntrb_id',
                            edge_attr='number',
                            create_using=nx.MultiGraph())

In [None]:
nodes = G.nodes()

In [None]:
Repo_id = smaller_df_melted_commit['repo_id'].to_list()
contributor_id = smaller_df_melted_commit['cntrb_id'].to_list()
colors = ['blue' if n in Repo_id else 'yellow' for n in nodes]

In [None]:
fig, ax = plt.subplots(figsize=(15,15))
#ax.legend(handles=[yellow_patch, blue_patch])
nx.draw_networkx(G, node_color=colors, font_size=8, ax=ax)

### Issue Contributor Graph

We plot the plots similar to above on issue type contribution.

In [None]:
df_issue = issue_contrib.groupby(['repo_id', 'cntrb_id']).size().unstack(fill_value=0)
df_issue.head()

In [None]:
df_issue = df_issue.reset_index()

In [None]:
df_issue.head()

In [None]:
df_melted_issue = df_issue.melt(
    ['repo_id'],
    var_name = 'cntrb_id',value_name='number')

In [None]:
df_melted_issue = df_melted_issue[df_melted_issue[df_melted_issue.columns[2]] != 0]
df_melted_issue.head()

In [None]:
Repo_id = df_melted_issue['repo_id'].to_list()
contributor_id = df_melted_issue['cntrb_id'].to_list()

In [None]:
G = nx.from_pandas_edgelist(df_melted_issue, 
                            source='repo_id',
                            target='cntrb_id',
                            edge_attr='number',
                            create_using=nx.MultiGraph())

In [None]:
nodes = G.nodes()

In [None]:
colors = ['blue' if n in Repo_id else 'yellow' for n in nodes]

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
#ax.legend(handles=[yellow_patch, blue_patch])
nx.draw_networkx(G, node_color=colors, font_size=8, ax=ax)

### PR Contributor Graph

We now plot similar graphs as above for Pull Request type contributors

In [None]:
df_pr = pr_contrib.groupby(['repo_id', 'cntrb_id']).size().unstack(fill_value=0)
df_pr.head()

In [None]:
df_pr = df_pr.reset_index()

In [None]:
df_melted_pr = df_pr.melt(
    ['repo_id'],
    var_name = 'cntrb_id',value_name='number')

In [None]:
df_melted_pr = df_melted_pr[df_melted_pr[df_melted_pr.columns[2]] != 0]
df_melted_pr.head()

In [None]:
Repo_id = df_melted_issue['repo_id'].to_list()
contributor_id = df_melted_issue['cntrb_id'].to_list()

In [None]:
G = nx.from_pandas_edgelist(df_melted_pr, 
                            source='repo_id',
                            target='cntrb_id',
                            edge_attr='number',
                            create_using=nx.MultiGraph())

In [None]:
nodes = G.nodes()

In [None]:
colors = ['blue' if n in Repo_id else 'yellow' for n in nodes]

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
#ax.legend(handles=[yellow_patch, blue_patch])
nx.draw_networkx(G, node_color=colors, font_size=8, ax=ax)

## Nodes as projects edges as contributors

In this section, we represent data in a different way and try out another graph representation where the project repositories are represented by nodes and the edges are shared contributions between those projects

Lets pick the **Pull Request** type contribution for these graph plots as an example

In [None]:
df_melted_pr.head()

In [None]:
contributorGraph = {}
for i, row in df_melted_pr.iterrows():
    if row['cntrb_id'] not in contributorGraph:
        contributorGraph[row['cntrb_id']] = []
    if(row['number'] > 0):
        contributorGraph[row['cntrb_id']].append((row['repo_id'], row['number']))

In [None]:
list(contributorGraph.items())[:10]

`contributorGraph` above is a dictionary where each key is a project repository, and the value is a list of **"connected"** project repositories and the number of **"shared connections"** between them. Lets explain **"connected"** repositories and shared "connections".

structure of `contributorGraph` =  
{  
`repo1`: [(`repo2`, `PRs by same authors in repo 1 and repo 2`)],  
 `repo2`: [(`repo4`, `PRs created by same authors in repo 1 and repo 4` ), (`repo5`, `PRs by same authors in repo 2 and repo 5`)]  
}

**"shared connections"** constitute of *commits*, *pull requests*, *issues* and *pull request reviews* that are made by the same contributor.
We will call 2 project repositories **"connected"** if they have a **"shared connection"** between them. 
This means if they have a contributor who makes a *commit*, *pull request*, *issue* or *pull request review* in both the repositories, they count as a shared contributor and the repositories are connected. 

We track the number of shared contributions between 2 repositories for creating this graph plot.

In [None]:
commonRepoContributionsByContributor = collections.defaultdict(int)
for key in contributorGraph:
    if len(contributorGraph[key])-1 <= 0:
        continue
    for repoContributionIndex in range(len(contributorGraph[key])-1):
        commonRepoContributionsByContributor[(contributorGraph[key][repoContributionIndex][0], contributorGraph[key][repoContributionIndex+1][0])] += contributorGraph[key][repoContributionIndex][1]+contributorGraph[key][repoContributionIndex+1][1]
print(commonRepoContributionsByContributor)

`commonRepoContributionsByContributor` is a nested dictionary consisting of dictionaries of repository pairs and their common contributions. 

structure of `commonRepoContributionsByContributor` =  
{  
(`repo1, repo2`): `PRs by same authors in repo 1 and repo 2`,  
(`repo2, repo4`): `PRs by same authors in repo 2 and repo 4`,  
(`repo2, repo5`): `PRs by same authors in repo 2 and repo 5`,   
}

In [None]:
res = []
for key in commonRepoContributionsByContributor:
    res.append(tuple(str(k) for k in list(key)) + (commonRepoContributionsByContributor[key],))

For plotting the graph below, we pick the repositories as the nodes and let the shared contributions dictate the edge weights

In [None]:
g = nx.Graph()
g.add_weighted_edges_from(res)

In [None]:
fig, ax = plt.subplots(figsize=(20,20))
nx.draw_networkx(g, node_size=120, font_size=14, ax=ax)


The above graph represents project repositories and how close or far they are to each other based on their degree of connected (number of shared contributions amongst them). If 2 nodes are close to each other, the 2 projects have a high number of shared contributions and vice versa. Each node in this graph has atleast one connection. We are not plotting lone projects in this graph as we want to identify project repositories in connection to existing known repositories.  
Note: this is not a complete (fully-connected) graph. All projects are not **"connected"** to each project. See above for the definition of **"connected"** 

## Conclusion

In this notebook, we created initial graph representations of existing open source GitHub repositories falling under a certain category using [NetworkX](https://networkx.org/). 

We used 2 type of graph representations:

- One where repositories and contributors both are both nodes (differently colored). Viewing which repositories share which set of contributors and analyzing their clusters can give an idea about how projects are connected to each other and to what degree 
- One where repositories are nodes, and edges are number of contributions. The distance between repositories, how close or far they are will depend on the number of shared contributions that exist between them.