# Github Metrics- Activity by Repo 

This notebook will aim to query the Augur DB to access the neccessary information to be able to get the following metrics dirived from the GitHub Community Metrics working document https://docs.google.com/document/d/1Yocr6fk0J8EsVZnJwoIl3kRQaI94tI-XHe7VSMFT0yM/edit?usp=sharing

Any necessary computations from the data to get the metric value will be done as the queries are determined|

In [19]:
import psycopg2
import pandas as pd 
import sqlalchemy as salc
import json
import os

with open("../config_temp.json") as config_file:
    config = json.load(config_file)

In [20]:
database_connection_string = 'postgresql+psycopg2://{}:{}@{}:{}/{}'.format(config['user'], config['password'], config['host'], config['port'], config['database'])

dbschema='augur_data'
engine = salc.create_engine(
    database_connection_string,
    connect_args={'options': '-csearch_path={}'.format(dbschema)})

## Control Cell - Set Variables and Filters 

The cell below is for generating repo_ids from the repo names. For this to work, the repo must already be in the database. If you want to manually assign a repo_id skip the cell below and read the comments on the next cell.

In [21]:
#add your repo name(s) here of the repo(s) you want to query if known (and in the database)
repo_name_set = ['augur', 'grimoirelab']
repo_set = []

for repo_name in repo_name_set:
    repo_query = salc.sql.text(f"""
                 SET SCHEMA 'augur_data';
                 SELECT 
                    b.repo_id
                FROM
                    repo_groups a,
                    repo b
                WHERE
                    a.repo_group_id = b.repo_group_id AND
                    b.repo_name = \'{repo_name}\'
        """)

    t = engine.execute(repo_query)
    repo_id =t.mappings().all()[0].get('repo_id')
    repo_set.append(repo_id)
print(repo_set)

[25440, 25448]


In [22]:
#Take this out of quotes if you want to manually assign a repo_id number(s)
#repo_set = [25440]

## Query for commit analysis

It should be noted that each entry on this table is for files, not commits. So there can be multiple files/rows for each commits  

In [23]:
df_commits = pd.DataFrame()

for repo_id in repo_set: 

    pr_query = salc.sql.text(f"""
                SELECT
                    r.repo_name,
                    c.cmt_commit_hash AS commits,
                    c.cmt_id AS file, 
					c.cmt_added AS lines_added,
					c.cmt_removed AS lines_removed,
                    c.cmt_author_date AS date, 
                    c.cmt_author_timestamp AS time_stamp
                FROM
                	repo r,
                    commits c
                WHERE
                	r.repo_id = c.repo_id AND
                    c.repo_id = \'{repo_id}\'
        """)
    df_current_repo = pd.read_sql(pr_query, con=engine)
    df_commits = pd.concat([df_commits, df_current_repo])

df_commits = df_commits.reset_index()
df_commits.drop("index", axis=1, inplace=True)
        
df_commits

Unnamed: 0,repo_name,commits,file,lines_added,lines_removed,date,time_stamp
0,augur,79ee6429a570b2982fab12a20dea670a858e1141,34605744,2,2,2021-03-24,2021-03-24 16:47:14+01:00
1,augur,79ee6429a570b2982fab12a20dea670a858e1141,34605745,1,1,2021-03-24,2021-03-24 16:47:14+01:00
2,augur,574901f5cd8372006cc64ff2b9f14f5470bba4e4,34605767,6,6,2021-03-19,2021-03-19 07:21:06+01:00
3,augur,7877b0c7937c6242e18ed712d5fb14323d13115f,34605778,2,2,2021-03-24,2021-03-24 16:06:07+01:00
4,augur,7877b0c7937c6242e18ed712d5fb14323d13115f,34605779,1,1,2021-03-24,2021-03-24 16:06:07+01:00
...,...,...,...,...,...,...,...
27673,grimoirelab,fe7ab96bdf7d0187737285af6dc3c08f5c422f14,34556304,6,2,2018-01-19,2018-01-19 01:01:41+01:00
27674,grimoirelab,ff18ee3474f147c6c3b99997ac79b612fd7a2adc,34556713,2,2,2018-06-04,2018-06-04 00:05:02+02:00
27675,grimoirelab,ff22f21630e80b67707db0551ca1787de843e403,34556365,1,1,2018-09-10,2018-09-10 21:02:37+02:00
27676,grimoirelab,ff7bd07014362d6b1b54a7c4de6e70e8412ee1d0,34557194,15,0,2019-04-01,2019-04-01 10:52:44+02:00


### Number of commits (by hour/day/week)

### Number of lines per commit

### Number of lines/commits/file 

## Query for issues analysis
 

In [26]:
df_issues = pd.DataFrame()

for repo_id in repo_set: 

    pr_query = salc.sql.text(f"""
                SELECT
                    r.repo_name,
					i.issue_id AS issue, 
					i.gh_issue_number AS issue_number,
					i.gh_issue_id AS gh_issue,
					i.created_at AS created, 
					i.closed_at AS closed
                FROM
                	repo r,
                    issues i
                WHERE
                	r.repo_id = i.repo_id AND
                    i.repo_id = \'{repo_id}\'
        """)
    df_current_repo = pd.read_sql(pr_query, con=engine)
    df_issues = pd.concat([df_issues, df_current_repo])

df_issues = df_issues.reset_index()
df_issues.drop("index", axis=1, inplace=True)
        
df_issues

Unnamed: 0,repo_name,issue,issue_number,gh_issue,created,closed
0,augur,340115,28,213149529,2017-03-09 20:06:18,2017-04-07 21:18:01
1,augur,343231,886,682259157,2020-08-20 00:09:30,2020-08-20 00:16:50
2,augur,343216,880,679627659,2020-08-15 19:11:45,2020-08-17 14:30:04
3,augur,343467,967,724668885,2020-10-19 14:21:08,2020-10-19 14:21:34
4,augur,342738,740,628534692,2020-06-01 15:34:33,2020-08-20 10:48:14
...,...,...,...,...,...,...
1855,grimoirelab,735294,437,941801983,2021-07-12 08:23:29,2021-07-28 08:58:49
1856,grimoirelab,735295,436,924259145,2021-06-17 19:24:54,NaT
1857,grimoirelab,340606,284,559853733,2020-02-04 17:00:31,NaT
1858,grimoirelab,734649,429,889819068,2021-05-12 08:28:28,NaT


### Numer of issues created (by hour/day/week)

### Numer of issues closed (by hour/day/week)

### Numer of issues open (by hour/day/week)

## Query for Pull Request Analysis

In [None]:
df_pr = pd.DataFrame()

for repo_id in repo_set: 

    pr_query = salc.sql.text(f"""
                SELECT
                    r.repo_name,
					i.issue_id AS issue, 
					i.gh_issue_number AS issue_number,
					i.gh_issue_id AS gh_issue,
					i.created_at AS created, 
					i.closed_at AS closed
                FROM
                	repo r,
                    pulll_request i
                WHERE
                	r.repo_id = pr.repo_id AND
                    r.repo_id = \'{repo_id}\'
        """)
    df_current_repo = pd.read_sql(pr_query, con=engine)
    df_pr = pd.concat([df_pr, df_current_repo])

df_pr = df_pr.reset_index()
df_pr.drop("index", axis=1, inplace=True)
        
df_pr

### Numer of PRs created (by hour/day/week)

### Numer of PRs closed (by hour/day/week)

### Numer of PRs open (by hour/day/week)

### Number of Reviews started/closed (by hour/day/week)

### Number of repositories

### Number of assignees

### Commits 