# GitHub Repository Metric Analysis

In this notebook, we will fetch the GitHub Issue/PR data for the repositories mentioned in [sigs.yaml](https://github.com/open-services-group/community/blob/main/sigs.yaml) using the [MI tool](https://github.com/thoth-station/mi), pre-process the raw data into suitable data frames and store them as parquet files to an s3 bucket. We will also create [Trino](https://trino.io/) tables for the generated parquet files so that we can later create dashboards in [Superset](https://superset.operate-first.cloud/).

This notebook will serve as a template notebook to analyze different GitHub repositories so that it can be easily executed in automation as part of our metrics processing pipeline. The notebook can be executed in parallel for different repos by passing as an argument the GitHub repository for which we would like to analyze and calculate metrics.

(Related issues: [Issue 1](https://github.com/open-services-group/metrics/issues/19))

In [1]:
import os
from dotenv import find_dotenv, load_dotenv
import warnings
import trino
from s3_communication import S3Communication

warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

True

In [2]:
## Create a .env file on your local with the correct configs
REPO = os.getenv("REPO")
ORG = os.getenv("ORG")
GITHUB_ACCESS_TOKEN = os.getenv("GITHUB_ACCESS_TOKEN")
s3_endpoint_url = os.getenv("S3_ENDPOINT")
aws_access_key_id = os.getenv("S3_ACCESS_KEY")
aws_secret_access_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")

In [3]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url, aws_access_key_id, aws_secret_access_key, s3_bucket
)

In [4]:
repo_slug = f"{ORG}/{REPO}"
repo_slug

'os-climate/aicoe-osc-demo'

In [5]:
# Note: The GitHub access token needs to be exported before importing the srcopmetrics package (current bug)
from srcopsmetrics.entities.issue import Issue  # noqa: E402
from srcopsmetrics.entities.pull_request import PullRequest  # noqa: E402

In [6]:
# Gather the data
!python -m srcopsmetrics.cli -clr $repo_slug -e Issue,PullRequest

INFO:srcopsmetrics.github_knowledge:Overall repositories found: 1
INFO:srcopsmetrics.bot_knowledge:######################## Analysing os-climate/aicoe-osc-demo ########################

INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Detected entities:
CodeFrequency # Commit # DependencyUpdate # Fork # Issue # IssueEvent # KebechetUpdateManager # License # PullRequest # PullRequestDiscussion # RawIssue # RawPullRequest # ReadMe # Release # Stargazer # TrafficClones # TrafficPaths # TrafficPaths # TrafficReferrers # TrafficClones # TrafficViews
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Issue inspection
INFO:srcopsmetrics.entities.tools.storage:Loading knowledge locally
INFO:srcopsmetrics.entities.tools.storage:Data from file %s loaded
INFO:srcopsmetrics.entities.interface:Found previous Issue knowledge for os-climate/aicoe-osc-demo with 77 records
INFO:srcopsmetrics.iterator:-------------Issue An

## Issue Metrics

Now, lets fetch the issues for the repository and derive some metrics.

In [7]:
issue = Issue(repo_slug)
issue_df = issue.load_previous_knowledge(is_local=True)
issue_df.head()

Unnamed: 0_level_0,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
132,Rerun NLP demo on the new cluster,The new dev cluster with a bigger GPU is up ([...,Shreyanand,2022-02-24 16:12:24,,NaT,{},{}
131,Use physical-landing bucket for NLP demo,"As an effort to decouple data owned by Trino, ...",Shreyanand,2022-02-24 15:57:35,,NaT,"{'enhancement': {'color': 'a2eeef', 'labeled_a...",{}
128,Prepare GPU image for training teacher network...,**Is your feature request related to a problem...,pacospace,2022-02-10 16:16:08,erikerlandson,2022-02-17 16:09:37,"{'enhancement': {'color': 'a2eeef', 'labeled_a...",{'pacospace': 8}
127,Value Error related to S3 in demo2 notebook,Value Error related to S3 at Import in when ru...,andraNew,2022-01-07 16:17:07,andraNew,2022-01-13 11:27:54,"{'bug': {'color': 'd73a4a', 'labeled_at': 1641...","{'andraNew': 123, 'erikerlandson': 90, 'chauha..."
125,Create Jupyterbook,Add _toc.yaml and _config.yml for the repo and...,oindrillac,2021-12-17 13:09:41,oindrillac,2021-12-20 13:23:13,{},{}


In [8]:
issue_df = issue_df.reset_index()

In [9]:
issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
0,132,Rerun NLP demo on the new cluster,The new dev cluster with a bigger GPU is up ([...,Shreyanand,2022-02-24 16:12:24,,NaT,{},{}
1,131,Use physical-landing bucket for NLP demo,"As an effort to decouple data owned by Trino, ...",Shreyanand,2022-02-24 15:57:35,,NaT,"{'enhancement': {'color': 'a2eeef', 'labeled_a...",{}
2,128,Prepare GPU image for training teacher network...,**Is your feature request related to a problem...,pacospace,2022-02-10 16:16:08,erikerlandson,2022-02-17 16:09:37,"{'enhancement': {'color': 'a2eeef', 'labeled_a...",{'pacospace': 8}
3,127,Value Error related to S3 in demo2 notebook,Value Error related to S3 at Import in when ru...,andraNew,2022-01-07 16:17:07,andraNew,2022-01-13 11:27:54,"{'bug': {'color': 'd73a4a', 'labeled_at': 1641...","{'andraNew': 123, 'erikerlandson': 90, 'chauha..."
4,125,Create Jupyterbook,Add _toc.yaml and _config.yml for the repo and...,oindrillac,2021-12-17 13:09:41,oindrillac,2021-12-20 13:23:13,{},{}


In [10]:
# Retain only relevant columns
issue_cols_to_drop = ["labels", "interactions"]
issue_df = issue_df.drop(columns=issue_cols_to_drop)
issue_df["org"] = ORG
issue_df["repo"] = REPO

issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,org,repo
0,132,Rerun NLP demo on the new cluster,The new dev cluster with a bigger GPU is up ([...,Shreyanand,2022-02-24 16:12:24,,NaT,os-climate,aicoe-osc-demo
1,131,Use physical-landing bucket for NLP demo,"As an effort to decouple data owned by Trino, ...",Shreyanand,2022-02-24 15:57:35,,NaT,os-climate,aicoe-osc-demo
2,128,Prepare GPU image for training teacher network...,**Is your feature request related to a problem...,pacospace,2022-02-10 16:16:08,erikerlandson,2022-02-17 16:09:37,os-climate,aicoe-osc-demo
3,127,Value Error related to S3 in demo2 notebook,Value Error related to S3 at Import in when ru...,andraNew,2022-01-07 16:17:07,andraNew,2022-01-13 11:27:54,os-climate,aicoe-osc-demo
4,125,Create Jupyterbook,Add _toc.yaml and _config.yml for the repo and...,oindrillac,2021-12-17 13:09:41,oindrillac,2021-12-20 13:23:13,os-climate,aicoe-osc-demo


In [11]:
# Upload the processed df to s3 as a parquet file
s3c.upload_df_to_s3(
    df=issue_df,
    s3_prefix="open-services-group/metrics/github/issues",
    s3_key=f"{ORG}-{REPO}.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx000000000000000054328-0062212460-bd9943-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"425a237e4e4a645d4b9a5e369ed2cb22"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx000000000000000054328-0062212460-bd9943-ocs-storagecluster-cephobjectstore',
   'date': 'Thu, 03 Mar 2022 20:26:08 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=9245b3fe660230b2beaa13e1023f5083; path=/; HttpOnly'},
  'RetryAttempts': 0},
 'ETag': '"425a237e4e4a645d4b9a5e369ed2cb22"'}

## PR Metrics

Now, lets fetch the PRs for the repository and derive some metrics.

In [12]:
pr = PullRequest(repo_slug)
pr_df = pr.load_previous_knowledge(is_local=True)
pr_df.head()

Unnamed: 0_level_0,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
133,[WIP] Add move data util nb and update config,# Related issues\r\n#131 #132 \r\n\r\nThis PR ...,L,Shreyanand,2022-02-28 21:18:36,NaT,,NaT,,1,3,{'review-notebook-app[bot]': 29},{},[],[fcd345815e73d8c8d570a8c32a2a34e3e6bf7447],"[notebooks/demo2/config.py, notebooks/move_dat...",NaT,NaT
130,Fix gpu manifests builds,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XS,pacospace,2022-02-22 09:31:00,NaT,,NaT,,1,1,{},{},[],[d87b0a45185451072c2fef554024a273ed8985a3],[manifests/nm-gpu-training-manifests.yaml],NaT,NaT
129,Add manifests for GPU image build,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XL,pacospace,2022-02-14 11:10:27,2022-02-17 16:09:37,erikerlandson,2022-02-17 16:09:37,erikerlandson,1,9,{'pacospace': 22},"{'881638114': {'author': 'erikerlandson', 'wor...",[],[17709f8aac5533bbaa20c311a937c89384644b14],"[manifests/.sops.yaml, manifests/README.md, ma...",2022-02-14 13:36:05,2022-02-14 13:36:05
126,Updated documentation,closes #125 \r\ncloses #110 \r\n\r\nJupyterBoo...,M,oindrillac,2021-12-17 16:40:00,2021-12-20 13:23:14,oindrillac,2021-12-20 13:23:13,oindrillac,1,5,"{'oindrillac': 13, 'chauhankaranraj': 18}","{'835457435': {'author': 'aakankshaduggal', 'w...",[],[523e26606333956764986a296149ca56edf56b40],"[README.md, _config.yml, _toc.yml, notebooks/d...",2021-12-17 17:05:43,2021-12-17 21:18:59
124,Update README,This PR \r\n- updates the README to mention th...,XS,chauhankaranraj,2021-12-14 20:45:08,2021-12-15 12:57:35,MichaelClifford,2021-12-15 12:57:35,MichaelClifford,1,1,{'MichaelClifford': 1},{},[],[ad9668f096e1e5ebc5b123d1c6df4ea5ed98af5a],[notebooks/demo2/README.md],NaT,NaT


In [13]:
pr_df = pr_df.reset_index()

In [14]:
pr_df.head()

Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at
0,133,[WIP] Add move data util nb and update config,# Related issues\r\n#131 #132 \r\n\r\nThis PR ...,L,Shreyanand,2022-02-28 21:18:36,NaT,,NaT,,1,3,{'review-notebook-app[bot]': 29},{},[],[fcd345815e73d8c8d570a8c32a2a34e3e6bf7447],"[notebooks/demo2/config.py, notebooks/move_dat...",NaT,NaT
1,130,Fix gpu manifests builds,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XS,pacospace,2022-02-22 09:31:00,NaT,,NaT,,1,1,{},{},[],[d87b0a45185451072c2fef554024a273ed8985a3],[manifests/nm-gpu-training-manifests.yaml],NaT,NaT
2,129,Add manifests for GPU image build,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XL,pacospace,2022-02-14 11:10:27,2022-02-17 16:09:37,erikerlandson,2022-02-17 16:09:37,erikerlandson,1,9,{'pacospace': 22},"{'881638114': {'author': 'erikerlandson', 'wor...",[],[17709f8aac5533bbaa20c311a937c89384644b14],"[manifests/.sops.yaml, manifests/README.md, ma...",2022-02-14 13:36:05,2022-02-14 13:36:05
3,126,Updated documentation,closes #125 \r\ncloses #110 \r\n\r\nJupyterBoo...,M,oindrillac,2021-12-17 16:40:00,2021-12-20 13:23:14,oindrillac,2021-12-20 13:23:13,oindrillac,1,5,"{'oindrillac': 13, 'chauhankaranraj': 18}","{'835457435': {'author': 'aakankshaduggal', 'w...",[],[523e26606333956764986a296149ca56edf56b40],"[README.md, _config.yml, _toc.yml, notebooks/d...",2021-12-17 17:05:43,2021-12-17 21:18:59
4,124,Update README,This PR \r\n- updates the README to mention th...,XS,chauhankaranraj,2021-12-14 20:45:08,2021-12-15 12:57:35,MichaelClifford,2021-12-15 12:57:35,MichaelClifford,1,1,{'MichaelClifford': 1},{},[],[ad9668f096e1e5ebc5b123d1c6df4ea5ed98af5a],[notebooks/demo2/README.md],NaT,NaT


In [15]:
# Retain only relevant columns
pr_cols_to_drop = ["interactions", "reviews", "labels", "commits", "changed_files"]
prs_df = pr_df.drop(columns=pr_cols_to_drop)
prs_df["org"] = ORG
prs_df["repo"] = REPO

prs_df.head()

Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,first_review_at,first_approve_at,org,repo
0,133,[WIP] Add move data util nb and update config,# Related issues\r\n#131 #132 \r\n\r\nThis PR ...,L,Shreyanand,2022-02-28 21:18:36,NaT,,NaT,,1,3,NaT,NaT,os-climate,aicoe-osc-demo
1,130,Fix gpu manifests builds,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XS,pacospace,2022-02-22 09:31:00,NaT,,NaT,,1,1,NaT,NaT,os-climate,aicoe-osc-demo
2,129,Add manifests for GPU image build,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XL,pacospace,2022-02-14 11:10:27,2022-02-17 16:09:37,erikerlandson,2022-02-17 16:09:37,erikerlandson,1,9,2022-02-14 13:36:05,2022-02-14 13:36:05,os-climate,aicoe-osc-demo
3,126,Updated documentation,closes #125 \r\ncloses #110 \r\n\r\nJupyterBoo...,M,oindrillac,2021-12-17 16:40:00,2021-12-20 13:23:14,oindrillac,2021-12-20 13:23:13,oindrillac,1,5,2021-12-17 17:05:43,2021-12-17 21:18:59,os-climate,aicoe-osc-demo
4,124,Update README,This PR \r\n- updates the README to mention th...,XS,chauhankaranraj,2021-12-14 20:45:08,2021-12-15 12:57:35,MichaelClifford,2021-12-15 12:57:35,MichaelClifford,1,1,NaT,NaT,os-climate,aicoe-osc-demo


In [16]:
# Upload the processed df to s3 as a parquet file
s3c.upload_df_to_s3(
    df=prs_df,
    s3_prefix="open-services-group/metrics/github/prs",
    s3_key=f"{ORG}-{REPO}.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx00000000000000005432a-0062212461-bd9943-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"4d0e3cb60d8d34b61b1ca93739e55d21"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx00000000000000005432a-0062212461-bd9943-ocs-storagecluster-cephobjectstore',
   'date': 'Thu, 03 Mar 2022 20:26:09 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=9245b3fe660230b2beaa13e1023f5083; path=/; HttpOnly'},
  'RetryAttempts': 0},
 'ETag': '"4d0e3cb60d8d34b61b1ca93739e55d21"'}

## Create Trino Tables

Now that we have the processed data frames stored as parquet files in s3, we can generate [Trino](https://trino.io/) tables from it so that interactive dashboards can be implemented in [Superset](https://superset.apache.org/). We will be connecting to the [Operate First Trino](https://trino.operate-first.cloud/).

In [17]:
# Map the datatype columns of df to suitable datatype supported in Trino/Superset
_p2smap = {
    "object": "varchar",
    "int64": "bigint",
    "float64": "double",
    "datetime64[ns]": "timestamp",
    "bool": "boolean",
}


def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))


# Generate the Trino table schema
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0], t=e[1]) for e in pz])

In [18]:
# Create a Trino client
conn = trino.dbapi.connect(
    auth=trino.auth.BasicAuthentication(
        os.environ["TRINO_USER"], os.environ["TRINO_PASSWD"]
    ),
    host=os.environ["TRINO_HOST"],
    port=int(os.environ["TRINO_PORT"]),
    http_scheme="https",
    verify=True,
)
cur = conn.cursor()

In [19]:
# Check if Trino connection was successful
cur.execute("show catalogs")
cur.fetchall()[1]

['data_science_general']

In [20]:
# Create the issues table with data populated from parquet file
issue_schema = generate_table_schema_pairs(issue_df)

tabledef = """create table if not exists data_science_general.default.issues(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{s3_bucket}/open-services-group/metrics/github/issues'
)""".format(
    schema=issue_schema,
    s3_bucket=os.environ["S3_BUCKET"],
)

cur.execute(tabledef)
cur.fetchall()

[[True]]

In [21]:
# Create the PR table with data populated from parquet file
pr_schema = generate_table_schema_pairs(prs_df)

tabledef = """create table if not exists data_science_general.default.prs(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{s3_bucket}/open-services-group/metrics/github/prs'
)""".format(
    schema=pr_schema,
    s3_bucket=os.environ["S3_BUCKET"],
)

cur.execute(tabledef)
cur.fetchall()

[[True]]

## Conclusion

In this notebook we:

- Fetched GitHub Issue/PR data for a specified org/repo using the MI `srcopsmetrics` module
- Pre-processed the raw data into data frames with relevant columns
- Uploaded the processed data frames as parquet files to an S3 bucket
- Created suitable tables for the parquet files generated in Trino

We can now further explore the GitHub data obtained for different repos/orgs and create interactive visualization dashboards in [Superset](https://superset.operate-first.cloud/).