# GitHub Repository Metric Analysis

In this notebook, we will fetch the GitHub Issue/PR data for the repositories mentioned in [sigs.yaml](https://github.com/open-services-group/community/blob/main/sigs.yaml) using the [MI tool](https://github.com/thoth-station/mi), pre-process the raw data into suitable data frames and store them as parquet files to an s3 bucket. We will also create [Trino](https://trino.io/) tables for the generated parquet files so that we can later create dashboards in [Superset](https://superset.operate-first.cloud/).

This notebook will serve as a template notebook to analyze different GitHub repositories so that it can be easily executed in automation as part of our metrics processing pipeline. The notebook can be executed in parallel for different repos by passing as an argument the GitHub repository for which we would like to analyze and calculate metrics.

(Related issues: [Issue 1](https://github.com/open-services-group/metrics/issues/19))

In [1]:
import os
from dotenv import find_dotenv, load_dotenv
import warnings
import trino
from s3_communication import S3Communication
from github import Github
import pandas as pd

warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

True

In [2]:
## Create a .env file on your local with the correct configs
REPO = os.getenv("REPO")
ORG = os.getenv("ORG")
SIG = os.getenv("SIG")
GITHUB_ACCESS_TOKEN = os.getenv("GITHUB_ACCESS_TOKEN")
s3_endpoint_url = os.getenv("S3_ENDPOINT")
aws_access_key_id = os.getenv("S3_ACCESS_KEY")
aws_secret_access_key = os.getenv("S3_SECRET_KEY")
s3_bucket = os.getenv("S3_BUCKET")
repo_slug = f"{ORG}/{REPO}"

In [3]:
# Note: The GitHub access token needs to be exported before importing the srcopmetrics package (current bug)
from srcopsmetrics.entities.issue import Issue  # noqa: E402
from srcopsmetrics.entities.pull_request import PullRequest  # noqa: E402

In [4]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url, aws_access_key_id, aws_secret_access_key, s3_bucket
)

In [5]:
# Remove any existing old data
!rm -r srcopsmetrics/
# Gather the data
!python -m srcopsmetrics.cli -clr $repo_slug -e Issue,PullRequest

INFO:srcopsmetrics.github_knowledge:Overall repositories found: 1
INFO:srcopsmetrics.bot_knowledge:######################## Analysing open-services-group/devsecops ########################

INFO:srcopsmetrics.utils:No repo identified, creating new directory at /opt/app-root/src/metrics/notebooks/srcopsmetrics/bot_knowledge/open-services-group/devsecops
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Detected entities:
CodeFrequency # Commit # DependencyUpdate # Fork # Issue # IssueEvent # KebechetUpdateManager # License # PullRequest # PullRequestDiscussion # RawIssue # RawPullRequest # ReadMe # Release # Stargazer # TrafficClones # TrafficPaths # TrafficPaths # TrafficReferrers # TrafficClones # TrafficViews
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Issue inspection
INFO:srcopsmetrics.entities.tools.storage:Loading knowledge locally
INFO:srcopsmetrics.entities.tools.storage:Data from file %s lo

## Issue Metrics

Now, lets fetch the issues for the repository and derive some metrics.

In [6]:
issue = Issue(repo_slug)
issue_df = issue.load_previous_knowledge(is_local=True)
issue_df = issue_df.reset_index()
issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions
0,18,TAG vs BRANCHES,,harshad16,2022-03-23 17:17:41,,NaT,{},{'Gkrumbach07': 142}
1,14,Devsecops ODH documentation,We are moving into the end of this quarter and...,Gregory-Pereira,2022-03-16 22:06:45,,NaT,{},"{'Gregory-Pereira': 14, 'harshad16': 7}"
2,11,Implement way to cherry pick usptream commits ...,We want to be able to integrate upstream chang...,Gregory-Pereira,2022-03-09 17:00:57,,NaT,{},{'Gregory-Pereira': 959}
3,10,Write ADR on odh operator deployment,We should start documenting how we're deployin...,HumairAK,2022-03-02 17:50:34,HumairAK,2022-03-22 14:52:32,{},"{'Gkrumbach07': 482, 'HumairAK': 120}"
4,9,Upgrade osc-cl1 and osc-cl2 to odh v1.1.2,This will require rebasing the `odh-manifests`...,HumairAK,2022-03-02 17:47:11,HumairAK,2022-03-22 14:52:44,{},"{'HumairAK': 34, 'Gregory-Pereira': 1, 'harsha..."


In [7]:
# Retain only relevant columns
issue_cols_to_drop = ["labels", "interactions"]
issue_df = issue_df.drop(columns=issue_cols_to_drop)
issue_df["org"] = ORG
issue_df["repo"] = REPO

issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,org,repo
0,18,TAG vs BRANCHES,,harshad16,2022-03-23 17:17:41,,NaT,open-services-group,devsecops
1,14,Devsecops ODH documentation,We are moving into the end of this quarter and...,Gregory-Pereira,2022-03-16 22:06:45,,NaT,open-services-group,devsecops
2,11,Implement way to cherry pick usptream commits ...,We want to be able to integrate upstream chang...,Gregory-Pereira,2022-03-09 17:00:57,,NaT,open-services-group,devsecops
3,10,Write ADR on odh operator deployment,We should start documenting how we're deployin...,HumairAK,2022-03-02 17:50:34,HumairAK,2022-03-22 14:52:32,open-services-group,devsecops
4,9,Upgrade osc-cl1 and osc-cl2 to odh v1.1.2,This will require rebasing the `odh-manifests`...,HumairAK,2022-03-02 17:47:11,HumairAK,2022-03-22 14:52:44,open-services-group,devsecops


In [8]:
# Upload the processed df to s3 as a parquet file
s3c.upload_df_to_s3(
    df=issue_df,
    s3_prefix="open-services-group/metrics/github/issues",
    s3_key=f"{ORG}-{REPO}.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx00000000000000003a5b6-0062474698-c1dc3c-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"188ac1a186596b61cb56d2afe741033b"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx00000000000000003a5b6-0062474698-c1dc3c-ocs-storagecluster-cephobjectstore',
   'date': 'Fri, 01 Apr 2022 18:38:16 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=9305a9992645bb0698c9f5d65ae10c7e; path=/; HttpOnly'},
  'RetryAttempts': 0},
 'ETag': '"188ac1a186596b61cb56d2afe741033b"'}

## PR Metrics

Now, lets fetch the PRs for the repository and derive some metrics.

In [9]:
pr = PullRequest(repo_slug)
pr_df = pr.load_previous_knowledge(is_local=True)
pr_df = pr_df.reset_index()
pr_df.head()

Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at
0,17,adding documentation on the operate-first odh-...,Couldn't find more documentation on the Kfdefs...,M,Gregory-Pereira,2022-03-22 23:22:24,NaT,,NaT,,1,1,"{'sesheta': 202, 'Gkrumbach07': 21, 'Gregory-P...",{},"[size/M, needs-ok-to-test]",[217505f48c8ccf8220406cc6955ceda0c386bb95],[Docs/odh-deployment.md],NaT,NaT
1,16,added ADR 0002,This is a an ADR for the issue here https://gi...,M,Gkrumbach07,2022-03-22 17:14:51,NaT,,NaT,,1,1,{'sesheta': 85},"{'918920515': {'author': 'harshad16', 'words_c...","[size/M, lgtm]",[fe01ec850333967c5e8e69c2c22a409fa4b57a0b],[ADR/0002-odh-manifest-differences.md],2022-03-23 14:56:45,NaT
2,15,Add superset operational docs.,Related: https://github.com/open-services-grou...,M,HumairAK,2022-03-22 14:53:14,NaT,,NaT,,1,1,{'sesheta': 73},{},[size/M],[32bb7d9704be408090403798defc04bedbf7deb4],[Docs/odh-superset.md],NaT,NaT
3,13,added CI configurations from operate-first/apps,In response to WG-Devsecops discussion today: ...,M,Gregory-Pereira,2022-03-16 20:48:20,2022-03-22 13:17:19,sesheta,2022-03-22 13:17:19,sesheta,1,5,"{'sesheta': 182, 'Gregory-Pereira': 9, 'Humair...","{'912979439': {'author': 'harshad16', 'words_c...","[size/M, approved, lgtm, ok-to-test]",[13fcd1b354f350eed005e9587ab3d5a3516e2a97],"[.aicoe-ci.yaml, .pre-commit-config.yaml, .pro...",2022-03-17 11:09:47,NaT
4,12,Create 0001-dont-use-olm.md,Add OLM ADR\r\n\r\nrelated: https://github.com...,M,Gkrumbach07,2022-03-16 16:49:04,2022-03-22 13:42:21,HumairAK,2022-03-22 13:42:21,HumairAK,2,2,"{'sesheta': 124, 'Gkrumbach07': 1, 'HumairAK':...","{'916333890': {'author': 'Gregory-Pereira', 'w...",[size/M],"[55aa296a863f9bdf607deea0e5b8c7fe8f2be153, b11...","[ADR/0001-dont-use-olm.md, OWNERS]",2022-03-21 20:42:13,NaT


In [10]:
# Retain only relevant columns
pr_cols_to_drop = ["interactions", "reviews", "labels", "commits", "changed_files"]
prs_df = pr_df.drop(columns=pr_cols_to_drop)
prs_df["org"] = ORG
prs_df["repo"] = REPO

prs_df.head()

Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,first_review_at,first_approve_at,org,repo
0,17,adding documentation on the operate-first odh-...,Couldn't find more documentation on the Kfdefs...,M,Gregory-Pereira,2022-03-22 23:22:24,NaT,,NaT,,1,1,NaT,NaT,open-services-group,devsecops
1,16,added ADR 0002,This is a an ADR for the issue here https://gi...,M,Gkrumbach07,2022-03-22 17:14:51,NaT,,NaT,,1,1,2022-03-23 14:56:45,NaT,open-services-group,devsecops
2,15,Add superset operational docs.,Related: https://github.com/open-services-grou...,M,HumairAK,2022-03-22 14:53:14,NaT,,NaT,,1,1,NaT,NaT,open-services-group,devsecops
3,13,added CI configurations from operate-first/apps,In response to WG-Devsecops discussion today: ...,M,Gregory-Pereira,2022-03-16 20:48:20,2022-03-22 13:17:19,sesheta,2022-03-22 13:17:19,sesheta,1,5,2022-03-17 11:09:47,NaT,open-services-group,devsecops
4,12,Create 0001-dont-use-olm.md,Add OLM ADR\r\n\r\nrelated: https://github.com...,M,Gkrumbach07,2022-03-16 16:49:04,2022-03-22 13:42:21,HumairAK,2022-03-22 13:42:21,HumairAK,2,2,2022-03-21 20:42:13,NaT,open-services-group,devsecops


In [11]:
# Upload the processed df to s3 as a parquet file
s3c.upload_df_to_s3(
    df=prs_df,
    s3_prefix="open-services-group/metrics/github/prs",
    s3_key=f"{ORG}-{REPO}.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx00000000000000003a5b7-0062474699-c1dc3c-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"2454938d7e26ede0fa9c5b7b98640f86"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx00000000000000003a5b7-0062474699-c1dc3c-ocs-storagecluster-cephobjectstore',
   'date': 'Fri, 01 Apr 2022 18:38:18 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=9305a9992645bb0698c9f5d65ae10c7e; path=/; HttpOnly'},
  'RetryAttempts': 0},
 'ETag': '"2454938d7e26ede0fa9c5b7b98640f86"'}

## Contributors metrics
Next, lets fetch the events for the repository. This table will be used to find information about contributors, their affiliations and the events they generate in the repository. 

In [12]:
# To do
# In the current version we can only get data for the last 90 days or 300 events
# limited by the guithub API
# Add a loop for getting older data
# Probably, get all events for a month and then loop to the next month

In [13]:
## Get internal members of OSG group
gth_obj = Github(GITHUB_ACCESS_TOKEN)
osg = gth_obj.get_organization("open-services-group")
internal = [m.login for m in osg.get_members()]

In [14]:
repo = gth_obj.get_repo(repo_slug)

In [15]:
# Define events we are interested in
issue_event_types = ["IssueCommentEvent", "IssuesEvent"]

pr_event_types = [
    "PullRequestEvent",
    "PullRequestReviewEvent",
    "PullRequestReviewCommentEvent",
]

In [16]:
# Create the events data frame
rows = []
for e in repo.get_events():
    if e.type in issue_event_types or e.type in pr_event_types:
        event_id = e.id
        created_at = e.created_at
        event_contributor_id = e.actor.id
        event_contributor = e.actor.login
        event_type = e.type
        event_action = e.payload["action"]
        if event_type in issue_event_types:
            issue_or_pr_id = e.payload["issue"]["number"]
        else:
            issue_or_pr_id = e.payload["pull_request"]["number"]
        rows.append(
            [
                event_id,
                created_at,
                event_contributor_id,
                event_contributor,
                event_type,
                event_action,
                issue_or_pr_id,
            ]
        )

In [17]:
# Add column names for events data frame
column_name = [
    "id",
    "created_at",
    "contributor_id",
    "contributor_name",
    "type",
    "action",
    "issue_or_pr_id",
]

In [18]:
# Add other required colums
events_df = pd.DataFrame(data=rows, columns=column_name)
events_df["org"] = ORG
events_df["repo"] = REPO
events_df["sig"] = SIG
events_df["internal_contributor"] = events_df["contributor_name"].apply(
    lambda x: x in internal
)

In [19]:
events_df.head()

Unnamed: 0,id,created_at,contributor_id,contributor_name,type,action,issue_or_pr_id,org,repo,sig,internal_contributor
0,20976303499,2022-03-28 15:29:47,12587674,Gkrumbach07,IssueCommentEvent,created,18,open-services-group,devsecops,wg-devsecops,True
1,20898787396,2022-03-23 17:17:41,14028058,harshad16,IssuesEvent,opened,18,open-services-group,devsecops,wg-devsecops,True
2,20896021955,2022-03-23 15:01:53,19876404,Gregory-Pereira,IssueCommentEvent,created,17,open-services-group,devsecops,wg-devsecops,False
3,20895900246,2022-03-23 14:56:45,14028058,harshad16,PullRequestReviewEvent,created,16,open-services-group,devsecops,wg-devsecops,True
4,20893004155,2022-03-23 12:38:25,12587674,Gkrumbach07,IssueCommentEvent,created,17,open-services-group,devsecops,wg-devsecops,True


In [20]:
# Upload the processed df to s3 as a parquet file
s3c.upload_df_to_s3(
    df=events_df,
    s3_prefix="open-services-group/metrics/github/events",
    s3_key=f"{ORG}-{REPO}.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx00000000000000003a5b9-006247469e-c1dc3c-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"7941f59df5a1834180b83f1760b8b264"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx00000000000000003a5b9-006247469e-c1dc3c-ocs-storagecluster-cephobjectstore',
   'date': 'Fri, 01 Apr 2022 18:38:23 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=9305a9992645bb0698c9f5d65ae10c7e; path=/; HttpOnly'},
  'RetryAttempts': 0},
 'ETag': '"7941f59df5a1834180b83f1760b8b264"'}

## Create Trino Tables

Now that we have the processed data frames stored as parquet files in s3, we can generate [Trino](https://trino.io/) tables from it so that interactive dashboards can be implemented in [Superset](https://superset.apache.org/). We will be connecting to the [Operate First Trino](https://trino.operate-first.cloud/).

In [21]:
# Map the datatype columns of df to suitable datatype supported in Trino/Superset
_p2smap = {
    "object": "varchar",
    "int64": "bigint",
    "float64": "double",
    "datetime64[ns]": "timestamp",
    "bool": "boolean",
}


def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))


# Generate the Trino table schema
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0], t=e[1]) for e in pz])

In [22]:
# Create a Trino client
conn = trino.dbapi.connect(
    auth=trino.auth.BasicAuthentication(
        os.environ["TRINO_USER"], os.environ["TRINO_PASSWD"]
    ),
    host=os.environ["TRINO_HOST"],
    port=int(os.environ["TRINO_PORT"]),
    http_scheme="https",
    verify=True,
)
cur = conn.cursor()

In [23]:
# Check if Trino connection was successful
cur.execute("show catalogs")
cur.fetchall()[1]

['data_science_general']

In [24]:
# Create the issues table with data populated from parquet file
issue_schema = generate_table_schema_pairs(issue_df)

tabledef = """create table if not exists data_science_general.default.issues(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{s3_bucket}/open-services-group/metrics/github/issues'
)""".format(
    schema=issue_schema,
    s3_bucket=os.environ["S3_BUCKET"],
)

cur.execute(tabledef)
cur.fetchall()

[[True]]

In [25]:
# Create the PR table with data populated from parquet file
pr_schema = generate_table_schema_pairs(prs_df)

tabledef = """create table if not exists data_science_general.default.prs(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{s3_bucket}/open-services-group/metrics/github/prs'
)""".format(
    schema=pr_schema,
    s3_bucket=os.environ["S3_BUCKET"],
)

cur.execute(tabledef)
cur.fetchall()

[[True]]

In [26]:
# Create the PR table with data populated from parquet file
events_schema = generate_table_schema_pairs(events_df)

tabledef = """create table if not exists data_science_general.default.events(
{schema}
) with (
    format = 'parquet',
    external_location = 's3a://{s3_bucket}/open-services-group/metrics/github/events'
)""".format(
    schema=events_schema,
    s3_bucket=os.environ["S3_BUCKET"],
)

cur.execute(tabledef)
cur.fetchall()

[[True]]

## Conclusion

In this notebook we:

- Fetched GitHub Issue/PR data for a specified org/repo using the MI `srcopsmetrics` module
- Pre-processed the raw data into data frames with relevant columns
- Uploaded the processed data frames as parquet files to an S3 bucket
- Created suitable tables for the parquet files generated in Trino

We can now further explore the GitHub data obtained for different repos/orgs and create interactive visualization dashboards in [Superset](https://superset.operate-first.cloud/).