# GitHub Repository Metric Analysis

In this notebook, we will fetch the GitHub Issue/PR data for the repositories mentioned in [sigs.yaml](https://github.com/open-services-group/community/blob/main/sigs.yaml) using the [MI tool](https://github.com/thoth-station/mi), pre-process the raw data into suitable data frames and store them as parquet files to an s3 bucket. We will also create [Trino](https://trino.io/) tables for the generated parquet files so that we can later create dashboards in [Superset](https://superset.operate-first.cloud/).

This notebook will serve as a template notebook to analyze different GitHub repositories so that it can be easily executed in automation as part of our metrics processing pipeline. The notebook can be executed in parallel for different repos by passing as an argument the GitHub repository for which we would like to analyze and calculate metrics.

(Related issues: [Issue 1](https://github.com/open-services-group/metrics/issues/19))

In [1]:
import os
import yaml
import requests
from dotenv import find_dotenv, load_dotenv
import warnings
import trino
from s3_communication import S3Communication
from github import Github
import pandas as pd

warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

True

In [2]:
## Create a .env file on your local with the correct configs
RAW_SIGS_YAML_URL = os.getenv("RAW_SIGS_YAML_URL")
GITHUB_ACCESS_TOKEN = os.getenv("GITHUB_ACCESS_TOKEN")

s3_bucket = os.getenv("S3_BUCKET")
s3_endpoint_url = os.getenv("S3_ENDPOINT")
aws_access_key_id = os.getenv("S3_ACCESS_KEY")
aws_secret_access_key = os.getenv("S3_SECRET_KEY")

In [3]:
# Note: The GitHub access token needs to be exported before importing the srcopmetrics package (current bug)
from srcopsmetrics.entities.issue import Issue  # noqa: E402
from srcopsmetrics.entities.pull_request import PullRequest  # noqa: E402

In [4]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url, aws_access_key_id, aws_secret_access_key, s3_bucket
)

In [5]:
# read the sigs.yaml file
ret = requests.get(RAW_SIGS_YAML_URL)
sigs_file = yaml.load(ret.content)

In [6]:
def get_org_repo_from_owners_slug(owners_slug):
    split_path = owners_slug.split("/")
    org = split_path[3]
    repo = split_path[4]
    return org, repo

In [8]:
# Remove any existing old data
!rm -r srcopsmetrics/

# Gather the data for each subproject in sig (other groups may not have a repo
# associated with them)
# NOTE: there are cases where multiple subprojects have the same repo
# In such cases, we dont wanna pull the github data all over again since
# we're already constrained by the amount of github API calls we can make
# So this set keeps track of repos for which data is already pulled, and
# makes sure API calls arent made for a repo for which data already exists
repos_collected = set()

for sig in sigs_file['sigs']:
    for sp in sig['subprojects']:
        for owners_slug in sp['owners']:
            org, repo = get_org_repo_from_owners_slug(owners_slug)
            if owners_slug not in repos_collected:
                print(f'#### Collecting data for {org}/{repo}')
                !python -m srcopsmetrics.cli -clr $org/$repo -e Issue,PullRequest # noqa: E999
                repos_collected.add(owners_slug)

#### Collecting data for open-services-group/community
INFO:srcopsmetrics.github_knowledge:Overall repositories found: 1
INFO:srcopsmetrics.bot_knowledge:######################## Analysing open-services-group/community ########################

INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Detected entities:
CodeFrequency # Commit # DependencyUpdate # Fork # Issue # IssueEvent # KebechetUpdateManager # License # PullRequest # PullRequestDiscussion # RawIssue # RawPullRequest # ReadMe # Release # Stargazer # TrafficClones # TrafficPaths # TrafficPaths # TrafficReferrers # TrafficClones # TrafficViews
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Issue inspection
INFO:srcopsmetrics.entities.tools.storage:Loading knowledge locally
INFO:srcopsmetrics.entities.tools.storage:Data from file %s loaded
INFO:srcopsmetrics.entities.interface:Found previous Issue knowledge for open-services-group/community wi

## Issue Metrics

Now, lets fetch the issues for the repository and derive some metrics.

In [9]:
def get_issue_metrics(sig, org, repo):
    # read repo data we fetched in previous step via MI
    issue = Issue(f"{org}/{repo}")
    issue_df = issue.load_previous_knowledge(is_local=True)
    issue_df = issue_df.reset_index()

    # Retain only relevant columns
    issue_cols_to_drop = ["labels", "interactions"]
    issue_df = issue_df.drop(columns=issue_cols_to_drop)

    # add sig, org, repo columns
    issue_df["sig"] = sig
    issue_df["org"] = org
    issue_df["repo"] = repo

    return issue_df

In [10]:
# lets read a sample df
issue_df = get_issue_metrics(
    sig="sig-data-science", org="os-climate", repo="aicoe-osc-demo"
)
issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,sig,org,repo
0,150,Run infer_relevance.ipynb in demo2,I have an issue when I run this notebook. Ther...,JeremyGohBNP,2022-04-21 09:43:13,,NaT,sig-data-science,os-climate,aicoe-osc-demo
1,149,Deliver Container Image,"Hey, AICoE-CI!\r\n\r\nPlease build and deliver...",Shreyanand,2022-04-14 15:14:43,sesheta,2022-04-14 15:44:21,sig-data-science,os-climate,aicoe-osc-demo
2,146,Deliver Container Image,"Hey, AICoE-CI!\r\n\r\nPlease build and deliver...",Shreyanand,2022-04-13 15:19:00,sesheta,2022-04-13 15:50:35,sig-data-science,os-climate,aicoe-osc-demo
3,145,Remove editable src package from pipfile,Remove editable _src_ package from the Pipfile...,Shreyanand,2022-04-11 14:10:53,,NaT,sig-data-science,os-climate,aicoe-osc-demo
4,142,Fine tune KPI inference model,"As a data scientist,\r\nI want to fine tune a ...",chauhankaranraj,2022-04-04 19:35:17,,NaT,sig-data-science,os-climate,aicoe-osc-demo


In [18]:
# read in and process issue dfs for all repos
# then upload the processed df to s3 as a parquet file
for sig in sigs_file["sigs"]:
    for sp in sig["subprojects"]:
        for owners_slug in sp["owners"]:
            # parse org, repo names
            org, repo = get_org_repo_from_owners_slug(owners_slug)

            # create issues df for org, repo
            issue_df = get_issue_metrics(sig=sig["dir"], org=org, repo=repo)

            # upload to bucket
            s3c.upload_df_to_s3(
                df=issue_df,
                s3_prefix="open-services-group/metrics/github/issues",
                s3_key=f"{org}-{repo}.parquet",
            )

## PR Metrics

Now, lets fetch the PRs for the repository and derive some metrics.

In [19]:
def get_pr_metrics(sig, org, repo):
    # read repo data we fetched in previous step via MI
    pr = PullRequest(f"{org}/{repo}")
    pr_df = pr.load_previous_knowledge(is_local=True)
    pr_df = pr_df.reset_index()

    # Retain only relevant columns
    pr_cols_to_drop = ["interactions", "reviews", "labels", "commits", "changed_files"]
    prs_df = pr_df.drop(columns=pr_cols_to_drop)

    # add sigs,org,repo
    prs_df["sig"] = sig
    prs_df["org"] = org
    prs_df["repo"] = repo

    return prs_df

In [20]:
pr_df = get_pr_metrics(sig="sig-data-science", org="os-climate", repo="aicoe-osc-demo")
pr_df.head()

Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,first_review_at,first_approve_at,sig,org,repo
0,148,Update table curator to work w changes in kpi ...,pdf text curation notebook imports curation cl...,S,chauhankaranraj,2022-04-14 14:27:32,2022-04-14 14:30:01,Shreyanand,2022-04-14 14:30:01,Shreyanand,1,1,2022-04-14 14:29:15,2022-04-14 14:29:15,sig-data-science,os-climate,aicoe-osc-demo
1,147,Downgrade markupsafe to fix soft_unicode depre...,Fixes os-climate/os_c_data_commons#162,XL,chauhankaranraj,2022-04-14 14:07:34,2022-04-14 14:25:54,Shreyanand,2022-04-14 14:25:54,Shreyanand,1,2,2022-04-14 14:09:22,NaT,sig-data-science,os-climate,aicoe-osc-demo
2,144,Remove s3c from kpi-mapping.py,Closes #69 \r\n\r\nThis PR makes following cha...,XXL,Shreyanand,2022-04-08 19:13:26,2022-04-13 14:59:49,chauhankaranraj,2022-04-13 14:59:49,chauhankaranraj,1,15,2022-04-11 20:20:52,2022-04-13 14:53:50,sig-data-science,os-climate,aicoe-osc-demo
3,143,Update base image version in Dockerfile. Updat...,Update image so that it works in the new Elyra...,XXL,chauhankaranraj,2022-04-05 18:10:23,2022-04-05 19:00:15,MichaelClifford,2022-04-05 19:00:15,MichaelClifford,1,2,NaT,NaT,sig-data-science,os-climate,aicoe-osc-demo
4,140,Step 2.5 Fine Tune model,Signed-off-by: Francesco Murdaca <fmurdaca@red...,XXL,pacospace,2022-03-16 10:50:53,NaT,,NaT,,2,3,NaT,NaT,sig-data-science,os-climate,aicoe-osc-demo


In [22]:
# read in and process pr dfs for all repos
# then upload the processed df to s3 as a parquet file
for sig in sigs_file["sigs"]:
    for sp in sig["subprojects"]:
        for owners_slug in sp["owners"]:
            # parse org, repo names
            org, repo = get_org_repo_from_owners_slug(owners_slug)

            # create prs df for org, repo
            pr_df = get_pr_metrics(sig=sig["dir"], org=org, repo=repo)

            # upload to bucket
            s3c.upload_df_to_s3(
                df=pr_df,
                s3_prefix="open-services-group/metrics/github/prs",
                s3_key=f"{org}-{repo}.parquet",
            )

## Contributors metrics
Next, lets fetch the events for the repository. This table will be used to find information about contributors, their affiliations and the events they generate in the repository. 

In [23]:
# To do
# In the current version we can only get data for the last 90 days or 300 events
# limited by the guithub API
# Add a loop for getting older data
# Probably, get all events for a month and then loop to the next month

In [24]:
def get_events_df(sig, org, repo):
    ## Get internal members of OSG group
    gth_obj = Github(GITHUB_ACCESS_TOKEN)
    osg = gth_obj.get_organization("open-services-group")
    internal = [m.login for m in osg.get_members()]

    gh_repo_obj = gth_obj.get_repo(f"{org}/{repo}")

    # Define events we are interested in
    issue_event_types = ["IssueCommentEvent", "IssuesEvent"]

    pr_event_types = [
        "PullRequestEvent",
        "PullRequestReviewEvent",
        "PullRequestReviewCommentEvent",
    ]

    # Create the events data frame
    rows = []
    for e in gh_repo_obj.get_events():
        if e.type in issue_event_types or e.type in pr_event_types:
            event_id = e.id
            created_at = e.created_at
            event_contributor_id = e.actor.id
            event_contributor = e.actor.login
            event_type = e.type
            event_action = e.payload["action"]
            if event_type in issue_event_types:
                issue_or_pr_id = e.payload["issue"]["number"]
            else:
                issue_or_pr_id = e.payload["pull_request"]["number"]
            rows.append(
                [
                    event_id,
                    created_at,
                    event_contributor_id,
                    event_contributor,
                    event_type,
                    event_action,
                    issue_or_pr_id,
                ]
            )

    # Add column names for events data frame
    column_name = [
        "id",
        "created_at",
        "contributor_id",
        "contributor_name",
        "type",
        "action",
        "issue_or_pr_id",
    ]

    # Add other required colums
    events_df = pd.DataFrame(data=rows, columns=column_name)
    events_df["org"] = org
    events_df["repo"] = repo
    events_df["sig"] = sig
    events_df["internal_contributor"] = events_df["contributor_name"].apply(
        lambda x: x in internal
    )

    return events_df

In [25]:
events_df = get_events_df(
    sig="sig-data-science", org="os-climate", repo="aicoe-osc-demo"
)
events_df.head()

Unnamed: 0,id,created_at,contributor_id,contributor_name,type,action,issue_or_pr_id,org,repo,sig,internal_contributor
0,21404226591,2022-04-21 19:04:14,90428947,HeatherAck,IssueCommentEvent,created,150,os-climate,aicoe-osc-demo,sig-data-science,False
1,21401943596,2022-04-21 16:38:12,82595185,JeremyGohBNP,IssueCommentEvent,created,150,os-climate,aicoe-osc-demo,sig-data-science,False
2,21401915177,2022-04-21 16:36:28,8916126,Shreyanand,IssueCommentEvent,created,150,os-climate,aicoe-osc-demo,sig-data-science,True
3,21401869003,2022-04-21 16:33:44,82595185,JeremyGohBNP,IssueCommentEvent,created,150,os-climate,aicoe-osc-demo,sig-data-science,False
4,21401696347,2022-04-21 16:24:09,8916126,Shreyanand,IssueCommentEvent,created,150,os-climate,aicoe-osc-demo,sig-data-science,True


In [26]:
# read in and process events dfs for all repos
# then upload the processed df to s3 as a parquet file
for sig in sigs_file["sigs"]:
    for sp in sig["subprojects"]:
        for owners_slug in sp["owners"]:
            # parse org, repo names
            org, repo = get_org_repo_from_owners_slug(owners_slug)

            # create events df for org, repo
            events_df = get_events_df(sig=sig["dir"], org=org, repo=repo)

            # upload to bucket
            s3c.upload_df_to_s3(
                df=events_df,
                s3_prefix="open-services-group/metrics/github/events",
                s3_key=f"{org}-{repo}.parquet",
            )

## Create Trino Tables

Now that we have the processed data frames stored as parquet files in s3, we can generate [Trino](https://trino.io/) tables from it so that interactive dashboards can be implemented in [Superset](https://superset.apache.org/). We will be connecting to the [Operate First Trino](https://trino.operate-first.cloud/).

In [27]:
# Map the datatype columns of df to suitable datatype supported in Trino/Superset
_p2smap = {
    "object": "varchar",
    "int64": "bigint",
    "float64": "double",
    "datetime64[ns]": "timestamp",
    "bool": "boolean",
}


def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))


# Generate the Trino table schema
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0], t=e[1]) for e in pz])

In [28]:
# Create a Trino client
conn = trino.dbapi.connect(
    auth=trino.auth.BasicAuthentication(
        os.environ["TRINO_USER"], os.environ["TRINO_PASSWD"]
    ),
    host=os.environ["TRINO_HOST"],
    port=int(os.environ["TRINO_PORT"]),
    http_scheme="https",
    verify=True,
)
cur = conn.cursor()

In [29]:
# Check if Trino connection was successful
cur.execute("show catalogs")
cur.fetchall()[1]

['data_science_general']

In [30]:
def create_table_from_df(df, table_name, s3_prefix):
    # Create the table with data populated from parquet file
    schema = generate_table_schema_pairs(df)

    tabledef = """create table if not exists data_science_general.default.{table_name}(
    {schema}
    ) with (
        format = 'parquet',
        external_location = 's3a://{s3_bucket}/{s3_prefix}'
    )""".format(
        table_name=table_name,
        schema=schema,
        s3_prefix=s3_prefix,
        s3_bucket=os.environ["S3_BUCKET"],
    )

    cur.execute(tabledef)
    return cur.fetchall()

In [34]:
create_table_from_df(
    issue_df, table_name="issues", s3_prefix="open-services-group/metrics/github/issues"
)

[[True]]

In [35]:
create_table_from_df(
    pr_df, table_name="prs", s3_prefix="open-services-group/metrics/github/prs"
)

[[True]]

In [36]:
create_table_from_df(
    events_df,
    table_name="events",
    s3_prefix="open-services-group/metrics/github/events",
)

[[True]]

## Conclusion

In this notebook we:

- Fetched GitHub Issue/PR data for a specified org/repo using the MI `srcopsmetrics` module
- Pre-processed the raw data into data frames with relevant columns
- Uploaded the processed data frames as parquet files to an S3 bucket
- Created suitable tables for the parquet files generated in Trino

We can now further explore the GitHub data obtained for different repos/orgs and create interactive visualization dashboards in [Superset](https://superset.operate-first.cloud/).