# GitHub Repository (`thoth-station/support`) Metric Analysis

In this notebook, we will fetch the GitHub Issue/PR data for the `support` repositories mentioned in [thoth-station](https://github.com/thoth-station/support) organization using the [MI tool](https://github.com/thoth-station/mi), pre-process the raw data into suitable data frames and store them as parquet files to an s3 bucket.

In [1]:
import os
import yaml
import requests
from dotenv import find_dotenv, load_dotenv
import warnings
import trino
from s3_communication import S3Communication
from github import Github
import pandas as pd

warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

True

In [2]:
## Create a .env file on your local with the correct configs
GITHUB_ACCESS_TOKEN = os.getenv("GITHUB_ACCESS_TOKEN")

s3_bucket = os.getenv("S3_BUCKET")
s3_endpoint_url = os.getenv("S3_ENDPOINT")
aws_access_key_id = os.getenv("S3_ACCESS_KEY")
aws_secret_access_key = os.getenv("S3_SECRET_KEY")

In [3]:
# Note: The GitHub access token needs to be exported before importing the srcopmetrics package (current bug)
from srcopsmetrics.entities.issue import Issue
from srcopsmetrics.entities.pull_request import PullRequest

In [4]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url, aws_access_key_id, aws_secret_access_key, s3_bucket
)

In [5]:
# repo from where the data is extracted
org_repo = ["thoth-station/support"]

In [6]:
# Remove any existing old data
!rm -r srcopsmetrics/


for repo in org_repo:
    org = repo.split('/')[0]
    repo = repo.split('/')[1]
    print(f"******----->>Extracting data from {org}/{repo}")
    !python -m srcopsmetrics.cli -clr $org/$repo -e Issue,PullRequest # noqa: E999

******----->>Extracting data from thoth-station/support
INFO:srcopsmetrics.github_knowledge:Overall repositories found: 1
INFO:srcopsmetrics.bot_knowledge:######################## Analysing thoth-station/support ########################

INFO:srcopsmetrics.utils:No repo identified, creating new directory at /opt/app-root/src/metrics/notebooks/srcopsmetrics/bot_knowledge/thoth-station/support
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Detected entities:
CodeFrequency # Commit # DependencyUpdate # Fork # Issue # IssueEvent # KebechetUpdateManager # License # PullRequest # PullRequestDiscussion # RawIssue # RawPullRequest # ReadMe # Release # Stargazer # ThothAdviseMetrics # ThothMetrics # ThothMetrics # ThothMetrics # ThothVersionManagerMetrics # TrafficClones # TrafficPaths # TrafficPaths # TrafficReferrers # TrafficClones # TrafficViews
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Issue inspec

## Issue

Now, lets fetch the issues for the repository and save it as a dataframe in a S3 bucket.

In [7]:
def get_issue_metrics(org, repo):
    # read repo data we fetched in previous step via MI
    issue = Issue(f"{org}/{repo}")
    issue_df = issue.load_previous_knowledge(is_local=True)
    issue_df = issue_df.reset_index()

    # Retain only relevant columns
    issue_cols_to_drop = ["labels", "interactions"]
    issue_df = issue_df.drop(columns=issue_cols_to_drop)

    # add sig, org, repo columns
    issue_df["org"] = org
    issue_df["repo"] = repo

    return issue_df

In [8]:
# lets read a sample df
issue_df = get_issue_metrics(org="thoth-station", repo="support")
issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,org,repo
0,223,Kebechet update manager: GitCommandError on ke...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-01 12:49:54,,NaT,thoth-station,support
1,222,Kebechet version manager: GithubAPIException o...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-01 05:46:15,,NaT,thoth-station,support
2,221,Kebechet update manager: InternalError on kebe...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-05-24 12:50:03,,NaT,thoth-station,support
3,220,Include package osc-ingest-tools in recommenda...,### Package name\n\nosc-ingest-tools\n\n### Py...,fridex,2022-05-23 13:58:23,,NaT,thoth-station,support
4,219,the URL that points to an advise's results giv...,**Describe the bug**\r\n\r\nWhen I ask for an ...,codificat,2022-05-20 16:06:03,,NaT,thoth-station,support


In [9]:
# read in and process issue dfs for all repos
# then upload the processed df to s3 as a parquet file


# upload to bucket
s3c.upload_df_to_s3(
    df=issue_df,
    s3_prefix="open-services-group/metrics/thoth-support-github/issues",
    s3_key="thoth_support_issues.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx000000000000000075813-00629a0fc1-2f12b8-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"6d80922111b0ab9fcfb5144d036857ca"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx000000000000000075813-00629a0fc1-2f12b8-ocs-storagecluster-cephobjectstore',
   'date': 'Fri, 03 Jun 2022 13:42:26 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=8a437562f974804482699da2db9fb9f7; path=/; HttpOnly'},
  'RetryAttempts': 2},
 'ETag': '"6d80922111b0ab9fcfb5144d036857ca"'}

## Pull-Request

Now, lets fetch the PRs for the repository and save it as a dataframe in a S3 bucket.

In [10]:
def get_pr_metrics(org, repo):
    # read repo data we fetched in previous step via MI
    pr = PullRequest(f"{org}/{repo}")
    pr_df = pr.load_previous_knowledge(is_local=True)
    pr_df = pr_df.reset_index()

    # Retain only relevant columns
    pr_cols_to_drop = ["interactions", "reviews", "labels", "commits", "changed_files"]
    prs_df = pr_df.drop(columns=pr_cols_to_drop)

    # add sigs,org,repo
    prs_df["org"] = org
    prs_df["repo"] = repo

    return prs_df

In [11]:
pr_df = get_pr_metrics(org="thoth-station", repo="support")
pr_df.head()

Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,first_review_at,first_approve_at,org,repo
0,195,Thoth Configuration Initialization,## Automatic configuration initialization\nThe...,M,khebhut[bot],2022-03-02 07:53:44,NaT,,NaT,,1,1,NaT,NaT,thoth-station,support
1,146,Thoth Configuration Initialization,## Automatic configuration initialization\nThe...,M,khebhut[bot],2021-11-18 14:17:40,2022-02-17 09:32:16,sesheta,NaT,,1,1,NaT,NaT,thoth-station,support
2,116,Add optional Python package index URL to the p...,Let's have also this field as per discussion w...,XS,fridex,2021-10-18 11:11:44,2021-10-18 11:24:04,fridex,2021-10-18 11:24:04,fridex,1,1,2021-10-18 11:15:15,2021-10-18 11:15:15,thoth-station,support
3,115,Template for registering a Python package index,See https://github.com/thoth-station/support/p...,M,fridex,2021-10-18 11:02:32,2021-10-18 11:24:32,fridex,2021-10-18 11:24:32,fridex,1,1,2021-10-18 11:14:45,2021-10-18 11:14:45,thoth-station,support
4,99,Add template for requesting a package in Thoth...,,M,fridex,2021-10-11 19:42:39,2021-10-18 10:53:53,fridex,2021-10-18 10:53:53,fridex,1,1,2021-10-18 10:45:16,2021-10-18 10:52:07,thoth-station,support


In [12]:
# read in and process pr dfs for all repos
# then upload the processed df to s3 as a parquet file

# upload to bucket

s3c.upload_df_to_s3(
    df=pr_df,
    s3_prefix="open-services-group/metrics/thoth-support-github/prs",
    s3_key="thoth_support_prs.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx000000000000000075814-00629a0fc2-2f12b8-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"ec919117d21d9345f53053fd2ffc7a30"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx000000000000000075814-00629a0fc2-2f12b8-ocs-storagecluster-cephobjectstore',
   'date': 'Fri, 03 Jun 2022 13:42:26 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=8a437562f974804482699da2db9fb9f7; path=/; HttpOnly'},
  'RetryAttempts': 1},
 'ETag': '"ec919117d21d9345f53053fd2ffc7a30"'}

# Create Trino Tables

Now that we have the processed data frames stored as parquet files in s3, we can generate [Trino](https://trino.io/) tables from it so that interactive dashboards can be implemented in [Superset](https://superset.apache.org/). We will be connecting to the [Operate First Trino](https://trino.operate-first.cloud/).

In [13]:
# Map the datatype columns of df to suitable datatype supported in Trino/Superset
_p2smap = {
    "object": "varchar",
    "int64": "bigint",
    "float64": "double",
    "datetime64[ns]": "timestamp",
    "bool": "boolean",
}


def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))


# Generate the Trino table schema
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0], t=e[1]) for e in pz])

In [14]:
# Create a Trino client
conn = trino.dbapi.connect(
    auth=trino.auth.BasicAuthentication(
        os.environ["TRINO_USER"], os.environ["TRINO_PASSWD"]
    ),
    host=os.environ["TRINO_HOST"],
    port=int(os.environ["TRINO_PORT"]),
    http_scheme="https",
    verify=True,
)
cur = conn.cursor()

In [15]:
# Check if Trino connection was successful
cur.execute("show catalogs")
cur.fetchall()[2]

['data_science_general_morty']

In [16]:
def create_table_from_df(df, table_name, s3_prefix):
    # Create the table with data populated from parquet file
    schema = generate_table_schema_pairs(df)

    tabledef = """CREATE TABLE IF NOT EXISTS data_science_general_morty.default.{table_name}(
    {schema}
    ) with (
        format = 'parquet',
        external_location = 's3a://{s3_bucket}/{s3_prefix}'
    )""".format(
        table_name=table_name,
        schema=schema,
        s3_prefix=s3_prefix,
        s3_bucket=os.environ["S3_BUCKET"],
    )

    cur.execute(tabledef)
    return cur.fetchall()

In [17]:
create_table_from_df(
    issue_df,
    table_name="thoth_support_issues",
    s3_prefix="open-services-group/metrics/thoth-support-github/issues",
)

[[True]]

In [18]:
create_table_from_df(
    pr_df,
    table_name="thoth_support_prs",
    s3_prefix="open-services-group/metrics/thoth-support-github/prs",
)

[[True]]

## Conclusion

In this notebook we:

- Fetched GitHub Issue/PR data for a specified org/repo using the MI srcopsmetrics module.
- Pre-processed the raw data into data frames with relevant columns.
- Uploaded the processed data frames as parquet files to an S3 bucket.
- Created suitable tables for the parquet files generated in Trino.

We can now further explore the GitHub data and create interactive visualization dashboards in Superset.