# GitHub Repository (`thoth-station/support`) Metric Analysis

In this notebook, we will fetch the GitHub Issue/PR data for the `support` repositories mentioned in [thoth-station](https://github.com/thoth-station/support) organization using the [MI tool](https://github.com/thoth-station/mi), pre-process the raw data into suitable data frames and store them as parquet files to an s3 bucket.

In [1]:
import os
import yaml
import requests
from dotenv import find_dotenv, load_dotenv
import warnings
import trino
from s3_communication import S3Communication
from github import Github
import pandas as pd
import numpy as np

warnings.filterwarnings("ignore")
load_dotenv(find_dotenv())

True

In [2]:
## Create a .env file on your local with the correct configs
GITHUB_ACCESS_TOKEN = os.getenv("GITHUB_ACCESS_TOKEN")

s3_bucket = os.getenv("S3_BUCKET")
s3_endpoint_url = os.getenv("S3_ENDPOINT")
aws_access_key_id = os.getenv("S3_ACCESS_KEY")
aws_secret_access_key = os.getenv("S3_SECRET_KEY")

In [3]:
# Note: The GitHub access token needs to be exported before importing the srcopmetrics package (current bug)
from srcopsmetrics.entities.issue import Issue
from srcopsmetrics.entities.pull_request import PullRequest

In [4]:
# init s3 connector
s3c = S3Communication(
    s3_endpoint_url, aws_access_key_id, aws_secret_access_key, s3_bucket
)

In [5]:
# repo from where the data is extracted
org_repo = ["thoth-station/support"]

In [6]:
# Remove any existing old data
!rm -r srcopsmetrics/


for repo in org_repo:
    org = repo.split('/')[0]
    repo = repo.split('/')[1]
    print(f"******----->>Extracting data from {org}/{repo}")
    !python -m srcopsmetrics.cli -clr $org/$repo -e Issue,PullRequest # noqa: E999

******----->>Extracting data from thoth-station/support
INFO:srcopsmetrics.github_knowledge:Overall repositories found: 1
INFO:srcopsmetrics.bot_knowledge:######################## Analysing thoth-station/support ########################

INFO:srcopsmetrics.utils:No repo identified, creating new directory at /opt/app-root/src/metrics/notebooks/srcopsmetrics/bot_knowledge/thoth-station/support
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Detected entities:
CodeFrequency # Commit # DependencyUpdate # Fork # Issue # IssueEvent # KebechetUpdateManager # License # PullRequest # PullRequestDiscussion # RawIssue # RawPullRequest # ReadMe # Release # Stargazer # ThothAdviseMetrics # ThothMetrics # ThothMetrics # ThothMetrics # ThothVersionManagerMetrics # TrafficClones # TrafficPaths # TrafficPaths # TrafficReferrers # TrafficClones # TrafficViews
INFO:srcopsmetrics.bot_knowledge:########################
INFO:srcopsmetrics.bot_knowledge:Issue inspec

## Issue

Now, lets fetch the issues for the repository and save it as a dataframe in a S3 bucket.

In [7]:
def get_issue_metrics(org, repo):
    # read repo data we fetched in previous step via MI
    issue = Issue(f"{org}/{repo}")
    issue_df = issue.load_previous_knowledge(is_local=True)
    issue_df = issue_df.reset_index()

    # add sig, org, repo columns
    issue_df["org"] = org
    issue_df["repo"] = repo

    return issue_df

In [8]:
# lets read a sample df
issue_df = get_issue_metrics(org="thoth-station", repo="support")
issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions,first_response_at,commenters_number,comments_number,comments,cross_references,cross_references_number,org,repo
0,238,Kebechet version manager: IssueTrackerDisabled...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-30 15:59:12,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1656...",{'sesheta': 141},"{'created_at': 1656604760, 'created_by': 'sesh...",1,2,"[{'created_at': 1656604760, 'created_by': 'ses...",[],0,thoth-station,support
1,237,thamos advise adds a duplicate index/source,**Describe the bug**\r\n\r\nWhen I request a `...,codificat,2022-06-28 15:19:34,,NaT,"{'kind/bug': {'color': 'e11d21', 'labeled_at':...",{'codificat': 4},"{'created_at': 1656429622, 'created_by': 'codi...",1,1,"[{'created_at': 1656429622, 'created_by': 'cod...",[],0,thoth-station,support
2,236,Kebechet update manager: InternalError on kebe...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-21 18:11:00,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1655...",{'sesheta': 141},"{'created_at': 1655835067, 'created_by': 'sesh...",1,2,"[{'created_at': 1655835067, 'created_by': 'ses...",[],0,thoth-station,support
3,235,Kebechet version manager: GitCommandError on k...,## Description\nThis is an automated issue gen...,shreekarSS,2022-06-13 19:34:48,shreekarSS,2022-06-13 19:37:51,"{'needs-triage': {'color': 'ededed', 'labeled_...",{'sesheta': 141},"{'created_at': 1655148895, 'created_by': 'sesh...",1,2,"[{'created_at': 1655148895, 'created_by': 'ses...",[],0,thoth-station,support
4,234,Kebechet update manager: GithubAPIException on...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-13 12:27:39,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1655...",{'sesheta': 141},"{'created_at': 1655123266, 'created_by': 'sesh...",1,2,"[{'created_at': 1655123266, 'created_by': 'ses...",[],0,thoth-station,support


### Adding "first_comment_at" and "first_comment_by" columns:

In [9]:
df1 = issue_df["first_response_at"].apply(pd.Series)
df1.rename(
    columns={"created_at": "first_comment_at", "created_by": "first_comment_by"},
    inplace=True,
)
df1["first_comment_at"] = pd.to_datetime(df1["first_comment_at"], unit="s")
df1 = df1[["first_comment_at", "first_comment_by"]]

In [10]:
issue_df = pd.concat([issue_df, df1], axis=1)
issue_df.reset_index(drop=True, inplace=True)

In [11]:
issue_df.head()

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions,first_response_at,commenters_number,comments_number,comments,cross_references,cross_references_number,org,repo,first_comment_at,first_comment_by
0,238,Kebechet version manager: IssueTrackerDisabled...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-30 15:59:12,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1656...",{'sesheta': 141},"{'created_at': 1656604760, 'created_by': 'sesh...",1,2,"[{'created_at': 1656604760, 'created_by': 'ses...",[],0,thoth-station,support,2022-06-30 15:59:20,sesheta
1,237,thamos advise adds a duplicate index/source,**Describe the bug**\r\n\r\nWhen I request a `...,codificat,2022-06-28 15:19:34,,NaT,"{'kind/bug': {'color': 'e11d21', 'labeled_at':...",{'codificat': 4},"{'created_at': 1656429622, 'created_by': 'codi...",1,1,"[{'created_at': 1656429622, 'created_by': 'cod...",[],0,thoth-station,support,2022-06-28 15:20:22,codificat
2,236,Kebechet update manager: InternalError on kebe...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-21 18:11:00,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1655...",{'sesheta': 141},"{'created_at': 1655835067, 'created_by': 'sesh...",1,2,"[{'created_at': 1655835067, 'created_by': 'ses...",[],0,thoth-station,support,2022-06-21 18:11:07,sesheta
3,235,Kebechet version manager: GitCommandError on k...,## Description\nThis is an automated issue gen...,shreekarSS,2022-06-13 19:34:48,shreekarSS,2022-06-13 19:37:51,"{'needs-triage': {'color': 'ededed', 'labeled_...",{'sesheta': 141},"{'created_at': 1655148895, 'created_by': 'sesh...",1,2,"[{'created_at': 1655148895, 'created_by': 'ses...",[],0,thoth-station,support,2022-06-13 19:34:55,sesheta
4,234,Kebechet update manager: GithubAPIException on...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-13 12:27:39,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1655...",{'sesheta': 141},"{'created_at': 1655123266, 'created_by': 'sesh...",1,2,"[{'created_at': 1655123266, 'created_by': 'ses...",[],0,thoth-station,support,2022-06-13 12:27:46,sesheta


---

### Adding columns for different labels

In [12]:
# Labels to concat to issue_df
labels = set(issue_df["labels"].apply(pd.Series).columns)
labels

{'area/byon',
 'area/prescriptions',
 'area/solver',
 'bot',
 'good first issue',
 'human_intervention_required',
 'kind/bug',
 'kind/documentation',
 'kind/feature',
 'kind/question',
 'lifecycle/active',
 'lifecycle/frozen',
 'lifecycle/rotten',
 'lifecycle/stale',
 'needs-sig',
 'needs-triage',
 'priority/awaiting-more-evidence',
 'priority/backlog',
 'priority/critical-urgent',
 'priority/important-longterm',
 'priority/important-soon',
 'sig/devsecops',
 'sig/stack-guidance',
 'sig/user-experience',
 'triage/accepted',
 'triage/duplicate',
 'triage/needs-information',
 'triage/not-reproducible'}

In [13]:
label_df = pd.DataFrame()
for label in labels:
    df = (
        (issue_df["labels"].apply(pd.Series))[label]
        .apply(pd.Series)["labeled_at"]
        .to_frame()
    )
    df.rename(columns={"labeled_at": label}, inplace=True)
    df[label] = df[label].notnull().astype(int)
    df.reset_index(drop=True, inplace=True)
    label_df = pd.concat([label_df, df], axis=1)
label_df.head(3)

Unnamed: 0,lifecycle/frozen,kind/feature,lifecycle/active,priority/backlog,kind/question,priority/important-soon,bot,triage/accepted,area/solver,triage/needs-information,...,priority/critical-urgent,priority/awaiting-more-evidence,sig/stack-guidance,triage/not-reproducible,area/byon,sig/user-experience,needs-sig,lifecycle/stale,needs-triage,lifecycle/rotten
0,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,0,0,0,0,0,1,0,1,0,0,...,0,0,1,0,0,0,1,0,1,0
2,0,0,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,1,0


In [14]:
# concat whole dataset
issue_df = pd.concat([issue_df, label_df], axis=1)

In [15]:
issue_df.head(2)

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions,first_response_at,...,priority/critical-urgent,priority/awaiting-more-evidence,sig/stack-guidance,triage/not-reproducible,area/byon,sig/user-experience,needs-sig,lifecycle/stale,needs-triage,lifecycle/rotten
0,238,Kebechet version manager: IssueTrackerDisabled...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-30 15:59:12,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1656...",{'sesheta': 141},"{'created_at': 1656604760, 'created_by': 'sesh...",...,0,0,0,0,0,0,1,0,1,0
1,237,thamos advise adds a duplicate index/source,**Describe the bug**\r\n\r\nWhen I request a `...,codificat,2022-06-28 15:19:34,,NaT,"{'kind/bug': {'color': 'e11d21', 'labeled_at':...",{'codificat': 4},"{'created_at': 1656429622, 'created_by': 'codi...",...,0,0,1,0,0,0,1,0,1,0


In the next case, we will be constructing different columns for `Good_Quality_Issue`, `Triaged_Issue`, `Planned Issue`, `Issue_being_worked_on` based on different conditions they satisfy. If any row satisfies the condition, then the corresponding column is assigned 1 else 0. In this way, we can check the status of any issue.

### Conditions for `Good_Quality_Issue` :

The required conditions for "Good_Quality_Issue" are:

- Issue includes a `kind` label.
- Issue includes a `priority` label
- Issue includes a `sig` label
- (The issue title includes story points in the format of [3pt]) OR (The issue's title includes an issue descriptor in the format of [EPIC] or [SPIKE]).

In [16]:
issue_df.head(1)

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions,first_response_at,...,priority/critical-urgent,priority/awaiting-more-evidence,sig/stack-guidance,triage/not-reproducible,area/byon,sig/user-experience,needs-sig,lifecycle/stale,needs-triage,lifecycle/rotten
0,238,Kebechet version manager: IssueTrackerDisabled...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-30 15:59:12,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1656...",{'sesheta': 141},"{'created_at': 1656604760, 'created_by': 'sesh...",...,0,0,0,0,0,0,1,0,1,0


In [17]:
condition1 = [
    (
        (
            (issue_df["kind/bug"] == 1)
            | (issue_df["kind/documentation"] == 1)
            | (issue_df["kind/feature"] == 1)
            | (issue_df["kind/question"] == 1)
        )
        & (
            (issue_df["priority/awaiting-more-evidence"] == 1)
            | (issue_df["priority/backlog"] == 1)
            | (issue_df["priority/critical-urgent"] == 1)
            | (issue_df["priority/important-longterm"] == 1)
            | (issue_df["priority/important-soon"] == 1)
        )
        & (
            (issue_df["sig/devsecops"] == 1)
            | (issue_df["sig/stack-guidance"] == 1)
            | (issue_df["sig/user-experience"] == 1)
        )
        & (issue_df["title"].str.contains("EPIC"))
        | (issue_df["title"].str.contains("SPIKE"))
    )
]
choice1 = [1]

issue_df["Good_Quality_Issue"] = np.select(condition1, choice1, default=0)

In [18]:
issue_df.head(1)

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions,first_response_at,...,priority/awaiting-more-evidence,sig/stack-guidance,triage/not-reproducible,area/byon,sig/user-experience,needs-sig,lifecycle/stale,needs-triage,lifecycle/rotten,Good_Quality_Issue
0,238,Kebechet version manager: IssueTrackerDisabled...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-30 15:59:12,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1656...",{'sesheta': 141},"{'created_at': 1656604760, 'created_by': 'sesh...",...,0,0,0,0,0,1,0,1,0,0


### Conditions for `Triaged_Issue`

The required conditions for "Triaged_Issue" are:

- Issue includes a `triage/accepted` label.

In [19]:
condition2 = [(issue_df["triage/accepted"] == 1)]
choice2 = [1]

issue_df["Triaged_Issue"] = np.select(condition2, choice2, default=0)

In [20]:
issue_df.head(1)

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions,first_response_at,...,sig/stack-guidance,triage/not-reproducible,area/byon,sig/user-experience,needs-sig,lifecycle/stale,needs-triage,lifecycle/rotten,Good_Quality_Issue,Triaged_Issue
0,238,Kebechet version manager: IssueTrackerDisabled...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-30 15:59:12,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1656...",{'sesheta': 141},"{'created_at': 1656604760, 'created_by': 'sesh...",...,0,0,0,0,1,0,1,0,0,0


### Conditions for `Planned_Issue`

The required conditions for "Planned_Issue" are:

- Issue is tracked in a GitHub project board.
- Issue has an assignee.

**Because of the limitation of the extracted data, we are not able to know info about assignee and GitHub project board.**

### Conditions for `Issue_being_worked_on`

The required conditions for "Issue_being_worked_on" are:

- Issue is on a GitHub project board's in progress column.

or
- Issue has a pull request.

or
- Issue includes lifecycle/active label.

In this case, we are tracking those issues which are being worked on by tracking through lifecycle/active label.

In [21]:
condition3 = [(issue_df["lifecycle/active"] == 1)]
choice3 = [1]

issue_df["Issue_being_worked_on"] = np.select(condition3, choice3, default=0)

In [22]:
issue_df.head(2)

Unnamed: 0,id,title,body,created_by,created_at,closed_by,closed_at,labels,interactions,first_response_at,...,triage/not-reproducible,area/byon,sig/user-experience,needs-sig,lifecycle/stale,needs-triage,lifecycle/rotten,Good_Quality_Issue,Triaged_Issue,Issue_being_worked_on
0,238,Kebechet version manager: IssueTrackerDisabled...,## Description\nThis is an automated issue gen...,khebhut[bot],2022-06-30 15:59:12,,NaT,"{'bot': {'color': '698b69', 'labeled_at': 1656...",{'sesheta': 141},"{'created_at': 1656604760, 'created_by': 'sesh...",...,0,0,0,1,0,1,0,0,0,0
1,237,thamos advise adds a duplicate index/source,**Describe the bug**\r\n\r\nWhen I request a `...,codificat,2022-06-28 15:19:34,,NaT,"{'kind/bug': {'color': 'e11d21', 'labeled_at':...",{'codificat': 4},"{'created_at': 1656429622, 'created_by': 'codi...",...,0,0,0,1,0,1,0,0,1,0


In [23]:
# Retaining only relevant columns:
issue_df = issue_df[
    [
        "id",
        "created_by",
        "created_at",
        "closed_by",
        "closed_at",
        "org",
        "repo",
        "first_comment_at",
        "first_comment_by",
        "Good_Quality_Issue",
        "Triaged_Issue",
        "Issue_being_worked_on",
    ]
]

Now that we have manipulated dataset. In the next steps we will upload the dataset in a public bucket and trino table. We will then visualiza the metrics in superset dashboard.

In [24]:
# read in and process issue dfs for all repos
# then upload the processed df to s3 as a parquet file


# upload to bucket
s3c.upload_df_to_s3(
    df=issue_df,
    s3_prefix="open-services-group/metrics/thoth-support-github/issues",
    s3_key="thoth_support_issues.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx00000000000000022b5f6-0062bde94f-2f12b8-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"8d07f36b0c0b4a767465a622a3654df7"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx00000000000000022b5f6-0062bde94f-2f12b8-ocs-storagecluster-cephobjectstore',
   'date': 'Thu, 30 Jun 2022 18:19:59 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=8a437562f974804482699da2db9fb9f7; path=/; HttpOnly'},
  'RetryAttempts': 1},
 'ETag': '"8d07f36b0c0b4a767465a622a3654df7"'}

## Pull-Request

Now, lets fetch the PRs for the repository and save it as a dataframe in a S3 bucket.

In [25]:
def get_pr_metrics(org, repo):
    # read repo data we fetched in previous step via MI
    pr = PullRequest(f"{org}/{repo}")
    pr_df = pr.load_previous_knowledge(is_local=True)
    pr_df = pr_df.reset_index()

    # Retain only relevant columns
    pr_cols_to_drop = ["interactions", "reviews", "labels", "commits", "changed_files"]
    prs_df = pr_df.drop(columns=pr_cols_to_drop)

    # add sigs,org,repo
    prs_df["org"] = org
    prs_df["repo"] = repo

    return prs_df

In [26]:
pr_df = get_pr_metrics(org="thoth-station", repo="support")
pr_df.head()

Unnamed: 0,id,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,first_review_at,first_approve_at,org,repo
0,195,Thoth Configuration Initialization,## Automatic configuration initialization\nThe...,M,khebhut[bot],2022-03-02 07:53:44,NaT,,NaT,,1,1,NaT,NaT,thoth-station,support
1,146,Thoth Configuration Initialization,## Automatic configuration initialization\nThe...,M,khebhut[bot],2021-11-18 14:17:40,2022-02-17 09:32:16,sesheta,NaT,,1,1,NaT,NaT,thoth-station,support
2,116,Add optional Python package index URL to the p...,Let's have also this field as per discussion w...,XS,fridex,2021-10-18 11:11:44,2021-10-18 11:24:04,fridex,2021-10-18 11:24:04,fridex,1,1,2021-10-18 11:15:15,2021-10-18 11:15:15,thoth-station,support
3,115,Template for registering a Python package index,See https://github.com/thoth-station/support/p...,M,fridex,2021-10-18 11:02:32,2021-10-18 11:24:32,fridex,2021-10-18 11:24:32,fridex,1,1,2021-10-18 11:14:45,2021-10-18 11:14:45,thoth-station,support
4,99,Add template for requesting a package in Thoth...,,M,fridex,2021-10-11 19:42:39,2021-10-18 10:53:53,fridex,2021-10-18 10:53:53,fridex,1,1,2021-10-18 10:45:16,2021-10-18 10:52:07,thoth-station,support


In [27]:
# read in and process pr dfs for all repos
# then upload the processed df to s3 as a parquet file

# upload to bucket

s3c.upload_df_to_s3(
    df=pr_df,
    s3_prefix="open-services-group/metrics/thoth-support-github/prs",
    s3_key="thoth_support_prs.parquet",
)

{'ResponseMetadata': {'RequestId': 'tx00000000000000022b5f8-0062bde950-2f12b8-ocs-storagecluster-cephobjectstore',
  'HostId': '',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'content-length': '0',
   'etag': '"dd6c8c3a2fdc8aed97c93cc5b792ee5e"',
   'accept-ranges': 'bytes',
   'x-amz-request-id': 'tx00000000000000022b5f8-0062bde950-2f12b8-ocs-storagecluster-cephobjectstore',
   'date': 'Thu, 30 Jun 2022 18:20:00 GMT',
   'set-cookie': 'bbdcd938787a45e68f8d240a4e2dadcf=8a437562f974804482699da2db9fb9f7; path=/; HttpOnly'},
  'RetryAttempts': 1},
 'ETag': '"dd6c8c3a2fdc8aed97c93cc5b792ee5e"'}

# Create Trino Tables

Now that we have the processed data frames stored as parquet files in s3, we can generate [Trino](https://trino.io/) tables from it so that interactive dashboards can be implemented in [Superset](https://superset.apache.org/). We will be connecting to the [Operate First Trino](https://trino.operate-first.cloud/).

In [28]:
# Map the datatype columns of df to suitable datatype supported in Trino/Superset
_p2smap = {
    "object": "varchar",
    "int64": "bigint",
    "float64": "double",
    "datetime64[ns]": "timestamp",
    "bool": "boolean",
}


def pandas_type_to_sql(pt):
    st = _p2smap.get(pt)
    if st is not None:
        return st
    raise ValueError("unexpected pandas column type '{pt}'".format(pt=pt))


# Generate the Trino table schema
def generate_table_schema_pairs(df):
    ptypes = [str(e) for e in df.dtypes.to_list()]
    stypes = [pandas_type_to_sql(e) for e in ptypes]
    pz = list(zip(df.columns.to_list(), stypes))
    return ",\n".join(["    {n} {t}".format(n=e[0], t=e[1]) for e in pz])

In [29]:
# Create a Trino client
conn = trino.dbapi.connect(
    auth=trino.auth.BasicAuthentication(
        os.environ["TRINO_USER"], os.environ["TRINO_PASSWD"]
    ),
    host=os.environ["TRINO_HOST"],
    port=int(os.environ["TRINO_PORT"]),
    http_scheme="https",
    verify=True,
)
cur = conn.cursor()

In [30]:
# Check if Trino connection was successful
cur.execute("show catalogs")
cur.fetchall()[2]

['data_science_general_morty']

In [31]:
def create_table_from_df(df, table_name, s3_prefix):
    # Create the table with data populated from parquet file
    schema = generate_table_schema_pairs(df)

    tabledef = """CREATE TABLE IF NOT EXISTS data_science_general_morty.default.{table_name}(
    {schema}
    ) with (
        format = 'parquet',
        external_location = 's3a://{s3_bucket}/{s3_prefix}'
    )""".format(
        table_name=table_name,
        schema=schema,
        s3_prefix=s3_prefix,
        s3_bucket=os.environ["S3_BUCKET"],
    )

    cur.execute(tabledef)
    return cur.fetchall()

In [32]:
create_table_from_df(
    issue_df,
    table_name="thoth_support_issues",
    s3_prefix="open-services-group/metrics/thoth-support-github/issues",
)

[[True]]

In [33]:
create_table_from_df(
    pr_df,
    table_name="thoth_support_prs",
    s3_prefix="open-services-group/metrics/thoth-support-github/prs",
)

[[True]]

## Conclusion

In this notebook we:

- Fetched GitHub Issue/PR data for a specified org/repo using the MI srcopsmetrics module.
- Pre-processed the raw data into data frames with relevant columns.
- Uploaded the processed data frames as parquet files to an S3 bucket.
- Created suitable tables for the parquet files generated in Trino.

We can now further explore the GitHub data and create interactive visualization dashboards in Superset.