# Run inference on time to merge model trained previously


## What we did previously

In the previous [notebook](./03_model_training.ipynb) we trained machine learning models to classify a PR's `time_to_merge` into one of the 10 bins (or "classes"). We then deployed the model with the highest f1-score as a service using Seldon. Refer to deployment docs here

## In this step


  The purpose of this notebook is to check whether this service is running as intended, and more specifically to ensure that the model performance is what we expect it to be. So here, we will use the test set from the aforementioned notebook as the query payload for the service, and then verify that the return values are the same as those obtained during training/testing locally.
# Time to Merge Prediction Inference Service

In the previous notebook, we explored some basic machine learning models for predicting time to merge of a PR.

In [1]:
import os
import sys
import gzip
import json
import boto3
import datetime
import requests
from dotenv import load_dotenv, find_dotenv

import numpy as np
import pandas as pd

from sklearn.metrics import classification_report

metric_template_path = "../../../notebooks/data-sources/TestGrid/metrics"
if metric_template_path not in sys.path:
    sys.path.insert(1, metric_template_path)

from ipynb.fs.defs.metric_template import (  # noqa: E402
    CephCommunication,
)

load_dotenv(find_dotenv(), override=True)

True

In [2]:
## CEPH Bucket variables
## Create a .env file on your local with the correct configs,

ORG = os.getenv("GITHUB_ORG")
REPO = os.getenv("GITHUB_REPO")

## S3 bucket credentials
s3_endpoint_url = os.getenv("S3_ENDPOINT_URL")
s3_access_key = os.getenv("AWS_ACCESS_KEY_ID")
s3_secret_key = os.getenv("AWS_SECRET_ACCESS_KEY")
s3_bucket = os.getenv("S3_BUCKET")

s3_input_data_path = os.getenv("CEPH_BUCKET_PREFIX")

REMOTE = os.getenv("REMOTE")
RAW_DATA_PATH = os.path.join(
    s3_input_data_path, "srcopsmetrics/bot_knowledge", ORG, REPO, "PullRequest.json"
)

In [3]:
output = []
local_input_data_path = "../../../data/raw/GitHub/PullRequest.json.gz"
if REMOTE:
    print("getting dataset from ceph")
    s3 = boto3.resource(
        "s3",
        endpoint_url=s3_endpoint_url,
        aws_access_key_id=s3_access_key,
        aws_secret_access_key=s3_secret_key,
    )
    content = s3.Object(s3_bucket, RAW_DATA_PATH)
    file = content.get()["Body"].read().decode("utf-8")

    prs = json.loads(file)

    for pr in prs.splitlines():
        output.append(json.loads(pr))

else:
    print("getting dataset from local")
    with gzip.open(local_input_data_path, "r") as f:
        prs = json.loads(f.read().decode("utf-8"))


pr_df = pd.DataFrame(output)

getting dataset from ceph


In [4]:
# github pr dataset collected using thoth's mi-scheduler
pr_df.head()

Unnamed: 0,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at,id
0,Refactor github issue templates.,This commit does the following:\r\n1. Re-Order...,L,HumairAK,1650478984,1651241000.0,HumairAK,1651241000.0,HumairAK,2,18,"{'HumairAK': 3, 'larsks': 1, 'sesheta': 76}","{'948687753': {'author': 'larsks', 'words_coun...","[size/L, lgtm]","[327b84cd624b568e055f7a9043aa4c7a289de430, d6e...","[.github/ISSUE_TEMPLATE/1_question.yaml, .gith...",1650551000.0,,555
1,updating the onboarding_to_cluster docs,"includes changes to available cluster list, up...",M,Gregory-Pereira,1648853117,,,,,1,2,"{'sesheta': 123, 'Gregory-Pereira': 3}","{'930372186': {'author': 'HumairAK', 'words_co...","[size/M, lifecycle/stale]",[ad8c7a34df51110f2bedd4a1489c5480567fac75],[.github/ISSUE_TEMPLATE/onboarding_to_cluster....,1649079000.0,,550
2,Adding Morty cluster option to the onboarding_...,I'm adding the Morty cluster to the dropdown l...,XS,dystewart,1648507481,1648569000.0,4n4nd,1648569000.0,4n4nd,1,1,"{'sesheta': 172, '4n4nd': 2}","{'924847048': {'author': '4n4nd', 'words_count...","[size/XS, approved, lgtm]",[0abb3bd0e836eaa4611dbd76c429b333cf32dfaf],[.github/ISSUE_TEMPLATE/onboarding_to_cluster....,1648569000.0,1648569000.0,548
3,Added cluster introduction and links to suppor...,### Description:\r\n- added section to support...,XS,bryanmontalvan,1643990431,1644414000.0,sesheta,1644414000.0,sesheta,1,1,"{'bryanmontalvan': 3, 'sesheta': 65}","{'876514487': {'author': 'HumairAK', 'words_co...","[size/XS, approved, lgtm]",[5964cb9a76d1f6e6f6e66ea2474395a074e9fe78],[README.md],1644347000.0,1644347000.0,519
4,"Revert ""Onboard Freeze Notice for template""",Reverts operate-first/support#301\r\n\r\n/cc @...,XS,tumido,1643473377,1643646000.0,sesheta,1643646000.0,sesheta,1,1,{'sesheta': 65},"{'868092044': {'author': 'HumairAK', 'words_co...","[size/XS, approved, lgtm]",[8020a353880a2b53577614367bb54308fe13ed0a],[.github/ISSUE_TEMPLATE/onboarding_to_cluster....,1643646000.0,1643646000.0,516


In [5]:
interval = (pr_df["merged_at"] - pr_df["created_at"]).astype("float")
interval = interval.dropna()
interval

0     761991.0
2      61140.0
3     424068.0
4     172735.0
6       1017.0
        ...   
78       216.0
79       676.0
80     60348.0
81       534.0
82     79495.0
Length: 77, dtype: float64

In [6]:
n_buckets = 10

quantiles = interval.quantile(q=np.arange(0, 1 + 1e-100, 1 / n_buckets))
quantiles

0.0       139.0
0.1       572.4
0.2      1553.6
0.3      4582.4
0.4     15089.4
0.5     29664.0
0.6     61447.8
0.7     81339.8
0.8    170869.2
0.9    441258.0
dtype: float64

In [7]:
quantiles / 3600

0.0      0.038611
0.1      0.159000
0.2      0.431556
0.3      1.272889
0.4      4.191500
0.5      8.240000
0.6     17.068833
0.7     22.594389
0.8     47.463667
0.9    122.571667
dtype: float64

In [8]:
time_intervals = quantiles / 3600

In [9]:
# remove PRs from train/test which are still open
pr_df = pr_df[pr_df["closed_at"].notna()]
pr_df = pr_df[pr_df["merged_at"].notna()]

In [10]:
pr_df["created_at"] = pr_df["created_at"].apply(
    lambda x: int(datetime.datetime.timestamp(pd.to_datetime(x)))
)
pr_df["closed_at"] = pr_df["closed_at"].apply(
    lambda x: float(datetime.datetime.timestamp(pd.to_datetime(x)))
)
pr_df["merged_at"] = pr_df["merged_at"].apply(
    lambda x: float(datetime.datetime.timestamp(pd.to_datetime(x)))
)

In [11]:
TEST_DATA_PATH = os.path.join(s3_input_data_path, ORG, REPO, "test-data")

# read processed and split data created for train/test in the model training notebook
if REMOTE:
    cc = CephCommunication(s3_endpoint_url, s3_access_key, s3_secret_key, s3_bucket)
    X_test = cc.read_from_ceph(TEST_DATA_PATH, "X_test.parquet")
    y_test = cc.read_from_ceph(TEST_DATA_PATH, "y_test.parquet")

else:
    print(
        "The X_test.parquet and y_test.parquet files are not included in the github repo."
    )
    print(
        "Please set REMOTE=1 in the .env file and read this data from the S3 bucket instead."
    )

In [12]:
X_test

Unnamed: 0,size,created_at_day,created_at_month,created_at_weekday,created_at_hour,changed_files_number,body_size,commits_number,filetype_.md,filetype_.png,...,title_wordcount_uptdated,title_wordcount_url,title_wordcount_user,title_wordcount_users,title_wordcount_warning,title_wordcount_website,title_wordcount_workloads,title_wordcount_wrong,title_wordcount_yaml,title_wordcount_zero
6,1.0,18.0,1.0,1.0,14.0,1.0,2.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
40,1.0,3.0,5.0,0.0,12.0,1.0,32.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
14,1.0,7.0,10.0,3.0,13.0,1.0,1.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
0,4.0,20.0,4.0,2.0,18.0,18.0,43.0,2.0,6.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50,1.0,25.0,3.0,3.0,16.0,1.0,3.0,1.0,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
52,3.0,18.0,3.0,3.0,20.0,1.0,2.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
72,4.0,20.0,1.0,2.0,19.0,2.0,56.0,1.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
58,3.0,4.0,3.0,3.0,22.0,2.0,41.0,2.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
55,1.0,11.0,3.0,3.0,15.0,1.0,3.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
33,1.0,11.0,6.0,4.0,16.0,1.0,0.0,1.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
y_test

Unnamed: 0,ttm_class
6,1
40,2
14,5
0,9
50,1
52,5
72,6
58,6
55,0
33,0


In [14]:
# endpoint from the seldon deployment
# base_url = "http://dep1-route-aiops-tools-workshop.apps.smaug.na.operate-first.cloud/predict"

In [15]:
base_url = "http://mobeloper-route-aiops-tools-workshop.apps.smaug.na.operate-first.cloud/predict"

In [16]:
# lets extract the raw PR data corresponding to the PRs used in the test set
sample_payload = pr_df.reindex(X_test.index)

In [17]:
sample_payload

Unnamed: 0,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at,id
6,Update outdated links in onboarding template.,Related: https://github.com/operate-first/SRE/...,XS,HumairAK,1,1.642516,sesheta,1.642516,sesheta,1,1,{'sesheta': 65},"{'855520150': {'author': '4n4nd', 'words_count...","[size/XS, approved, lgtm]",[41d8386fb5b9611a6ebef7b6a72339967737b1d8],[.github/ISSUE_TEMPLATE/onboarding_to_cluster....,1642517000.0,1642517000.0,510
40,chore: Remove martin from reviewers so he's no...,Let's relieve this poor man's inbox pressure! ...,XS,tumido,1,1.620048,sesheta,1.620048,sesheta,1,1,"{'sesheta': 68, 'tumido': 11, 'HumairAK': 1}","{'650324332': {'author': 'HumairAK', 'words_co...","[size/XS, approved, lgtm]",[2ff53a7ff32bd45b23eb814792a0087970775523],[OWNERS],1620047000.0,1620047000.0,225
14,Add smaug and balrog to cluster list in onboar...,SSIA,XS,HumairAK,1,1.633644,HumairAK,1.633644,HumairAK,1,1,{'sesheta': 76},"{'773952338': {'author': 'harshad16', 'words_c...",[size/XS],[24bd93442dd648940ce7830fe5cdfad983362469],[.github/ISSUE_TEMPLATE/onboarding_to_cluster....,1633615000.0,1633615000.0,415
0,Refactor github issue templates.,This commit does the following:\r\n1. Re-Order...,L,HumairAK,1,1.65124,HumairAK,1.65124,HumairAK,2,18,"{'HumairAK': 3, 'larsks': 1, 'sesheta': 76}","{'948687753': {'author': 'larsks', 'words_coun...","[size/L, lgtm]","[327b84cd624b568e055f7a9043aa4c7a289de430, d6e...","[.github/ISSUE_TEMPLATE/1_question.yaml, .gith...",1650551000.0,,555
50,feat: Show the nice URL for a dashboard on the...,Part of: https://github.com/operate-first/apps...,XS,tumido,1,1.616691,sesheta,1.616691,sesheta,1,1,{'sesheta': 65},"{'621362929': {'author': 'HumairAK', 'words_co...","[size/XS, approved]",[4c62c46c5cb47150a7e79a683fdf821c54e0f439],[README.md],1616692000.0,1616692000.0,151
52,Clarify onboarding issue template.,Fixes: https://github.com/operate-first/suppor...,M,HumairAK,1,1.616156,sesheta,1.616156,sesheta,1,1,"{'HumairAK': 25, 'martinpovolny': 1, 'sesheta'...","{'616343597': {'author': 'tumido', 'words_coun...","[size/M, approved]",[f163066abfee75edaad439e6a9df91a923253712],[.github/ISSUE_TEMPLATE/onboarding_to_cluster.md],1616157000.0,1616157000.0,135
72,Add argocd onboarding docs,As per suggestion from [here](https://github.c...,L,HumairAK,1,1.611251,sesheta,1.611251,sesheta,1,2,{'sesheta': 68},"{'573228143': {'author': 'tumido', 'words_coun...","[approved, size/L]",[f15a722ffcde708dd8a2e6bb3e6ea580c47ab325],"[.github/ISSUE_TEMPLATE/onboarding_argocd.md, ...",1611230000.0,1611247000.0,43
58,Update docs,Updated the docs with new links / info. \r\n\r...,M,HumairAK,1,1.614957,sesheta,1.614957,sesheta,2,2,"{'4n4nd': 1, 'sesheta': 65}","{'604988101': {'author': 'tumido', 'words_coun...","[size/M, approved, lgtm]","[8f0754fe7a0bb1918da1435691ef0d6eac5cd50d, 460...","[.prow.yaml, docs/onboarding_to_argocd.md]",1614939000.0,,108
55,chore: Rename hack to scripts folder,Docs for: https://github.com/operate-first/blu...,XS,tumido,1,1.615478,sesheta,1.615478,sesheta,1,1,{'sesheta': 65},"{'609926098': {'author': 'HumairAK', 'words_co...","[size/XS, approved, lgtm]",[0c5f02e6157657b61613aa4fe4624fa4941d733a],[docs/onboarding_to_cluster.md],1615478000.0,1615478000.0,115
33,Fix typo in quota doc.,,XS,HumairAK,1,1.623429,HumairAK,1.623429,HumairAK,1,1,{'sesheta': 65},"{'682021142': {'author': '4n4nd', 'words_count...","[size/XS, approved]",[4a96a02cf9eaf6e4c37c20a89d20d23ba3ddfeee],[docs/quotas.md],1623429000.0,1623429000.0,269


In [18]:
sample_payload.dtypes

title                    object
body                     object
size                     object
created_by               object
created_at                int64
closed_at               float64
closed_by                object
merged_at               float64
merged_by                object
commits_number            int64
changed_files_number      int64
interactions             object
reviews                  object
labels                   object
commits                  object
changed_files            object
first_review_at         float64
first_approve_at        float64
id                       object
dtype: object

In [19]:
sample_payload

Unnamed: 0,title,body,size,created_by,created_at,closed_at,closed_by,merged_at,merged_by,commits_number,changed_files_number,interactions,reviews,labels,commits,changed_files,first_review_at,first_approve_at,id
6,Update outdated links in onboarding template.,Related: https://github.com/operate-first/SRE/...,XS,HumairAK,1,1.642516,sesheta,1.642516,sesheta,1,1,{'sesheta': 65},"{'855520150': {'author': '4n4nd', 'words_count...","[size/XS, approved, lgtm]",[41d8386fb5b9611a6ebef7b6a72339967737b1d8],[.github/ISSUE_TEMPLATE/onboarding_to_cluster....,1642517000.0,1642517000.0,510
40,chore: Remove martin from reviewers so he's no...,Let's relieve this poor man's inbox pressure! ...,XS,tumido,1,1.620048,sesheta,1.620048,sesheta,1,1,"{'sesheta': 68, 'tumido': 11, 'HumairAK': 1}","{'650324332': {'author': 'HumairAK', 'words_co...","[size/XS, approved, lgtm]",[2ff53a7ff32bd45b23eb814792a0087970775523],[OWNERS],1620047000.0,1620047000.0,225
14,Add smaug and balrog to cluster list in onboar...,SSIA,XS,HumairAK,1,1.633644,HumairAK,1.633644,HumairAK,1,1,{'sesheta': 76},"{'773952338': {'author': 'harshad16', 'words_c...",[size/XS],[24bd93442dd648940ce7830fe5cdfad983362469],[.github/ISSUE_TEMPLATE/onboarding_to_cluster....,1633615000.0,1633615000.0,415
0,Refactor github issue templates.,This commit does the following:\r\n1. Re-Order...,L,HumairAK,1,1.65124,HumairAK,1.65124,HumairAK,2,18,"{'HumairAK': 3, 'larsks': 1, 'sesheta': 76}","{'948687753': {'author': 'larsks', 'words_coun...","[size/L, lgtm]","[327b84cd624b568e055f7a9043aa4c7a289de430, d6e...","[.github/ISSUE_TEMPLATE/1_question.yaml, .gith...",1650551000.0,,555
50,feat: Show the nice URL for a dashboard on the...,Part of: https://github.com/operate-first/apps...,XS,tumido,1,1.616691,sesheta,1.616691,sesheta,1,1,{'sesheta': 65},"{'621362929': {'author': 'HumairAK', 'words_co...","[size/XS, approved]",[4c62c46c5cb47150a7e79a683fdf821c54e0f439],[README.md],1616692000.0,1616692000.0,151
52,Clarify onboarding issue template.,Fixes: https://github.com/operate-first/suppor...,M,HumairAK,1,1.616156,sesheta,1.616156,sesheta,1,1,"{'HumairAK': 25, 'martinpovolny': 1, 'sesheta'...","{'616343597': {'author': 'tumido', 'words_coun...","[size/M, approved]",[f163066abfee75edaad439e6a9df91a923253712],[.github/ISSUE_TEMPLATE/onboarding_to_cluster.md],1616157000.0,1616157000.0,135
72,Add argocd onboarding docs,As per suggestion from [here](https://github.c...,L,HumairAK,1,1.611251,sesheta,1.611251,sesheta,1,2,{'sesheta': 68},"{'573228143': {'author': 'tumido', 'words_coun...","[approved, size/L]",[f15a722ffcde708dd8a2e6bb3e6ea580c47ab325],"[.github/ISSUE_TEMPLATE/onboarding_argocd.md, ...",1611230000.0,1611247000.0,43
58,Update docs,Updated the docs with new links / info. \r\n\r...,M,HumairAK,1,1.614957,sesheta,1.614957,sesheta,2,2,"{'4n4nd': 1, 'sesheta': 65}","{'604988101': {'author': 'tumido', 'words_coun...","[size/M, approved, lgtm]","[8f0754fe7a0bb1918da1435691ef0d6eac5cd50d, 460...","[.prow.yaml, docs/onboarding_to_argocd.md]",1614939000.0,,108
55,chore: Rename hack to scripts folder,Docs for: https://github.com/operate-first/blu...,XS,tumido,1,1.615478,sesheta,1.615478,sesheta,1,1,{'sesheta': 65},"{'609926098': {'author': 'HumairAK', 'words_co...","[size/XS, approved, lgtm]",[0c5f02e6157657b61613aa4fe4624fa4941d733a],[docs/onboarding_to_cluster.md],1615478000.0,1615478000.0,115
33,Fix typo in quota doc.,,XS,HumairAK,1,1.623429,HumairAK,1.623429,HumairAK,1,1,{'sesheta': 65},"{'682021142': {'author': '4n4nd', 'words_count...","[size/XS, approved]",[4a96a02cf9eaf6e4c37c20a89d20d23ba3ddfeee],[docs/quotas.md],1623429000.0,1623429000.0,269


In [20]:
# convert the dataframe into a numpy array and then to a list (required by seldon)
data = {
    "data": {
        "names": sample_payload.columns.tolist(),
        "ndarray": sample_payload.to_numpy().tolist(),
    }
}

# create the query payload
json_data = json.dumps(data)
headers = {"content-Type": "application/json"}

In [21]:
class_dict = {pos: str(ele) + " hrs" for pos, ele in enumerate(time_intervals)}
class_dict

{0: '0.03861111111111111 hrs',
 1: '0.15900000000000003 hrs',
 2: '0.43155555555555564 hrs',
 3: '1.2728888888888905 hrs',
 4: '4.1915 hrs',
 5: '8.24 hrs',
 6: '17.068833333333334 hrs',
 7: '22.594388888888886 hrs',
 8: '47.463666666666676 hrs',
 9: '122.57166666666673 hrs'}

In [22]:
# query our inference service
response = requests.post(base_url, data=json_data, headers=headers)
response

<Response [200]>

In [23]:
# what are the names of the prediction classes
json_response = response.json()

In [24]:
json_response["data"]["names"]

['Class_0',
 'Class_1',
 'Class_2',
 'Class_3',
 'Class_4',
 'Class_5',
 'Class_6',
 'Class_7',
 'Class_8',
 'Class_9']

In [25]:
sample_pr = 10

In [26]:
# probabality estimates for each of the class for a sample PR
json_response["data"]["ndarray"][sample_pr][:10]

[0.015, 0.0, 0.0, 0.04, 0.645, 0.07, 0.085, 0.045, 0.08, 0.02]

In [27]:
# get predicted classes from probabilities for each PR
preds = np.argmax(json_response["data"]["ndarray"], axis=1)
print(
    "The PR belongs to class",
    preds[sample_pr],
    "and it is most likely to be merged in",
    class_dict[preds[sample_pr]],
)

The PR belongs to class 4 and it is most likely to be merged in 4.1915 hrs


In [28]:
print("The PR was actually merged in", class_dict[int(y_test.iloc[sample_pr])])

The PR was actually merged in 22.594388888888886 hrs


In [29]:
# evaluate results on the entire dataset
print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.00      0.00      0.00         2
           1       0.00      0.00      0.00         2
           2       0.00      0.00      0.00         2
           3       0.00      0.00      0.00         0
           4       0.00      0.00      0.00         1
           5       1.00      0.67      0.80         3
           6       0.00      0.00      0.00         2
           7       0.00      0.00      0.00         1
           9       0.00      0.00      0.00         3

    accuracy                           0.12        16
   macro avg       0.11      0.07      0.09        16
weighted avg       0.19      0.12      0.15        16



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Conclusion

This notebook shows how raw PR data can be sent to the deployed Seldon service to get time-to-merge predictions. Additionally, we see that the evaluation scores in the classification report match the ones we saw in the training notebook. So, great, looks like our inference service and model are working as expected, and are ready to predict some times to merge for GitHub PRs! 