# Fetch GitHub Issues and Compute Embeddings

* This notebook downloads GitHub Issues and then computes the embeddings using a trained model
* [issues_loader.ipynb](../Label_Microservice/notebooks/issues_loader.ipynb) is a very similar notebook

   * That notebook however just uses the IssuesLoader class as way of hard coding some paths.

## Running this Notebook

This notebook is run in the container: [hamelsmu/ml-gpu-issue-lang-model](https://github.com/machine-learning-apps/IssuesLanguageModel/blob/master/gpu.Dockerfile)

This container is publicly available [on Dockerhub](https://cloud.docker.com/u/hamelsmu/repository/docker/hamelsmu/ml-gpu-issue-lang-model)

#### Compute: This notebook was run on a [p3.8xlarge](https://aws.amazon.com/ec2/instance-types/p3/) on AWS
Tesla V100 GPU, 32 vCPUs 244GB of Memory

In [1]:
import logging
import os
from pathlib import Path
import sys

logging.basicConfig(format='%(message)s')
logging.getLogger().setLevel(logging.INFO)

home = str(Path.home())

# Installing the python packages locally doesn't appear to have them automatically
# added the path so we need to manually add the directory
local_py_path = os.path.join(home, ".local/lib/python3.6/site-packages")

for p in [local_py_path, os.path.abspath("../../py")]:
    if p not in sys.path:
      logging.info("Adding %s to python path", p)
      # Insert at front because we want to override any installed packages
      sys.path.insert(0, p)


Adding /home/jovyan/git_kubeflow-code-intelligence/py to python path


In [72]:
!pip3 install --user -r ../requirements.txt

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [2]:
from bs4 import BeautifulSoup
import requests
from fastai.core import parallel, partial

from collections import Counter
from tqdm import tqdm_notebook
import torch
from code_intelligence import embeddings
from code_intelligence import graphql
from code_intelligence import gcs_util
from google.cloud import storage

## Get a list of Kubeflow REPOs

* You will need to either set a GitHub token or use a GitHub App in order to call the API

In [55]:
if not os.getenv("GITHUB_TOKEN"):
    logging.warning(f"No GitHub token set defaulting to hardcode list of Kubeflow repositories")
    
    # The list of repos can be updated using the else block
    repo_names = ['arena', 'batch-predict', 'caffe2-operator', 'chainer-operator', 'code-intelligence', 'common', 'community', 'crd-validation', 'example-seldon', 'examples', 'fairing', 'features', 'frontend', 'homebrew-cask', 'homebrew-core', 'internal-acls', 'katib', 'kfctl', 'kfp-tekton', 'kfserving', 'kubebench', 'kubeflow', 'manifests', 'marketing-materials', 'metadata', 'mpi-operator', 'mxnet-operator', 'pipelines', 'pytorch-operator', 'reporting', 'testing', 'tf-operator', 'triage-issues', 'website', 'xgboost-operator']
else:
    gh_client = graphql.GraphQLClient()
        
    repo_query="""query repoQuery($org: String!) {
       organization(login: $org) {
        repositories(first:100) {
          totalCount 
          edges {
            node {
              name
            }
          }
        }
      }
    }
    """
    variables = {
        "org": "kubeflow",
    }
    results = gh_client.run_query(repo_query, variables)
    repo_nodes = graphql.unpack_and_split_nodes(results, ["data", "organization", "repositories", "edges"])
    repo_names = [n["name"] for n in repo_nodes]

    ",".join([f"'{n}'" for n in sorted(repo_names)])
    names_str = ", ".join([f"'{n}'" for n in sorted(repo_names)])
    print(f"[{names_str}]")

GraphQLClient is defaulting to FixedAccessTokenGenerator based on environment variables. This is deprecated. Caller should explicitly pass in a instance via header_generator. Traceback:
<function extract_stack at 0x7f91e275f6a8>


['arena', 'batch-predict', 'caffe2-operator', 'chainer-operator', 'code-intelligence', 'common', 'community', 'crd-validation', 'example-seldon', 'examples', 'fairing', 'features', 'frontend', 'homebrew-cask', 'homebrew-core', 'internal-acls', 'katib', 'kfctl', 'kfp-tekton', 'kfserving', 'kubebench', 'kubeflow', 'manifests', 'marketing-materials', 'metadata', 'mpi-operator', 'mxnet-operator', 'pipelines', 'pytorch-operator', 'reporting', 'testing', 'tf-operator', 'triage-issues', 'website', 'xgboost-operator']


## Get The Data

In [3]:
%load_ext autoreload
%autoreload 2
import pandas as pd
from inference import InferenceWrapper

## Load Model Artifacts (Download from GC if not on local)

* We need to load the model used to compute embeddings

In [4]:
from pathlib import Path
from urllib import request as request_url

def pass_through(x):
    return x

model_url = 'https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_22zkdqlr.pkl'
inference_wrapper = embeddings.load_model_artifact(model_url)

#### Warning: The below cell benefits tremendously from parallelism, the more cores your machine has the better

* The code will fail if you aren't running with a GPU

## Load Data Using BigQuery

* TODO(jlewi): Does the preprocessing match what we do for inference?

In [9]:
# TODO(jlewi): I was encountering all kind of version mismatches; doing a force upgrade seemed to fix issues with bigquery
# may have broken pytorch; not sure.
!pip install --user --force --upgrade pandas-gbq google-cloud-bigquery

Collecting pandas-gbq
  Using cached https://files.pythonhosted.org/packages/c3/74/126408f6bdb7b2cb1dcb8c6e4bd69a511a7f85792d686d1237d9825e6194/pandas_gbq-0.13.1-py3-none-any.whl
Collecting google-cloud-bigquery
  Using cached https://files.pythonhosted.org/packages/8f/f7/b6f55e144da37f38a79552a06103f2df4a9569e2dfc6d741a7e2a63d3592/google_cloud_bigquery-1.24.0-py2.py3-none-any.whl
Collecting pydata-google-auth
  Using cached https://files.pythonhosted.org/packages/87/ed/9c9f410c032645632de787b8c285a78496bd89590c777385b921eb89433d/pydata_google_auth-0.3.0-py2.py3-none-any.whl
Collecting google-auth-oauthlib
  Using cached https://files.pythonhosted.org/packages/7b/b8/88def36e74bee9fce511c9519571f4e485e890093ab7442284f4ffaef60b/google_auth_oauthlib-0.4.1-py2.py3-none-any.whl
Collecting setuptools
[?25l  Downloading https://files.pythonhosted.org/packages/a0/df/635cdb901ee4a8a42ec68e480c49f85f4c59e8816effbf57d9e6ee8b3588/setuptools-46.1.3-py3-none-any.whl (582kB)
[K     |███████████████

## Get the Data Using BigQuery

* We can use BigQuery to fetch the data from the GitHub Archive
* Here is a list of [GitHub Event Types](https://developer.github.com/v3/activity/events/types/)
  * We need to consider both [IssuesEvent](https://developer.github.com/v3/activity/events/types/#issuesevent) and [IssueCommentEvent](https://developer.github.com/v3/activity/events/types/#issuecommentevent)
* At the time of this writing 2020/04/08 there are approximately 137K events in Kubeflow and it takes O(30) seconds to fetch all of them.

In [1]:
from pandas.io import gbq
import subprocess 
# TODO(jlewi): Get the project using fairing?
PROJECT = subprocess.check_output(["gcloud", "config", "get-value", "project"]).strip().decode()

In [2]:
# TODO(jlewi): Was GBQ prodding me for an oauth token? It should be using workload identity? Is this because
# metadata server is unavailable?
query = '''SELECT *
FROM (
  SELECT
    updated_at
    , MAX(updated_at) OVER (PARTITION BY url) as last_time
    , FORMAT("%T", ARRAY_CONCAT_AGG(labels)) as labels
    , repo, url, title, body, len_labels
  FROM(
      SELECT
          TIMESTAMP(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.updated_at'), "\"", "")) as updated_at
        , REGEXP_EXTRACT(JSON_EXTRACT(payload, '$.issue.url'), r'https://api.github.com/repos/(.*)/issues') as repo
        , JSON_EXTRACT(payload, '$.issue.url') as url
          -- extract the title and body removing parentheses, brackets, and quotes
        , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.title'), r"\\n|\(|\)|\[|\]|#|\*|`|\"", ' '))) as title
        , LOWER(TRIM(REGEXP_REPLACE(JSON_EXTRACT(payload, '$.issue.body'), r"\\n|\(|\)|\[|\]|#|\*|`|\"", ' '))) as body
        , REGEXP_EXTRACT_ALL(JSON_EXTRACT(payload, "$.issue.labels"), ',"name\":"(.+?)","color') as labels
        , ARRAY_LENGTH(REGEXP_EXTRACT_ALL(JSON_EXTRACT(payload, "$.issue.labels"), ',"name\":"(.+?)","color')) as len_labels
      FROM `githubarchive.month.20*`
      WHERE 
         type="IssuesEvent"
  )
  WHERE 
    repo = 'kubeflow/kubeflow'
  GROUP BY updated_at, repo, url, title, body, len_labels
)
WHERE last_time = updated_at and len_labels >= 1
'''

query = """SELECT          
          JSON_EXTRACT(payload, '$.issue.html_url') as html_url,
          JSON_EXTRACT(payload, '$.issue.title') as title,
          JSON_EXTRACT(payload, '$.issue.body') as body,
          JSON_EXTRACT(payload, "$.issue.labels") as labels,
          JSON_EXTRACT(payload, "$.issue.updated_at") as updated_at,
          org.login,
          type,
      FROM `githubarchive.month.20*`
      WHERE  (type="IssuesEvent" or type="IssueCommentEvent") and org.login = 'kubeflow'"""
issues_and_pulls=gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

Downloading: 100%|██████████| 137061/137061 [00:34<00:00, 3985.16rows/s]


* Pull request comments also get included so we need to filter those out

In [5]:
import re
pattern = re.compile(".*issues/[\d]+")
issues_index = issues_and_pulls["html_url"].apply(lambda x: pattern.match(x) is not None)
issues=issues_and_pulls[issues_index]

* We need to group the events by issue and then select the most recent event for each issue as that should have
  the most up to date labels for each issue
* TODO(jlewi): We should look for the most recent event in the dataset and then have some alert if the age exceeds some
  limit as that indicates the data isn't up to date.

In [7]:
latest_issues = issues.groupby("html_url", as_index=False).apply(lambda x: x.sort_values(["updated_at"]).iloc[-1])

In [12]:
# Example of fetching a specific issue
# This allows easy spot checking of the data
some_issue = "https://github.com/kubeflow/kubeflow/issues/4916"
test_issue = latest_issues.loc[latest_issues["html_url"]==f'"{some_issue}"']
test_issue

Unnamed: 0,html_url,title,body,labels,updated_at,login,type
4299,"""https://github.com/kubeflow/kubeflow/issues/4...","""Open Data Hub & Kubeflow relationship""","""/kind question\r\n\r\nHi all,\r\n\r\nAs some ...","[{""id"":1182962369,""node_id"":""MDU6TGFiZWwxMTgyO...","""2020-04-04T23:48:44Z""",kubeflow,IssueCommentEvent


In [13]:
test_issue["labels"][]

4299    [{"id":1182962369,"node_id":"MDU6TGFiZWwxMTgyO...
Name: labels, dtype: object

In [64]:
query = """SELECT          
          JSON_EXTRACT(payload, '$.issue.html_url') as html_url,
          JSON_EXTRACT(payload, '$.issue.title') as title,
          JSON_EXTRACT(payload, '$.issue.body') as body,
          JSON_EXTRACT(payload, "$.issue.labels") as labels,
          JSON_EXTRACT(payload, "$.issue.updated_at") as updated_at, 
          type,
      FROM `githubarchive.month.20*`
      WHERE  org.login='kubeflow' and JSON_EXTRACT(payload, '$.issue.html_url') = '"https://github.com/kubeflow/kubeflow/issues/4916"' """
issue_rows=gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

Downloading: 100%|██████████| 6/6 [00:00<00:00, 30.18rows/s]


In [65]:
issue_rows

Unnamed: 0,html_url,title,body,labels,updated_at,type
0,"""https://github.com/kubeflow/kubeflow/issues/4...","""Open Data Hub & Kubeflow relationship""","""/kind question\r\n\r\nHi all,\r\n\r\nAs some ...","[{""id"":1182962369,""node_id"":""MDU6TGFiZWwxMTgyO...","""2020-04-04T23:48:44Z""",IssueCommentEvent
1,"""https://github.com/kubeflow/kubeflow/issues/4...","""Open Data Hub & Kubeflow relationship""","""/kind question\r\n\r\nHi all,\r\n\r\nAs some ...","[{""id"":765839028,""node_id"":""MDU6TGFiZWw3NjU4Mz...","""2020-04-03T15:08:46Z""",IssueCommentEvent
2,"""https://github.com/kubeflow/kubeflow/issues/4...","""Open Data Hub & Kubeflow relationship""","""/kind question\r\n\r\nHi all,\r\n\r\nAs some ...",[],"""2020-04-03T14:20:36Z""",IssuesEvent
3,"""https://github.com/kubeflow/kubeflow/issues/4...","""Open Data Hub & Kubeflow relationship""","""/kind question\r\n\r\nHi all,\r\n\r\nAs some ...","[{""id"":765839028,""node_id"":""MDU6TGFiZWw3NjU4Mz...","""2020-04-04T02:56:24Z""",IssueCommentEvent
4,"""https://github.com/kubeflow/kubeflow/issues/4...","""Open Data Hub & Kubeflow relationship""","""/kind question\r\n\r\nHi all,\r\n\r\nAs some ...","[{""id"":765839028,""node_id"":""MDU6TGFiZWw3NjU4Mz...","""2020-04-03T14:20:53Z""",IssueCommentEvent
5,"""https://github.com/kubeflow/kubeflow/issues/4...","""Open Data Hub & Kubeflow relationship""","""/kind question\r\n\r\nHi all,\r\n\r\nAs some ...","[{""id"":765839028,""node_id"":""MDU6TGFiZWw3NjU4Mz...","""2020-04-03T15:40:27Z""",IssueCommentEvent


In [32]:

import json
json.loads(issues.loc[8]["labels"])

[{'id': 892266354,
  'node_id': 'MDU6TGFiZWw4OTIyNjYzNTQ=',
  'url': 'https://api.github.com/repos/kubeflow/website/labels/improvement/enhancement',
  'name': 'improvement/enhancement',
  'color': '00daff',
  'default': False}]

## Compute Embeddings

* For each repo compute the embeddings and save to GCS
* TODO(jlewi): Can we use the metadata storage to keep track of artifacts?

In [87]:
embeddings_dir = "gs://repo-embeddings/kubeflow/20200427"

In [103]:
%%time
import dill as dpickle
import tempfile

repo_names.sort()
for repo in repo_names:
    # TODO(https://github.com/kubeflow/code-intelligence/issues/123): Use HDF5 files
    embeddings_file = os.path.join(embeddings_dir, repo + ".pkl")
    if gcs_util.check_gcs_object(embeddings_file):
        logging.info(f"Skipping repo {repo}; File {embeddings_file} exists")
        continue
        
    logging.info(f"Procesing issues for repo {repo}")
    try:
        # TODO(jlewi): get_all_issue_text should be refactored. We shouldn't couple computing inference with
        # fetching of the issues. We should decouple the calls so its easier to change how we fetch the issues.
        # We might want to support fetching the data either via the GraphQL API and/or BigQuery.
        # Using BigQuery might be faster for bulk pulls but might only work for public repositories.
        # The notebook 07_Get_Repo_TrainingData_BigQuery.ipynb has code for fetching using BigQuery.
        #
        # There is code here: https://github.com/kubeflow/code-intelligence/blob/9bbdce34fc0d81bfb9a63493941763771d2a0746/Issue_Triage/notebooks/triage.ipynb
        # For downloading all the issues with the GitHub API. That will likely run into API limits.
        repo_embeddings = embeddings.get_all_issue_text(owner='kubeflow', repo=repo, inf_wrapper=inference_wrapper)
    except TypeError as e:
        logging.error(f"Exception {e} occurred for repo {repo_name}")
    local_file = None
    with tempfile.NamedTemporaryFile(delete=False, mode='wb') as f:
        dpickle.dump(repo_embeddings, f)
        local_file = f.name
        
    gcs_util.copy_to_gcs(local_file, embeddings_file)

Skipping repo arena; File gs://repo-embeddings/kubeflow/20200427/arena.pkl exists
Skipping repo batch-predict; File gs://repo-embeddings/kubeflow/20200427/batch-predict.pkl exists
Skipping repo caffe2-operator; File gs://repo-embeddings/kubeflow/20200427/caffe2-operator.pkl exists
Skipping repo chainer-operator; File gs://repo-embeddings/kubeflow/20200427/chainer-operator.pkl exists
Skipping repo code-intelligence; File gs://repo-embeddings/kubeflow/20200427/code-intelligence.pkl exists
Skipping repo common; File gs://repo-embeddings/kubeflow/20200427/common.pkl exists
Procesing issues for repo community


HTTPError: 429 Client Error: too many requests for url: https://github.com/kubeflow/community/issues

In [102]:
importlib.reload(embeddings)

<module 'code_intelligence.embeddings' from '/home/jovyan/git_kubeflow-code-intelligence/py/code_intelligence/embeddings.py'>

# Notes

It takes 4min to retrieve embeddings and labels for `Kubeflow\Kubeflow` this time can likely be brought down to 1 minute by batching the text instead of feeding the language model one by one.  