# Fetch GitHub Issues and Compute Embeddings

* This notebook downloads GitHub Issues and then computes the embeddings using a trained model
* [issues_loader.ipynb](../Label_Microservice/notebooks/issues_loader.ipynb) is a very similar notebook

   * That notebook however just uses the IssuesLoader class as way of hard coding some paths.

## Running this Notebook

This notebook is run in the container: [hamelsmu/ml-gpu-issue-lang-model](https://github.com/machine-learning-apps/IssuesLanguageModel/blob/master/gpu.Dockerfile)

This container is publicly available [on Dockerhub](https://cloud.docker.com/u/hamelsmu/repository/docker/hamelsmu/ml-gpu-issue-lang-model)

#### Compute: This notebook was run on a [p3.8xlarge](https://aws.amazon.com/ec2/instance-types/p3/) on AWS
Tesla V100 GPU, 32 vCPUs 244GB of Memory

In [77]:
import logging
import os
from pathlib import Path
import sys

logging.basicConfig(format='%(message)s')
logging.getLogger().setLevel(logging.INFO)

home = str(Path.home())

# Installing the python packages locally doesn't appear to have them automatically
# added the path so we need to manually add the directory
local_py_path = os.path.join(home, ".local/lib/python3.6/site-packages")

for p in [local_py_path, os.path.abspath("../../py")]:
    if p not in sys.path:
      logging.info("Adding %s to python path", p)
      # Insert at front because we want to override any installed packages
      sys.path.insert(0, p)


In [72]:
!pip3 install --user -r ../requirements.txt

You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [57]:
from bs4 import BeautifulSoup
import requests
from fastai.core import parallel, partial

from collections import Counter
from tqdm import tqdm_notebook
import torch
from code_intelligence import embeddings
from code_intelligence import graphql
from code_intelligence import gcs_util
from google.cloud import storage

## Get a list of Kubeflow REPOs

* You will need to either set a GitHub token or use a GitHub App in order to call the API

In [55]:
if not os.getenv("GITHUB_TOKEN"):
    logging.warning(f"No GitHub token set defaulting to hardcode list of Kubeflow repositories")
    
    # The list of repos can be updated using the else block
    repo_names = ['arena', 'batch-predict', 'caffe2-operator', 'chainer-operator', 'code-intelligence', 'common', 'community', 'crd-validation', 'example-seldon', 'examples', 'fairing', 'features', 'frontend', 'homebrew-cask', 'homebrew-core', 'internal-acls', 'katib', 'kfctl', 'kfp-tekton', 'kfserving', 'kubebench', 'kubeflow', 'manifests', 'marketing-materials', 'metadata', 'mpi-operator', 'mxnet-operator', 'pipelines', 'pytorch-operator', 'reporting', 'testing', 'tf-operator', 'triage-issues', 'website', 'xgboost-operator']
else:
    gh_client = graphql.GraphQLClient()
        
    repo_query="""query repoQuery($org: String!) {
       organization(login: $org) {
        repositories(first:100) {
          totalCount 
          edges {
            node {
              name
            }
          }
        }
      }
    }
    """
    variables = {
        "org": "kubeflow",
    }
    results = gh_client.run_query(repo_query, variables)
    repo_nodes = graphql.unpack_and_split_nodes(results, ["data", "organization", "repositories", "edges"])
    repo_names = [n["name"] for n in repo_nodes]

    ",".join([f"'{n}'" for n in sorted(repo_names)])
    names_str = ", ".join([f"'{n}'" for n in sorted(repo_names)])
    print(f"[{names_str}]")

GraphQLClient is defaulting to FixedAccessTokenGenerator based on environment variables. This is deprecated. Caller should explicitly pass in a instance via header_generator. Traceback:
<function extract_stack at 0x7f91e275f6a8>


['arena', 'batch-predict', 'caffe2-operator', 'chainer-operator', 'code-intelligence', 'common', 'community', 'crd-validation', 'example-seldon', 'examples', 'fairing', 'features', 'frontend', 'homebrew-cask', 'homebrew-core', 'internal-acls', 'katib', 'kfctl', 'kfp-tekton', 'kfserving', 'kubebench', 'kubeflow', 'manifests', 'marketing-materials', 'metadata', 'mpi-operator', 'mxnet-operator', 'pipelines', 'pytorch-operator', 'reporting', 'testing', 'tf-operator', 'triage-issues', 'website', 'xgboost-operator']


## Get The Data

In [3]:
%load_ext autoreload
%autoreload 2
import pandas as pd
from inference import InferenceWrapper

## Load Model Artifacts (Download from GC if not on local)

* We need to load the model used to compute embeddings

In [4]:
from pathlib import Path
from urllib import request as request_url

def pass_through(x):
    return x

model_url = 'https://storage.googleapis.com/issue_label_bot/model/lang_model/models_22zkdqlr/trained_model_22zkdqlr.pkl'
inference_wrapper = embeddings.load_model_artifact(model_url)

#### Warning: The below cell benefits tremendously from parallelism, the more cores your machine has the better

* The code will fail if you aren't running with a GPU

## Compute Embeddings

* For each repo compute the embeddings and save to GCS
* TODO(jlewi): Can we use the metadata storage to keep track of artifacts?

In [87]:
embeddings_dir = "gs://repo-embeddings/kubeflow/20200427"

In [None]:
%%time
import dill as dpickle
import tempfile

for repo in repo_names:
    # TODO(https://github.com/kubeflow/code-intelligence/issues/123): Use HDF5 files
    embeddings_file = os.path.join(embeddings_dir, repo + ".pkl")
    if gcs_util.check_gcs_object(embeddings_file):
        logging.info(f"Skipping repo {repo}; File {embeddings_file} exists")
        continue
        
    logging.info(f"Procesing issues for repo {repo}")
    repo_embeddings = embeddings.get_all_issue_text(owner='kubeflow', repo=repo, inf_wrapper=inference_wrapper)
    
    local_file = None
    with tempfile.NamedTemporaryFile(delete=False, mode='wb') as f:
        dpickle.dump(repo_embeddings, f)
        local_file = f.name
        
    gcs_util.copy_to_gcs(local_file, embeddings_file)

Procesing issues for repo tf-operator


In [88]:
embeddings_file = os.path.join(embeddings_dir, repo + ".pkl")

In [93]:
gcs_util.copy_to_gcs(local_file, embeddings_file)

In [82]:
importlib.reload(gcs_util)

<module 'code_intelligence.gcs_util' from '/home/jovyan/git_kubeflow-code-intelligence/py/code_intelligence/gcs_util.py'>

In [None]:
gs

# Notes

It takes 4min to retrieve embeddings and labels for `Kubeflow\Kubeflow` this time can likely be brought down to 1 minute by batching the text instead of feeding the language model one by one.  