# Code Search on Kubeflow

This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.

**NOTE**: If you haven't already, see [kubeflow/examples/code_search](https://github.com/kubeflow/examples/tree/master/code_search) for instructions on how to get this notebook,.

## Install dependencies

Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This will take a while the first time.

### Verify Version Information

In [None]:
%%bash

echo "Pip Version Info: " && python2 --version && python2 -m pip --version && echo
echo "Google Cloud SDK Info: " && gcloud --version && echo
echo "Kubectl Version Info: " && kubectl version

### Install Pip Packages

In [None]:
! python2 -m pip install -U pip

In [None]:
# Code Search dependencies
! python2 -m pip install --user https://github.com/kubeflow/batch-predict/tarball/master
! python2 -m pip install --user -r src/requirements.ui.txt
! python2 -m pip install --user -r src/requirements.nmslib.txt
! python2 -m pip install --user -r src/requirements.dataflow.txt

In [None]:
# BigQuery Cell Dependencies
! python2 -m pip install --user pandas-gbq

In [None]:
# NOTE: The RuntimeWarnings (if any) are harmless. See ContinuumIO/anaconda-issues#6678.
from pandas.io import gbq

## Setup the Evironment

This involves setting up the Ksonnet application, utility environment variables for various CLI steps, GCS bucket, and BigQuery dataset.

###  Setup Authorization

In a Kubeflow cluster on GKE, we already have the Google Application Credentials mounted onto each Pod. We can simply point `gcloud` to activate that service account.

In [None]:
%%bash

# Activate Service Account provided by Kubeflow.
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}

Additionally, to interact with the underlying cluster, we configure `kubectl`.

In [None]:
%%bash

kubectl config set-cluster kubeflow --server=https://kubernetes.default --certificate-authority=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt
kubectl config set-credentials jupyter --token "$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)"
kubectl config set-context kubeflow --cluster kubeflow --user jupyter
kubectl config use-context kubeflow

Collectively, these allow us to interact with Google Cloud Services as well as the Kubernetes Cluster directly to submit `TFJob`s and execute `Dataflow` pipelines.

### Create PROJECT variable

Set PROJECT to the GCP project you want to use.
* If gcloud has a project set, this will be used by default.
* To use a different project or if gcloud doesn't have a project set, you will need to configure one explicitly.

In [None]:
import subprocess

PROJECT = subprocess.check_output(["gcloud", "config", "get-value", "project"]).strip()

# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env PROJECT $PROJECT

### Create a storage bucket

Create a GCS bucket to store data.

In [None]:
%%bash

gsutil mb -p $PROJECT gs://code-search

## Define an experiment

* This solution consists of multiple jobs and servers that need to share parameters.
* To facilitate this we use [params.json](https://github.com/kubeflow/examples/blob/master/code_search/kubeflow/components/params.jason) to define sets of parameters.
* You configure an experiment to run by defining a set of experiments in the **current** field. 
* You can save an old experiment's parameters in a different field (not **current**); it does not matter which field's name to use because we are not running the old experiment.

To get started define your experiment

* Create or modify the **cureent** entry containing a set of values to be used for your experiment.
* Set the following values

    * **project**: Set this to the GCP project you are working on.
    * **experiment**: Experiment's name.
    * **bq_target_dataset**: BigQuery dataset to store data.
    * **data_dir**: The data directory in GCS to be used by T2T.
    * **working_dir**: The working directory in GCS to store temporary data, stages, and output models.
    * **model_dir**: 
        * After training your model, set this to a GCS directory containing the export model
        * e.g gs://code-search/20190402/working/output/export/1533685294
    * **embedding_dir**: The embedding directory in GCS to store functions' embeddings.
    * **t2t_problem**: Set this to "kf_github_function_docstring".
    * **t2t_model**: Set this  "kf_similarity_transformer".
    * **train_steps**: Numer oftraining steps.
    * **eval_steps**: Number of steps to be used for eval.
    * **hparams_set**: The set of hyperparameters to use; see some suggestions [here](https://github.com/tensorflow/tensor2tensor#language-modeling).
    * **lookup_file**: Set this to the GCS location of the CSV produced by the job to create the nmslib index of the embeddings for all GitHub data.
    * **index_file**: Set this to the GCS location of the nmslib index for all the data in GitHub.

### Set up environment variables

Set up environments variables according to params.json.

In [None]:
import json
import os

params_file = 'params.json'

with open(params_file) as json_file:
    data = json.load(json_file)
    params = data['current']
    WORKING_DIR = params['working_dir']
    DATA_DIR = params['data_dir']
    EMBEDDING_DIR = params['embedding_dir']
    TARGET_DATASET = params['bq_target_dataset']
    
# DO NOT MODIFY. This are environment variables to be used in a bash shell.
%env WORKING_DIR $WORKING_DIR
%env DATA_DIR $DATA_DIR
%env TARGET_DATASET $TARGET_DATASET
%env EMBEDDING_DIR $EMBEDDING_DIR

### Create the BigQuery dataset

In [None]:
%%bash
bq mk ${PROJECT}:${TARGET_DATASET}

## View Github Files

This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next.

**WARNING**: The table is large and the query can take a few minutes to complete.

In [None]:
query = """
  SELECT
    MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,
    c.content
  FROM
    `bigquery-public-data.github_repos.files` AS f
  JOIN
    `bigquery-public-data.github_repos.contents` AS c
  ON
    f.id = c.id
  JOIN (
      --this part of the query makes sure repo is watched at least twice since 2017
    SELECT
      repo
    FROM (
      SELECT
        repo.name AS repo
      FROM
        `githubarchive.year.2017`
      WHERE
        type="WatchEvent"
      UNION ALL
      SELECT
        repo.name AS repo
      FROM
        `githubarchive.month.2018*`
      WHERE
        type="WatchEvent" )
    GROUP BY
      1
    HAVING
      COUNT(*) >= 2 ) AS r
  ON
    f.repo_name = r.repo
  WHERE
    f.path LIKE '%.py' AND --with python extension
    c.size < 15000 AND --get rid of ridiculously long files
    REGEXP_CONTAINS(c.content, r'def ') --contains function definition
  GROUP BY
    c.content
  LIMIT
    10
"""

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

## Pre-Processing Github Files

In this step, we use [Google Cloud Dataflow](https://cloud.google.com/dataflow/) to preprocess the data.

* We use a K8s Job to run a python program `code_search.dataflow.cli.preprocess_github_dataset` that submits the Dataflow job
* Once the job has been created it can be monitored using the Dataflow console

### Submit the Dataflow Job

Configure the job description. This script will produce jobs/submit-preprocess-job.yaml file.

In [None]:
%%bash

python configure_job.py kubeflow/components/submit-preprocess-job.yaml

Create and run the job.

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/submit-preprocess-job.yaml

When completed successfully, this should create a dataset in `BigQuery` named `$TARGET_DATASET`. Additionally, it also dumps CSV files into `$DATA_DIR` which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query.

In [None]:
query = """
  SELECT * 
  FROM 
    {}.token_pairs
  LIMIT
    10
""".format(TARGET_DATASET)

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

This pipeline also writes a set of CSV files which contain function and docstring pairs delimited by a comma. Here, we list a subset of them.

In [None]:
%%bash

LIMIT=10

gsutil ls ${DATA_DIR}/*.csv | head -n ${LIMIT}

## Prepare Dataset for Training

We will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format.

In [None]:
%%bash

python configure_job.py kubeflow/components/t2t-code-search-datagen.yaml

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/t2t-code-search-datagen.yaml

Once this job finishes, the data directory should have a vocabulary file and a list of `TFRecords` prefixed by the problem name which in our case is `github_function_docstring_extended`. Here, we list a subset of them.

In [None]:
%%bash

LIMIT=10

gsutil ls ${DATA_DIR}/vocab*
gsutil ls ${DATA_DIR}/*train* | head -n ${LIMIT}

## Execute Tensorflow Training

Once, the `TFRecords` are generated, we will use `t2t-trainer` to execute the training.

In [None]:
%%bash

python configure_job.py kubeflow/components/t2t-code-search-trainer.yaml

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/t2t-code-search-trainer.yaml

This will generate TensorFlow model checkpoints which is illustrated below.

In [None]:
%%bash

gsutil ls ${WORKING_DIR}/output/*ckpt*

## Export Tensorflow Model

We now use `t2t-exporter` to export the `TFModel`.

In [None]:
%%bash

python configure_job.py kubeflow/components/t2t-code-search-exporter.yaml

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/t2t-code-search-exporter.yaml

Once completed, this will generate a TensorFlow `SavedModel` which we will further use for both online (via `TF Serving`) and offline inference (via `Kubeflow Batch Prediction`). TFServing expects this directory to consist of numeric subdirectories corresponding to different versions of the model. Each subdirectory contains the saved model in protocol buffer along with the weights.

In [None]:
%%bash

gsutil ls ${WORKING_DIR}/output/export

## Compute Function Embeddings

In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below.

### Configuration

First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo` as seen above. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`. Below, we automatically do that by selecting the folder which represents the latest timestamp.

In [None]:
%%bash --out EXPORT_DIR_LS

gsutil ls ${WORKING_DIR}/output/export | grep -oE "([0-9]+)/$"

In [None]:
# WARNING: This routine will fail if no export has been completed successfully.
MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\n') if ts])

# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env MODEL_VERSION $MODEL_VERSION

Modify params.json and set **model_dir** to the directory computed above

### Run the Dataflow Job for Function Embeddings

In [None]:
%%bash

python configure_job.py kubeflow/components/submit-code-embeddings-job.yaml

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/submit-code-embeddings-job.yaml

When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below.

In [None]:
query = """
  SELECT * 
  FROM 
    {}.function_embeddings
  LIMIT
    10
""".format(TARGET_DATASET)

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

The pipeline also generates a set of CSV files which will be useful to generate the search index.

In [None]:
%%bash

LIMIT=10

gsutil ls ${EMBEDDING_DIR}/*index*.csv | head -n ${LIMIT}

## Create Search Index

We now create the Search Index from the computed embeddings. This facilitates k-Nearest Neighbor search to for semantically similar results.

In [None]:
%%bash

python configure_job.py kubeflow/components/search-index-creator.yaml

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/search-index-creator.yaml

Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created.

In [None]:
%%bash

gsutil ls ${WORKING_DIR}/code_search_index*

# Deploy the Web App

The included web app provides a simple way for users to submit queries.
* The first web app includes two pieces.
     * A Flask app that serves a simple UI for sending queries and uses nmslib to provide fast lookups
     * A TF Serving instance to compute the embeddings for search queries
* The second web app is used so that we can optionally use [ArgoCD](https://github.com/argoproj/argo-cd) to keep the serving components up to date.

## Deploy an Inference Server

We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/).

In [None]:
%%bash

python configure_job.py web-app/query-embed-server.yaml

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/query-embed-server.yaml

## Deploy Search UI

We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed.

In [None]:
%%bash

python configure_job.py web-app/search-index-server.yaml

In [None]:
%%bash

kubectl -n kubeflow apply -f jobs/search-index-server.yaml

The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`.