# Code Search on Kubeflow

This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.

## Prerequisites

**NOTE**: If using the JupyterHub Spawner on a Kubeflow cluster, use the Docker image `gcr.io/kubeflow-images-public/kubeflow-codelab-notebook:v20180808-v0.2-22-gcfdcb12` which has baked all the pre-prequisites.

* `Kubeflow v0.2.2`
  This notebook assumes a Kubeflow cluster is already deployed. See [Getting Started with Kubeflow](https://www.kubeflow.org/docs/started/getting-started/).

* `Python 2.7` (bundled with `pip`) 
  For this demo, we will use Python 2.7. This restriction is due to [Apache Beam](https://beam.apache.org/), which
  does not support Python 3 yet (See [BEAM-1251](https://issues.apache.org/jira/browse/BEAM-1251)).

* `Google Cloud SDK`
  This example will use tools from the [Google Cloud SDK](https://cloud.google.com/sdk/). The SDK must be
  authenticated and authorized. See [Authentication Overview](https://cloud.google.com/docs/authentication/).
  
* `Ksonnet 0.12`
  We use [Ksonnet](https://ksonnet.io/) to write Kubernetes jobs in a declarative manner to be run on top of Kubeflow.

In [None]:
%%bash

echo "Pip Version Info: " && pip2 --version && echo
echo "Google Cloud SDK Info: " && gcloud --version && echo
echo "Ksonnet Version Info: " && ks version && echo
echo "Kubectl Version Info: " && kubectl version

## Get the Source Code

Let us clone the source code for Code Search from [kubeflow/examples](https://github.com/kubeflow/examples).

In [None]:
%%bash

##
# NOTE: This cell must only be run once or the clone will complain.
# Only if necessary, uncomment the line below.

# rm -rf examples
git clone --depth=1 https://github.com/kubeflow/examples

We also change the working directory to the root directory of our Code Search example. This will allow us to perform all downstream operations without long winded paths and relative to the `code_search` project source code.

In [None]:
import os
os.chdir('examples/code_search')

## Install dependencies

Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This should take a while.

**NOTE**: The Kubeflow Batch Prediction dependency is installed from a fork for reasons in [kubeflow/batch-preidct#9](https://github.com/kubeflow/batch-predict/pull/9) and corresponding issue [kubeflow/batch-preidct#10](https://github.com/kubeflow/batch-predict/pull/10)

In [None]:
%%bash

pip2 install -r src/requirements.txt
pip2 install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider

## Configure Variables

This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps.

In [1]:
# Configuration Variables. Modify as desired.

PROJECT = 'kubeflow-dev'
CLUSTER_NAME = 'kubeflow-latest'
CLUSTER_REGION = 'us-central1-a'
CLUSTER_NAMESPACE = 'kubeflow-latest'

TARGET_DATASET = 'code_search'
WORKING_DIR = 'gs://kubeflow-examples/t2t-code-search/20180809'
WORKER_MACHINE_TYPE = 'n1-highcpu-32'
NUM_WORKERS = 16

# DO NOT MODIFY. The following statements are needed to support
# environment variables in a bash cell.
%env PROJECT $PROJECT
%env CLUSTER_NAME $CLUSTER_NAME
%env CLUSTER_REGION $CLUSTER_REGION

%env TARGET_DATASET $TARGET_DATASET
%env WORKING_DIR $WORKING_DIR
%env WORKER_MACHINE_TYPE $WORKER_MACHINE_TYPE
%env NUM_WORKERS $NUM_WORKERS

env: PROJECT=kubeflow-dev
env: CLUSTER_NAME=kubeflow-latest
env: CLUSTER_REGION=us-central1-a
env: TARGET_DATASET=code_search
env: WORKING_DIR=gs://kubeflow-examples/t2t-code-search/20180809
env: WORKER_MACHINE_TYPE=n1-highcpu-32
env: NUM_WORKERS=16


Setup the Kubernetes cluster credentials and the Ksonnet application (present in the `examples/code_search/kubeflow` directory) and update parameters.

In [13]:
%%bash

# FIXME: The deployment must set up the service account with GKE permissions
# gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}

# gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${CLUSTER_REGION}
# kubectl config view
# ks env add code-search --context=$(kubectl config current-context) --namespace=kubeflow
cat ${GOOGLE_APPLICATION_CREDENTIALS}

{
  "type": "service_account",
  "project_id": "kubeflow-dev",
  "private_key_id": "7fa99e3b3c8803b32beb6f2b0a147410bc806eec",
  "private_key": "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkqhkiG9w0BAQEFAASCBKcwggSjAgEAAoIBAQC1OUvIB9hWVCQu\nJ/MdU7XH87sA4/iPT3cKjirIxTgT7kA5jXNnpPet0G4OhpjRje9Em+oIpdMcg8sX\nv8OuluagUWi8UsU0naP0v8y8Uc2onYtaC59nCn+vg9EAyCutpT2wJeZH6mDH2Kta\nubDh6NcobWqwCfqNS7tF7DYaxl/EQ3mvAt+TlAbPnSjRP7POIEE4DzOl3QuIZ9AN\nbWiPn0Y/2+cpRkPqDIRAKDTBLAGcSIF//7aMSSalPh15dyGgdKPvF/SMyl+xiXQ9\nA4p4k7UPzZcUpWiOvAwIkncs0Q1VnYuIuAvjj/u3ufFREInAlqz4ECZo0GQbbIaF\n3U8P448ZAgMBAAECggEAQs8XHWyq+BR37B4lNcQVCVxUrfzdNvP8NkN4CWEPjeVw\n/uajS2vZNVZYJHnBX8u8ECaMjliXrfT2S9CR0szlw+ePPZIkCoQtG/8Ter+LmmRO\nKcmMH+ASd4GYbPnehFsdFVG7hfqlaDd74GwBhh8hJtHDmZdsK2fmZ94vigpk5sS8\ndrdmSXRyrh8epvV54z/zZ6gIE3HrBbZD1nGJg81TOPtPQBwub+MCtm6Wg6sUsFdc\nxDvC3ExIM3Nqyrrw71dbTNY/MVCZRQYSgQolcJraAFQuTPFmHrgtKeEXRV5/jAl+\nhEDwf7o5U2NcnAD8Yq7VHMFyQP38T3fEKhEo7DlCGwKBgQDd8pmB2oDE/RFOmpqw\nb+xJJQd/GRx+mtb3SVKm19sLAt+ANYDJMk

## Pre-Processing Github Files

In this step, we will run a [Google Cloud Dataflow](https://cloud.google.com/dataflow/) pipeline (based on Apache Beam). A `Python 2` module `code_search.dataflow.cli.preprocess_github_dataset` has been provided which builds an Apache Beam pipeline. A list of all possible arguments can be seen via the following command.

In [None]:
%%bash

python2 -m code_search.dataflow.cli.preprocess_github_dataset -h

### Configuration

We use a subset of the options available to run our Dataflow job. The variables are required and make sure you modify the preset variables as desired.

### Run the Dataflow Job for Pre-Processing

In [None]:
%%bash

# See help for a short description of each argument.

python2 -m code_search.dataflow.cli.preprocess_github_dataset \
        --runner DataflowRunner \
        --project "${PROJECT}" \
        --target_dataset "${TARGET_DATASET}" \
        --data_dir "${WORKING_DIR}/data" \
        --job_name "preprocess-github-dataset" \
        --temp_location "${WORKING_DIR}/data/dataflow/temp" \
        --staging_location "${WORKING_DIR}/data/dataflow/staging" \
        --worker_machine_type "${WORKER_MACHINE_TYPE}" \
        --num_workers "${NUM_WORKERS}"

When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model.

## Prepare Dataset for Training

In this step we will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format. We will run this job on the Kubeflow cluster.

In [None]:
%%bash

ks apply code-search -c t2t-code-search-datagen

## Execute Tensorflow Training

In [None]:
%%bash

ks apply code-search -c t2t-code-search-trainer

## Export Tensorflow Model

In [None]:
%%bash

ks apply code-search -c t2t-code-search-exporter

## Compute Function Embeddings

In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below.

In [None]:
%%bash

python2 -m code_search.dataflow.cli.create_function_embeddings -h

### Configuration

First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo`. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`.

In [None]:
MODEL_VERSION = '1533685294'

# DO NOT MODIFY. The following statements are needed to support
# environment variables in a bash cell.
%env MODEL_VERSION $MODEL_VERSION

### Run the Dataflow Job for Function Embeddings

In [None]:
%%bash

python2 -m code_search.dataflow.cli.create_function_embeddings \
        --runner DataflowRunner
        --project "${PROJECT}" \
        --target_dataset "${TARGET_DATASET}" \
        --problem github_function_docstring \
        --data_dir "${WORKING_DIR}/data" \
        --saved_model_dir "${WORKING_DIR}/output/export/Servo/${MODEL_VERSION}" \
        --job_name compute-function-embeddings
        --temp_location "${WORKING_DIR}/data/dataflow/temp" \
        --staging_location "${WORKING_DIR}/data/dataflow/staging" \
        --worker_machine_type "${WORKER_MACHINE_TYPE}" \
        --num_workers "${NUM_WORKERS}"

When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings.

## Create Search Index

We now create the Search Index from the computed embeddings so that during a query we can do a k-Nearest Neighbor search to give out semantically similar results.

In [None]:
%%bash

ks apply code-search -c search-index-creator

Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created in the `WORKING_DIR`.

## Deploy an Inference Server

We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/).

In [None]:
%%bash

ks apply code-search -c t2t-code-search-serving

## Deploy Search UI

We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed.

In [None]:
%%bash

ks apply code-search -c search-index-server

The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`.