# Code Search on Kubeflow

This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string. 


# TODO

- Appropriate Working Directories for each of the bash commands
- `ks param set` requires a small refactor of the Ksonnet Application.
- Verify everything runs.

## Prerequisites

**NOTE**: If you are using this notebook on a Kubeflow cluster, all the pre-prequisites have already been set up.

* `Kubeflow v0.2.2`
  This notebook assumes a Kubeflow cluster is already deployed. See [Getting Started with Kubeflow](https://www.kubeflow.org/docs/started/getting-started/).

* `Python 2.7` (bundled with `pip`) 
  For this demo, we will use Python 2.7. This restriction is due to [Apache Beam](https://beam.apache.org/), which
  does not support Python 3 yet (See [BEAM-1251](https://issues.apache.org/jira/browse/BEAM-1251)).

* `Google Cloud SDK`
  This example will use tools from the [Google Cloud SDK](https://cloud.google.com/sdk/). The SDK must be
  authenticated and authorized. See [Authentication Overview](https://cloud.google.com/docs/authentication/).
  
* `Ksonnet 0.12`
  We use [Ksonnet](https://ksonnet.io/) to write Kubernetes jobs in a declarative manner to be run on top of Kubeflow.

In [None]:
%%bash

echo "Python Version Info: " && python2 --version && echo
echo "Google Cloud SDK Info: " && gcloud --version && echo
echo "Kubectl Version Info: " && kubectl version && echo
echo "Ksonnet Version Info: " && ks version

## Get the Source Code

Let us clone the source code for Code Search from [kubeflow/examples](https://github.com/kubeflow/examples).

In [None]:
%%bash

# NOTE: This cell must only be run once or the clone will complain.
git clone --depth=1 https://github.com/kubeflow/examples

## Install dependencies

Let us install all the Python dependencies. Note that everything must be done with `Python 2`.

In [None]:
%%bash

pip2 install -r examples/code_search/src/requirements.txt
pip2 install https://github.com/activatedgeek/batch-predict/tarball/fix-value-provider

## Configure Variables

This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps.

In [None]:
# Configuration Variables. Modify as desired.

PROJECT = 'kubeflow-dev'
TARGET_DATASET = 'code_search'
WORKING_DIR = 'gs://kubeflow-examples/t2t-code-search/20180807'

WORKER_MACHINE_TYPE = 'n1-highcpu-32'
NUM_WORKERS = 16

# DO NOT MODIFY. The following statements are needed to support
# environment variables in a bash cell.
%env PROJECT $PROJECT
%env TARGET_DATASET $TARGET_DATASET
%env WORKING_DIR $WORKING_DIR
%env WORKER_MACHINE_TYPE $WORKER_MACHINE_TYPE
%env NUM_WORKERS $NUM_WORKERS

Add the Kubernetes environment to the Ksonnet application and update parameters.

In [None]:
%%bash

ks env add code-search --context=$(kubectl config current-context) --namespace=kubeflow

# TODO: refactor Kubeflow application to allow setting common variables.
# ks param set t2t-job workingDir ${WORKING_DIR}

## Pre-Processing Github Files

In this step, we will run a [Google Cloud Dataflow](https://cloud.google.com/dataflow/) pipeline (based on Apache Beam). A `Python 2` module `code_search.dataflow.cli.preprocess_github_dataset` has been provided which builds an Apache Beam pipeline. A list of all possible arguments can be seen via the following command.

In [None]:
%%bash

python2 -m code_search.dataflow.cli.preprocess_github_dataset -h

### Configuration

We use a subset of the options available to run our Dataflow job. The variables are required and make sure you modify the preset variables as desired.

### Run the Dataflow Job for Pre-Processing

In [None]:
%%bash

# See help for a short description of each argument.

python2 -m code_search.dataflow.cli.preprocess_github_dataset \
        --runner DataflowRunner \
        --project "${PROJECT}" \
        --target_dataset "${TARGET_DATASET}" \
        --data_dir "${WORKING_DIR}/data" \
        --job_name "preprocess-github-dataset" \
        --temp_location "${WORKING_DIR}/data/dataflow/temp" \
        --staging_location "${WORKING_DIR}/data/dataflow/staging" \
        --worker_machine_type "${WORKER_MACHINE_TYPE}" \
        --num_workers "${NUM_WORKERS}"

When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model.

## Prepare Dataset for Training

In this step we will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format. We will run this job on the Kubeflow cluster.

In [None]:
%%bash

ks apply code-search -c t2t-code-search-datagen

## Execute Tensorflow Training

In [None]:
%%bash

ks apply code-search -c t2t-code-search-trainer

## Export Tensorflow Model

In [None]:
%%bash

ks apply code-search -c t2t-code-search-exporter

## Compute Function Embeddings

In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo`. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`.

In [None]:
MODEL_VERSION = '1533685294'

# DO NOT MODIFY. The following statements are needed to support
# environment variables in a bash cell.
%env MODEL_VERSION $MODEL_VERSION

### Run the Dataflow Job for Function Embeddings

In [None]:
%%bash

python2 -m code_search.dataflow.cli.create_function_embeddings \
        --runner DataflowRunner
        --project "${PROJECT}" \
        --target_dataset "${TARGET_DATASET}" \
        --problem github_function_docstring \
        --data_dir "${WORKING_DIR}/data" \
        --saved_model_dir "${WORKING_DIR}/output/export/Servo/${MODEL_VERSION}" \
        --job_name compute-function-embeddings
        --temp_location "${WORKING_DIR}/data/dataflow/temp" \
        --staging_location "${WORKING_DIR}/data/dataflow/staging" \
        --worker_machine_type "${WORKER_MACHINE_TYPE}" \
        --num_workers "${NUM_WORKERS}"

When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings.

## Create Search Index

We now create the Search Index from the computed embeddings so that during a query we can do a k-Nearest Neighbor search to give out semantically similar results.

In [None]:
%%bash

ks apply code-search -c search-index-creator

Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created in the `WORKING_DIR`.

## Deploy an Inference Server

We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/).

In [None]:
%%bash

ks apply code-search -c t2t-code-search-serving

## Deploy Search UI

We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed.

In [None]:
%%bash

ks apply code-search -c search-index-server

The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`.