# Code Search on Kubeflow

This notebook implements an end-to-end Semantic Code Search on top of [Kubeflow](https://www.kubeflow.org/) - given an input query string, get a list of code snippets semantically similar to the query string.

**NOTE**: If you haven't already, see [kubeflow/examples/code_search](https://github.com/kubeflow/examples/tree/master/code_search) for instructions on how to get this notebook,.

## Install dependencies

Let us install all the Python dependencies. Note that everything must be done with `Python 2`. This will take a while and only needs to be run once.

In [None]:
! pip2 install https://github.com/kubeflow/batch-predict/tarball/master

! pip2 install -r src/requirements.txt

In [None]:
# Only for BigQuery cells
! pip2 install pandas-gbq

In [None]:
from pandas.io import gbq

### Configure Variables

This involves setting up the Ksonnet application as well as utility environment variables for various CLI steps.

In [None]:
# Configuration Variables. Modify as desired.

PROJECT = 'kubeflow-dev'
CLUSTER_NAME = 'kubeflow-latest'
CLUSTER_REGION = 'us-east1-d'
CLUSTER_NAMESPACE = 'kubeflow-latest'

TARGET_DATASET = 'code_search'
WORKING_DIR = 'gs://kubeflow-examples/t2t-code-search/notebook-demo'
WORKER_MACHINE_TYPE = 'n1-highcpu-32'
NUM_WORKERS = 16

# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env PROJECT $PROJECT
%env CLUSTER_NAME $CLUSTER_NAME
%env CLUSTER_REGION $CLUSTER_REGION
%env CLUSTER_NAMESPACE $CLUSTER_NAMESPACE

%env TARGET_DATASET $TARGET_DATASET
%env WORKING_DIR $WORKING_DIR
%env WORKER_MACHINE_TYPE $WORKER_MACHINE_TYPE
%env NUM_WORKERS $NUM_WORKERS

###  Setup Authorization

In a Kubeflow cluster, we already have the key credentials available with each pod and will re-use them to authenticate. This will allow us to submit `TFJob`s and execute `Dataflow` pipelines. We also set the new context for the Code Search Ksonnet application.

In [None]:
%%bash

# Activate Service Account provided by Kubeflow.
gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS}

# Get KUBECONFIG for the desired cluster.
gcloud container clusters get-credentials ${CLUSTER_NAME} --region ${CLUSTER_REGION}

# Set the namespace of the context.
kubectl config set contexts.$(kubectl config current-context).namespace ${CLUSTER_NAMESPACE}

### Setup Ksonnet Application

This will use the context we've set above and provide it as a new environment to the Ksonnet application.

In [None]:
%%bash

cd kubeflow

# Update Ksonnet application to the context set earlier
ks env add code-search --context=$(kubectl config current-context)

# Update the Working Directory of the application
ks param set t2t-code-search workingDir ${WORKING_DIR}

### Verify Version Information

In [None]:
%%bash

echo "Pip Version Info: " && pip2 --version && echo
echo "Google Cloud SDK Info: " && gcloud --version && echo
echo "Ksonnet Version Info: " && ks version && echo
echo "Kubectl Version Info: " && kubectl version

## View Github Files

This is the query that is run as the first step of the Pre-Processing pipeline and is sent through a set of transformations. This is illustrative of the rows being processed in the pipeline we trigger next.

In [None]:
query = """
  SELECT
    MAX(CONCAT(f.repo_name, ' ', f.path)) AS repo_path,
    c.content
  FROM
    `bigquery-public-data.github_repos.files` AS f
  JOIN
    `bigquery-public-data.github_repos.contents` AS c
  ON
    f.id = c.id
  JOIN (
      --this part of the query makes sure repo is watched at least twice since 2017
    SELECT
      repo
    FROM (
      SELECT
        repo.name AS repo
      FROM
        `githubarchive.year.2017`
      WHERE
        type="WatchEvent"
      UNION ALL
      SELECT
        repo.name AS repo
      FROM
        `githubarchive.month.2018*`
      WHERE
        type="WatchEvent" )
    GROUP BY
      1
    HAVING
      COUNT(*) >= 2 ) AS r
  ON
    f.repo_name = r.repo
  WHERE
    f.path LIKE '%.py' AND --with python extension
    c.size < 15000 AND --get rid of ridiculously long files
    REGEXP_CONTAINS(c.content, r'def ') --contains function definition
  GROUP BY
    c.content
  LIMIT
    10
"""

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

## Pre-Processing Github Files

In this step, we will run a [Google Cloud Dataflow](https://cloud.google.com/dataflow/) pipeline (based on Apache Beam). A `Python 2` module `code_search.dataflow.cli.preprocess_github_dataset` has been provided which builds an Apache Beam pipeline. A list of all possible arguments can be seen via the following command.

In [None]:
%%bash

cd src

python2 -m code_search.dataflow.cli.preprocess_github_dataset -h

### Run the Dataflow Job for Pre-Processing

See help above for a short description of each argument. The values are being taken from environment variables defined earlier.

In [None]:
%%bash

cd src

JOB_NAME="preprocess-github-dataset-$(date +'%Y%m%d-%H%M%S')"

python2 -m code_search.dataflow.cli.preprocess_github_dataset \
        --runner DataflowRunner \
        --project "${PROJECT}" \
        --target_dataset "${TARGET_DATASET}" \
        --data_dir "${WORKING_DIR}/data" \
        --job_name "${JOB_NAME}" \
        --temp_location "${WORKING_DIR}/dataflow/temp" \
        --staging_location "${WORKING_DIR}/dataflow/staging" \
        --worker_machine_type "${WORKER_MACHINE_TYPE}" \
        --num_workers "${NUM_WORKERS}"

When completed successfully, this should create a dataset in `BigQuery` named `target_dataset`. Additionally, it also dumps CSV files into `data_dir` which contain training samples (pairs of function and docstrings) for our Tensorflow Model. A representative set of results can be viewed using the following query.

In [None]:
query = """
  SELECT * 
  FROM 
    {}.token_pairs
  LIMIT
    10
""".format(TARGET_DATASET)

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

## Prepare Dataset for Training

In this step we will use `t2t-datagen` to convert the transformed data above into the `TFRecord` format. We will run this job on the Kubeflow cluster.

In [None]:
%%bash

cd kubeflow

ks apply code-search -c t2t-code-search-datagen

## Execute Tensorflow Training

In [None]:
%%bash

cd kubeflow

ks apply code-search -c t2t-code-search-trainer

## Export Tensorflow Model

In [None]:
%%bash

cd kubeflow

ks apply code-search -c t2t-code-search-exporter

## Compute Function Embeddings

In this step, we will use the exported model above to compute function embeddings via another `Dataflow` pipeline. A `Python 2` module `code_search.dataflow.cli.create_function_embeddings` has been provided for this purpose. A list of all possible arguments can be seen below.

In [None]:
%%bash

cd src

python2 -m code_search.dataflow.cli.create_function_embeddings -h

### Configuration

First, select a Exported Model version from the `${WORKING_DIR}/output/export/Servo`. This should be name of a folder with UNIX Seconds Timestamp like `1533685294`. Below, we automatically do that by selecting the folder which represents the latest timestamp.

In [None]:
%%bash --out EXPORT_DIR_LS

gsutil ls ${WORKING_DIR}/output/export/Servo | grep -oE "([0-9]+)/$"

In [None]:
# WARNING: This routine will fail if no export has been completed successfully.
MODEL_VERSION = max([int(ts[:-1]) for ts in EXPORT_DIR_LS.split('\n') if ts])

# DO NOT MODIFY. These are environment variables to be used in a bash shell.
%env MODEL_VERSION $MODEL_VERSION

### Run the Dataflow Job for Function Embeddings

In [None]:
%%bash

cd src

python2 -m code_search.dataflow.cli.create_function_embeddings \
        --runner DataflowRunner
        --project "${PROJECT}" \
        --target_dataset "${TARGET_DATASET}" \
        --problem github_function_docstring \
        --data_dir "${WORKING_DIR}/data" \
        --saved_model_dir "${WORKING_DIR}/output/export/Servo/${MODEL_VERSION}" \
        --job_name compute-function-embeddings
        --temp_location "${WORKING_DIR}/dataflow/temp" \
        --staging_location "${WORKING_DIR}/dataflow/staging" \
        --worker_machine_type "${WORKER_MACHINE_TYPE}" \
        --num_workers "${NUM_WORKERS}"

When completed successfully, this should create another table in the same `BigQuery` dataset which contains the function embeddings for each existing data sample available from the previous Dataflow Job. Additionally, it also dumps a CSV file containing metadata for each of the function and its embeddings. A representative query result is shown below.

In [None]:
query = """
  SELECT * 
  FROM 
    {}.function_embeddings
  LIMIT
    10
""".format(TARGET_DATASET)

gbq.read_gbq(query, dialect='standard', project_id=PROJECT)

## Create Search Index

We now create the Search Index from the computed embeddings so that during a query we can do a k-Nearest Neighbor search to give out semantically similar results.

In [None]:
%%bash

cd kubeflow

ks apply code-search -c search-index-creator

Using the CSV files generated from the previous step, this creates an index using [NMSLib](https://github.com/nmslib/nmslib). A unified CSV file containing all the code examples for a human-readable reverse lookup during the query, is also created in the `WORKING_DIR`.

## Deploy an Inference Server

We've seen offline inference during the computation of embeddings. For online inference, we deploy the exported Tensorflow model above using [Tensorflow Serving](https://www.tensorflow.org/serving/).

In [None]:
%%bash

cd kubeflow

ks apply code-search -c t2t-code-search-serving

## Deploy Search UI

We finally deploy the Search UI which allows the user to input arbitrary strings and see a list of results corresponding to semantically similar Python functions. This internally uses the inference server we just deployed.

In [None]:
%%bash

cd kubeflow

ks apply code-search -c search-index-server

The service should now be available at FQDN of the Kubeflow cluster at path `/code-search/`.