# Chicago Crime Prediction Pipeline

An example notebook that demonstrates how to:
* Download data from BigQuery
* Create a Kubeflow pipeline
* Include Google Cloud AI Platform components to train and deploy the model in the pipeline
* Submit a job for execution
* Query the final deployed model

The model forecasts how many crimes are expected to be reported the next day, based on how many were reported over the previous `n` days.

## Imports

In [None]:
%%capture

# Install the SDK (Uncomment the code if the SDK is not installed before)
!python3 -m pip install 'kfp>=0.1.31' --quiet
!python3 -m pip install pandas --upgrade -q

# Restart the kernel for changes to take effect

In [None]:
import json

import kfp
import kfp.components as comp
import kfp.dsl as dsl

import pandas as pd

import time

## Pipeline

### Constants

In [None]:
# Required Parameters
project_id = '<ADD GCP PROJECT HERE>'
output = 'gs://<ADD STORAGE LOCATION HERE>' # No ending slash


In [None]:
# Optional Parameters
REGION = 'us-central1'
RUNTIME_VERSION = '1.13'
PACKAGE_URIS=json.dumps(['gs://chicago-crime/chicago_crime_trainer-0.0.tar.gz'])
TRAINER_OUTPUT_GCS_PATH = output + '/train/output/' + str(int(time.time())) + '/'
DATA_GCS_PATH = output + '/reports.csv'
PYTHON_MODULE = 'trainer.task'
PIPELINE_NAME = 'Chicago Crime Prediction'
PIPELINE_FILENAME_PREFIX = 'chicago'
PIPELINE_DESCRIPTION = ''
MODEL_NAME = 'chicago_pipeline_model' + str(int(time.time()))
MODEL_VERSION = 'chicago_pipeline_model_v1' + str(int(time.time()))

### Download data

Define a download function that uses the BigQuery component

In [None]:
bigquery_query_op = comp.load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/01a23ae8672d3b18e88adf3036071496aca3552d/components/gcp/bigquery/query/component.yaml')

QUERY = """
    SELECT count(*) as count, TIMESTAMP_TRUNC(date, DAY) as day
    FROM `bigquery-public-data.chicago_crime.crime`
    GROUP BY day
    ORDER BY day
"""

def download(project_id, data_gcs_path):

    return bigquery_query_op(
        query=QUERY,
        project_id=project_id,
        output_gcs_path=data_gcs_path
    )

### Train the model

Run training code that will pre-process the data and then submit a training job to the AI Platform.

In [None]:
mlengine_train_op = comp.load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/1.1.0-alpha.1/components/gcp/ml_engine/train/component.yaml')

def train(project_id,
          trainer_args,
          package_uris,
          trainer_output_gcs_path,
          gcs_working_dir,
          region,
          python_module,
          runtime_version):

    return mlengine_train_op(
        project_id=project_id, 
        python_module=python_module,
        package_uris=package_uris,
        region=region,
        args=trainer_args,
        job_dir=trainer_output_gcs_path,
        runtime_version=runtime_version
    )

### Deploy model

Deploy the model with the ID given from the training step

In [None]:
mlengine_deploy_op = comp.load_component_from_url(
    'https://raw.githubusercontent.com/kubeflow/pipelines/1.1.0-alpha.1/components/gcp/ml_engine/deploy/component.yaml')

def deploy(
    project_id,
    model_uri,
    model_id,
    model_version,
    runtime_version):
    
    return mlengine_deploy_op(
        model_uri=model_uri,
        project_id=project_id, 
        model_id=model_id, 
        version_id=model_version, 
        runtime_version=runtime_version, 
        replace_existing_version=True, 
        set_default=True)

### Define pipeline

In [None]:
@dsl.pipeline(
    name=PIPELINE_NAME,
    description=PIPELINE_DESCRIPTION
)

def pipeline(
    data_gcs_path=DATA_GCS_PATH,
    gcs_working_dir=output,
    project_id=project_id,
    python_module=PYTHON_MODULE,
    region=REGION,
    runtime_version=RUNTIME_VERSION,
    package_uris=PACKAGE_URIS,
    trainer_output_gcs_path=TRAINER_OUTPUT_GCS_PATH,
):      
    download_task = download(project_id,
                             data_gcs_path)

    train_task = train(project_id,
                       json.dumps(
                           ['--data-file-url',
                            '%s' % download_task.outputs['output_gcs_path'],
                            '--job-dir',
                            output]
                       ),
                       package_uris,
                       trainer_output_gcs_path,
                       gcs_working_dir,
                       region,
                       python_module,
                       runtime_version)
    
    deploy_task = deploy(project_id,
                         train_task.outputs['job_dir'],
                         MODEL_NAME,
                         MODEL_VERSION,
                         runtime_version)    
    return True

# Reference for invocation later
pipeline_func = pipeline

### Submit the pipeline for execution

In [None]:
pipeline = kfp.Client().create_run_from_pipeline_func(pipeline, arguments={})

# Run the pipeline on a separate Kubeflow Cluster instead
# (use if your notebook is not running in Kubeflow - e.x. if using AI Platform Notebooks)
# pipeline = kfp.Client(host='<ADD KFP ENDPOINT HERE>').create_run_from_pipeline_func(pipeline, arguments={})

### Wait for the pipeline to finish

In [None]:
run_detail = pipeline.wait_for_run_completion(timeout=1800)
print(run_detail.run.status)

### Use the deployed model to predict (online prediction)

In [None]:
import os
os.environ['MODEL_NAME'] = MODEL_NAME
os.environ['MODEL_VERSION'] = MODEL_VERSION

Create normalized input representing 14 days prior to prediction day.

In [None]:
%%writefile test.json
{"lstm_input": [[-1.24344569, -0.71910112, -0.86641698, -0.91635456, -1.04868914, -1.01373283, -0.7690387, -0.71910112, -0.86641698, -0.91635456, -1.04868914, -1.01373283, -0.7690387 , -0.90387016]]}

In [None]:
!gcloud ai-platform predict --model=$MODEL_NAME --version=$MODEL_VERSION --json-instances=test.json

### Examine cloud services invoked by the pipeline
- BigQuery query: https://console.cloud.google.com/bigquery?page=queries (click on 'Project History')
- AI Platform training job: https://console.cloud.google.com/ai-platform/jobs
- AI Platform model serving: https://console.cloud.google.com/ai-platform/models


### Clean models

In [None]:
# !gcloud ai-platform versions delete $MODEL_VERSION --model $MODEL_NAME
# !gcloud ai-platform models delete $MODEL_NAME