# Pipeline for Ames Model

Lightweight python components do not require you to build a new container image for every code change.
They're intended to use for fast iteration in notebook environment.

#### Building a lightweight python component
To build a component just define a stand-alone python function and then call kfp.components.func_to_container_op(func) to convert it to a component that can be used in a pipeline.

There are several requirements for the function:
* The function should be stand-alone. It should not use any code declared outside of the function definition. Any imports should be added inside the main function. Any helper functions should also be defined inside the main function.
* The function can only import packages that are available in the base image. If you need to import a package that's not available you can try to find a container image that already includes the required packages. (As a workaround you can use the module subprocess to run pip install for the required package.)
* If the function operates on numbers, the parameters need to have type hints. Supported types are ```[int, float, bool]```. Everything else is passed as string.
* To build a component with multiple output values, use the typing.NamedTuple type hint syntax: ```NamedTuple('MyFunctionOutputs', [('output_name_1', type), ('output_name_2', float)])```

In [125]:
EXPERIMENT_NAME = 'Ames'
KFP_PACKAGE = 'https://storage.googleapis.com/ml-pipeline/release/0.1.20/kfp.tar.gz'

In [2]:
# Install the SDK
!pip3 install $KFP_PACKAGE --upgrade
!pip3 install pandas

Collecting https://storage.googleapis.com/ml-pipeline/release/0.1.20/kfp.tar.gz
[?25l  Downloading https://storage.googleapis.com/ml-pipeline/release/0.1.20/kfp.tar.gz (67kB)
[K    100% |████████████████████████████████| 71kB 16.0MB/s ta 0:00:01
Collecting PyJWT>=1.6.4 (from kfp==0.1.20)
  Downloading https://files.pythonhosted.org/packages/87/8b/6a9f14b5f781697e51259d81657e6048fd31a113229cf346880bb7545565/PyJWT-1.7.1-py2.py3-none-any.whl
Collecting requests_toolbelt>=0.8.0 (from kfp==0.1.20)
[?25l  Downloading https://files.pythonhosted.org/packages/60/ef/7681134338fc097acef8d9b2f8abe0458e4d87559c689a8c306d0957ece5/requests_toolbelt-0.9.1-py2.py3-none-any.whl (54kB)
[K    100% |████████████████████████████████| 61kB 2.9MB/s ta 0:00:011
[?25hCollecting kfp-server-api<0.1.19,>=0.1.18 (from kfp==0.1.20)
  Downloading https://files.pythonhosted.org/packages/85/76/8d966b2bbd753aee0ae913973aa89b0eeda3cc574f7a4cbb0fdfcbbd43fb/kfp-server-api-0.1.18.2.tar.gz
Building wheels for collected 

In [3]:
import kfp.components as comp
import kfp.gcp as gcp

Simple function that just add two numbers:

In [34]:
!gcloud auth activate-service-account --key-file=${GOOGLE_APPLICATION_CREDENTIALS} --quiet

Activated service account credentials for: [label-issues-0409-user@code-search-demo.iam.gserviceaccount.com]


In [126]:
#Define a Python function
def train(train_data: str, model_file: str) -> str:
    '''Train the model'''
    test_size=0.25
    n_estimators = 50
    learning_rate = 0.1
        
    import joblib
    import re
    from sklearn.metrics import mean_absolute_error
    from sklearn.model_selection import train_test_split
    from sklearn.impute import SimpleImputer
    from xgboost import XGBRegressor
    def split_gcs_uri(gcs_uri):
        """Split a GCS URI into bucket and path."""
        GCS_REGEX = re.compile("gs://([^/]*)(/.*)?")
        m = GCS_REGEX.match(gcs_uri)
        bucket = m.group(1)
        path = ""
        if m.group(2):
            path = m.group(2).lstrip("/")
        return bucket, path
    
    train_bucket_name, train_path = split_gcs_uri(train_data)
    
    from google.cloud import storage
    storage_client = storage.Client()
    train_bucket = storage_client.get_bucket(train_bucket_name)   
    train_blob = train_bucket.blob(train_path)

    train_blob.download_to_filename("/tmp/data.csv")
        
    import pandas as pd

    data = pd.read_csv("/tmp/data.csv")
    
    model_bucket_name, model_path = split_gcs_uri(model_file)
    
    
    
    
    data.dropna(axis=0, subset=['SalePrice'], inplace=True)

    y = data.SalePrice
    X = data.drop(['SalePrice'], axis=1).select_dtypes(exclude=['object'])

    train_X, test_X, train_y, test_y = train_test_split(X.values,
                                                        y.values,
                                                        test_size=test_size,
                                                        shuffle=False)

    imputer = SimpleImputer()
    train_X = imputer.fit_transform(train_X)
    test_X = imputer.transform(test_X)
    
    def train_model(train_X,
                train_y,
                test_X,
                test_y,
                n_estimators,
                learning_rate):
        """Train the model using XGBRegressor."""
        model = XGBRegressor(n_estimators=n_estimators, learning_rate=learning_rate)

        model.fit(train_X,
                train_y,
                early_stopping_rounds=40,
                eval_set=[(test_X, test_y)])

        print("Best RMSE on eval: {0} with {1} rounds".format(
                   model.best_score,
                   model.best_iteration+1))
        return model

    def eval_model(model, test_X, test_y):
        """Evaluate the model performance."""
        predictions = model.predict(test_X)
        print("mean_absolute_error=%.2f", mean_absolute_error(predictions, test_y))

    def save_model(model, model_file):
        """Save XGBoost model for serving."""
        joblib.dump(model, model_file)
        print("Model export success: {0}".format(model_file))
    
    model = train_model(train_X,
                          train_y,
                          test_X,
                          test_y,
                          n_estimators,
                          learning_rate)

    eval_model(model, test_X, test_y)
    save_model(model, "/tmp/model.dat")

    model_bucket = storage_client.get_bucket(model_bucket_name)   
    model_blob = model_bucket.blob(model_path)
    
    model_blob.upload_from_filename("/tmp/model.dat")
    
    print("Model uploaded to {0}".format(model_file))
    return model_file


In [127]:
# Run training locally
if False:
    train("gs://code-search-demo_ames/data/ames_dataset/train.csv", "gs://code-search-demo_ames/output/model-local.txt")

Convert the function to a pipeline operation

In [128]:
# Use the docker image built by fairing as the base image
image = "gcr.io/code-search-demo/fairing-job/fairing-job:1FE4890C"
train_op = comp.func_to_container_op(train, base_image=image)

#### Define the pipeline
Pipeline function has to be decorated with the `@dsl.pipeline` decorator

In [129]:
import kfp.dsl as dsl
from kubernetes import client as k8s_client
@dsl.pipeline(
   name='Training pipeline',
   description='A pipeline that trains the model.'
)
def train_pipeline(
   train_data="gs://code-search-demo_ames/data/ames_dataset/train.csv",
   model_file="gs://code-search-demo_ames/output/hello-world1.txt",
):      
    train_task = train_op(train_data, model_file).apply(gcp.use_gcp_secret('user-gcp-sa'))

#### Compile the pipeline

In [130]:
pipeline_func = train_pipeline
pipeline_filename = pipeline_func.__name__ + '.pipeline.zip'
import kfp.compiler as compiler
compiler.Compiler().compile(pipeline_func, pipeline_filename)

#### Submit the pipeline for execution

In [131]:
#Specify pipeline argument values
arguments = {"train_data": "gs://code-search-demo_ames/data/ames_dataset/train.csv",
             "model_file": "gs://code-search-demo_ames/output/hello-world1.txt"}

#Get or create an experiment and submit a pipeline run
import kfp
client = kfp.Client()
experiment = client.create_experiment(EXPERIMENT_NAME)

#Submit a pipeline run
run_name = pipeline_func.__name__ + ' run'
run_result = client.run_pipeline(experiment.id, run_name, pipeline_filename, arguments)

#vvvvvvvvv This link leads to the run information page. (Note: There is a bug in JupyterLab that modifies the URL and makes the link stop working)

In [132]:
print("Previous completed run https://label-issues-0409.endpoints.code-search-demo.cloud.goog/pipeline/#/runs/details/1bc82d20-7870-11e9-8964-42010a8e00ff")

Previous completed run https://label-issues-0409.endpoints.code-search-demo.cloud.goog/pipeline/#/runs/details/1bc82d20-7870-11e9-8964-42010a8e00ff
