# Churn Project
  --------------------------------------------------------------------

_____________

It's easy to make a business case for running customer churn analyses--given some relevant data, if we could only spot those scenarios where some measurable intervention might have a high likelihood of generating value. From the data-scientist's perspective, making that case starts with some questions, data, and some ideas about how to model that data.  

The **[Kaggle Telco Churn dataset](https://www.kaggle.com/blastchar/telco-customer-churn)** is a great starting point, enabling us to set up an almost completely generic pipeline with all the core components of a what could eventually become a complex churn prediction and intervention system.  As a bonus, the same setup could be used to develop a **[predictive maintenance system](https://docs.microsoft.com/en-us/archive/msdn-magazine/2019/may/machine-learning-using-survival-analysis-for-predictive-maintenance)**, or provide a key component in a health care and prevention system.

Churn can initally be approached as a binary classification problem.  We start with one or more static feature tables and estimate a prediction function.  In a real-time setting we could also join live aggregates to these static tables and then train a very high resolution churn-detection classifier. The churn detector developed in this notebook has one component that performs this type of static churn classification. It's called **current-state** in the pipeline flow chart below.

However, we can look to **[survival analysis](https://en.wikipedia.org/wiki/Survival_analysis)** if our data is time stamped in such a way that we can define a duration feature that represents the age of an observation (for data stored in a database, see **[example sql query to get survival data from a table](https://lifelines.readthedocs.io/en/latest/Examples.html#example-sql-query-to-get-survival-data-from-a-table)**). Fortunately, the Telco dataset contains the client's  contract tenure in months.  So a second regressor branch trains a number of survivability models that will enable us to provide estimates of the timing of events leading to churn, and these can be found in **survival-curves** in the pipeline flow chart below.  (Currently, they only provide information as this notebook is currently in development).

## mlrun and nuclio

in developing this churn model we highlight how we can use **[mlrun projects](https://github.com/mlrun)**,  **[nuclio functions](https://nuclio.io/)** functions, and **[kubeflow pipelines](https://www.kubeflow.org/)** to set up and deploy a realistic churn model in a production environment.  Along the way we will:
1. **write custom data encoders**:  raw data often needs to be processed, some features need to be categorized, others binarized.
2. **summarize data**: look at things like class balance, variable distributions.
3. **define parameters and hyperparameters** for a generic XGBoost training function
4. **train and test** a number of models
5. **deploy a "best" model** into "production" as a nuclio serverless functions
6. **test the model servers**

Additionally, we will demonstrate
* how logs and artifacts are collected throughtout the entire process, 
* how results can be compared
* how github can help streamline, and most importantly document and version your entire process

![pipeline](assets/pipeline-3.png)

## further development
### event simulator
Since we only have this one dataset, a "live" demonstration will require either changing datasets, or generating simulated data based on the existing telco sample.  Generative models are becoming more and more popular in research, training and in outright applications, so we will be looking to add one to our list of functions.  Given a clean and encoded original dataset, etc...

### recommendations
who needs attention?  how should we schedule interventions?

### apply this to your data
how can you adapt this project to suit your own needs?

## data science tags
xgboost<br>
cox proprortional hazards regression<br>
classifiers<br>
survival analysis<br>

_note_: This notebook was adapted from a number external sources:
* **[Churn Prediction and Prevention in Python](https://towardsdatascience.com/churn-prediction-and-prevention-in-python-2d454e5fd9a5)**

#### **notebook how-to's**
* Create and test a custom `data_clean` function 
* Examine data using a serverless (containerized) `describe` function
* Train a number of machine learning algorithms
* Tune hyperparameters 
* Create an automated ML pipeline from various library functions
* Run and track the pipeline results and artifacts

## a custom data cleaning function


In [1]:
# nuclio: ignore
import nuclio

In [2]:
%nuclio config kind = "job"
%nuclio config spec.image = "mlrun/ml-models"

%nuclio: setting kind to 'job'
%nuclio: setting spec.image to 'mlrun/ml-models'


In [3]:
import os

import json
import pandas as pd
import numpy as np
from collections import defaultdict

from cloudpickle import dumps, dump, load

from sklearn.preprocessing import (OneHotEncoder,
                                   LabelEncoder)

from mlrun.execution import MLClientCtx
from mlrun.datastore import DataItem

def data_clean(
    context:MLClientCtx, 
    src: DataItem,
    file_ext: str = "csv",
    models_dest: str = "models/encoders",
    cleaned_key: str = "cleaned-data",
    encoded_key: str = "encoded-data"
):
    """process a raw churn data file
    
    Data has 3 states here: `raw`, `cleaned` and `encoded`
    
    * `raw` kept by default, the pipeline begins with a raw data artifact
    * `cleaned` kept for charts, presentations
    * `encoded` is input for a cross validation and training function
    
    steps (not necessarily in correct order, some parallel)
    * column name maps
    * deal with nans and other types of missings/junk
    * label encode binary and ordinal category columns
    * create category ranges from numerical columns
    And finally,
    * test
    
    Why we don't one-hot-encode here? One hot encoding isn't a necessary
    step for all algorithms. It can also generate a very large feature
    matrix that doesn't need to be serialized (even if sparse).
    So we leave one-hot-encoding for the training step.
    
    What about scaling numerical columns? Same as why we don't one hot
    encode here. Do we scale before train-test split?  IMHO, no.  Scaling
    before splitting introduces a type of data leakage.  In addition,
    many estimators are completely immune to the monotonic transformations
    implied by scaling, so why waste the cycles?
    
    TODO: 
        * parallelize where possible
        * more abstraction (more parameters, chain sklearn transformers)
        * convert to marketplace function
        
    :param context:          the function execution context
    :param src:              an artifact or file path
    :param file_ext:         file type for artifacts
    :param models_dest:       label encoders and other preprocessing steps
                             should be saved together with other pipeline
                             models
    :param cleaned_key:      key of cleaned data table in artifact store
    :param encoded_key:      key of encoded data table in artifact store
    """
    df = src.as_df()
    
    # drop columns
    drop_cols_list = ["customerID", "TotalCharges"]
    df.drop(drop_cols_list, axis=1, inplace=True)
    
    # header transformations
    old_cols = df.columns
    rename_cols_map = {
        "SeniorCitizen" : "senior",
        "Partner"       : "partner",
        "Dependents"    : "deps",
        "Churn"         : "labels"
    }
    df.rename(rename_cols_map, axis=1, inplace=True)

    # add drop column to logs:
    for col in drop_cols_list:
        rename_cols_map.update({col: "_DROPPED_"})
    
    # log the op
    tp = os.path.join(models_dest, "preproc-column_map.json")
    context.log_artifact("preproc-column_map.json",
                         body=json.dumps(rename_cols_map),
                         local_path=tp)
    
    # VALUE transformations

    # clean
    # truncate reply to "No"
    df = df.applymap(lambda x: "No" if str(x).startswith("No ") else x)

    # encode numerical type as category bins (ordinal)
    bins = [0, 12, 24, 36, 48, 60, np.inf]
    labels = [0, 1, 2, 3, 4, 5]
    tenure = df.tenure.copy(deep=True)
    df["tenure_map"] = pd.cut(df.tenure, bins, labels=False)
    tenure_map = dict(zip(bins, labels))
    # save this transformation
    tp = os.path.join(models_dest, "preproc-numcat_map.json")
    context.log_artifact("preproc-numcat_map.json", 
                         body=bytes(json.dumps(tenure_map).encode("utf-8")), 
                         local_path=tp)
    
    context.log_dataset(cleaned_key, df=df, format=file_ext, index=False)
    
    # label encoding - generate model for each column saved in dict
    # some of these columns may be hot encoded in the training step
    fix_cols = ["gender", "partner", "deps", "OnlineSecurity", 
                "OnlineBackup", "DeviceProtection", "TechSupport",
                "StreamingTV", "StreamingMovies", "PhoneService",
                "MultipleLines", "PaperlessBilling", "InternetService", 
                "Contract", "PaymentMethod", "labels"]
    
    d = defaultdict(LabelEncoder)
    df[fix_cols] = df[fix_cols].apply(lambda x: d[x.name].fit_transform(x.astype(str)))
    context.log_dataset(encoded_key, df=df, format=file_ext, index=False)

    model_bin = dumps(d)
    context.log_model("model", 
                      body=model_bin,
                      artifact_path=os.path.join(context.artifact_path, 
                                                 models_dest),
                      model_file="model.pkl")
    # would be nice to have a check here on the integrity of all done
    # raw->clean->encoded->clean->raw

In [4]:
# nuclio: end-code

## Create a project to host our functions, jobs and artifacts

Projects are used to package multiple functions, workflows, and artifacts. We usually store project code and definitions in a Git archive.

The following code creates a new project in a local dir and initialize git tracking on that

In [5]:
from os import path
from mlrun import run_local, NewTask, mlconf, import_function, mount_v3io
mlconf.dbpath = mlconf.dbpath or "http://mlrun-api:8080"

# specify artifacts target location
artifact_path = mlconf.artifact_path or path.abspath("./")
project_name = "churn-project"

In [6]:
from mlrun import new_project, code_to_function
project_dir = "./project"
churn_proj = new_project(project_name, project_dir, init_git=True)
# churn_proj.artifact_path = "/User/artifacts/churn"

since the raw data file is local we'll just log it already as an artifact, ensuring we keep a record of the source data used to generate the cleaned data and results.

In [7]:
churn_proj.log_artifact(
    "raw-data", 
    target_path="https://raw.githubusercontent.com/yjb-ds/testdata/master/data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

[mlrun] 2020-05-22 14:25:47,093 log artifact raw-data at https://raw.githubusercontent.com/yjb-ds/testdata/master/data/WA_Fn-UseC_-Telco-Customer-Churn.csv, size: None, db: Y


<a id="test-locally"></a>
### Run the data generator function locally

The functions above can be tested locally. Parameters, inputs, and outputs can be specified in the API or the `Task` object.<br>
when using `run_local()` the function inputs and outputs are automatically recorded by MLRun experiment and data tracking DB.

In each run we can specify the function, inputs, parameters/hyper-parameters, etc... For more details, see the [mlrun_basics notebook](mlrun_basics.ipynb).

In [8]:
# run the function locally
from mlrun import NewTask
cleaner = run_local(
    name="data_clean",
    handler=data_clean, 
    project=project_name,
    inputs={"src": "store:///raw-data"},
    params={"file_ext" : "csv",
            "apply_tenure_map": False},
    artifact_path=path.join(mlconf.artifact_path, "churn"))

[mlrun] 2020-05-22 14:25:51,730 starting run data_clean uid=caed2e71528347ecad61af9161d15a01  -> http://mlrun-api:8080
[mlrun] 2020-05-22 14:25:52,066 log artifact preproc-column_map.json at /User/artifacts/churn/models/encoders/preproc-column_map.json, size: 146, db: Y
[mlrun] 2020-05-22 14:25:52,147 log artifact preproc-numcat_map.json at /User/artifacts/churn/models/encoders/preproc-numcat_map.json, size: 53, db: Y
[mlrun] 2020-05-22 14:25:52,402 log artifact cleaned-data at /User/artifacts/churn/cleaned-data.csv, size: 708244, db: Y
[mlrun] 2020-05-22 14:25:52,595 log artifact encoded-data at /User/artifacts/churn/encoded-data.csv, size: 326760, db: Y
[mlrun] 2020-05-22 14:25:52,615 log artifact model at /User/artifacts/churn/models/encoders/, size: 1515, db: Y



project,uid,iter,start,state,name,labels,inputs,parameters,results,artifacts
churn-project,...61d15a01,0,May 22 14:25:51,completed,data_clean,v3io_user=adminkind=handlerowner=adminhost=jupyter-67c88b95d4-crdhq,src,file_ext=csvapply_tenure_map=False,,preproc-column_map.jsonpreproc-numcat_map.jsoncleaned-dataencoded-datamodel


to track results use .show() or .logs() or in CLI: 
!mlrun get run caed2e71528347ecad61af9161d15a01 --project churn-project , !mlrun logs caed2e71528347ecad61af9161d15a01 --project churn-project
[mlrun] 2020-05-22 14:25:52,703 run executed, status=completed


#### Convert our local code to a distributed serverless function object 

In [9]:
clean_func = code_to_function(name="clean_data", kind="job")

## Create a Fully Automated ML Pipeline

#### Add more functions to our project to be used in our pipeline (from the functions hub/marketplace)

AutoML training (classifier), Model validation (test_classifier), Real-time model server, and Model REST API Tester

In [10]:
churn_proj.set_function(clean_func)
churn_proj.set_function("hub://describe", "describe")

churn_proj.set_function("hub://xgb_trainer", "classify")
churn_proj.set_function("hub://xgb_test", "xgbtest")

churn_proj.set_function("hub://coxph_trainer", "survive")
churn_proj.set_function("hub://coxph_test", "coxtest")

churn_proj.set_function("hub://churn_server", "server")

<mlrun.runtimes.function.RemoteRuntime at 0x7f76f71d5358>

#### Define and save a pipeline 

The following workflow definition will be written into a file, it describes a Kubeflow execution graph (DAG)<br>
and how functions and data are connected  to form an end to end pipeline. 

* Build the iris generator (ingest) function container 
* Ingest the iris data
* Analyze the dataset (describe)
* Train and test the model
* Deploy the model as a real-time serverless function
* Test the serverless function REST API with test dataset

Check the code below to see how functions objects are initialized and used (by name) inside the workflow.<br>
The `workflow.py` file has two parts, initialize the function objects and define pipeline dsl (connect the function inputs and outputs).

> Note: the pipeline can include CI steps like building container images and deploying models as illustrated  in the following example.


In [11]:
%%writefile project/workflow.py
from kfp import dsl
from mlrun import mount_v3io

funcs = {}

GPUS = False

DATA_REPO = "https://raw.githubusercontent.com/yjb-ds/testdata/master"
DATA_PATH = "data/WA_Fn-UseC_-Telco-Customer-Churn.csv"
RAW_CHURN_DATA = f"{DATA_REPO}/{DATA_PATH}"

# init functions is used to configure function resources and local settings
def init_functions(functions: dict, project=None, secrets=None):
    for f in functions.values():
        f.apply(mount_v3io())
        
    functions["server"].set_env("INFERENCE_STREAM", "users/admin/artifacts/churn/model_stream")

    
@dsl.pipeline(
    name="Demo training pipeline",
    description="Shows how to use mlrun."
)
def kfpipeline():
    
    # build our cleaner function (container image)
    builder_cleaner = funcs["clean-data"].deploy_step(skip_deployed=True, 
                                                      with_mlrun=False)
    # use mlrun<=0.4.7 only:
    builder_xgb = funcs["classify"].deploy_step(skip_deployed=True, 
                                                with_mlrun=False)
    
    # run the ingestion function with the new image and params
    clean = funcs["clean-data"].as_step(
        name="clean-data",
        handler="data_clean",
        image=builder_cleaner.outputs["image"],
        params={"file_ext": "csv",
                "models_dest": "models/encoders"},
        inputs={"src": RAW_CHURN_DATA},
        outputs=["preproc-colum_map",
                 "preproc-numcat_map",
                 "preproc-label_encoders"
                 "cleaned-data",
                 "encoded-data",
                 "tenured-data"])

    # analyze our dataset
    describe = funcs["describe"].as_step(
        name="summary",
        params={"label_column"  : "labels"},
        inputs={"table": clean.outputs["encoded-data"]},
        outputs={"histograms", 
                 "imbalance",
                 "correlation",
                 "correlation-matrix"})
    
    # train with hyper-paremeters
    xgb = funcs["classify"].as_step(
        name="current-state",
        handler="train_model",
        image=builder_xgb.outputs["image"],
        params={"sample"                  : -1, 
                "label_column"            : "labels",
                "model_type"              : "classifier",
                # xgb class initializers (tuning candidates):
                "CLASS_tree_method"       : "gpu_hist" if GPUS else "hist",
                "CLASS_objective"         : "binary:logistic",
                "CLASS_n_estimators"      : 50,
                "CLASS_max_depth"         : 5,
                "CLASS_learning_rate"     : 0.15,
                "CLASS_colsample_bylevel" : 0.7,
                "CLASS_colsample_bytree"  : 0.8,
                "CLASS_gamma"             : 1.0,
                "CLASS_max_delta_step"    : 3,
                "CLASS_min_child_weight"  : 1.0,
                "CLASS_reg_lambda"        : 10.0,
                "CLASS_scale_pos_weight"  : 1,
                "FIT_verbose"             : 0,
                "CLASS_subsample"         : 0.9,
                "CLASS_booster"           : "gbtree",
                "CLASS_random_state"      : 1,
                # encoding:
                "encode_cols"        : {"InternetService": "ISP",
                                        "Contract"       : "Contract",
                                        "PaymentMethod"   : "Payment"},
                # outputs
                "models_dest"        : "models",
                "plots_dest"         : "plots",
                "file_ext"           : "csv"
               },
        inputs={"dataset"   : clean.outputs["encoded-data"]},
        outputs=["model", 
                 "test-set"])

    cox = funcs["survive"].as_step(
        name="survival-curves",
        params={"sample"                  : -1, 
                "event_column"            : "labels",
                "strata_cols" : ['InternetService', 'StreamingMovies', 
                                 'StreamingTV', 'PhoneService'],
                "encode_cols" : {"Contract"       : "Contract",
                                 "PaymentMethod"  : "Payment"},
                # outputs
                "models_dest"        : "models/cox",
                "plots_dest"         : "plots",
                "file_ext"           : "csv"
               },
        inputs={"dataset"   : clean.outputs["encoded-data"]},
        outputs=["cx-model",
                 "coxhazard-summary",
                 "tenured-test-set"])

    test_xgb = funcs["xgbtest"].as_step(
        name="test classifier",
        params={"label_column": "labels",
                "plots_dest"  : "churn/test/xgb"},
        inputs={"models_path"  : xgb.outputs["model"],
                "test_set"    : xgb.outputs["test-set"]})

    test_cox = funcs["coxtest"].as_step(
        name="test regressor",
        params={"label_column": "labels",
                "plots_dest"  : "churn/test/cox"},
        inputs={"models_path"  : cox.outputs["cx-model"],
                "test_set"    : cox.outputs["tenured-test-set"]})

    # deploy our model as a serverless function
    deploy_xgb = funcs["server"].deploy_step(
        models={"churn_server_v1": xgb.outputs["model"]})
    deploy_xgb.after(cox)

Overwriting project/workflow.py


In [12]:
# register the workflow file as "main"
churn_proj.set_workflow("main", "workflow.py", embed=True)

Save the project definitions to a file (project.yaml), it is recommended to commit all changes to a Git repo.

In [13]:
churn_proj.save()

In [14]:
#print(churn_proj.to_yaml())

<a id="run-pipeline"></a>
## Run a pipeline workflow
use the `run` method to execute a workflow, you can provide alternative arguments and specify the default target for workflow artifacts.<br>
The workflow ID is returned and can be used to track the progress or you can use the hyperlinks

> Note: The same command can be issued through CLI commands:<br>
    `mlrun project my-proj/ -r main -p "v3io:///users/admin/mlrun/kfp/{{workflow.uid}}/"`

The dirty flag allow us to run a project with uncommited changes (when the notebook is in the same git dir it will always be dirty)

In [15]:
artifact_path = os.path.join(mlconf.artifact_path, "churn")
run_id = churn_proj.run(
    "main",
    arguments={},
    artifact_path=artifact_path,
    dirty=True)

[mlrun] 2020-05-22 14:26:01,408 Pipeline run id=e2067b83-cf97-4415-ae57-800df70fe327, check UI or DB for progress


#### Track pipeline results

In [16]:
# from mlrun import get_run_db
# db = get_run_db().connect()
# db.list_runs(project=churn_proj.name).show() #, labels=f"workflow={run_id}").show()

**[back to top](#top)**