# 03 C3 Platform Machine Learning Pipelines

## Table of Contents 

* [Machine Learning Pipelines on the C3 AI Suite](#1)
    * [0. Import packages and Helper Functions](#1.1)
    * [1. Retrieve Data](#1.2)
    * [2. Test-Train Split](#1.3)
    * [3. Specify Preprocessing Pipeline](#1.4)
    * [4. Specify Classifier](#1.5)
    * [5. Create Hetrogeneous Pipeline](#1.6)
    * [6. Train Pipeline](#1.7)
    * [7. Score Pipeline and Analyze Results](#1.8)



### 0. Import the necessary packages <a class="anchor" id="1.1">

In [9]:
import pandas as pd
import numpy as np 
import collections
from datetime import datetime
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10,6)
%matplotlib inline

Helper methods that you defined in the previous module:

In [10]:
'''
Implemented in previous challenge
'''
def spec_to_emr_to_df(source_type: object,
                      spec: object,
                      on_the_fly: bool,
                      overrideMetrics: list
             ):
        
    return df

In [11]:
'''
Implemented in previous challenge
'''
def plot_metrics(df, ids, expressions):
    return

We would like to explore the timeseries features on the SmartBulb type and develop a supervised learning model that given a set of observations on any given day, predicts whether or not the SmartBulb will fail.

### 1. Retrieve Data <a class="anchor" id="1.2">

The first thing we need to do in order to train our Machine learning Pipeline is to assemble our training data. 

For our binary classification problem, we will use the simple metrics and compound metrics we explored in the previous module as features. The `WillFailNextMonth` compound metric will serve as the label while the `HasEverFailed` metric will serve as a mask to ignore all data after a bulb has failed.

In [12]:
# List of features that we would like to use in our model
features = [
            "AveragePower",
            "AverageTemperature",
]

# We will use this to discard data AFTER a bulb has failed
mask = 'HasEverFailed'
label = 'WillFailNextMonth'
columns = features + [mask, label]
print(columns)

['AveragePower', 'AverageTemperature', 'HasEverFailed', 'WillFailNextmonth']


**Evaluating Metrics using the methods `.evalMetrics()` is a synchronous operation and should not be called on large datasets. The helper methods we've defined will be getting the results of these evaluated metrics into the memory of our Jupyter notebook. When doing so, we should be conscious of the memory required by our dataset.** 

We retrieve the features along with the label and mask at a daily interval using our helper function on a subset of our total data.

In [14]:
start = datetime(2016,1,1)
end = datetime(2021,1,1)
interval = 'DAY'

my_spec = c3.EvalMetricsSpec(filter = "startsWith(id, 'SMBLB')",
                             expressions = columns,
                             start = start,
                             end = end,
                             interval = interval)

df = spec_to_emr_to_df(source_type=c3.SmartBulb,
                       spec=my_spec,
                       on_the_fly=False,
                       overrideMetrics=[None])
df

NameError: name 'df' is not defined

### 2. Train Test Split <a class="anchor" id="1.3">

We have assembled the data we need into a well formatted dataframe and now we can process this dataframe to generate training and testing data.

In this example, we will make the simplifying assumption that all SmartBulbs are independent of each other and therefore we can split our timeseries data into train and test set based on the bulbIds. We can do so since this is a completely fabricated classification dataset. It allows us to focus more on the mechanics of the platform as it pertains to model building and deployment without adding the complications of timeseries cross-validation. 

In [None]:
df = df[df[mask] == 0]

df_train = df[~df['source'].str.contains("SMBLB11|SMBLB12")]
df_test = df[df['source'].str.contains("SMBLB11|SMBLB12")]


**Note**: Here we convert our train and test dataframes to c3 `Dataset` type 

In [None]:
X_train = c3.Dataset.fromPython(df_train[features])
y_train = c3.Dataset.fromPython(df_train[label].values)
X_test = c3.Dataset.fromPython(df_test[features])
y_test = c3.Dataset.fromPython(df_test[label].values)

Examine the structure of X_train

In [None]:
X_train

In [None]:
X_train.shape, X_test.shape

### 3. Specify Preprocessing Pipeline <a class="anchor" id="1.4">

In this module, we will construct a machine learning pipeline which will consist of a **Preprocessing** step which will perform standardization of features and then maps the data onto its principal components.

We first define the individual `SklearnPipe`s

In [15]:
help(c3.SklearnPipe)

**Note** that `SklearnPipe` mixes `MLLeafPipe`

First we define the StandardScaler operation. **Note** the structure of the `SklearnTechnique`.

In [None]:
standardScaler = c3.SklearnPipe(
                    name="standardScaler",
                    technique=c3.SklearnTechnique(
                        # This tells ML pipeline to import sklearn.preprocessing.StandardScaler.
                        name="preprocessing.StandardScaler",
                        # This tells ML pipeline to call "transform" method on sklearn.preprocessing.StandardScaler when we invoke the C3 action process() later.
                        processingFunctionName="transform"
                    )
                 )

Then we define the PCA operation. **Note** how hyperparameters may be passed to the SklearnTechnique.

In [None]:
pca = c3.SklearnPipe(
         name="pca",
         technique=c3.SklearnTechnique(
             name="decomposition.PCA",
             processingFunctionName="transform",
             # hyperParameters are passed to sklearn.decomposition.PCA as kwargs
#              hyperParameters={"n_components": 2}
         )
      )

Now we can put these pieces together into an `MLSerialPipeline`. **Note** the steps field which consists of an array of `MLSteps`.

In [None]:
preprocess_pipeline = c3.MLSerialPipeline(
                        name="preprocessPipeline",
                        steps=[c3.MLStep(name="standardScaler",
                                         pipe=standardScaler),
                               c3.MLStep(name="pca",
                                         pipe=pca)
                              ]
)

### 4. Specify Classifier <a class="anchor" id="1.5">

For our binary classifier, we will use a LightGBM model which is a gradient boosting framework and especially designed for distributed and efficient computing. The C3 Type `LightGbmPipe` uses the LightGbm framework underneath and its implementation closely follows that of the `SklearnPipe` Type.

For more code examples of declaring other MLPipes refer to our documentation: [Code Examples for MLPipe Interface](https://developer.c3.ai/docs/7.24.0/topic/mlpipe-code-examples)

In [None]:
classifier = c3.LightGbmPipe(
                    name="lgbmClassifier",
                    technique=c3.LightGbmTechnique(
                        name="LGBMClassifier",
                        processingFunctionName="predict",
                        hyperParameters={
                                            'boosting_type': 'gbdt',
                                            'learning_rate': 0.1,
                                            'n_estimators': 2000,
                                        }),
#                     interpretTechnique=c3.ShapInterpretTechnique(
#                                                 name="TreeExplainer"
#                                         )

)

**Note** the interpret technique is used for generating feature contributions. You can investigate this as part of an Advanced Challenge in this module.

### 5. Create Hetrogeneous Pipeline and Upsert <a class="anchor" id="1.6">

Now that we have our **preprocessing** and **classification** steps defined, we can combine them into an `MLSerialPipeline`.

In [None]:
heterogeneousPipeline = c3.MLSerialPipeline(
                name="myPipeline",
                steps=[c3.MLStep(name="preprocess",
                                 pipe=preprocess_pipeline),
                       c3.MLStep(name="classifier",
                                 pipe=classifier)],
                scoringMetrics=c3.MLScoringMetric.toScoringMetricMap(scoringMetricList=[c3.MLAccuracyMetric(), c3.MLPrecisionMetric(), c3.MLF1ScoreMetric()])
             )


Now that he have defined our `MLSerialPipeline`, we can **upsert** it so that it persists on the platform.

In [None]:
upserted_heterogeneousPipeline = heterogeneousPipeline.upsert()

Now that the `MLSerialPipeline` is upserted on the platform, we can retrieve it via an API call. 

**Note** that in this example, we're retrieving the machine learning pipeline that we just declared, however now that its persisted on the platform anyone on our team can also retrieve it in order to use in their experiments.

In [None]:
retrievedPipeline = upserted_heterogeneousPipeline.get()
retrievedPipeline

### 6. Train Pipeline <a class="anchor" id="1.7">

Now we can simply train on the platform using the train() method on the `MLSerialPipeline` and passing in our training `Dataset`s

**Note** - This method of synchronous training should only be performed while experimenting with small datasets. In the upcoming modules we will explore asynchronous training techniques that can be used to train using the distributed framework of the C3 AI suite.

In [None]:
%%time
retrievedPipeline = retrievedPipeline.train(X_train, y_train)

In [None]:
retrievedPipeline

In [None]:
retrievedPipeline.upsert()

### 7. Score and Analyze Results <a class="anchor" id="1.8">

We can score our model on the platform.

In [None]:
train_score = retrievedPipeline.score(input=X_train, targetOutput=y_train)
test_score = retrievedPipeline.score(input=X_test, targetOutput=y_test)
print(f'train score = {train_score}')
print(f'test score = {test_score}')

We can generate predictions on the platform. Note the returned predictions are c3 `Dataset`s.

In [None]:
y_pred_train = retrievedPipeline.process(input=X_train)
y_pred_test = retrievedPipeline.process(input=X_test)

We convert back to a pandas dataframe.

In [None]:
y_pred_train_df = c3.Dataset.toPandas(y_pred_train)
y_pred_test_df = c3.Dataset.toPandas(y_pred_test)

Now we declare a plotting method that will plot the precision and recall curve along with the average precision

In [None]:
from sklearn.metrics import roc_curve, auc
from matplotlib import pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score

def plot_prec(y_pred_test, y_true_test, y_pred_train, y_true_train, fp=None, save=False):
    
    prec, rec, thresholds = precision_recall_curve(y_true=y_true_test, probas_pred=y_pred_test)
    avg_prec_test = average_precision_score(y_true=y_true_test, y_score=y_pred_test) 
    lw = 1
    plt.figure(figsize=[6, 6])
    plt.plot(rec, prec, color='darkorange',
         lw=lw, label='Precision-Recall curve(average precision = %0.2f), Testing' % avg_prec_test)
    
    prec, rec, thresholds = precision_recall_curve(y_true=y_true_train, probas_pred=y_pred_train)
    avg_prec_train = average_precision_score(y_true=y_true_train, y_score=y_pred_train) 
    lw = 1
    
    plt.plot(rec, prec, color='blue',
         lw=lw, label='Precision-Recall curve(average precision = %0.2f), Training' % avg_prec_train)
    
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall curve')
    plt.legend(loc="lower right")
    if save:
        plt.savefig(fp)
    plt.show()
    return avg_prec_test, avg_prec_train


We can use the helper method to plot the precision recall curve for the training and test set

In [None]:
test_avg_prec, train_avg_prec = plot_prec(y_pred_test=y_pred_test_df,
                                          y_true_test=df_test[label],
                                          y_pred_train=y_pred_train_df,
                                          y_true_train=df_train[label])

In [None]:
print(f'Average Precision Train: {train_avg_prec}')
print(f'Average Precision Test: {test_avg_prec}')

In [16]:
help(c3.SklearnPipe)

### Use this section for the Advanced Challenge: Model Interpretability and Feature Contributions

In order to complete this challenge, go back up to the cell where you specified your classifier and uncomment the lines to include the InterpretTechnique. Re-run your notebook up until this point.

You can explore the various explainability frameworks we can use in our documentation here: [ML Pipeline Interpretability](https://developer.c3.ai/docs/7.24.0/topic/pipeline-interpretability)

Generate feature contributions using the interpret technique specified on your pipeline:

In [None]:
interpret_result = retrievedPipeline.interpret(input=X_test)


Extract the feature contributions for each prediction:

In [None]:
contributions_dataset = c3.Dataset.fromTensor(
    tensor=interpret_result.contributions.subTensor([":",":","1"]),
    axesAsRow=[0],
    axesAsColumn=[1, 2])
appended = interpret_result.output.extractColumns(["prediction"]).appendColumns(dataset=contributions_dataset)
contributions_df = c3.Dataset.toPandas(dataset=appended)


In [None]:
contributions_df

# Make sure to SYNC your notebook to the server, then CLOSE AND HALT this notebook when you leave.
To sync: go to the File menu, Save and Checkpoint your notebook, and then select "Upload Notebook to C3.ai", or select the notebook in the tree view (check the box) and hit the "Sync" button.