# Machine Learning Experimentation on the C3 AI Suite

In this notebook we will cover:
- Best practices for experimenting with and tuning machine learning pipelines at scale
- Performing hyperparameter optimization using the `MLAutoTuner`
- Preparing a trained and tuned pipeline for production deployment as an `MLModel`

## Table of Contents 

* [Machine Learning Experimentation on the C3 AI Suite](#1)
    * [1. Import packages and Helper Functions](#1.1)
    * [2. Create Train, Test and Validation segments](#1.3)
    * [3. Define how to retrieve features, label, mask](#1.4)
    * [4. Define Machine Learning Pipeline](#1.5)
    * [5. Define Hyperparameter Search Space](#1.6)
    * [6. Define Scoring Metric](#1.7)
    * [7. Define the Validation Technique](#1.8)
    * [8. Define the Hyperparameter Search Technique](#1.9)
    * [9. Define the Execution Criteria](#1.10)
    * [10. Put it all together in MLAutoTunerSearchSpec](#1.11)
    * [11. Submit a MLAutoTuner search job](#1.12)
    * [12. Score and Analyze Results](#1.13)



### Step 1. Import the necessary packages <a class="anchor" id="1.1">

In [12]:
import pandas as pd
import numpy as np 
import collections
from datetime import datetime
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = (10,6)
%matplotlib inline

### Step 2: Create Train, Test and Validation segments <a class="anchor" id="1.3">


Determine which C3 Type you are using as the subject of your ML experiments ( = which Type your features and metrics are defined on).

In [13]:
subject_type_name = "SmartBulb"

Here we're creating our traing, validation and test segments:

In [14]:
training_segment = c3.MLPopulationSegment(
    subjectFilter="!(contains(id, 'SMBLB1') || contains(id, 'SMBLB2'))", # Which subjects to include in the group
#     mlProject=project, # use this to attach your model experimentation to an existing ML Project for traceability
    name="TrainingBulbs",
).upsert()


In [15]:
validation_segment = c3.MLPopulationSegment(
    subjectFilter="contains(id, 'SMBLB1')", # Which subjects to include in the group
#     mlProject=project, # use this to attach your model experimentation to an existing ML Project for traceability
    name="ValidationBulbs",
).upsert()


In [16]:
test_segment = c3.MLPopulationSegment(
    subjectFilter="contains(id, 'SMBLB2')", # Which subjects to include in the group
#     mlProject=project, # use this to attach your model experimentation to an existing ML Project for traceability
    name="TestingBulbs",
).upsert()


### Step 3: Define how to retrieve features, label, mask <a class="anchor" id="1.4">

In [17]:
pd.DataFrame(c3.SmartBulb.listMetrics().toJson())

Unnamed: 0,type,name,expression,meta,id,version,srcType,path,variables,tsDecl
0,SimpleMetric,AverageLumens,avg(avg(normalized.data.lumens)),"{'type': 'Meta', 'tenantTagId': 152, 'tenant':...",AverageLumens_SmartBulb,1,"{'type': 'TypeRef', 'typeName': 'SmartBulb'}",bulbMeasurements,,
1,SimpleMetric,AveragePower,avg(avg(normalized.data.power)),"{'type': 'Meta', 'tenantTagId': 152, 'tenant':...",AveragePower_SmartBulb,1,"{'type': 'TypeRef', 'typeName': 'SmartBulb'}",bulbMeasurements,,
2,SimpleMetric,AverageTemperature,avg(avg(normalized.data.temperature)),"{'type': 'Meta', 'tenantTagId': 152, 'tenant':...",AverageTemperature_SmartBulb,1,"{'type': 'TypeRef', 'typeName': 'SmartBulb'}",bulbMeasurements,,
3,SimpleMetric,AverageVoltage,avg(avg(normalized.data.voltage)),"{'type': 'Meta', 'tenantTagId': 152, 'tenant':...",AverageVoltage_SmartBulb,1,"{'type': 'TypeRef', 'typeName': 'SmartBulb'}",bulbMeasurements,,
4,CompoundMetric,DayOfWeek,"timeComponent('DAYOFWEEK', start())","{'type': 'Meta', 'fetchInclude': '[id,name,uni...",DayOfWeek,1,,,,
5,CompoundMetric,DayOfWeekDemo,"timeComponent('DAYOFWEEK', start())","{'type': 'Meta', 'fetchInclude': '[id,name,uni...",DayOfWeekDemo,1,,,,
6,CompoundMetric,DayOfYear,"timeComponent('DAYOFYEAR', start())","{'type': 'Meta', 'fetchInclude': '[id,name,uni...",DayOfYear,1,,,,
7,CompoundMetric,DaysInYear,"timeComponent('YEAR', start()) % 4 != 0 || (ti...","{'type': 'Meta', 'fetchInclude': '[id,name,uni...",DaysInYear,1,,,,
8,CompoundMetric,DurationOnInHours,"rolling('SUM',sum(eval('HOUR', Status)))","{'type': 'Meta', 'fetchInclude': '[id,name,uni...",DurationOnInHours,1,,,,
9,CompoundMetric,HasEverFailed,"rolling('SUM',IsDefective) ? 1 : 0","{'type': 'Meta', 'fetchInclude': '[id,name,uni...",HasEverFailed,1,,,,


Construct an `EvalMetricsDatasetMLDataSourceSpec` which is a specification for how we wish to construct our dataset for model training.

In [18]:
# List of C3 metric (SimpleMetric or Compound) names
features = [
                "AverageTemperature",
                "AveragePower",
]

# We will use this to discard data AFTER a bulb has failed
mask = 'HasEverFailed'
label = 'WillFailNextMonth'
train_start_date = "2016-01-01" # datetime string for start of training period
train_end_date = "2021-01-01" # datetime string for end of training period
time_series_interval = "DAY" # string specifying interval of data (See Interval type for more info)

source_spec = c3.EvalMetricsDatasetMLDataSourceSpec(
    name="training_smartbulb",
    srcType=subject_type_name,
    features=features,
    maskMetric=mask,
    target=label,
    start=train_start_date,
    end=train_end_date,
    interval=time_series_interval
).upsert()


Here we use the `EvalMetricsDatasetMLDataSourceSpec` defined above and generate training, validation and test datasets in a distributed fashion. These features are evaluated and persisted on the platform and the client is given a reference to these datasets. We will use the reference to the generated datasets to do the remainder of our experimentation and will not pull this dataset into our relatively memory constrained client (jupyter). 

This allows us to easily scale to very large datasets limited only by filesystem storage! 

In [19]:
training_dataset_generation_job = source_spec.inputAndTargetDataByReference(sourceFilter=training_segment.get().subjectFilter)

validation_dataset_generation_job = source_spec.inputAndTargetDataByReference(sourceFilter=validation_segment.get().subjectFilter)

test_dataset_generation_job = source_spec.inputAndTargetDataByReference(sourceFilter=test_segment.get().subjectFilter)



AttributeError: 'EvalMetricsDatasetMLDataSourceSpec' object has no attribute 'inputAndTargetDataByReference'

We wait until the dataset generation jobs are complete:

**Note** - `fst` and `snd` represent the jobs for features and target generation respectively

In [None]:
training_dataset_generation_job.fst.waitForCompletion()
validation_dataset_generation_job.fst.waitForCompletion()
test_dataset_generation_job.fst.waitForCompletion()
training_dataset_generation_job.snd.waitForCompletion()
validation_dataset_generation_job.snd.waitForCompletion()
test_dataset_generation_job.snd.waitForCompletion()

Now we can easily construct the X's and y's for our supervised learning problem and get a handle to these generated datasets without pulling the data into our Jupyter client and materializing them:

**Note** - We are accessing the `default` collection since we haven't declared a grouping field for our dataset generation job. We can optionally specify this grouping field and only get a handle to a dataset that is grouped by a certain field.

In [None]:
X_train_dataset_ref = training_dataset_generation_job.fst.results()["default"][0]
y_train_dataset_ref = training_dataset_generation_job.snd.results()["default"][0]

X_val_dataset_ref = validation_dataset_generation_job.fst.results()["default"][0]
y_val_dataset_ref = validation_dataset_generation_job.snd.results()["default"][0]

X_test_dataset_ref = test_dataset_generation_job.fst.results()["default"][0]
y_test_dataset_ref = test_dataset_generation_job.snd.results()["default"][0]

**Note** the structure of the `Dataset` object. It's just a reference to the persisted features!

In [None]:
X_train_dataset_ref

### Step 4: Define Machine Learning Pipeline <a class="anchor" id="1.5">

Define the machine learning pipeline that we wish to tune:

In [None]:
standardScaler = c3.SklearnPipe(
                    name="standardScaler",
                    technique=c3.SklearnTechnique(
                        # This tells ML pipeline to import sklearn.preprocessing.StandardScaler.
                        name="preprocessing.StandardScaler",
                        # This tells ML pipeline to call "transform" method on sklearn.preprocessing.StandardScaler when we invoke the C3 action process() later.
                        processingFunctionName="transform"
                    )
                 )


preprocess_pipeline = c3.MLSerialPipeline(
                        name="preprocessPipeline",
                        steps=[c3.MLStep(name="standardScaler",
                                         pipe=standardScaler),
                              ]
)


classifier = c3.SklearnPipe(
                    name="rfc",
                    technique=c3.SklearnTechnique(
                        name="ensemble.RandomForestClassifier",
                        processingFunctionName="predict",
                    )
)


serialPipeline = c3.MLSerialPipeline(
                name="randomForestPipeline2",
                steps=[c3.MLStep(name="preprocess",
                                 pipe=preprocess_pipeline),
                       c3.MLStep(name="classifier",
                                 pipe=classifier)],
                scoringMetrics=c3.MLScoringMetric.toScoringMetricMap(scoringMetricList=[c3.MLAccuracyMetric(), c3.MLPrecisionMetric()])
             ).upsert()


Or retrieve a previously upserted pipeline:

In [None]:
# # If you would like to use an MLPipeline you created earlier, e.g. in the previous module, you would fetch the
# # id and use it here

# serialPipeline = c3.MLSerialPipeline.get('unique_id_of_previously_upserted_pipeline')

In [None]:
serialPipeline

### Step 5. Define Hyperparameter Search Space <a class="anchor" id="1.6">

Note the syntax for specifying hyperparameter space to search over:

`MLStepName__hyperparameter`

In [None]:
hyperparam_space = {
                        "classifier__n_estimators": c3.IntegerRangeParamSpace(start=500, stop=1000, stepSize=100),
                        "classifier__max_depth": c3.IntegerRangeParamSpace(start=2, stop=5, stepSize=1)
                    }

### Step 6. Define Scoring Metric <a class="anchor" id="1.7">

Define the scoring metric to be used for validation:

In [None]:
scoring_metric = c3.MLF1ScoreMetric()
scoring_metric.category()

### Step 7. Define the Validation Technique <a class="anchor" id="1.8">

Types of supported cross-validation techniques:

In [None]:
c3.MetadataStore.tag().typesThatMixin("MLValidationTechnique")

We will use `MLCustomHoldoutTechnique`

In [None]:
val_technique = c3.MLCustomHoldoutTechnique(
    customInputValidate=X_val_dataset_ref,
    customTargetOutputValidate=y_val_dataset_ref,
    validationMetric="f1_score", 
    scoringMetrics={
        "f1_score": scoring_metric
    },
)

### Step 8. Define the Hyperparameter Search Technique <a class="anchor" id="1.9">

We will use a simple `Grid` search algorithm which can be parallelized across the c3 cluster. Try to find out what other algorithms we support!

In [None]:
hyp_technique = c3.MLHyperParamSearchTechniqueChocolate(
    algorithm="Grid"
)
hyp_technique.isSerial()

### Step 9. Define the Execution Criteria <a class="anchor" id="1.10">

In [None]:
exec_spec = c3.MLHyperParamExecSpec(
    keepAllTrainedPipes=True,
    checkAutoEarlyStop=True,
    targetScore=0.97
)

### Step 10. Put it all together in MLAutoTunerSearchSpec <a class="anchor" id="1.11">

**Note** - You should change the `maxIterations` field here to account for your search space

In [None]:
search_spec = c3.MLAutoTunerSearchSpec(
    validationTechnique=val_technique,
    hyperParamSearchTechnique=hyp_technique,
    executionSpec=exec_spec,
    maxConcurrentIterations=100,
    maxIterations=1,
    maxSearchTime=600,
    refit=True
)

### Step 11. Submit a MLAutoTuner search job <a class="anchor" id="1.12">

In [None]:
auto_tuner = c3.MLAutoTuner(name="MyUniqueMLAutoTunerName").search(
                            # mlProject=project,
                            pipe=serialPipeline.get(),
                            hyperparameterSpace=hyperparam_space,
                            input=X_train_dataset_ref,
                            targetOutput=y_train_dataset_ref,
                            spec=search_spec
)

In [None]:
auto_tuner.get('MyUniqueMLAutoTunerName')

In [None]:
def check_auto_tuner_status(auto_tuner):
    import pandas as pd
    from IPython.display import clear_output
    clear_output()
    if auto_tuner.get("completedIterations").completedIterations is None:
        display(f"Status: {auto_tuner.state()}")
    else:
        completed_iterations = auto_tuner.get("completedIterations").completedIterations
        results = [i["validationFoldsResults"][0] for i in completed_iterations.toJson()]
        for iteration, res in zip(completed_iterations, results):
            res["HP"] = iteration.hyperParameters.toJson()
        df = pd.json_normalize(results)
        df["timeElapsed"] = [iteration.endTimestamp - iteration.startTimestamp for iteration in completed_iterations]
        display(f"Status: {auto_tuner.state()}")
        display(df[[col for col in df.columns if "Scores" in col or "HP" in col] + ["timeElapsed"]])
        if auto_tuner.state() == "completed":
            display("********** BEST MODEL RESULT RETRAINED ON TRAIN + VAL DATASETS **********")
            display(auto_tuner.bestResult())
            return True
    return False

while True:
    import time
    if check_auto_tuner_status(auto_tuner):
        break
    time.sleep(20)

In [None]:
auto_tuner.errors()

### Step 12. Score and Analyze Results <a class="anchor" id="1.13">

We can retrieve the best fit model that has been retrained on both the training and validation set:

In [None]:
retrievedPipeline = auto_tuner.get().retrainedPipe.get()

In [None]:
train_score = retrievedPipeline.score(input=X_train_dataset_ref, targetOutput=y_train_dataset_ref)
test_score = retrievedPipeline.score(input=X_test_dataset_ref, targetOutput=y_test_dataset_ref)
print(f'train score = {train_score}')
print(f'test score = {test_score}')

We can generate predictions on the platform. Note the returned predictions are c3 `Dataset`s.

In [None]:
y_pred_train = retrievedPipeline.process(input=X_train_dataset_ref)
y_pred_test = retrievedPipeline.process(input=X_test_dataset_ref)

We convert back to a pandas dataframe.

In [None]:
y_pred_train_df = c3.Dataset.toPandas(y_pred_train)
y_pred_test_df = c3.Dataset.toPandas(y_pred_test)

Now we declare a plotting method that will plot the precision and recall curve along with the average precision

In [None]:
from sklearn.metrics import roc_curve, auc
from matplotlib import pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score

def plot_prec(y_pred_test, y_true_test, y_pred_train, y_true_train, fp=None, save=False):
    
    prec, rec, thresholds = precision_recall_curve(y_true=y_true_test, probas_pred=y_pred_test)
    avg_prec_test = average_precision_score(y_true=y_true_test, y_score=y_pred_test) 
    lw = 1
    plt.figure(figsize=[6, 6])
    plt.plot(rec, prec, color='darkorange',
         lw=lw, label='Precision-Recall curve(average precision = %0.2f), Testing' % avg_prec_test)
    
    prec, rec, thresholds = precision_recall_curve(y_true=y_true_train, probas_pred=y_pred_train)
    avg_prec_train = average_precision_score(y_true=y_true_train, y_score=y_pred_train) 
    lw = 1
    
    plt.plot(rec, prec, color='blue',
         lw=lw, label='Precision-Recall curve(average precision = %0.2f), Training' % avg_prec_train)
    
    plt.xlim([0.0, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.title('Precision-Recall curve')
    plt.legend(loc="lower right")
    if save:
        plt.savefig(fp)
    plt.show()
    return avg_prec_test, avg_prec_train


We can use the helper method to plot the precision recall curve for the training and test set

In [None]:
y_true_train_df = c3.Dataset.toPandas(y_train_dataset_ref)
y_true_test_df = c3.Dataset.toPandas(y_test_dataset_ref)

In [None]:
test_avg_prec, train_avg_prec = plot_prec(y_pred_test=y_pred_test_df,
                                          y_true_test=y_true_test_df,
                                          y_pred_train=y_pred_train_df,
                                          y_true_train=y_true_train_df)

In [None]:
print(f'Average Precision Train: {train_avg_prec}')
print(f'Average Precision Test: {test_avg_prec}')

### Step 13. Create MLModel for Deployment <a class="anchor" id="1.14">

Now, we iterate over the above steps until we're satisfied with the generalizability of our machine learning pipeline. 

Ocne you're satisfied with the results and you have your model acheiving a good F1 score, you can uncomment and execute the following cell to create an `MLModel` from the trained and tuned pipeline which is now ready to be deployed directly in production using the [C3 AI Model Deployment Framework ](https://developer.c3.ai/docs/7.29.0/topic/modelDeployment)! 

In [None]:
# ml_model = c3.MLModel.createFromPipeline(
#     pipeline=retrievedPipeline.get(), 
#     trainingDataSourceSpec=source_spec.get(),
#     spec=c3.MLModelCreateSpec(
#         predictionDataSourceSpec=source_spec.get(),
#         # mlProject=project # for the DS course, this line should stay commented out;
#                             # you would use it to to attach your model experimentation to an existing ML Project for traceability
#     )
# ).upsert(spec=c3.UpsertSpec(returnInclude="this"))


### Get the id of your created MLModel in this cell. You will deploy it in the Capstone notebook and see how your model performs against a "LIVE" model.

In [None]:
# ml_model

ml_model.id

In [11]:
help(c3.EvalMetricsDatasetMLDataSourceSpec)

In [20]:
help(c3.MLPopulationSegment)