<center><h1>PART 2: TRAIN A MODEL ON AZURE MACHINE LEARNING SERVICE</h1></center>
<br>

In this notebook, we will train a non-trivial machine learning model using Azure Machine Learning Service. In 

#### ABOUT THE MODEL & DATA

Using data from Taarifa and the Tanzanian Ministry of Water, we will predict which pumps are functional, which need some repairs, and which don't work at all. The labels encompass three classes and the training data is based on a number of variables about what kind of pump is operating, when it was installed, and how it is managed. A smart understanding of which waterpoints will fail can improve maintenance operations and ensure that clean, portable water is available to communities across Tanzania. This competition is hosted on [Driven Data.](https://www.drivendata.org/competitions/7/pump-it-up-data-mining-the-water-table/)

#### PYTHON DEPENDENCIES
```
This notebook was developed with the following packages:
azureml-sdk[databricks]
category-encoders==1.3.0
numpy==1.15.0
pandas==0.24.1
scikit-learn==0.20.2

```

#### APPROACH
We will use a Machine Learning Compute resource on AMLS to train the model. This is simply a VM. Training a model on Azure Machine Learning Service involves the following steps:
1. Configure the development environment  
2. Upload your data to blob storage
3. Create a training script called `train.py`
4. Submit the job
5. Register the final model

Resource: [How Azure Machine Learning service works: Architecture and concepts](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace)

In [2]:
from __future__ import print_function 
import os
import numpy as np
import pandas as pd
from sklearn.externals import joblib
import azureml.core
from azureml.core import Workspace
from azureml.core.model import Model
from azureml.core.compute import AmlCompute, AksCompute, ComputeTarget
from azureml.train.estimator import Estimator
from azureml.train.hyperdrive import GridParameterSampling, BanditPolicy, choice
from azureml.train.hyperdrive import HyperDriveRunConfig, PrimaryMetricGoal

# set Pandas display options
pd.options.display.max_columns = None

# check Azure SDK version
print("Azure ML SDK Version: ", azureml.core.VERSION)

#### 2.1 CONFIGURE THE DEVELOPMENT ENVIRONMENT
Configuring the development environment involves steps such as setting up an Azure ML Workspace, connecting to a remote storage and compute resources etc. 

For local disk storage while development we will use the Databricks FileStore folder. `/FileStore` is a special folder within DBFS where you can save files and also download files to your local machine via a browser.  
Use the Databricks Data menu/UI to upload the pumps_data.csv and new_pumps_data.csv to ``/FileStore/tables/pumps` directory (skip if you've already uploaded the data in Part 1).
```
/FileStore
  ├── tables                     -> Databricks by default stores data here
  │   └──pumps                   -> we will create this project specific folder
  │      ├── new_pumps_data.csv  -> scoring dataset with no labels
  │      └── pumps_data.csv      -> training dataset with labels
  └── users/jason/pumps          -> we will create this folder as our project root folder
      ├───models                 -> store all pickle files
      │    ├──  local            -> pickle files created by training locally in the notebook
      │    ├──  rf.pkl           -> Random Forest estimator trained on AMLS
      │    ├──  le.pkl           -> Preprocessing transformer trained on AMLS
      │    ├──  ohc.pkl          -> Preprocessing transformer trained on AMLS
      │    ├──  y_le.pkl         -> Preprocessing transformer trained on AMLS
      └── scripts                -> scripts such as train.py and score.py for AMLS
```

Let's create a `Config` class to hold all the pertinent configurations and storage locations. 

Resource: [Configure a development environment for Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-configure-environment)

In [4]:
class Config(object):

    # define Azure ML Workspace configuration variables---------
    SUBSCRIPTION_ID = 'your_key_here'
    RESOURCE_GROUP = 'ML_SANDBOX'
    WORKSPACE_NAME = 'ML_SANDBOX'
    WORKSPACE_REGION = 'East US 2'
    
    # setup Azure Machine Learning Service compute for training
    TRAIN_COMPUTE = 'dev-vm'
    
    # Kubernetes cluster for deployment
    DEPLOY_COMPUTE = 'dev-cluster'
    IMAGE_NAME = 'pumps_rf_image'
    AKS_SERVICE_NAME = 'pumps-aks-service-1'
    
    # define DBFS paths for sub-directories---------------------
    # dbutils requires filepaths without the use of '/dbfs' so we will use this
    # variables largely with dbutils functions.
    PROJECT_DIR = '/FileStore/users/jason/pumps' 
    
    # for Python to understand filepaths, you need to prefix '/dbfs'
    MODELS_DIR = '/dbfs'+PROJECT_DIR+'/models'
    SCRIPTS_DIR = '/dbfs'+PROJECT_DIR+'/scripts'
    
    # set location for uploading data
    # default location for data is /FileStore/tables but we will use a pumps sub-directory
    DATA_DIR = '/dbfs/FileStore/tables/pumps'  

Once the directories are created, use the Databricks Data menu/UI to upload the `pumps_data.csv` and `new_pumps_data.csv` to `/FileStore/tables/pumps` directory

In [6]:
# create the project directories in FileStore if not already exists
dbutils.fs.mkdirs(Config.PROJECT_DIR+'/models')
dbutils.fs.mkdirs(Config.PROJECT_DIR+'/scripts')
# verify
dbutils.fs.ls(Config.PROJECT_DIR)

The Azure Machine Learning Workspace provides a centralized place to work with all the artifacts you create when you use Azure Machine Learning service.

In [8]:
# connect to workspace
try:
    ws = Workspace(subscription_id = Config.SUBSCRIPTION_ID, resource_group = Config.RESOURCE_GROUP, workspace_name = Config.WORKSPACE_NAME)
    # write the details of the workspace to a configuration file to the notebook library
    ws.write_config()
    print("Workspace configuration succeeded.")
except:
    print("Workspace not accessible.")
    
# set up experiment
experiment_name = 'pumps-exp1'
from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

In [9]:
# attach compute cluster
if Config.TRAIN_COMPUTE in ws.compute_targets:
    compute_target = ws.compute_targets[Config.TRAIN_COMPUTE]
    if compute_target and type(compute_target) is AmlCompute:
        print('Compute Target Found: ' + Config.TRAIN_COMPUTE)
else:
    print('No cluster found')

#### 2.2 UPLOAD DATA TO BLOB STORAGE
Data has to be uploaded to specific Blob storage buckets linked to Azure Machine Learning Service so that it is accessable to the remote compute clusters. The pumps files are uploaded into a directory named `pumps` at the root of the datastore.

In [11]:
# upload data to Azure Machine Learning Service datastore
ds = ws.get_default_datastore()
print('Datastore Type : '+ds.datastore_type)
print('Account Name   : '+ds.account_name)
print('Container Name : '+ds.container_name)

ds.upload(src_dir = Config.DATA_DIR, 
          target_path = 'pumps', 
          overwrite = True, 
          show_progress = True)

#### 2.3 CREATE A TRAINING SCRIPT
Training on a remote cluster involves create a `train.py` script along with pointers to the data location and any other objects such as estimators. When using `%%writefile` you need to prefix `/dbfs` to the filesystem path

In [13]:
%%writefile /dbfs/FileStore/users/jason/pumps/scripts/train.py

import os
import argparse
import numpy as np
import pandas as pd
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.externals import joblib

from azureml.core import Run
# get hold of the current run
run_logger = Run.get_context()

###########################################################################################
# HELPER FUNCTIONS
###########################################################################################

def create_dataframe(x):
    """
    Imports the pumps csv data file directly from AML blob storage.

    :param x: full path to a csv file
              e.g: '/dbfs/mnt/jason/pumps/data/pumps_data.csv'
    :return: two dataframes that split data and labels
    """
    # import raw data
    raw = pd.read_csv(x, index_col=0)
    labels = pd.DataFrame(raw['status_group'])
    data = raw.drop('status_group', axis=1)
    
    return data, labels

def print_nans(df):

    print('Checking for NANs:............................')
    mis_val = df.isnull().sum()
    mis_val_percent = 100 * df.isnull().sum() / len(df)
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    mis_val_table_ren_columns = mis_val_table.rename(
        columns={0: 'Missing Values', 1: '% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:, 1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
    print("Your selected dataframe has " +
          str(df.shape[1]) +
          " columns and " +
          str(len(df)) +
          " rows \n" + "There are " +
          str(mis_val_table_ren_columns.shape[0]) +
          " columns that have missing values.")
    print('..............................................')
    return mis_val_table_ren_columns


def data_frame_imputer(df):
    fill = pd.Series([df[c].value_counts().index[0]
                      if df[c].dtype == np.dtype('O') else df[c].mean() for c in df],
                     index=df.columns)
    return df.fillna(fill)


def replace_with_grouped_mean(df, value, column, to_groupby):

    invalid_mask = (df[column] == value)

    # get the mean without the invalid value
    means_by_group = (df[~invalid_mask].groupby(to_groupby)[column].mean())

    # get an array of the means for all of the data
    means_array = means_by_group[df[to_groupby].values].values

    # assign the invalid values to means
    df.loc[invalid_mask, column] = means_array[invalid_mask]

    return df


def log_transformer(df, base, c=1):

    if base == 'e' or base == np.e:
        log = np.log

    elif base == '10' or base == 10:
        log = np.log10

    else:
        def log(x): return np.log(x) / np.log(base)

    c = c
    out = pd.DataFrame()
    for _ in df:
        out = df.apply(lambda x: log(x + c))
    return out


def stratified_split(x, y, test_size):

    from sklearn.model_selection import StratifiedShuffleSplit

    sss = StratifiedShuffleSplit(n_splits=10, test_size=test_size, random_state=5)
    sss.get_n_splits(x, y)
    data_train = pd.DataFrame()
    data_test = pd.DataFrame()
    label_train = pd.DataFrame()
    label_test = pd.DataFrame()
    for train_index, test_index in sss.split(x, y):
        data_train, data_test = x.iloc[train_index], x.iloc[test_index]
        label_train, label_test = y.iloc[train_index], y.iloc[test_index]
    return data_train, data_test, label_train, label_test
  
###########################################################################################
# PREPROCESSING
###########################################################################################
  
def clean_data(x, y):
    """
    Takes the pumps data and label dataframe and cleans it

    :param x: the pumps dataframe
    :param y: the pumps labels dataframe
    :return:  stratified splits for train and test
    """
  
    useful_columns = ['amount_tsh',
                      'gps_height',
                      'longitude',
                      'latitude',
                      'region',
                      'population',
                      'construction_year',
                      'extraction_type_class',
                      'management_group',
                      'quality_group',
                      'source_type',
                      'waterpoint_type']

    # subset to columns we care about
    x = x[useful_columns]

    # for column construction_year, values <=1000 are probably bad
    invalid_rows = x['construction_year'] < 1000
    valid_mean = int(x.construction_year[~invalid_rows].mean())
    x.loc[invalid_rows, "construction_year"] = valid_mean

    # in some columns 0 is an invalid value
    x = replace_with_grouped_mean(df=x, value=0, column='longitude', to_groupby='region')
    x = replace_with_grouped_mean(df=x, value=0, column='population', to_groupby='region')

    # set latitude to the proper value
    x = replace_with_grouped_mean(df=x, value=-2e-8, column='latitude', to_groupby='region')

    # set amount_to non-zeroes
    x = replace_with_grouped_mean(df=x, value=0, column='amount_tsh', to_groupby='region')

    # remove na's
    x = data_frame_imputer(df=x)

    # print nans in the dataframe if any
    print_nans(x)

    # log transform numerical columns
    num_cols = ['amount_tsh', 'population']
    x[num_cols] = log_transformer(df=x[num_cols], base='e', c=1)

    # do train/test split
    x_train, x_test, y_train, y_test = stratified_split(x=x, y=y, test_size=0.2)

    return x_train, x_test, y_train, y_test
  
  
def train_pre_processing(x, y):

    """
    Preprocesses the pumps train datasets by applying label
    and one-hot encoding

    :param x: the pumps x_train dataset
    :param y: the upmps y_train dataset
    :return: encoded datasets and the fitted transformers
    """

    # transform categorical variables with encoders
    le_cols = ['region']
    ohc_cols = ['extraction_type_class',
                'management_group',
                'quality_group',
                'source_type',
                'waterpoint_type']

    # define encoders include label encoding for the actual labels
    # using handle_unknown='ignore' will leave out new unseen values so keep
    # monitoring your data for changes
    
    le = ce.OrdinalEncoder(cols=le_cols,
                           return_df=True,
                           handle_unknown='ignore')

    ohc = ce.OneHotEncoder(cols=ohc_cols,
                           return_df=True,
                           use_cat_names=False,
                           handle_unknown='ignore')

    y_le = ce.OrdinalEncoder(return_df=True,
                             handle_unknown='ignore')

    print("x_train shape: ", x.shape)
    print("y_train shape: ", y.shape)
    # apply the encoders
    print("Running label and one-hot encoding on the train data...")
    x = le.fit_transform(x)
    x = ohc.fit_transform(x)
    y = y_le.fit_transform(y)
    # update the transformers
    le = le
    ohc = ohc
    y_le = y_le
    print("Final x_train shape: ", x.shape)
    print("Final y_train shape: ", y.shape)
    print("done.")
    
    return x, y, le, ohc, y_le


def test_pre_processing(x, y, le, ohc, y_le):

    """
    Preprocesses the pumps test datasets by applying the fitted label
    and one-hot encoding transformers to the test/validation datasets

    :param x: the x_test dataset
    :param y: the y_test dataset
    :param le: the label encoder fitted from the train_pre_processing() function
    :param ohc: the one-hot encoder fitted from the train_pre_processing() function
    :param y_le: the y label encoder fitted from the train_pre_processing() function
    :return: encoded x_test and y_test
    """
  
    print("x_test shape: ", x.shape)
    print("y_test shape: ", y.shape)
    print("Running label and one-hot encoding on the test data...")
    x = le.transform(x)
    x = ohc.transform(x)
    y = y_le.transform(y)
    print("New x_test shape: ", x.shape)
    print("New y_test shape: ", y.shape)
    print("done.")

    return x, y
  
  
###########################################################################################
# TRAINING
###########################################################################################

def train_and_evaluate(x, n_estimators=100, criterion='entropy', class_weight='balanced_subsample'):
    """
    A full pipeline that cleans, preprocesses data and then fits a
    random forest classifier

    :param x              : full location of the Azure blob storage
                            e.g: os.path.join(args.data_folder, 'pumps')
    :param n_estimators:  : random forest parameter for number of branches
    :param criterion:     : random forest parameter for node splitting methodology
    :param class_weight:  : random forest parameter whether to treat all classes as balanced
    :return               :  all estimator/transfromer objects and the accuracy metric
    """

    # ingest and process data
    data, labels = create_dataframe(x=x)
    x_train, x_test, y_train, y_test = clean_data(x=data, y=labels)
    x_train, y_train, le, ohc, y_le = train_pre_processing(x=x_train, y=y_train)
    x_test, y_test = test_pre_processing(x=x_test, y=y_test, le=le, ohc=ohc, y_le=y_le)
    
    # train classifier
    print("training classifier...")
    rf = RandomForestClassifier(n_estimators=n_estimators,
                                criterion=criterion,
                                class_weight=class_weight)
    rf.fit(x_train, np.ravel(y_train))
    print(" classifier has been trained")
    
    # evaluate on test set
    test_pred = rf.predict(x_test)
    accuracy = accuracy_score(y_test, test_pred)
    print("Test Accuracy: ", accuracy)
    
    # we need to return the transformers also so that it gets captured in the global scope
    # this will allow us to save these models as pickle files
    return rf, accuracy, le, ohc, y_le
  
  
###########################################################################################
# MAIN
###########################################################################################

def main(): 
  
    # create four arguments to specify location of the data and the 
    # Random Forest hyperparameters 
    parser = argparse.ArgumentParser()

    parser.add_argument(
      '--data-folder', 
      help='data folder mounting point',
      type=str, 
      dest='data_folder'
    )

    parser.add_argument(
      '--n_estimators', 
      help='The number of trees in the forest.',
      type=int, 
      dest='n_estimators', 
      default=100
    )

    parser.add_argument(
      '--criterion', 
      help='The function to measure the quality of a split. Supported criteria are “gini” \
      for the Gini impurity and “entropy” for the information gain. ',
      type=str, 
      dest='criterion', 
      default='entropy'
    )

    parser.add_argument(
      '--class_weight', 
      help='Specify class weights. Options are "balanced", "balanced_subsample" or "None".',
      type=str, 
      dest='class_weight', 
      default='balanced_subsample'
    )

    args = parser.parse_args()

    # specify the pumps folder within the AML blob storage as the folder with data
    data_folder = os.path.join(args.data_folder, 'pumps')
    print('Data folder:', data_folder)

    # run the entire pipeline
    rf, accuracy, le, ohc, y_le = train_and_evaluate(x = os.path.join(data_folder, 'pumps_data.csv'), 
                                                     n_estimators = args.n_estimators,
                                                     criterion = args.criterion,
                                                     class_weight = args.class_weight)
    
    # log accuracy
    run_logger.log('accuracy', np.float(accuracy))

    # save model objects
    # note file saved in the outputs folder is automatically uploaded into experiment record
    os.makedirs('outputs', exist_ok=True)
    joblib.dump(value=rf, filename='outputs/rf.pkl')
    joblib.dump(value=le, filename='outputs/le.pkl')
    joblib.dump(value=ohc, filename='outputs/ohc.pkl')
    joblib.dump(value=y_le, filename='outputs/y_le.pkl')

    
if __name__ == "__main__":
    main()

In [14]:
dbutils.fs.ls(Config.PROJECT_DIR+'/scripts')

Files stored in /FileStore are accessible in your web browser. For example, the file you stored in `/FileStore/my-stuff/my-file.txt` is accessible at:  
`https://<your-region>.azuredatabricks.net/files/my-stuff/my-file.txt?o=######`

You will find the value of `o` in the URL of your browser.
To verify the file, you can use the URL endpoint:  
<https://eastus2.azuredatabricks.net/files/users/jason/pumps/scripts/train.py?o=3749457624382033>

#### 2.4 SUBMIT THE JOB
An estimator object needs to be created before submitting a job to the remote compute cluster. Will will also incorporate a grid search for the hyperparameters using `GridParameterSampling`. 

Reference: [Train models with Azure Machine Learning using estimator](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-ml-models)  
Reference: [Tune hyperparameters for your model with Azure Machine Learning service](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters#specify-an-early-termination-policy)

In [17]:
# setup estimator object and its parameters

# if you don't use hyperdrive, you will need to specify all flags
# including the model hyperparameters
no_hyperdrive_script_params = {
  '--data-folder': ds.as_mount(),
  '--n_estimator': 120,
  '--criterion': 'entropy',
  '--class_weight': 'balanced_subsample'}

script_params = {'--data-folder': ds.as_mount()}

# lock package versions to the Databricks environment
# this is the equivalent of creating a requirements.txt file 
package_list = [
  'category-encoders==1.3.0',
  'numpy==1.15.0',
  'pandas==0.24.1',
  'scikit-learn==0.20.2']

est = Estimator(source_directory=Config.SCRIPTS_DIR,
                script_params=script_params,
                compute_target=Config.TRAIN_COMPUTE,
                entry_script='train.py',
                pip_packages=package_list)

# setup hyperdrive
param_sampling = GridParameterSampling({
  'n_estimator': choice(100, 120),
  'criterion': choice('entropy', 'gini'),
  'class_weight': choice('balanced_subsample', 'balanced')})

early_termination_policy = BanditPolicy(slack_factor=0.1, 
                                        evaluation_interval=1, 
                                        delay_evaluation=4)

hypertune_config = HyperDriveRunConfig(estimator=est,
                                       hyperparameter_sampling=param_sampling,
                                       policy = early_termination_policy,
                                       primary_metric_name='accuracy',
                                       primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                       max_total_runs=5,
                                       max_concurrent_runs=2)

hyperdrive_run = exp.submit(config=hypertune_config)
hyperdrive_run

You can monitor the job from Azure Portal or use the Python library azureml-widgets

In [19]:
# not working need to follow up
from azureml.widgets import RunDetails
RunDetails(hyperdrive_run).show()

Once all the runs are complete, let's select the run that produced the best accuracy

In [21]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']
print('Best Run Id: ', best_run.id)
print('Accuracy:', best_run_metrics['accuracy'])
print('class_weight:', parameter_values[3])
print('criterion:', parameter_values[5])
print('n_estimators:', parameter_values[7])

#### 2.5 REGISTER THE FINAL MODEL
The last step in the training script wrote files such as `outputs/rf.pkl` in a directory named outputs in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. Hence, the model file as well as the transformers are now also available in your workspace.

In [23]:
# register model as well as input transformers
pumps_rf = best_run.register_model(model_name='pumps_rf', model_path='outputs/rf.pkl')
pumps_le = best_run.register_model(model_name='pumps_le', model_path='outputs/le.pkl')
pumps_ohc = best_run.register_model(model_name='pumps_ohc', model_path='outputs/ohc.pkl')
pumps_y_le = best_run.register_model(model_name='pumps_y_le', model_path='outputs/y_le.pkl')

print(pumps_rf.name, pumps_rf.id, pumps_rf.version, sep = '\t')
print(pumps_le.name, pumps_le.id, pumps_le.version, sep = '\t')
print(pumps_ohc.name, pumps_ohc.id, pumps_ohc.version, sep = '\t')
print(pumps_y_le.name, pumps_y_le.id, pumps_y_le.version, sep = '\t')

In Part 3 we will take this model we trained and deploy it on a Kubernetes cluster as a web service.