# Tutorial #1: Model training with Azure Machine Learning

In this tutorial, you will train a machine learning model on local and Azure compute resources. You will explore the Azure Machine Learning service and the Azure ML SDK for Python. 
This notebook serves as a quick start to hands-on Azure Machine Learning service. 

The codes here were tested using Azure ML SDK version:
- 1.6.0
- 1.3.0
- 1.0.72 on Microsoft Azure Notebooks with Python 3.6 kernel 

The use case is based on  
* [Predicting Employee Turnover](https://towardsdatascience.com/predicting-employee-turnover-7ab2b9ecf47e) by Imad Dabbura 
* [Predict Employee Retention](https://towardsdatascience.com/predict-employee-retention-901bfb2c8db5) by Ila Maheshwari

**Note: The focus of this tutorial is not about accuracy of the model or what algorithm to use. You can choose any use case that you want to explore.**

Before you start this tutorial, you need to create a workspace in the Azure portal first.
[Create and manage Azure Machine Learning workspaces in the Azure portal](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace)

The following are covered in this tutorial:
* Store a dataset in the Azure Machine Learning datastore
* Retrieve a dataset from the Azure Machine Learning datastore
* Train a simple logistic regression model on local machine and on Azure compute resources
* Register the model in Azure Machine Learning Workspace

If you are trying out this tutorial for the first time, please run the code cells in this tutorial sequentially.
Tutorial #2 will cover the basics of deploying a model. 

## References

[How Azure Machine Learning works: Architecture and concepts](https://docs.microsoft.com/en-us/azure/machine-learning/concept-azure-machine-learning-architecture)

[Azure Machine Learning documentation](https://docs.microsoft.com/azure/machine-learning/service/tutorial-train-models-with-aml#prerequisites).
                                                                

## Set up your development environment

### Dependencies required for local machine setup in order to use Azure ML SDK.

Step 1. You need to create a [free Azure account](https://azure.microsoft.com/en-gb/free/) first. This tutorial will use Anaconda on your local machine to connect to your Azure account.

Step 2. This notebook was tested in Anaconda Jupyter Notebook. 
Once you have installed Anaconda on your machine, run the following pip commands to download these packages into Anaconda:
    
+ conda install anaconda-client
+ conda update anaconda
+ pip install azureml-sdk[notebooks,automl]
+ pip install azureml-dataprep[pandas]
+ conda update conda

**Note: If you need to upgrade the azureml components, uninstall the old version first before install the new ones.**

OR you can use a [free Microsoft Azure Notebooks](https://notebooks.azure.com/) to run this notebook if you don't have Anaconda.

### Import Azure Machine Learning SDK for Python 

This step is to check that you have installed Azure Machine Learning SDK for Python.

**Note: if you encounter ModuleNotFoundError, try uninstall all the azureml components first then re-install them again.**

In [None]:
import azureml.core

# check core SDK version number (need Python 3.6 kernel if you run this in Microsoft Azure Notebooks)
print("Azure ML SDK Version: ", azureml.core.VERSION)

### Connect Azure Machine Learning Workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

If you see this message:
"Performing interactive authentication. Please follow the instructions on the terminal.
To sign in, use a web browser to open the page https://microsoft.com/devicelogin and enter the code &lt;token\&gt; to authenticate."
    
Click on the link and use the &lt;token\&gt; given to authenticate. After authenticated, run this script again to get load the Workspace.&lt;/token\&gt;&lt;/token\&gt;

In [None]:
# Load workspace configuration from the config.json file in the current folder.
from azureml.core import Workspace
workspace = Workspace.from_config()
print(workspace.name, workspace.location, workspace.resource_group, workspace.location, sep='\t')

### Create an Experiment

An Experiment tracks the runs in your workspace. A workspace can have muliple experiments. 

In [None]:
from azureml.core import Experiment

experiment_name = 'predict-employee-retention'
exp = Experiment(workspace=workspace, name=experiment_name)

### Connect to Datastore

From your machine learning Workspace, launch Azure Machine Learning studio. Click "Datastores" (under "Manage" section) on the left menu.

This tutorial will use the default datastore 'workspaceblobstore' to store the datasets.

In [None]:
# Connect to datastore
from azureml.core import Datastore
datastore = Datastore.get(workspace, datastore_name='workspaceblobstore')

### Create Dataset

To upload your data file, launch Azure Machine Learning studio then go to "Datasets" (under "Assets" section), then click "Create dataset".

You use training-data.csv to create the following datasets (note the date time in the "Upload path" is auto generated):
1. Name=predict-employee-retention-tabular, Dataset type=Tabular, Upload path=predict-employee-retention-tabular/04-17-2020_033204_UTC/training-data.csv, select "Use headers from the first file"

When you upload training-data.csv as a Tabular dataset in the datastore, make sure to uncheck "Skip data validation" so that it can auto detect the data type.

2. Name=predict-employee-retention-training-data, Dataset type=File, Upload path=predict-employee-retention/04-16-2020_023057_UTC/training-data.csv

### How to download data from datastore

This section shows how to download a dataset (file or tabular).

[Create Azure Machine Learning datasets](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets)

In [None]:
from azureml.core import Dataset
import os

# 1. Download data file from datastore using "Relative path"
# Notice the downloaded file will be stored in a folder that has structure similar to the relative path.
temp_folder = os.path.join(os.getcwd(), "temp")
# Remember to change file relative path.
training_data_file = 'predict-employee-retention/05-21-2020_072626_UTC/training-data.csv'
os.makedirs(temp_folder, exist_ok=True)
datastore.download(target_path=temp_folder,
                   prefix=training_data_file,
                   show_progress=True,
                   overwrite=True)

# 2. Download Datasets from Workspace
# Below is generated sample code found in "Microsoft Azure Machine Learning > Datasets > (choose your dataset) > Consume". You can modify it to suit your need.
# azureml-core of version 1.0.72 or higher is required
from azureml.core import Workspace, Dataset

#subscription_id = 'XXX'
#resource_group = 'XXX'
#workspace_name = 'XXX'
#workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name='predict-employee-retention-training-data')
dataset.download(target_path=temp_folder, overwrite=True)

dataset = Dataset.get_by_name(workspace, name='predict-employee-retention-tabular')
df = dataset.to_pandas_dataframe()
display(df.head())

## Explore data

This example uses Dataset.get_by_name() to retrieve the registered dataset.

In [None]:
import pandas as pd
from azureml.core import Dataset
import os

# Read tabular data that has been uploaded into datastore.
training_data = Dataset.get_by_name(workspace, name='predict-employee-retention-tabular')
dataset = training_data.to_pandas_dataframe()

# Rename sales feature into department
dataset = dataset.rename(columns={"sales": "department"})

# Map salary into integers
salary_map = {"low": 0, "medium": 1, "high": 2}
dataset["salary"] = dataset["salary"].map(salary_map)
display(dataset.head())

# Create dummy variables for department feature
dataset = pd.get_dummies(dataset, columns=["department"], drop_first=True)
# Now becomes 17 features after spliting up department
#dataset.columns[dataset.columns != "left"].shape
display(dataset.head())

# Check both the datatypes and if there is missing values
print("\033[1m" + "\033[94m" + "Data types:\n" + 11 * "-")
print("\033[30m" + "{}\n".format(dataset.dtypes))
print("\033[1m" + "\033[94m" + "Sum of null values in each column:\n" + 35 * "-")
print("\033[30m" + "{}".format(dataset.isnull().sum()))

import matplotlib.pyplot as plt
import seaborn as sns
# Get number of positive and negative examples
pos = dataset[dataset["left"] == 1].shape[0]
neg = dataset[dataset["left"] == 0].shape[0]
print("Positive examples = {}".format(pos))
print("Negative examples = {}".format(neg))
print("Proportion of positive to negative examples = {:.2f}%".format((pos / neg) * 100))
sns.countplot(dataset["left"])
plt.xticks((0, 1), ["Didn't leave", "Left"])
plt.xlabel("Left")
plt.ylabel("Count")
plt.title("Class counts");

## Create a model using SKLearn

First, split the data into training and test sets using 80/20 split; 80% of the data will be used to train the models and 20% to test the performance of the models. Second, Upsample the minority class and downsample the majority class. 

For this data set, positive class refers to those that "Left", negative class refers to those that "Didn't leave".


In [None]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

# Convert dataframe into numpy objects and split them into
# train and test sets: 80/20
X = dataset.loc[:, dataset.columns != "left"].values
y = dataset.loc[:, dataset.columns == "left"].values.flatten()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

# Upsample minority class (When "Left" is minority, "Didn't leave" is majority)
X_train_u, y_train_u = resample(X_train[y_train == 1],
                                y_train[y_train == 1],
                                replace=True,
                                n_samples=X_train[y_train == 0].shape[0],
                                random_state=1)
X_train_u = np.concatenate((X_train[y_train == 0], X_train_u))
y_train_u = np.concatenate((y_train[y_train == 0], y_train_u))

# Downsample majority class (When "Left" is majority, "Didn't leave" is minority)
X_train_d, y_train_d = resample(X_train[y_train == 0],
                                y_train[y_train == 0],
                                replace=True,
                                n_samples=X_train[y_train == 1].shape[0],
                                random_state=1)
X_train_d = np.concatenate((X_train[y_train == 1], X_train_d))
y_train_d = np.concatenate((y_train[y_train == 1], y_train_d))

print("Original shape:", X_train.shape, y_train.shape)
print("Upsampled shape:", X_train_u.shape, y_train_u.shape)
print("Downsampled shape:", X_train_d.shape, y_train_d.shape)

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
import seaborn as sns

clf=LogisticRegression(solver='liblinear',random_state=0)

# Choose training data
print('Use original data')
clf.fit(X_train,y_train)

#print('Use upsampled data')
#clf.fit(X_train_u,y_train_u)

#print('Use downsampled sample')
#clf.fit(X_train_d,y_train_d)

# View the model's coefficients and bias
print(clf.coef_)
print(clf.intercept_)

y_pred_LR=clf.predict(X_test)

cf_matrix = confusion_matrix(y_test, y_pred_LR)

# Display confusion matrix
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                cf_matrix.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in
                     cf_matrix.flatten()/np.sum(cf_matrix)]
labels = [f"{v1}\n{v2}\n{v3}" for v1, v2, v3 in
          zip(group_names,group_counts,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(cf_matrix, annot=labels, fmt='', cmap='Blues')

# Show stats of binary classification.
# Accuracy is sum of diagonal divided by total observations.
accuracy  = np.trace(cf_matrix) / float(np.sum(cf_matrix))

# Alternative way to calculate accuracy on the prediction
#acc = np.average(y_pred_LR == y_test)
#print('Accuracy is', acc)

#Metrics for Binary Confusion Matrices
precision = cf_matrix[1,1] / sum(cf_matrix[:,1])
recall    = cf_matrix[1,1] / sum(cf_matrix[1,:])
f1_score  = 2*precision*recall / (precision + recall)
stats_text = "\n\nAccuracy={:0.3f}\nPrecision (ability to identify only the relevant data points)={:0.3f}[ True Pos/(True Pos+False Pos)]\nRecall (ability to find all the data points of interest in a dataset)={:0.3f} [True Pos/(True Pos+False Neg)]\nF1 Score (harmonic mean of precision and recall taking both metrics into account) ={:0.3f} [2 x (precision*recall/precision+recall)]".format(
accuracy,precision,recall,f1_score)
print(stats_text)

## Model Training

For this task, submit a job to train model in your local machine. To submit a job you:
* Create directory to store training scripts and data
* Create training scripts
* Create training environment
* Submit a run

The training results will be stored in your Azure ML workspace.

## Step 1. Create directory to store model training scripts and data

In [None]:
import os

script_folder = os.path.join(os.getcwd(), "training")
os.makedirs(script_folder, exist_ok=True)

data_folder = os.path.join(script_folder, "data")
os.makedirs(data_folder, exist_ok=True)

## Step 2. Create training scripts

Create training scripts in the directory you just created. Notice how the script saves the model:
    
+ The training script saves your model into a directory named outputs. <br>
`joblib.dump(value=clf, filename='outputs/predict-employee-retention-model.pkl')`<br>
Anything written in this directory is automatically uploaded into your workspace. You'll access your model from this directory later in the tutorial.

[Run class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py)

[Joblib](https://joblib.readthedocs.io/en/latest/)

In [None]:
%%writefile train_csv.py

import argparse
import os
from azureml.core import Run
import pandas as pd
from predict_employee_retention import generate_model
from joblib import dump

# retrieve argument configured through script_params in estimator
parser = argparse.ArgumentParser()
parser.add_argument('--dataset-file', type=str, dest='dataset_file', help='path and name of the dataset file')
args = parser.parse_args()

# get hold of the current run
run = Run.get_context()

training_dataset_file = args.dataset_file
run.log('Dataset name', training_dataset_file)

# Read dataset
dataset = pd.read_csv(training_dataset_file)
run.log('Read training data from file', training_dataset_file)

# Generate model
clf = generate_model(dataset, run)

# note file saved in the outputs folder is automatically uploaded into experiment record
os.makedirs('outputs', exist_ok=True)
dump(value=clf, filename='outputs/predict-employee-retention-model.pkl')
run.log('End of run','Training completed')

In [None]:
%%writefile predict_employee_retention.py

from azureml.core import Run
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

def generate_model(dataset, run):
    # Rename sales feature into department
    dataset = dataset.rename(columns={"sales": "department"})

    # Map salary into integers
    salary_map = {"low": 0, "medium": 1, "high": 2}
    dataset["salary"] = dataset["salary"].map(salary_map)

    # Create dummy variables for department feature
    dataset = pd.get_dummies(dataset, columns=["department"], drop_first=True)

    # Get number of positve and negative examples
    pos = dataset[dataset["left"] == 1].shape[0]
    neg = dataset[dataset["left"] == 0].shape[0]
    run.log('Positive','Positive examples = {}'.format(pos))
    run.log('Negative', 'Negative examples = {}'.format(neg))
    run.log('Proportion','Proportion of positive to negative examples = {:.2f}%'.format((pos / neg) * 100))

    run.log('Begin training','Train with a logistic regression model.')
    # Convert dataframe into numpy objects and split them into
    # train and test sets: 80/20
    X = dataset.loc[:, dataset.columns != "left"].values
    y = dataset.loc[:, dataset.columns == "left"].values.flatten()

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=1)

    clf=LogisticRegression(solver='liblinear',random_state=0)
    clf.fit(X_train,y_train)
    
    # View the model's coefficients and bias
    run.log('Coefficients',clf.coef_)
    run.log('Bias',clf.intercept_)

    run.log('Predict','Predict the training data set.')
    y_pred_LR=clf.predict(X_test)

    # Display confusion matrix
    cf_matrix = confusion_matrix(y_test, y_pred_LR)
    group_names = ['True Neg','False Pos','False Neg','True Pos']
    group_counts = ["{0:0.0f}".format(value) for value in cf_matrix.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cf_matrix.flatten()/np.sum(cf_matrix)]
    labels = [f"{v1} {v2} ({v3})" for v1, v2, v3 in zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    run.log('Confusion Matrix',labels)

    # Display statistics
    accuracy  = np.trace(cf_matrix) / float(np.sum(cf_matrix))
    run.log('Accuracy',accuracy)
    precision = cf_matrix[1,1] / sum(cf_matrix[:,1])
    run.log('Precision',precision)
    recall    = cf_matrix[1,1] / sum(cf_matrix[1,:])
    run.log('Recall',recall)
    f1_score  = 2*precision*recall / (precision + recall)
    run.log('F1-score',f1_score)
    stats_text = "\n\nAccuracy={:0.3f}\nPrecision={:0.3f}\nRecall={:0.3f}\nF1 Score={:0.3f}".format(
    accuracy,precision,recall,f1_score)
    run.log('Show stats',stats_text)
    return clf

### Copy scripts to the folder used for training

In [None]:
import shutil
shutil.copy('train_csv.py', script_folder)
shutil.copy('predict_employee_retention.py', script_folder)

## Option 1: Download data file (.csv) from datastore

In this example, you download a FileDataset (i.e. training-data.csv) from the datastore and use it as an input to your training script.

In [None]:
# Refer to "Explore ways to download data from datastore" for details.
from azureml.core import Workspace, Dataset

dataset = Dataset.get_by_name(workspace, name='predict-employee-retention-training-data')
dataset.download(target_path=data_folder, overwrite=True)

## Step 3. Create training environment in local machine

The steps here is to create a local training environment, such as to leverage on the Anaconda installed on local machine. However, you can also run this "locally" in the Microsoft Azure Notebooks.

Details are provided at https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-environment#local
    
[What are Azure Machine Learning environments?](https://docs.microsoft.com/en-us/azure/machine-learning/concept-environments)

### Define a user-managed environment
By default, Azure Machine Learning service will build a Conda environment with dependencies you specified, and will execute the run in that environment instead of using any Python libraries that you installed on the base image.
A later example in this example will demonstrate the use of Environment when training the model on Azure. 

In some situations, your custom base image may already contain a Python environment with packages that you want to use.

When using a user-managed environment for local training, you are responsible for ensuring that all the necessary packages are available in the Python environment you choose to run the script in.

+ Create and attach: There's no need to create or attach a compute target to use your local computer as the training environment.
+ Configure: When you use your local computer as a compute target, the training code is run in your development environment. If that environment already has the Python packages you need, use the user-managed environment.

To use your own installed packages, set the parameter Environment.python.user_managed_dependencies = True. Ensure that the base image contains a Python interpreter, and has the packages your training script needs.

In [None]:
from azureml.core import Environment

# Create a 'user-managed environment' environment.
user_managed_env = Environment("user-managed-env")

user_managed_env.python.user_managed_dependencies = True

# You can choose a specific Python environment by pointing to a Python path 
#user_managed_env.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python'

### Create ScriptRunConfig

Whatever the way you manage your environment, you need to use the ScriptRunConfig class. ScriptRunConfig identifies the training script to run in the experiment and the environment in which to run it. 

ScriptRunConfig includes
+ source_directory: The source directory that contains your training script
+ script: Identify the training script
+ run_config: The run configuration, which in turn defines where the training will occur

Note: ScriptRunConfig doesn't allow you to pass dataset to the training script.


In [None]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(source_directory=script_folder, script='train_csv.py', arguments=['--dataset-file','./data/training-data.csv'])
src.run_config.environment = user_managed_env

Altenatively, you can use a local target. You can switch the same experiment to run in a different compute target by using a different run configuration.

In [None]:
import os 
from azureml.core import ScriptRunConfig
from azureml.core.runconfig import RunConfiguration

# Edit a run configuration property on the fly.
run_local = RunConfiguration()
run_local.environment.python.user_managed_dependencies = True

src = ScriptRunConfig(source_directory=script_folder, script='train_csv.py', run_config=run_local, arguments=['--dataset-file','./data/training-data.csv'])

## 4. Submit a run

After you create a run configuration, you use it to run your experiment. An experiment is a logical container in an Azure ML Workspace. It contains a series of trials called Runs. As such, it hosts run records such as run metrics, logs, and other output artifacts from your experiments.

The code pattern to submit a training run is the same for all types of compute targets:
+ Create an experiment to run.
+ Submit the run.
+ Wait for the run to complete.



### Submit the run

In [None]:
run = exp.submit(src)
run

### Wait for the run to complete

After you submit the run, you can immediately execute this code to watch the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10 to 15 seconds until the job finishes:

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

### Get log results upon completion

Model training and monitoring happen in the background. Wait until the model has finished training before you run more code. Use wait_for_completion to show when the model training is finished:

In [None]:
run.wait_for_completion(show_output=False)  # specify True for a verbose log

Note: All these calculations were run on your local machine, in the conda environment you defined above. You can find the results in:

    + ~/.azureml/envs/azureml_xxxx for the conda environment you just created
    + ~/AppData/Local/Temp/azureml_runs/train-on-local_xxxx for the machine learning models you trained (this path may differ depending on the platform you use). This folder also contains
        - Logs (under azureml_logs/)
        - Output pickled files (under outputs/)
        - The configuration files (credentials, local and docker image setups)
        - The train.py and mylib.py scripts
        - The current notebook

Take a few minutes to examine the output of the cell above. It shows the content of some of the log files, and extra information on the conda environment used.


### Display run results

Display the information captured by run.log(). Results will appear only after the run completed.

In [None]:
print(run.get_metrics())

## Register Model

The last step in the training script wrote the file outputs/predict-employee-retention-model.pkl in a directory named outputs in the VM of the cluster where the job is run. "outputs" is a special directory in that all content in this directory is automatically uploaded to your workspace. This content appears in the run record in the experiment under your workspace. So the model file is now also available in your workspace.

You can see files associated with that run:

In [None]:
print(run.get_file_names())

Register the model in the workspace, so that you or other collaborators can later query, examine, and deploy this model. You can store the metrics you captured and store them into "tags" in the Model object. 

In [None]:
# Adding metrics to tags so that these information can be used for model comparison purpose.
metrics = ['Accuracy','Precision','Recall','F1-score']
tags = {}
for key in metrics:
    tags[key] = run.get_metrics(key).get(key)

# register model, note the metric values are stored in "tags".
model = run.register_model(model_name='predict-employee-retention-model',
                           model_path='outputs/predict-employee-retention-model.pkl',
                           tags=tags
                          )
print(model.name, model.id, model.version, model.tags, sep='\t')

Once you have registered the model, you can proceed to Tutorial#2 to deploy the model.

## Option 2: Use input dataset to pass TabularDataset to training scripts

In this example, you download a TabularDataset from datastore and use it as a direct input to your estimator object for training in your local machine.

TabularDataset objects provide the ability to load the data into a pandas or spark DataFrame so that you can work with familiar data preparation and training libraries. To leverage this capability, you can pass a TabularDataset as the input in your training configuration, and then retrieve it in your script.

To do so, access the input dataset through the Run object in your training script and use the to_pandas_dataframe() method.

### Create training script

In [None]:
%%writefile train_TabularDataset.py

import argparse
import os
from azureml.core import Run
import pandas as pd
from predict_employee_retention import generate_model

import joblib

from azureml.core import Run, Dataset

# retrieve argument configured through script_params in estimator
parser = argparse.ArgumentParser()
parser.add_argument('--dataset-name', type=str, dest='dataset_name', help='Name of dataset')
args = parser.parse_args()

# get hold of the current run
run = Run.get_context()

training_dataset_name = args.dataset_name
run.log('Dataset name', training_dataset_name)

# get the input dataset by name
training_data = run.input_datasets[training_dataset_name]
# load the TabularDataset to pandas DataFrame
dataset = training_data.to_pandas_dataframe()

# Generate model
clf = generate_model(dataset, run)

# note file saved in the outputs folder is automatically uploaded into experiment record
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=clf, filename='outputs/predict-employee-retention-model.pkl')
run.log('End of run','Training completed')

In [None]:
import shutil
shutil.copy('train_TabularDataset.py', script_folder)

### Download training data from data source

This step shows how to download a TabularDataset from datasource. 

In [None]:
from azureml.core import Dataset

tabular_dataset = Dataset.get_by_name(workspace, name='predict-employee-retention-tabular')
display(tabular_dataset.to_pandas_dataframe().head())

## Configure an Estimator object
An Estimator object is used to submit the experiment run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as a generic estimator.

This code creates a SKLearn estimator object, est, that specifies
+ A script directory for your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
+ Parameters required from the training script 
+ The environment definition for the experiment.
+ The training script i.e. train_TabularDataset.py.
+ The compute target for the experiment. In this case use 'local'.
+ The input dataset for training, 'training_data' as_named_input() is required so that the input dataset can be referenced by the assigned name in your training script.

### Define a user-managed environment

In [None]:
from azureml.core import Environment

# Create a 'user-managed environment' environment.
user_managed_env = Environment("user-managed-env")

user_managed_env.python.user_managed_dependencies = True

# You can choose a specific Python environment by pointing to a Python path 
#user_managed_env.python.interpreter_path = '/home/johndoe/miniconda3/envs/myenv/bin/python'

### Create a SKLearn estimator object 

In [None]:
from azureml.train.sklearn import SKLearn

script_params = {
    '--dataset-name': 'training_data'
}

est = SKLearn(source_directory=script_folder,
              script_params=script_params,
              environment_definition=user_managed_env,
              entry_script='train_TabularDataset.py',
              compute_target='local',
              # pass dataset object as an input with name 'training_data'
              inputs=[tabular_dataset.as_named_input('training_data')]
              )

### Submit the run

In [None]:
# Submit the estimator as part of your experiment run
run = exp.submit(est)
run

### Wait for the run to complete

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

### Get log results upon completion

In [None]:
run.wait_for_completion(show_output=False)

## Register model

In [None]:
# Adding metrics to tags so that these information can be used for model comparison purpose.
metrics = ['Accuracy','Precision','Recall','F1-score']
tags = {}
for key in metrics:
    tags[key] = run.get_metrics(key).get(key)

# register model, note the metric values are stored in "tags".
model = run.register_model(model_name='predict-employee-retention-model',
                           model_path='outputs/predict-employee-retention-model.pkl',
                           tags=tags
                          )
print(model.name, model.id, model.version, model.tags, sep='\t')

Once you have registered the model, you can proceed to Tutorial#2 to deploy the model.

# The following examples are for training the model on Azure

This example is similar to the TabularDataset example above, the change is to train the model using cloud compute resource, instead of local machine.

Note: The joblib need to change to use older version of SKLearn in the training script.

### Create or Attach existing compute resource
By using Azure Machine Learning Compute, a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. Examples include VMs with GPU support.

In this tutorial, you create an Azure Machine Learning compute cluster as your training environment. Once it is created, you can find 'cpucluster' in your Workspace under **Compute &gt; Training clusters**.

Note:

To avoid charges when no jobs are running, set the minimum nodes to 0. This setting allows Azure Machine Learning to de-allocate the nodes when they aren't in use. Any value larger than 0 will keep that number of nodes running, even if they are not in use.

[Set up and use compute targets for model training](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-set-up-training-targets)

In [None]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "cpucluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 2)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_DS2_V2")


if compute_name in workspace.compute_targets:
    compute_target = workspace.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size,
                                                                min_nodes=compute_min_nodes,
                                                                max_nodes=compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(
        workspace, compute_name, provisioning_config)

    # can poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())


### Create training script

In [None]:
%%writefile train_TabularDataset2.py

import argparse
import os
from azureml.core import Run
import pandas as pd
from predict_employee_retention import generate_model

from sklearn.externals import joblib # Use this when test model created on cloud (code error if use joblib from latest sklearn)

from azureml.core import Run, Dataset

# retrieve argument configured through script_params in estimator
parser = argparse.ArgumentParser()
parser.add_argument('--dataset-name', type=str, dest='dataset_name', help='Name of dataset')
args = parser.parse_args()

# get hold of the current run
run = Run.get_context()

training_dataset_name = args.dataset_name
run.log('Dataset name', training_dataset_name)

# get the input dataset by name
training_data = run.input_datasets[training_dataset_name]
# load the TabularDataset to pandas DataFrame
dataset = training_data.to_pandas_dataframe()

# Generate model
clf = generate_model(dataset, run)

# note file saved in the outputs folder is automatically uploaded into experiment record
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=clf, filename='outputs/predict-employee-retention-model.pkl')
run.log('End of run','Training completed')

In [None]:
import shutil
shutil.copy('train_TabularDataset2.py', script_folder)

### Create an environment object that specifies the dependencies required for training

Note: need to use older version of scikit-learn due to incompatible joblib package issue.

In [None]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

env = Environment('my_env')

# Need to use an older version of scklearn due to issue with Joblib when saving pickle file.
# Will fail to run https://predictemployeeretention-xxx.notebooks.azure.com/j/notebooks/predict-employee-retention-part1-training.ipynb#Submit-a-run-to-use-the-cloud-compute if use joblib from latest sklearn.
cd = CondaDependencies.create(pip_packages=['azureml-sdk','scikit-learn==0.20.3','azureml-dataprep[pandas,fuse]>=1.1.14'])

env.python.conda_dependencies = cd

### Submit a run to use the cloud compute

In [None]:
from azureml.train.sklearn import SKLearn

script_params = {
    '--dataset-name': 'training_data'
}

est = SKLearn(source_directory=script_folder,
              script_params=script_params,
              environment_definition=env,
              entry_script='train_TabularDataset2.py',
              compute_target=compute_target,
              # pass dataset object as an input with name 'training_data'
              inputs=[tabular_dataset.as_named_input('training_data')]
              )

# Submit the estimator as part of your experiment run
run = exp.submit(est)
run

### Wait for the run to complete

Will take a longer time to run on cloud because it will pip for the specified dependencies to be installed into the environment. Launch the widget to watch the progress. You can re-run this code to relaunch the widget.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

## Manage environment

Manage environments so that you can update, track, and reuse them across compute targets and with other users of the workspace.

### Register environment

The environment is automatically registered with your workspace when you submit a run or deploy a web service. You can also manually register the environment by using the register() method. This operation makes the environment into an entity that's tracked and versioned in the cloud. The entity can be shared between workspace users.

The following code registers the "my_env" environment to the workspace. Please remember to register this environment, which will be reused later.

[Reuse environments for training and deployment by using Azure Machine Learning](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-environments)

In [None]:
env.register(workspace=workspace)

## Register model

Wait for the run to complete before you register the model. Otherwise you will see ModelPathNotFoundException.

In [None]:
# Adding metrics to tags so that these information can be used for model comparison purpose.
metrics = ['Accuracy','Precision','Recall','F1-score']
tags = {}
for key in metrics:
    tags[key] = run.get_metrics(key).get(key)

# register model, note the metric values are stored in "tags".
model = run.register_model(model_name='predict-employee-retention-model',
                           model_path='outputs/predict-employee-retention-model.pkl',
                           tags=tags
                          )
print(model.name, model.id, model.version, model.tags, sep='\t')

Once you have registered the model, you can proceed to Tutorial#2 to deploy the model.

## Option 3:  Mount data file to a remote compute target for training script to access

When you mount a dataset, you attach the files referenced by the dataset to a directory (mount point) and make it available on the compute target.

FileDataset represents a collection of file references in datastores or public URLs to use in Azure Machine Learning.

The actual data loading happens when FileDataset is asked to deliver the data into another storage mechanism (e.g. files downloaded or mounted to local path).

[FileDataset class](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.filedataset?view=azure-ml-py)

[Create Azure Machine Learning datasets](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-register-datasets)


### Download data file from datastore

You download a FileDataset (i.e. training-data.csv) from the datastore, then mount it.

There are 2 ways to retrieve FileDataset from the datastore:

In [None]:
from azureml.core import Dataset

# Method 1: this example use "Dataset.File.from_files" method.
# For this method to work on mounting a single file, the FileDataset must specify relative path including file name.
# Remember to change the file relative path according to the path! You can check the path from the Datasets in Azure Machine Learning Studio

file_dataset = Dataset.File.from_files(path = [(datastore, 'predict-employee-retention/05-21-2020_072626_UTC/training-data.csv')])

# Although the following code can retrieve the FileDataset, but the mount don't work:
# file_dataset = Dataset.File.from_files(path = [(datastore, 'predict-employee-retention')])
# You will see this error:
# Error occurred: User program failed with ParserError: Error tokenizing data. C error: Calling read(nbytes) on source failed. Try engine='python'.

print(file_dataset)

# To see the list of files referenced by the dataset, if any others.
file_dataset.to_path()

In [None]:
from azureml.core.dataset import Dataset

# Method 2: this example use "Dataset.get_by_name" method.
file_dataset = Dataset.get_by_name(workspace, name='predict-employee-retention-training-data')
print(file_dataset)


### Create training scripts

In [None]:
%%writefile train_FileDataset.py

import argparse
import os
from azureml.core import Run
import pandas as pd
from predict_employee_retention import generate_model

from sklearn.externals import joblib # Use this when test model created on cloud (code error if use joblib from latest sklearn)

# retrieve argument configured through script_params in estimator
parser = argparse.ArgumentParser()
parser.add_argument('--dataset-file', type=str, dest='dataset_file', help='path and name of the dataset file')
args = parser.parse_args()

# get hold of the current run
run = Run.get_context()

training_dataset_file = args.dataset_file

run.log('Dataset name', training_dataset_file)

# Read dataset
dataset = pd.read_csv(training_dataset_file)
run.log('Read training data from file', training_dataset_file)

# Generate model
clf = generate_model(dataset, run)

# note file saved in the outputs folder is automatically uploaded into experiment record
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=clf, filename='outputs/predict-employee-retention-model.pkl')
run.log('End of run','Training completed')

In [None]:
import shutil
shutil.copy('train_FileDataset.py', script_folder)

## Re-use an existing environment for training

This example here will reuse an existing environment that has been saved to the workspace.

### View a list of environments

View the environments in your workspace by using the Environment.list(workspace="workspace_name") class.

In [None]:
from azureml.core.environment import Environment

## Get the list of environments
env_list = Environment.list(workspace)
# Get 'my_env' that was saved earlier.
env_list.get('my_env')

You can also get a specific environment by name and version. The following code uses the get() method to retrieve version 1 of my_env environment from the workspace.

In [None]:
from azureml.core.environment import Environment

# Retrieve the latest version of the Environment
env = Environment.get(workspace=workspace,name="my_env")
print(env)

### Create a SKLearn estimator object 

In [None]:
from azureml.train.sklearn import SKLearn

script_params = {
    # mount the dataset on the remote compute and pass the mounted path as an argument to the training script
    '--dataset-file': file_dataset.as_named_input('dataset_file').as_mount('/tmp/training_data'),
}

est = SKLearn(source_directory=script_folder,
              script_params=script_params,
              environment_definition=env,
              entry_script='train_FileDataset.py',
              compute_target=compute_target
              )

### Submit a run to use the cloud compute

In [None]:
# Submit the estimator as part of your experiment run
run = exp.submit(est)
run

### Wait for the run to complete

Launch the widget to watch the progress. You can re-run this code to relaunch the widget.


In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

## Register model

Wait for the run to complete before you register the model. Otherwise you will see ModelPathNotFoundException.

In [None]:
# Adding metrics to tags so that these information can be used for model comparison purpose.
metrics = ['Accuracy','Precision','Recall','F1-score']
tags = {}
for key in metrics:
    tags[key] = run.get_metrics(key).get(key)

# register model, note the metric values are stored in "tags".
model = run.register_model(model_name='predict-employee-retention-model',
                           model_path='outputs/predict-employee-retention-model.pkl',
                           tags=tags
                          )
print(model.name, model.id, model.version, model.tags, sep='\t')

Once you have registered the model, you can proceed to Tutorial#2 to deploy the model.