# Training

In this notebook, we will learn how to train an AI model in the cloud.  
There are a few things that are special regarding Cloud AI training, but also a lot of similarities between our old-school way of working.

Again let us start by setting some global parameters first

In [None]:
INITIAL_LEARNING_RATE = 0.01
MAX_EPOCHS = 50
BATCH_SIZE = 32
PATIENCE = 11
model_name = 'animal-cnn'

And of course importing the packages we need! Again, don't forget to set your kernel right in the top-right corner!

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import cv2

import os
from glob import glob
import warnings

warnings.filterwarnings("ignore")
import random
SEED = 42   # set random seed
random.seed(SEED)

from typing import List

In [None]:
## Import AzureML packages
from azureml.core import Workspace
from azureml.core import Dataset
from azureml.data.datapath import DataPath
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

One special import are these Utils scripts. You can read more about them in the `utils > utils.py` file. I have included them here to load them in. They contain some helper functions we will be needing later on.

In [None]:
from utils.utils import *

# Step 1: Connect Workspace

Follow the same steps as the previous notebook, to set up your Workspace configuration!

In [None]:
## Either get environment variables, or a fallback name, which is the second parameter.
## Currently, fill in the fallback values. Later on, we will make sure to work with Environment values. So we're already preparing for it in here!
workspace_name = os.environ.get('WORKSPACE', 'MLOps-Workshop')
subscription_id = os.environ.get('SUBSCRIPTION_ID', '763622cd-d9e1-46f1-84c7-635df9708641')
resource_group = os.environ.get('RESOURCE_GROUP', 'TETRA-Workshop-2406')

In [None]:
ws = Workspace.get(name=workspace_name,
               subscription_id=subscription_id,
               resource_group=resource_group)

## Step 1.1 -- Create Compute Cluster

A Compute Cluster is a combination of multiple Compute Instances. Azure will scale these machines according to the number of nodes we fill into the configuration.  
Based on the amount of Jobs we want to run in parallel, multiple machines will be created.

We choose to define a minimum of 0 machines, which means Azure will need some time to create at least one machine everytime we need one.
If you keep the minimum on 1, you always have one that's ready for your development.
The timeout time to scale down back to 0 machines can also be configured if required.

In [None]:
import os

# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "segers-nathan-cpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 1)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_D2_V2")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print("found compute target: " + compute_name)
else:
    print("creating new compute target...")
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current AmlCompute status, use get_status()
    print(compute_target.get_status().serialize())

## Find and download datasets

In [None]:
datasets = Dataset.get_all(workspace=ws) # Make sure to give our workspace with it
print(datasets)

# Step 2: Create an AI model and training code

We will first create an AI model to use in our training script.  
A basic AI model has been given in the /utils/utils.py directory. You can change it there if you want to

In this step, we will also configure a Training script. This script is an Executable Python script.  
This is slightly different from our other way of working, where we work with Notebooks.

Because Azure will be launching and running our Python scripts, we need to create one file that can be executed in one go.
This needs all our imports, packages, data ... ready without manual interference.

We'll store all of these files into a scripts directory. That way we can upload that directory to our training VM later.

### Step 2.1 -- Prepare the scripts

In [None]:
script_folder = os.path.join(os.getcwd(), 'scripts')
os.makedirs(script_folder, exist_ok=True)

In [None]:
%%writefile $script_folder/train.py

import argparse
import os
from glob import glob
import random

# This time we will need our Tensorflow Keras libraries, as we will be working with the AI training now
from tensorflow import keras
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from sklearn.metrics import classification_report, confusion_matrix

# This AzureML package will allow to log our metrics etc.
from azureml.core import Run

# Important to load in the utils as well!
from utils import *


### HARDCODED VARIABLES FOR NOW
### TODO for the students:
### Make sure to adapt the ArgumentParser on line 31 to include these parameters
### You can base your answer on the lines that are already there

SEED = 42
INITIAL_LEARNING_RATE = 0.01
BATCH_SIZE = 32
PATIENCE = 11
model_name = 'animal-cnn-test'


parser = argparse.ArgumentParser()
parser.add_argument('--training-folder', type=str, dest='training_folder', help='training folder mounting point')
parser.add_argument('--testing-folder', type=str, dest='testing_folder', help='testing folder mounting point')
parser.add_argument('--epochs', type=int, dest='epochs', help='The amount of Epochs to train')
args = parser.parse_args()


training_folder = args.training_folder
print('Training folder:', training_folder)

testing_folder = args.testing_folder
print('Testing folder:', testing_folder)

MAX_EPOCHS = args.epochs

# As we're mounting the training_folder and testing_folder onto the `/mnt/data` directories, we can load in the images by using glob.
training_paths = glob(os.path.join('/mnt/data/train', '**', 'processed_animals', '**', '*.jpg'), recursive=True)
testing_paths = glob(os.path.join('/mnt/data/test', '**', 'processed_animals', '**', '*.jpg'), recursive=True)

print("Training samples:", len(training_paths))
print("Testing samples:", len(testing_paths))

# Make sure to shuffle in the same way as I'm doing everything
random.seed(SEED)
random.shuffle(training_paths)
random.seed(SEED)
random.shuffle(testing_paths)

print(training_paths[:3]) # Examples
print(testing_paths[:3]) # Examples

# Parse to Features and Targets for both Training and Testing. Refer to the Utils package for more information
X_train = getFeatures(training_paths)
y_train = getTargets(training_paths)

X_test = getFeatures(testing_paths)
y_test = getTargets(testing_paths)

print('Shapes:')
print(X_train.shape)
print(X_test.shape)
print(len(y_train))
print(len(y_test))

# Make sure the data is one-hot-encoded
LABELS, y_train, y_test = encodeLabels(y_train, y_test)
print('One Hot Shapes:')

print(y_train.shape)
print(y_test.shape)

# Create an output directory where our AI model will be saved to.
# Everything inside the `outputs` directory will be logged and kept aside for later usage.
model_path = os.path.join('outputs', model_name)
os.makedirs(model_path, exist_ok=True)

## START OUR RUN context.
## We can now log interesting information to Azure, by using these methods.
run = Run.get_context()

# Save the best model, not the last
cb_save_best_model = keras.callbacks.ModelCheckpoint(filepath=model_path,
                                                         monitor='val_loss', 
                                                         save_best_only=True, 
                                                         verbose=1)

# Early stop when the val_los isn't improving for PATIENCE epochs
cb_early_stop = keras.callbacks.EarlyStopping(monitor='val_loss', 
                                              patience= PATIENCE,
                                              verbose=1,
                                              restore_best_weights=True)

# Reduce the Learning Rate when not learning more for 4 epochs.
cb_reduce_lr_on_plateau = keras.callbacks.ReduceLROnPlateau(factor=.5, patience=4, verbose=1)

opt = SGD(lr=INITIAL_LEARNING_RATE, decay=INITIAL_LEARNING_RATE / MAX_EPOCHS) # Define the Optimizer

model = buildModel((64, 64, 3), 3) # Create the AI model as defined in the utils script.

model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])

# Construct & initialize the image data generator for data augmentation
# Image augmentation allows us to construct “additional” training data from our existing training data 
# by randomly rotating, shifting, shearing, zooming, and flipping. This is to avoid overfitting.
# It also allows us to fit AI models using a Generator, so we don't need to capture the whole dataset in memory at once.
aug = ImageDataGenerator(rotation_range=30, width_shift_range=0.1,
                         height_shift_range=0.1, shear_range=0.2, zoom_range=0.2,
                         horizontal_flip=True, fill_mode="nearest")


# train the network
history = model.fit_generator( aug.flow(X_train, y_train, batch_size=BATCH_SIZE),
                        validation_data=(X_test, y_test),
                        steps_per_epoch=len(X_train) // BATCH_SIZE,
                        epochs=MAX_EPOCHS,
                        callbacks=[cb_save_best_model, cb_early_stop, cb_reduce_lr_on_plateau] )

print("[INFO] evaluating network...")
predictions = model.predict(X_test, batch_size=32)
print(classification_report(y_test.argmax(axis=1), predictions.argmax(axis=1), target_names=['cats', 'dogs', 'panda'])) # Give the target names to easier refer to them.
# If you want, you can enter the target names as a parameter as well, in case you ever adapt your AI model to more animals.

cf_matrix = confusion_matrix(y_test.argmax(axis=1), predictions.argmax(axis=1))
print(cf_matrix)

### TODO for students
### Find a way to log more information to the Run context.

# Save the confusion matrix to the outputs.
np.save('outputs/confusion_matrix.npy', cf_matrix)

print("DONE TRAINING")


In [None]:
# Copy the Utils file into the script_folder
import shutil
shutil.copy('utils/utils.py', script_folder)

### Step 2.2 -- Prepare the environment

The training script we have just defined still needs some more information before we can start it.  
We'll need to define it's Anaconda or Pip environment with all the packages that should be installed prior to training.  
We can re-use the environments later, or we can use environments other people have created for us.

You can also customize the Base Docker image to train on, if you prefer. I won't use this in here.

In [None]:
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies

# Create an Environment name for later use
environment_name = os.environ.get('TRAINING_ENV_NAME', 'animals-classification-env-training')
env = Environment(environment_name)

# It's called CondaDependencies, but you can also use pip packages ;-)
env.python.conda_dependencies = CondaDependencies.create(
        # Using opencv-python-headless is interesting to skip the overhead of packages that we don't need in a headless-VM.
        pip_packages=['azureml-dataset-runtime[pandas,fuse]', 'azureml-defaults', 'tensorflow', 'scikit-learn', 'opencv-python-headless']
    )
# Register environment to re-use later
env.register(workspace = ws)

### Step 2.3 -- Prepare the ScriptRunConfig

A **ScriptRunConfig** is a configuration that contains all the information needed to launch a Job inside an Experiment.
This contains information to the directory of scripts it should use, the **name** of the script to start,
the **arguments** to pass into that script, the **compute** target to run the script on, and finally the **environment** to run it on.

We then need to attach such a ScriptRunConfig onto an Experiment on Azure.

In [None]:
from azureml.core import ScriptRunConfig
from azureml.core import Experiment

experiment_name = os.environ.get('EXPERIMENT_NAME', 'Animals-Classification')

exp = Experiment(workspace=ws, name=experiment_name) # Create a new experiment

experiment_runs = []

# We can start four experiments for a bunch of different epoch options
for epochs in [25, 50, 75, 100]:
    args = [
        '--training-folder', datasets['animals-training-set'].as_mount('/mnt/data/train'),
        '--testing-folder', datasets['animals-testing-set'].as_mount('/mnt/data/test'),
        '--epochs', epochs]

    script_run_config = ScriptRunConfig(source_directory=script_folder,
                      script='train.py', 
                      arguments=args,
                      compute_target=compute_target,
                      environment=env)

    run = exp.submit(config=script_run_config)
    experiment_runs.append(run) # Append it to our list of experiment runs for now. This is easy for referring later!
    print('Run started!')

    

### Step 2.4 -- Await the results!

Now that our experiment runs are starting, we can await the logs and results.  
It can take a while to run everything, but the 4 jobs should run in Parallel, if all was well configured!

The cells below can help you in viewing the results, while you head out for a coffee!

I use the `experiment_runs[0]` as our run to log. It's the first one that was started.

There are a few different options for each to select the one they prefer :-)

#### Step 2.4.1 -- Plain text output

In [None]:
# specify show_output to True for a verbose log
experiment_runs[0].wait_for_completion(show_output=True) 

#### Step 2.4.2 -- AzureML Widgets

This needs an extra package to be installed, the AzureML widgets.
(Change the environment if you're running this in a different Kernel)

In [None]:
!/anaconda/envs/azureml_py38_PT_TF/bin/python -m pip install azureml-widgets

In [None]:
from azureml.widgets import RunDetails
RunDetails(experiment_runs[0]).show()

## Step 3 -- The Output!

As a result of our experiments, we should have a trained AI model.  
However, we have some more information that was logged or saved. You can find all this information in the Run context that we started, and Azure is filling in for us.

Use the documentation to find out some more information.

https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.scriptrun?view=azure-ml-py

In [None]:
run = experiment_runs[2]
# Get the files that were logged
print(run.get_file_names())

# Get the metrics
print(run.get_metrics())

In [None]:
# You can always save these details for further referencing!
run.get_details()

In [None]:
# Get a list of all the runs, so you can take one specific run based on it's ID
all_runs = list(exp.get_runs())

print(all_runs)

### Step 3.2 -- Registering and downloading our AI model

In Azure Machine Learning Service, we can register AI models so that they are versioned and kept together with other AI models.  
We keep track of their accuracy based on the runs they were created from.

Just a few lines are needed to register and download the AI model.

In [None]:
model_path = 'outputs/animal-cnn-test/'

In [None]:
run.download_files(prefix=model_path)

In [None]:
run.register_model(model_name,
                model_path=model_path,
                tags={'animals': 'cats, dogs, pandas', 'AI-Model': 'CNN'},
                description="Image classification on animals",
                sample_input_dataset=datasets['animals-testing-set'])