# Part 2 - Model Training
In this section of the lab we will train a multinomial logistics regression model using the bottleneck features created in the previous part of the lab.


![Transfer Learning](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/tllr.png)

We will use `scikit-learn` to configure logistic regression and run training. 

We will run training on a remote CPU VM. `scikit-learn` cannot utilize GPU so we will not reuse the GPU VM created in the previous stage.

![AML Arch](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/amlarch.png)



### Create training script

The trained model will be saved into the `./outputs` folder. This is one of the special folders in AML. The other one is the `./logs` folder. The content in these folders is automatically uploaded to the run history.

The script uses AML `Run` object to track  **validation accuracy**. The metric is captured at the end of training.

#### Create a folder to hold the script

In [10]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

#### Use Jupyter `%%writefile` magic to write the script


In [11]:
%%writefile $script_folder/train.py

import os
import argparse

from azureml.core import Run

import numpy as np
import random
import h5py

from sklearn.linear_model import LogisticRegression
from sklearn.externals import joblib



# Training regime
def train_evaluate(run):
   
    print("Loading bottleneck features")
    train_file_name = os.path.join(args.data_folder, args.training_file_name)
    valid_file_name = os.path.join(args.data_folder, args.validation_file_name)
    
    # Load bottleneck training features and labels
    with h5py.File(train_file_name, "r") as hfile:
        train_features = np.array(hfile.get('features'))
        train_labels = np.array(hfile.get('labels'))
        
        
    # Load bottleneck validation features and labels
    with h5py.File(valid_file_name, "r") as hfile:
        valid_features = np.array(hfile.get('features'))
        valid_labels = np.array(hfile.get('labels'))
        
    # Conver one-hot labels to integers
    y_train = np.argmax(train_labels, axis=1)
    y_valid = np.argmax(valid_labels, axis=1)
    
    # Train logistics regresssion model
    print("Starting training on")
    print("  Features:", train_features.shape)
    print("  Labels:", y_train.shape)
    clf = LogisticRegression(
        C=1.0/args.reg, 
        multi_class='multinomial',
        solver='lbfgs',
        random_state=42)
    clf.fit(train_features, y_train)
    
    
    # Validate
    print("Starting validation")
    y_hat = clf.predict(valid_features)
    
    # Calculate accuracy 
    acc = np.average(y_hat == y_valid)
    print('Validatin accuracy is:', acc)
    
    # Log to AML Experiment
    run.log('regularization_rate', np.float(args.reg))
    run.log('validation_acc', np.float(acc))
          
    # Save the trained model to outp'uts which is a standard folder expected by AML
    model_file = 'aerial_sklearn.pkl'
    model_file = os.path.join('outputs', model_file)
    print("Saving the model to: ", model_file)
    os.makedirs('outputs', exist_ok=True)
    joblib.dump(value=clf, filename=model_file)
    

  

if __name__ == '__main__':
    parser = argparse.ArgumentParser("Training, evaluation worklfow")

    ### Model parameters
    
    parser.add_argument(
        '--data-folder',
        type=str,
        default = './bottleneck',
        help='Folder with bottleneck features and labels')

    parser.add_argument(
        '--training-file-name',
        type=str,
        default = 'aerial_bottleneck_train.h5',
        help='Training file name')

    parser.add_argument(
        '--validation-file-name',
        type=str,
        default = 'aerial_bottleneck_valid.h5',
        help='Validation file name')

    parser.add_argument(
        '--regularization', 
        type=float, dest='reg', 
        default=0.01, 
        help='regularization rate')
    
    args = parser.parse_args()
    
    # get hold of the current run
    run = Run.get_submitted_run()
    train_evaluate(run)
    

Overwriting ./script/train.py


### Connect to AML workspace


In [12]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /data/home/demouser/notebooks/MTC_AzureAILabs/DataScienceTrack/01-aml-walkthrough-sklearn/aml_config/config.json
jkamllab
jkamllab
eastus2
952a710c-8d9c-40c1-9fec-f752138cc0b3


### Create Experiment

We will create a new experiment to manage training runs on a remote VM.

In [13]:
experiment_name = 'aerial-train-sklearn'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Create  compute target

We will create a new CPU VM to run the training. 

In [14]:
from azureml.core.compute import DsvmCompute
from azureml.core.compute_target import ComputeTargetException

compute_target_name = 'cpudsvm'
compute_target_type = 'Standard_DS3_v2'

try:
    dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)
    print('Found existing DSVM:', dsvm_compute.name)
except ComputeTargetException:
    dsvm_config = DsvmCompute.provisioning_configuration(vm_size=compute_target_type)
    dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)
    dsvm_compute.wait_for_completion(show_output=True)

Creating.............................
SucceededProvisioning operation finished, operation "Succeeded"


### Configure datastore

The bottleneck files have been uploaded to the workspace's default datastore during the previous step. We will mount the store on the nodes of the cluster.

In [15]:
from azureml.core import Datastore

ds = ws.get_default_datastore()
print("Using the default datastore for training data: ")
print(ds.name, ds.datastore_type, ds.account_name, ds.container_name)


Using the default datastore for training data: 
workspacefilestore AzureFile jkamllab4248191217 azureml-filestore-d531180f-918f-440a-9cbc-968e4cc0ac1b


### Create Conda run configuration

This time we will not run the script in a Docker container. Instead, we will run it in a custom Conda environment within Data Science Virtual machine.


In [25]:
from azureml.core.runconfig import DataReferenceConfiguration
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Run
from azureml.core import ScriptRunConfig


# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to our DSVM
conda_run_config.target = dsvm_compute.name

# set the data reference of the run configuration
dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore='bottleneck', 
                   path_on_compute='bottleneck',
                   mode='download', # download files from datastore to compute target
                   overwrite=True)
#conda_run_config.data_references = {ds.name: dr}

# specify CondaDependencies obj
conda_run_config.environment.python.conda_dependencies = CondaDependencies.create(conda_packages=['scikit-learn', 'h5py'])

src = ScriptRunConfig(source_directory=script_folder, 
                      script='train.py', 
                      run_config=conda_run_config, 
                      # pass the datastore reference as a parameter to the training script
                      arguments=['--data-folder', str(ds.as_download())] 
                     ) 

Submit the run.

In [26]:
tags = {"Run": "sklearn-logistic-regression"}
run = exp.submit(config=src, tags=tags)
run

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-train-sklearn,aerial-train-sklearn_1540955078313,azureml.scriptrun,Running,Link to Azure Portal,Link to Documentation


Monitor the run.

In [27]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

_UserRun()

In [20]:
run.wait_for_completion(show_output=False)

{'runId': 'aerial-train-sklearn_1540954231561',
 'target': 'dsvm',
 'status': 'Failed',
 'startTimeUtc': '2018-10-31T02:54:45.432034Z',
 'endTimeUtc': '2018-10-31T02:54:49.523891Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': '09fa9fc8-3224-4a16-bf57-fc3c314cb0ce'},
 'runDefinition': {'Script': 'train.py',
  'Arguments': ['--data-folder', '$AZUREML_DATAREFERENCE_workspacefilestore'],
  'SourceDirectoryDataStore': None,
  'Framework': 0,
  'Target': 'dsvm',
  'DataReferences': {'workspacefilestore': {'DataStoreName': 'workspacefilestore',
    'Mode': 'Download',
    'PathOnDataStore': 'bottleneck',
    'PathOnCompute': None,
    'Overwrite': True}},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'Environment': {'Python': {'InterpreterPath': 'python',
    'UserManagedDependencies': False,
    'CondaDependencies': {'name': 'project_environment',
     'dependencies': ['python=3.6.2',
      {'pip': ['azureml-defaults==0.

In [None]:
from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': ds.path('bottleneck').as_download(),
    '--training-file-name': 'aerial_bottleneck_train_vgg16.h5',
    '--validation-file-name': 'aerial_bottleneck_valid_vgg16.h5',
    '--regularization': 0.8
}

pip_packages = ['h5py','pillow','scikit-learn']

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=dsvm_compute,
                entry_script='train.py',
                pip_packages=pip_packages
                )

### Run the job

Run the experiment by submitting the estimator object.

In [None]:
tags = {"RunName": "train-sklearn"}

run = exp.submit(config=est, tags=tags)
run

Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.

### Monitor a remote run

In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the script dependencies don't change, the same image is reused and hence the container start up time is much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the estimator. The image is uploaded to the workspace. This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. 

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.


You can check the progress of a running job in multiple ways. This tutorial uses a Jupyter widget as well as a `wait_for_completion` method. 

### Jupyter widget

Watch the progress of the run with a Jupyter widget.  Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. 

Note: Currently, there is a problem with RunDetails widget in DSVM. 

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

### Get log results upon completion

Model training and monitoring happen in the background. Wait until the model has completed training before running more code. 

In [None]:
run.wait_for_completion(show_output=False) # specify True for a verbose log

### Display run results

The training has completed. You can see the logs generated during the run by executing `Run.get_file_names()` method.

In [None]:
print(run.get_file_names())

## Next Step

You can improve the performance of the network by fine tuning hyper-parameters. In the next part of the lab you will use AML technology called `hyperdrive` to try different combinations of hyper parameters by running concurrent training job on a GPU cluser.

 Continue to `03-hyperdrive`

## Clean up resources

Before you move to the next step, you can delete the GPU VM. We will not need it anymore.

In [None]:
dsvm_compute.delete()