# Lab 1 - Experimenting in a local environment
In this lab you will experiment with a `scikit-learn` machine learning classification algorithms using a small development dataset and local storage and compute. In many cases, your local development environment will not have enough computational and storage resources to support training on full datasets. A common machine learning workflow pattern is to develop and debug your training scripts in a local environment and then run training jobs on full datasets using powerfull cloud compute resources.

You will use Azure Machine Learning service Experiment to track your training runs. 


In [None]:
# Check core SDK version number
import azureml.core
print("SDK version:", azureml.core.VERSION)

## Connect to AML Workspace

In [None]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

## Create AML Experiment

In Azure Machine Learning service, you can track metrics and other artifacts created during model development process. The tracked items are stored in *Experiments* and organized in *Runs*. A *run* is a single trial of an experiment. A *run* object is used to store output of the trial, and to analyze results and access artifacts generated by the trial.

In [None]:
from azureml.core import Experiment

# Create AML Experiment
experiment_name = 'aerial-train-in-notebook'
exp = Experiment(workspace=ws, name=experiment_name)

## Experiment with `scikit-learn` classification models

You will experiment with `scikit-learn` models using raw image data as input features. This is a little bit of a naive approach as experience teaches us that simple machine learning models don't perform well on raw image data unless dealing with really simplistic scenarios like the MNIST dataset. Nevertheleess, we will use this approach to demostrate how to track the training progress using AML Experiment and Run objects. In the following labs you will utilize Transfer Learning to train much better classifiers.

### Dowload the development dataset

The datasets used in the labs have been uploaded to a public container in Azure Blob Storage.

Download the small development dataset.

In [None]:
%%sh

wget -nv https://azureailabs.blob.core.windows.net/aerialtar/aerial-tiny.zip -P /tmp
unzip -q /tmp/aerial-tiny.zip -d /tmp
ls -l /tmp/aerial-tiny

The dataset is organized into six folders, each folder containing images of a given land class.

### Load and label images

Load the images to a `numpy` array and assign numeric labels representing land classes. 
.

In [None]:
import os
import numpy as np
from PIL import Image

# Define a utility function to load images from a folder
def load_images(input_dir):
    label_to_integer = {
        "Barren": 0,
        "Cultivated": 1,
        "Developed": 2,
        "Forest": 3,
        "Herbaceous": 4,
        "Shrub": 5}
    
    image_list = [(np.array(Image.open(os.path.join(input_dir, folder, filename))), label_to_integer[folder])
             for folder in os.listdir(input_dir)
             for filename in os.listdir(os.path.join(input_dir, folder))]
    
    images, labels = zip(*image_list)
    
    return np.asarray(images), np.asarray(labels)


# Load images
images_dir = '/tmp/aerial-tiny'
images, labels = load_images(images_dir)

print(images.shape)
print(labels.shape)

The images are in the `PNG` format. After loading, the images are represented by rank 3 tensors of shape `(224, 224, 3)`. The color encoding is `RGB`.

In this notebook you will experiment with `sklearn` classification models. Most `sklearn` algorithms require input feature to be represented by rank 1 tensors - or vectors. Since our images are rank 3 tensors `(224, 224, 3)` we need to flatten them to `(150528,)` shape.

In [None]:
# Reshape the images
X = np.ndarray.reshape(images, (images.shape[0], -1))

print("Input data:")
print("  Images: ", X.shape)
print("  Labels: ", labels.shape)

### Train Logistic Regression model

We will start with a simple logistic regression model.

#### Start training
The below code snippet uses AML Experiment and Run objects to log values of hyperparametrs, evaluation metrics, and the serialized model. Note that the training process runs within the notebook's kernel but the tracked artifacts are pushed to the cloud. 

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split

# Divide the dataset in training and validation
X_train, X_validate, y_train, y_validate = train_test_split(X, labels,
                                                           test_size=0.2,
                                                           shuffle=True,
                                                           random_state=1,
                                                           stratify=labels)

# Initialize Experiment logging
run = exp.start_logging()

# Log run description and hyper-parameter values
run.tag("Description", "Naive attempt to fit logistic regression to aerial image data")
run.log("Algorithm", "logistic regresion")
run.log("Hyperparameter:Solver", "lbfgs")
run.log("Hyperparameter:C", 1.0)

# Train logistic regression
print("Starting training ...")
lr = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    C = 1.0,
    verbose=1)

lr.fit(X_train, y_train)
print("Training completed.")

# Evaluate the model on validation images
print("Starting evaluation")
y_hat = lr.predict(X_validate)
val_accuracy = np.average(y_hat == y_validate)
print("Validation accuracy:", val_accuracy)
run.log('Validation accuracy', val_accuracy)

# Save and upload the model
joblib.dump(value=lr, filename='model.pkl')
run.upload_file(name='outputs/model.pkl', path_or_stream='./model.pkl')

# Finalize the run
run.complete()


You can browse the recorded run in Azure portal. The run's hyperparameters, performance measures, and the serialized model are all stored in the Experiment.

As shown by the validation accuracy, our model's performance is rather absymal. Logistic regression can only learn linear decision boundries and cannot handle a complex dataset like our land images. Let's try an ML algorithm with more capacity - Random Forest.

### Train Random Forest model

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize logging
run = exp.start_logging()

# Log run description and hyper-parameter values
run.tag("Description", "Another naive attempt to train on aerial image data - random forests")
run.log("No of trees", 100)
run.log("Max Depth", 7)

# Train logistic regression
print("Starting training ...")
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=7,
    verbose=1)

rf.fit(X_train, y_train)
print("Training completed.")

# Evaluate the model on validation images
print("Starting evaluation")
y_hat = rf.predict(X_validate)
val_accuracy = np.average(y_hat == y_validate)
print("Validation accuracy:", val_accuracy)
run.log('Validation accuracy', val_accuracy)

# Save and upload the model
joblib.dump(value=lr, filename='model.pkl')
run.upload_file(name='outputs/model.pkl', path_or_stream='./model.pkl')

# Finalize the run
run.complete()

This is much better than logistic regression but still pretty bad. We could attempt to fine-tune hyper-parameters or try other machine learning algorithms but rather thank pursuing this naive approach we will apply a proven technique that has emerged in the recent years - Transfer Learning.

## Next Step

In the next lab you will utilize a pre-trained deep neural network to extract powerful features from images and use them to train a better performing classifier.
