# Experimenting in a local environment
In this lab you will experiment with a couple of machine learning algorithms using a small development dataset and local storage and compute. You will use Azure Machine Learning service to track your training runs. 


### Dowload the development dataset

You will experiment using a small development dataset. In many cases, your local development environment will not have enough computational and storage resources to support experimentation on full datasets. A common machine learning workflow pattern is to develop and debug your training scripts in a local development environment and then run training jobs on full datasets using powerfull cloud compute resources.

In [None]:
%%sh

wget -nv https://azureailabs.blob.core.windows.net/aerialtar/aerialtiny.tar.gz
tar -xf aerialtiny.tar.gz
ls -l aerialtiny

In [None]:
%%sh
ls -l aerialtiny/train

### Load the development dataset

Load the training and validation datasets to `numpy` arrays. The images are in `PNG` format. The size is `(224, 224, 3)` and the color encoding is `RGB`
.

In [None]:
import os
import numpy as np
from skimage.io import imread

# Define a utility function to load images from a folder
def load_images(input_dir):
    label_to_integer = {
        "Barren": 0,
        "Cultivated": 1,
        "Developed": 2,
        "Forest": 3,
        "Herbaceous": 4,
        "Shrub": 5}
    
    images = [(imread(os.path.join(input_dir, folder, filename)), label_to_integer[folder])
             for folder in os.listdir(input_dir)
             for filename in os.listdir(os.path.join(input_dir, folder))]
    
    images, labels = zip(*images)
    
    return np.asarray(images), np.asarray(labels)


# Load training images
training_images_dir = 'aerialtiny/train'
training_images, training_labels = load_images(training_images_dir)

# Load validation images
validation_images_dir = 'aerialtiny/valid'
validation_images, validation_labels = load_images(validation_images_dir)

print(training_images.shape)
print(training_labels.shape)

### Train a local model
In this step you will train a logistic regression model from `sckit-learn`, directly on image pixel data. This is a little bit of a naive approach as experience teaches us that simple machine learning models don't perform well on raw image data unless dealing with really simplistic scenarios like the MNIST dataset. Nevertheleess, we will use this approach to demostrate how to track the training progress using AML Experiment and Run entities.

#### Connect to AML Workspace

In [None]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

#### Preprocess images
Since our images are in (224, 224,3 ) RGB format we need to flatten them to conform to the input shape required by logistic regression and other sckit-learn ml algorithms. 

In [None]:
# Reshape training and validation datasets
X_train = np.ndarray.reshape(training_images, (training_images.shape[0], -1))
X_validate = np.ndarray.reshape(validation_images, (validation_images.shape[0], -1))
y_train = training_labels
y_valid = validation_labels

# Print the shape of inputs
print("Input data:")
print("  Training images: ", X_train.shape)
print("  Training labels: ", y_train.shape)

#### Train a model
We will track the model's hyper-parameters, performance, and serialized model file in AML Experiment

In [None]:
from azureml.core import Experiment

# Create AML Experiment
experiment_name = 'aerial-train-in-notebook'
exp = Experiment(workspace=ws, name=experiment_name)

# Initialize logging
run = exp.start_logging()

# Log run description and hyper-parameter values
run.tag("Description", "Naive attempt to fit logistic regression to aerial image data")
run.log("Solver", "lbfgs")
run.log("C", 1.0)

Start training.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn.externals import joblib

# Train logistic regression
print("Starting training ...")
lr = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    C = 1.0,
    verbose=1)

lr.fit(X_train, y_train)
print("Training completed.")

# Evaluate the model on validation images
print("Starting evaluation")
y_hat = lr.predict(X_validate)
val_accuracy = np.average(y_hat == y_valid)
print("Validation accuracy:", val_accuracy)
run.log('Validation accuracy', val_accuracy)

# Save and upload the model
joblib.dump(value=lr, filename='model.pkl')
run.upload_file(name='outputs/model.pkl', path_or_stream='./model.pkl')

# Finalize the run
run.complete()


You can browse the recorded run in Azure portal.

As shown by the validation accuracy, our model's performance is rather absymal. Logistic regression can only learn linear decision boundries. Let's try an ML algorithm with more capacity - Random Forest.

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Initialize logging
run = exp.start_logging()

# Log run description and hyper-parameter values
run.tag("Description", "Another naive attempt to train on aerial image data - random forests")
run.log("No of trees", 100)
run.log("Max Depth", 7)

# Train logistic regression
print("Starting training ...")
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=7,
    verbose=1)

rf.fit(X_train, y_train)
print("Training completed.")

# Evaluate the model on validation images
print("Starting evaluation")
y_hat = rf.predict(X_validate)
val_accuracy = np.average(y_hat == y_valid)
print("Validation accuracy:", val_accuracy)
run.log('Validation accuracy', val_accuracy)

# Save and upload the model
joblib.dump(value=lr, filename='model.pkl')
run.upload_file(name='outputs/model.pkl', path_or_stream='./model.pkl')

# Finalize the run
run.complete()

This is much better than logistic regression but still pretty bad. We could attempt to fine-tune hyper-parameters or try other machine learning algorithms but rather thank pursuing the approach of training a model on "raw" image data  we apply a proven technique that has emerged in the recent years - Transfer Learning.

## Next Step

In the next lab you will utilize a pre-trained deep neural network to extract powerful features from images and use them to train a better performing classifier.
