# Experimenting in a local environment
In this lab you will experiment with a couple of machine learning algorithms using a small development dataset and local storage and compute. You will use Azure Machine Learning service to track your training runs. 


### Connect to AML workspace
To connect to the workspace created during the lab setup load the file **config.json**.


In [1]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /home/demouser/repos/MTC_AzureAILabs/DataScienceTrack/01-aml-walkthrough-sklearn/aml_config/config.json
jkaml1
jkaml1
eastus2
952a710c-8d9c-40c1-9fec-f752138cc0b3


### Create Experiment

You will track a record of training runs in AML Experiment. A workspace can have multiple experiments.

In [2]:
experiment_name = 'aerial-experiment'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Dowload the development dataset

You will experiment using a small development dataset. In many cases, your local development environment will not have enough computational and storage resources to support experimentation on full datasets. A common machine learning workflow pattern is to develop and debug your training scripts in a local development environment and then run training jobs on full datasets using powerfull cloud compute resources.

In [3]:
%%sh

wget -nv https://azureailabs.blob.core.windows.net/aerialtar/aerialtiny.tar.gz
tar -xf aerialtiny.tar.gz
ls -l aerialtiny

2018-10-28 01:36:52 URL:https://azureailabs.blob.core.windows.net/aerialtar/aerialtiny.tar.gz [103131426/103131426] -> "aerialtiny.tar.gz" [1]


In [7]:
%%sh
ls -l aerialtiny/train

total 104
drwxrwxr-x 2 demouser demouser 20480 Oct  6 17:46 Barren
drwxrwxr-x 2 demouser demouser 16384 Oct  6 17:46 Cultivated
drwxrwxr-x 2 demouser demouser 16384 Oct  6 17:46 Developed
drwxrwxr-x 2 demouser demouser 16384 Oct  6 17:46 Forest
drwxrwxr-x 2 demouser demouser 20480 Oct  6 17:46 Herbaceous
drwxrwxr-x 2 demouser demouser 16384 Oct  6 17:46 Shrub


### Load the development dataset

Load the training and validation datasets to `numpy` arrays. The images are in `PNG` format. The size is `(224, 224, 3)` and the color encoding is `RGB`
.

In [73]:
import os
import numpy as np
from skimage.io import imread

# Define a utility function to load images from a folder
def load_images(input_dir):
    label_to_integer = {
        "Barren": 0,
        "Cultivated": 1,
        "Developed": 2,
        "Forest": 3,
        "Herbaceous": 4,
        "Shrub": 5}
    
    images = [(imread(os.path.join(input_dir, folder, filename)), label_to_integer[folder])
             for folder in os.listdir(input_dir)
             for filename in os.listdir(os.path.join(input_dir, folder))]
    
    images, labels = zip(*images)
    
    return np.asarray(images), np.asarray(labels)


# Load training images
training_images_dir = 'aerialtiny/train'
training_images, training_labels = load_images(training_images_dir)

# Load validation images
validate_images_dir = 'aerialtiny/valid'
validation_images, validation_labels = load_images(validation_images_dir)

print(training_images.shape)
print(training_labels.shape)

(1060, 224, 224, 3)
(1060,)


### Train a local model
Train a simple logistic regression model using `sckit-learn`. Since our images are in (224, 224,3 ) RGB format we need to flatten them to conform to logistic regression input shape.

In [82]:
from sklearn.linear_model import LogisticRegression

# Reshape training and validation datasets
X_train = np.ndarray.reshape(training_images, (training_images.shape[0], -1))
X_validate = np.ndarray.reshape(validation_images, (validation_images.shape[0], -1))
y_train = training_labels
y_valid = validation_labels

# Print the shape of inputs
print("Input data:")
print("  Training images: ", X_train.shape)
print("  Training labels: ", y_train.shape)

# Train logistic regression
print("Starting training ...")
lr = LogisticRegression(
    multi_class='multinomial',
    solver='lbfgs',
    verbose=1)

lr.fit(X_train, y_train)
print("Training completed.")

Input data:
  Training images:  (1060, 150528)
  Training labels:  (1060,)
Starting training ...
Training completed.


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.5min finished


### Calculate performance measures


In [83]:
from sklearn import metrics

# Run the model on validation images
y_hat = lr.predict(X_validate)

# Display accuracy
print("Validation accuracy:", np.average(y_hat == y_valid))


Validation accuracy: 0.3978494623655914


As indicated by the validation accuracy, our model's performance is rather absymal. Logistic regression can only learn linear decision boundries. Let's try an ML algorithm with more capacity - Random Forest.

In [85]:
from sklearn.ensemble import RandomForestClassifier

# Train logistic regression
print("Starting training ...")
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=7,
    verbose=1)

rf.fit(X_train, y_train)
print("Training completed.")

# Run the model on validation images
y_hat = rf.predict(X_validate)

# Display accuracy
print("Validation accuracy:", np.average(y_hat == y_valid))

Starting training ...
Training completed.
Validation accuracy: 0.6827956989247311


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   11.1s finished
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.0s finished


This is much better than logist regression but still pretty bad. We could attempt to fine-tune hyper-parameters or try other machine learning algorithms but the experience of the last few decades of machine learning research teaches us that barring applying sophisticated feature engineering, "classical" machine learning algorithms don't perform well on tasks like non-trivial image classification. Rather than pursuing this approach you will apply a proven technique that has emerged in the recent years - Transfer Learning.

## Next Step

In the next lab you will utilize a pre-trained deep neural network to extract powerful features from images and use them to train a better performing classifier.
