# Problem description

To a large degree, financial data has traditionally been numeric in format.

But in recent years, non-numeric formats like image, text and audio have been introduced.  

Private companies have satellites orbiting the Earth taking photos and offering them to customers.  A financial analyst might be able to extract information from these photos that could aid in the prediction of the future price of a stock

- Approximate number of customers visiting each store: count number of cars in parking lot
- Approximate activity in a factory by counting number of supplier trucks arriving and number of delivery trucks leaving
- Approximate demand for a commodity at each location: count cargo ships travelling between ports

In this assignment, we will attempt to recognize ships in satellite photos.
This would be a first step toward
counting.

As in any other domain: specific knowledge of the problem area will make you a better analyst.
For this assignment, we will ignore domain-specific information and just try to use a labelled training set (photo plus a binary indicator for whether a ship is present/absent in the photo), assuming that the labels are perfect.

n.b., it appears that a photo is labelled as having a ship present only if the ship is in the center of the photo.  Perhaps this prevents us from double-counting.


## Goal: problem set 1

You will need to create Sequential models in Keras to classify satellite photos.
- The features are images: 3 dimensional collection of pixels
  - 2 spatial dimensions
  - 1 dimension with 3 features for different parts of the color spectrum: Red, Green, Blue
- The labels are either 1 (ship is present) or 0 (ship is not present)

You will create several models, of increasing complexity
- A model that implements only a Classification Head (no transformations other than perhaps rearranging the image)
- A model that adds a Dense layer before the head
- (Later assignment) A model that adds Convolutional layers before the Head

## Learning objectives
- Learn how to construct Neural Networks using Keras Sequential model
- Appreciate how layer choices impact number of weights

# This section is for instructor only: Colab used only to create data set

In [None]:
try:
  from google.colab import drive
  IN_COLAB=True
except:
  IN_COLAB=False

if IN_COLAB:
  print("We're running Colab")

In [None]:
import tensorflow as tf

print("Running TensorFlow version ",tf.__version__)

# Parse tensorflow version
import re

version_match = re.match("([0-9]+)\.([0-9]+)", tf.__version__)
tf_major, tf_minor = int(version_match.group(1)) , int(version_match.group(2))
print("Version {v:d}, minor {m:d}".format(v=tf_major, m=tf_minor) )

In [None]:
if IN_COLAB:
  # Mount the Google Drive at mount
  mount='/content/gdrive'
  print("Colab: mounting Google drive on ", mount)

  drive.mount(mount)
  import os
  drive_root = os.path.join(mount, "My Drive/Colab Notebooks/NYU/demo/edX")
     
  # Create drive_root if it doesn't exist
  create_drive_root = True
  if create_drive_root:
    print("\nColab: making sure ", drive_root, " exists.")
    os.makedirs(drive_root, exist_ok=True)
  
  # Change to the directory
  proj_root = os.path.join(drive_root, "ships_in_satellite_images")
  print("\nColab: Changing directory to ", proj_root)
 
else:
    print("Running locally")
    proj_root="."
    # raise RuntimeError("This notebook should be run from Colab, not on the local machine")

%cd $proj_root 
%pwd

# Standard imports

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import sklearn

import os
import math

%matplotlib inline

# Create the dataset (Instructor runs this offline; the student will be provided with an API to get the data)

In [None]:
DATA_DIR = "./resource/asnlib/publicdata/ships_in_satellite_images/data"
    
json_file =  "shipsnet.json"
json_zip_file = "2869_61115_compressed_" + json_file + ".zip"
if (not os.path.exists( os.path.join(DATA_DIR, json_file) ) ) and os.path.exists( os.path.join(DATA_DIR, json_zip_file) ):
  print("Unzipping ", json_file)
  fquoted = '"{f:s}"'.format(f=json_zip_file)
  %cd $DATA_DIR
  ! unzip $fquoted
  %cd ..



# API for students

We will define some utility routines.

This will simplify problem solving

More importantly: it adds structure to your submission so that it may be easily graded

- getData: Get a collection of labelled images, used as follows

  >`data, labels = getData()`
- showData: Visualize labelled images, used as follows

  >`showData(data, labels)`

- train: train a model and visualize its progress, used as follows

  >`train(model, X_train, y_train, model_name, epochs=max_epochs)`


In [None]:

import math

import os
import h5py
from nose.tools import assert_equal
from tensorflow.keras.models import load_model
from tensorflow.keras.utils import to_categorical


import json

def getData():
  data,labels = json_to_numpy( os.path.join(DATA_DIR,json_file) )
  return data, labels

def showData(data, labels, num_cols=5):
  # Plot the first num_rows * num_cols images in X
  (num_rows, num_cols) = ( math.ceil(data.shape[0]/num_cols), num_cols)

  fig = plt.figure(figsize=(10,10))
  # Plot each image
  for i in range(0, data.shape[0]):
      img, img_label = data[i], labels[i]
      ax  = fig.add_subplot(num_rows, num_cols, i+1)
      _ = ax.set_axis_off()
      ax.set_title(img_label)

      _ = plt.imshow(img)
  fig.tight_layout()

  return fig

def modelPath(modelName):
    return os.path.join(".", "models", modelName)

def saveModel(model, modelName): 
    model_path = modelPath(modelName)
    
    try:
        os.makedirs(model_path)
    except OSError:
        print("Directory {dir:s} already exists, files will be over-written.".format(dir=model_path))
        
    # Save JSON config to disk
    json_config = model.to_json()
    with open(os.path.join(model_path, 'config.json'), 'w') as json_file:
        json_file.write(json_config)
    # Save weights to disk
    model.save_weights(os.path.join(model_path, 'weights.h5'))
    
    print("Model saved in directory {dir:s}; create an archive of this directory and submit with your assignment.".format(dir=model_path))

def loadModel(modelName):
    model_path = modelPath(modelName)
    
    # Reload the model from the 2 files we saved
    with open(os.path.join(model_path, 'config.json')) as json_file:
        json_config = json_file.read()
  
    model = tf.keras.models.model_from_json(json_config)
    model.compile(loss='categorical_crossentropy', metrics=['accuracy'])
    model.load_weights(os.path.join(model_path, 'weights.h5'))
    
    return model

def saveModelNonPortable(model, modelName): 
    model_path = modelPath(modelName)
    
    try:
        os.makedirs(model_path)
    except OSError:
        print("Directory {dir:s} already exists, files will be over-written.".format(dir=model_path))
        
    model.save( model_path )
    
    print("Model saved in directory {dir:s}; create an archive of this directory and submit with your assignment.".format(dir=model_path))
 
def loadModelNonPortable(modelName):
    model_path = modelPath(modelName)
    model = load_model( model_path )
    
    # Reload the model 
    return model

def MyModel(test_dir, model_path):
    # YOU MAY NOT change model after this statement !
    model = loadModel(model_path)
    
    # It should run model to create an array of predictions; we initialize it to the empty array for convenience
    predictions = []
    
    # We need to match your array of predictions with the examples you are predicting
    # The array below (ids) should have a one-to-one correspondence and identify the example your are predicting
    # For Bankruptcy: the Id column
    # For Stock prediction: the date on which you are making a prediction
    ids = []
    
    # YOUR CODE GOES HERE
    
    
    return predictions, ids

def json_to_numpy(json_file):
  # Read the JSON file
  f = open(json_file)
  dataset = json.load(f)
  f.close()

  data = np.array(dataset['data']).astype('uint8')
  labels = np.array(dataset['labels']).astype('uint8')

  # Reshape the data
  data = data.reshape([-1, 3, 80, 80]).transpose([0,2,3,1])

  return data, labels


def scale_data(data, labels):
  # Scale the data
  X = data/255.

  # Make target categorical
  y = to_categorical(labels, num_classes=2)

  return X, y

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
modelName = "Ships_in_satellite_images"
es_callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=.01, patience=2, verbose=0, mode='auto', baseline=None, restore_best_weights=True)

callbacks = [ es_callback,
              ModelCheckpoint(filepath=modelName + ".ckpt", monitor='accuracy', save_best_only=True)
              ]   

max_epochs = 30

def train(model, X, y, model_name, epochs=max_epochs):
  # Describe the model
  model.summary()

  # Compile the model
  model.compile(loss='categorical_crossentropy', metrics=['accuracy'])

  # Fix the validation set (for repeatability, not a great idea, in general)
  X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.20, random_state=42)
  
  print("Train set size: ", X_train.shape[0], ", Validation set size: ", X_valid.shape[0])

  history = model.fit(X_train, y_train, epochs=max_epochs, validation_data=(X_valid, y_valid), callbacks=callbacks)
  fig, axs = plotTrain(history, model_name)

  return history, fig, axs

def plotTrain(history, model_name="???"):
  fig, axs = plt.subplots( 1, 2, figsize=(12, 5) )

  # Plot loss
  axs[0].plot(history.history['loss'])
  axs[0].plot(history.history['val_loss'])
  axs[0].set_title(model_name + " " + 'model loss')
  axs[0].set_ylabel('loss')
  axs[0].set_xlabel('epoch')
  axs[0].legend(['train', 'validation'], loc='upper left')
 
  # Plot accuracy
  axs[1].plot(history.history['accuracy'])
  axs[1].plot(history.history['val_accuracy'])
  axs[1].set_title(model_name + " " +'model accuracy')
  axs[1].set_ylabel('accuracy')
  axs[1].set_xlabel('epoch')
  axs[1].legend(['train', 'validation'], loc='upper left')

  return fig, axs


## Get the data

In [None]:
# Get the data
data, labels = getData()
print("Date shape: ", data.shape)
print("Labels shape: ", labels.shape)
print("Label values: ", np.unique(labels))

In [None]:
# Shuffle the data
data, labels = sklearn.utils.shuffle(data, labels, random_state=42)

In [None]:
showData(data[:25], labels[:25])

## Examine the image/label pairs

In [None]:
# Inspect some data (images)
num_each_label = 10

for lab in np.unique(labels):
  X_lab, y_lab = data[ labels == lab ], labels[ labels == lab]
  fig = showData( X_lab[:num_each_label], [ str(label) for label in y_lab[:num_each_label] ])
  fig.suptitle("Label: "+  str(lab), fontsize=14)
  fig.show()
  print("\n\n")


# Make sure the features are in the range [0,1]  

When we want to train image data, the first thing we usually need to do is scaling. Since the feature values in our image data are between 0 and 255, to make them between 0 and 1, we need to divide them by 255. In addition, we usually use one-hot encoding to deal with our lables. The methods you may use are:
- `to_categorical()`, which is in `tensorflow.keras.utils`

**Question:** Design a function named `scale_data` to scale the input dataset.  

In [None]:
# Scale the data
# Assign values for X, y
#  X: the array of features
#  y: the array of labels
# The length of X and y should be identical and equal to the length of data.
X, y = np.array([]), np.array([])

###
### YOUR CODE HERE
###


In [None]:
print('X shape: ', str(X.shape))
print('y.shape: ', str(y.shape))
print(y[0])

In [None]:
assert (data/255. == X).all()
assert y.shape == (4000, 2)

**DO NOT** shuffle the data until after we have performed the split into train/test sets
- We want everyone to have the **identical** test set for grading
- Do not change this cell


In [None]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)


# Create a model using only Classification, no data transformation (other than reshaping)

You need to build a 1-layer (only head layer) network model with `tensorflow.keras` right now. The modules you may need to use are:
- `Sequential()`, which is in `tensorflow.keras.models`, to build a sequential model
- `Flatten()`, `Dense()`, which are in `tensorflow.keras.layers`, to create layers for your model. 

**Questions:** Build a 1-layer sequencial model with 1 dense layer.  
Hints:
1. You may want to use `Flatten()` to make your input data with 1 dimension. The input shape of `Flatten()` layer should be the shape of a single sample
2. The units in your dense layer should be the same as the number of unique lables. Since this is a classification problem, you may want to use the `softmax` function as your activation function


In [None]:

num_cases = np.unique(labels).shape[0]

# Set model0 equal to a Keras Sequential model
model0 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:
# Train the model using the API, `train()`
model_name = "Head only"
history0, fig0, axs0 = train(model0, X_train, y_train, model_name)

**Expected outputs (there may be some differences because we only have one layer):**  
<table> 
    <tr> 
        <td>  
            Training accuracy
        </td>
        <td>
         0.7965
        </td>
    </tr>
    <tr> 
        <td>
            Validation accuracy
        </td>
        <td>
         0.8319
        </td>
    </tr>

</table>

We can see that the validation accuracy is a little bit higher than training accuracy, which means our model may have a underfitting problem.

## How many weights in the model ?


In [None]:
# Set num_parameters0 equal to the number of weights in the model
num_parameters0 = None

###
### YOUR CODE HERE
###

print("Parameters number in model0: ", num_parameters0)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Evaluate the model

We have trained our model, then what we need to do next is to evaluate the model using test dataset. The function we should use is `evaluate()`. 

In [None]:
score = model0.evaluate(X_test, y_test, verbose=0)
print("{n:s}: Test loss: {l:3.2f} / Test accuracy: {a:3.2f}".format(n=model_name, l=score[0], a=score[1]))

## Save the trained model so we don't need to run training in order to run correctness tests

In [None]:
saveModel(model0, model_name)

In [None]:
saveModelNonPortable(model0, model_name)

In [None]:
## Restore the model (make sure that it works)
model0 = loadModel(model_name)
score = model0.evaluate(X_test, y_test, verbose=0)
print("RESTORED {n:s}: Test loss: {l:3.2f} / Test accuracy: {a:3.2f}".format(n=model_name, l=score[0], a=score[1]))


In [None]:
try:
    modelt = loadModelNonPortable(model_name)
    score = modelt.evaluate(X_test, y_test, verbose=0)
    print("RESTORED NP {n:s}: Test loss: {l:3.2f} / Test accuracy: {a:3.2f}".format(n=model_name, l=score[0], a=score[1]))
except:
    print("Can't restore non-portable format")


# Create a model with a Dense layer providing 512 features, plus the Classification head

**Question:** At this time, we will do some changes to the original model0.
1. Add a Dense layer with 512 units
2. Add a ReLU function after the new Dense layer
3. Add a Dropout layer after ReLU function to prevent overfitting  

Modules you may need to use are:
- `Flatten()`
- `Dense()`
- `Dropout()` in `tensorflow.keras.layers`

In [None]:
# Set model1 equal to a Keras Sequential model
model1 = None

###
### YOUR CODE HERE
###


In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


In [None]:

# Train the model using the API
model_name = "Dense + Head"
history1, fig1, axs1 = train(model1, X_train, y_train, model_name)

**Expected outputs (there may be some differences because we only have one layer):**  
<table> 
    <tr> 
        <td>  
            Training accuracy
        </td>
        <td>
         0.8458
        </td>
    </tr>
    <tr> 
        <td>
            Validation accuracy
        </td>
        <td>
         0.8528
        </td>
    </tr>

</table>

The loss and accuracy graphs of model1 are similiar to this:
<img src='./resource/asnlib/publicdata/ships_in_satellite_images/images/model1_loss_accuracy.png' style="width:600px;height:300px;">

## How many weights in the model ?

In [None]:
# Set num_parameters1 equal to the number of weights in the model
num_parameters1 = None

###
### YOUR CODE HERE
###

print('Parameters number in model1:', num_parameters1)

In [None]:
###
### AUTOGRADER TEST - DO NOT REMOVE
###


## Evaluate the model

In [None]:
score = model1.evaluate(X_test, y_test, verbose=0)
print("{n:s}: Test loss: {l:3.2f} / Test accuracy: {a:3.2f}".format(n=model_name, l=score[0], a=score[1]))
