# Assignment 2

#### Student ID: *Double click here to fill the Student ID*

#### Name: *Double click here to fill the name*

## Q1: Temperature forecasting

In this question, we will tackle the prediction problem of a multivariate time series. We will use the Jena Climate dataset recorded by the
[Max Planck Institute for Biogeochemistry](https://www.bgc-jena.mpg.de/wetter/). In this dataset, 14 different quantities (such as temperature, pressure, humidity, wind direction, and so on) were recorded every 10 minutes over several years. We will only use a subset of features to speed up the training:

Index| Features      |Format             |Description
-----|---------------|-------------------|-----------------------
1    |Date Time      |01.01.2009 00:10:00|Date-time reference
2    |p (mbar)       |996.52             |The pascal SI derived unit of pressure used to quantify internal pressure. Meteorological reports typically state atmospheric pressure in millibars.
3    |T (degC)       |-8.02              |Temperature in Celsius
4    |Tpot (K)       |265.4              |Temperature in Kelvin
5    |Tdew (degC)    |-8.9               |Temperature in Celsius relative to humidity. Dew Point is a measure of the absolute amount of water in the air, the DP is the temperature at which the air cannot hold all the moisture in it and water condenses.
6    |rh (%)         |93.3               |Relative Humidity is a measure of how saturated the air is with water vapor, the %RH determines the amount of water contained within collection objects.
7    |VPmax (mbar)   |3.33               |Saturation vapor pressure
8    |VPact (mbar)   |3.11               |Vapor pressure
9    |VPdef (mbar)   |0.22               |Vapor pressure deficit
10   |sh (g/kg)      |1.94               |Specific humidity
11   |H2OC (mmol/mol)|3.12               |Water vapor concentration
12   |rho (g/m ** 3) |1307.75            |Airtight
13   |wv (m/s)       |1.03               |Wind speed
14   |max. wv (m/s)  |1.75               |Maximum wind speed
15   |wd (deg)       |152.3              |Wind direction in degrees
 

The exact formulation of the problem is as follows: **Try to predict the temperature 24 hours in the future, given a time series of hourly measurements of quantities recorded over the past 5 days** by a set of sensors. 


Hint: Notice that the Recurrent models with very few parameters, like the ones in this assignment, tend to be significantly faster on a multicore CPU than on GPU because they only involve small matrix multiplications, and the chain of multiplications is not well parallelizable due to the presence of a for loop (But larger RNNs can significantly benefit from a GPU runtime). Therefore, try to train with 1 or 2 epochs to test the correctness of your code before training it with larger epochs. 

Firstly, use the following code snippet to preprocess the dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os
from tensorflow.keras import layers

df = pd.read_csv("climate.csv")

# Select a subset of features as our predictor (Notice the predictor contains temperature) and temperature as our target
temperature = df.iloc[:,2]
raw_data = df.iloc[:,[1, 2, 6, 8, 9, 11, 12]]

# We’ll use the first 20% of the data for training, the following 20% for validation, and the last 60% for testing
num_train_samples = int(0.2 * len(raw_data))
num_val_samples = int(0.2 * len(raw_data))
num_test_samples = len(raw_data) - num_train_samples - num_val_samples

# Normalize the data
mean = raw_data[:num_train_samples].mean(axis=0)
raw_data -= mean
std = raw_data[:num_train_samples].std(axis=0)
raw_data /= std

# Use timeseries_dataset_from_array() to generate input and target on the fly
# `sampling_rate = 6`— Observations will be sampled at one data point per hour: we will only keep one data point out of 6.
# `seq_length = 120`— Observations will go back 5 days (120 hours).
# `delay = sampling_rate * (sequence_length + 24 - 1)`— The target for a sequence will be the temperature 24 hours after the end of the sequence.

sampling_rate = 6
seq_length = 120
delay = sampling_rate * (seq_length + 24 - 1)
batch_size = 256

train_dataset = keras.utils.timeseries_dataset_from_array(
    data=raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=seq_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=0,
    end_index=num_train_samples)

val_dataset = keras.utils.timeseries_dataset_from_array(
    data=raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=seq_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=num_train_samples,
    end_index=num_train_samples + num_val_samples)

test_dataset = keras.utils.timeseries_dataset_from_array(
    data=raw_data[:-delay],
    targets=temperature[delay:],
    sampling_rate=sampling_rate,
    sequence_length=seq_length,
    shuffle=True,
    batch_size=batch_size,
    start_index=num_train_samples + num_val_samples)

(a) Try to predict the temperature 24 hours later using the naive approach (i.e., the temperature 24 hours from now will be equal to the temperature right now). Report the Mean absolute error (MAE) in the unit of degrees Celsius.

Hint: Refer to https://www.tensorflow.org/api_docs/python/tf/keras/utils/timeseries_dataset_from_array for more details

In [None]:
# coding your answer here.

(b) Build an RNN with one layer of 16 simple cells, followed by a dense layer with a single neuron in the output layer. Train the model for 10 epochs with Mean square error (MSE) loss. 

Try to manually calculate the number of parameters in your model's architecture and compare it with the one reported by `summary()`. Finally, plot the learning curves (validation and training loss vs. epochs) and report the MAE on the test set in the unit of degrees Celsius.

In [None]:
# coding your answer here.

(c) Build an RNN with two layers of GRU with 16 cells in each layer, followed by a dense layer with a single neuron in the output layer. In addition, apply the dropout for the input units of the dense layer. The dropout rate should be set to 0.5. Train the model for 10 epochs with MSE loss. 

Finally, plot the learning curves and report the MAE on the test set in the unit of degrees Celsius.

In [None]:
# coding your answer here.

(d) Build an RNN as in (c). But this time, also apply the dropout to the hidden state at each time step and try to **unroll the RNN during training**. The dropout rate should be set to 0.5. Train the model for **one** epoch with MSE loss. 

Report the time required to train the network compared with (c) and make some comments on it.

Hint: Look up the document https://keras.io/api/layers/recurrent_layers/gru/ to see how to apply dropout and unroll the loop of RNN.



In [None]:
# coding your answer here.

## Q2: Test classification using the preprocessed IMDB dataset

In this question, we will continue to work with the IMDB dataset: a set of 50,000 highly polarized reviews from the Internet Movie Database. It consists of 50% negative and 50% positive reviews. **Our job is trying to classify the reviews into positive or negative classes.** We will focus on the model building, training, and evaluation part. Therefore, we will use the preprocessed dataset from Keras.

https://ai.stanford.edu/~amaas/data/sentiment/.

(a) Load the IMDB dataset with `keras.datasets.imdb.load_data()` and only keep the top 10,000 most frequently occurring words in the training data. In addition, split it into a training set (25,000 images), a validation set (5,000 images), and a test set (20,000 images). Finally, pad or truncate the input sequence to 500 words.

Hint: You may find https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences and https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/load_data helpful.

In [None]:
# coding your answer here.

(b) Build a 1D Convolutional Neural Network using the following architecture and remember to mask your input in the embedding layer (Remember that the embedding layer is capable of generating a“mask” that corresponds to its input data as described in the laboratory): 


|        | Type                | Maps    | Activation | Notice|
|--------|---------------------|---------|------------|------------|
| Output | Fully connected     |       | Sigmoid    ||
| D3     | Dropout         | |  |with dropout rate set to 0.75|
| P1     | 1D Max Pooling         |         |            |with pooling size set to 2|
| D2     | Dropout |         |            |with dropout rate set to 0.75|
| C1     | 1D Convolution         | 32      | ReLu       |with kernel size set to 4, stride set to 2 and apply same padding|
| D1     | Dropout |         |            |with dropout rate set to 0.75|
| E1     | Embedding         |       |        | Output of embedding is set to 128 dimension|
| In     | Input               |  |            |Input is truncated to 500 words with 10,000 dimension|

Train the model for 10 epochs with Adam optimizer. Finally, plot the learning curves and report the accuracy on the test set.

In [None]:
# coding your answer here.

(c) There is a [rule of thumb](https://developers.google.com/machine-learning/guides/text-classification/step-2-5#algorithm_for_data_preparation_and_model_building) that you should pay close attention to the **ratio between the number of samples in your training data and the mean number of words per sample** when approaching a new text classification task. If that ratio is smaller or less than 1,500, the bag-of-bigrams model will perform better (and as a bonus, it will be much faster to train and iterate on too). If that ratio is higher than 1,500, you should go with a sequence model. In other words, sequence models work best when lots of training data are available and when each sample is relatively short.

Try to plot the Histogram of the number of words per sample for the IMDB training dataset and calculate the ratio described above. Finally, compare the accuracy we get using bag-of-bigrams in the laboratory and the results you get in (b). Make some comments on the rule of thumb.

In [None]:
# coding your answer here.

## Q3: Transfer learning and network architecture search for CIFAR-10 dataset


In this question, we will try to boost the classification performance on the CIFAR-10 dataset. The techniques will include transfer learning and network architecture search.

https://www.cs.toronto.edu/~kriz/cifar.html.

(a) Firstly, Load the CIFAR-10 dataset (you may refer to `keras.datasets.cifar10.load_data()`) as in Assignment 1, and split it into a training set (45,000 images), a validation set (5,000 images) and a test set (10,000 images).

In [None]:
# coding your answer here.

(b) EfficientNet is a modern convnets obtained from network architecture search. Use the convolutional base of `efficientnetB0` and pretrained weight from ImageNet. Try to freeze all the variables in the convolutional base. In addition, add a dropout layer with a dropout rate set to 0.5 followed by a dense layer with softmax activation.

Train the model for 10 epochs using SGD optimizer with a learning rate of 0.01. Finally, plot the learning curves and report the accuracy on the test set.

In [None]:
# coding your answer here.

(c) Use the same architecture as (b), but this time unfreeze all the layers (i.e.  We will fine-tune all the layers!). Train the model for 10 epochs using SGD optimizer with a learning rate of 0.01. Finally, plot the learning curves and report the accuracy on the test set. Compared the results with (a) and made some comments.


In [None]:
# coding your answer here.

(d) Use Keras Tuner to do the network architecture search. The search space is described as follows:

|        | Type                | Activation | Notice|
|--------|---------------------|---------|------------|
| Output | Fully connected    | Softmax    ||
| D1 | DropOut     |        ||
| F1     | Fully connected         | ReLu ||
| PN     | Global average pooling              |            ||
| ...     |  |                  |The convoltion blocks may repeat 3~5 times|
| P1     | Max pooling     |------------|\||
| R2     | ReLu         ||\||
| B2     | batch normalization ||\||
| C2     | Convolution     ||\|-------> These 7 layer forms 1 convolution blocks|
| R1     | ReLu         ||\||
| B1     | batch normalization ||\||
| C1     | Convolution      |------------|\||
| In     | Input         |           ||


1. Search the number of convolutional blocks (a single block contains seven layers: (convolution, batch normalization, relu)*2 followed by a pooling layer) from 3 to 5.
2. Search the number of filters used in convolutional layers in the convolutional blocks from 32 to 256 with step size set to 32
3. Search the number of neurons in the first dense layer from 30 to 100 with step size set to 10
4. Search the dropout rate from 0 to 0.5 with the step size set to 0.1.
5. Use the Adam optimizer and search the learning rate from 0.0001 to 0.01 with sampling strategy set to "log".

Use Bayesian optimization to search for a maximum of 3 trials with two executions per trial. To speed up the search, **only includes the first 1000 images from the training set** but evaluate the performance on the whole validation set for 10 epochs.

Finally, report the architecture you find.

In [None]:
# coding your answer here.

(e) Train the model you find in (d) for 10 epochs on the full training set.
Finally, plot the learning curves and report the accuracy on the test set.

In [None]:
# coding your answer here.