# Exercises due by EOD 2018.12.10

## goal

in this homework assignment we will work with deep learning libraries and `gpu` `ec2` instances

## method of delivery

as mentioned in our first lecture, the method of delivery may change from assignment to assignment. we will include this section in every assignment to provide an overview of how we expect homework results to be submitted, and to provide background notes or explanations for "new" delivery concepts or methods.

this week you will be submitting the results of your homework via upload to your `s3` homework bucket

summary:

| exercise | deliverable | method of delivery |
|----------|-------------|--------------------|
| 1 | none | none |
| 2 | an `environment.yml` file | uploaded to your `s3` homework bucket |
| 3 | a `load_train_positions.py` file | uploaded to your `s3` homework bucket |
| 4 | a `boston_keras.py` file | uploaded to your `s3` homework bucket |
| 5 | a `results.csv` file | uploaded to your `s3` homework bucket |

# exercise 1: execute and read the updated deep learning lecture

we're short on time now at the end of the year, and the deep learning lecture was only very recently finalized. to save us some class time and allow us to move on to `hadoop` before the end of the year, please read the remainder of the deep learning lecture, in which we cover `tensorflow` and `keras`.

for this lecture in particular it will be important to **execute** the cells in the notebook instead of just reading the material.

**choose one** of the following two options to execute the lectures


## 1.1: run the notebook locally

same ol' song and dance at this point -- download the file either directly from the `github` web interface or via `git pull`-ing the repository to your local desktop. then, in the directory containing the lecture (`014_deep_learning.ipynb`), run `jupyter notebook`.

the environment which launches the `jupyter` server with that command must additionally have the following packages installed to properly execute all of the code in that `notebook`:

+ `keras`
+ `numpy`
+ `plotly`
+ `scikit-learn`
+ `tensorflow`


## 1.2: run the notebook using `google` `colab`

either by installing the `chrome` browser extension ["open in colab"](https://chrome.google.com/webstore/detail/open-in-colab/iogfkhleblhcpcekbiedikdehleodpjo) or by navigiting to [google colab](https://colab.research.google.com) and opening the notebook `url` `https://github.com/rzl-ds/gu511/blob/master/014_deep_learning.ipynb` from `github`

##### there is nothing to submit for this exercise

# exercise 2: install `tensorflow` on an `ec2` instance

on your `ec2` instance, let's install the *non-`gpu`* `tensorflow` package (using `pip`) into a new `conda` environment

+ create a new `conda` environment called `tf` with `python` version 3.6 (*not 3.7!*) and activate that environment
+ install `tensorflow` using `pip` (not `conda install`)
+ export that environment via `conda env export > environment.yml`

##### upload your `environment.yml` file to your `s3` submission bucket

# exercise 3: load a `csv` as a `tensorflow` `dataset`

let's use the `tensorflow` `dataset` `api` to load a large `csv` file as a tensor and do some simple calculations

## 3.1: acquire the `csv`

we will use the `1GB` `train_positions.csv` file I have made publically available on `s3`. download it to your `/tmp` directory on your `ubuntu` `ec2` server with the command

```sh
# the -P /tmp will save the resulting file in the /tmp directory
wget https://s3.amazonaws.com/shared.rzl.gu511.com/train_positions/train_positions.csv -P /tmp
```

### 3.1.1: out of disk space?

if in the process of downloading this file you run out of disk space, increase the size of your `ec2`'s hard disk (it's `ebs` volume) through the web console. on the `ec2` dashboard, click on the `ec2` instance with the hard drive you wish to expand, and in the bottom panel find the root device link. click on that link and a popup will show the `ebs` id link, click that link

<br><div align="center"><img src="http://drive.google.com/uc?export=view&id=1mW1APVujBcS_C31Vd_kWR1Q7Ox1g1pFo" width="1000px"></div>

that link will have dropped you on the `ebs` id page. right click the volume row in the top panel and choose to modify the volume. modify the disk size by adding at least 1 GB.


## 3.2: create a `CsvDataset` object

in addition to the core `tensorflow` routines and `api`s, the `tensorflow` developers have a rigorous process for allowing developers to contribute new or experimental features. these new features are often saved in the `tf.contrib` namespace, but for datasets there is a special place for experimental (soon-to-be standard?) methods and classes: `tf.data.experimental`.

one of the classes defined in that namespace is `tf.data.experimental.CsvDataset`. look at [the docstring](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset)

```python
help(tf.data.experimental.CsvDataset)
```


### 3.2.1: initialization arguments

a quick review of [the *initialization* function documentation](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset#__init__) for this class (the one that is called to build our `CsvDataset` object)

```python
help(tf.data.experimental.CsvDataset.__init__)
```

shows us what arguments we have and gives us an idea of what we have to do to build this object.

```python
__init__(
    filenames,
    record_defaults,
    compression_type=None,
    buffer_size=None,
    header=False,
    field_delim=',',
    use_quote_delim=True,
    na_value='',
    select_cols=None
)
```

+ `filenames`: is a `tensor` of filenames as strings (it also accepts a single filename string, conveniently)
+ `record_default`: a list of default values for incoming records
    + each feature is represented by either a default value (e.g. '') if it *is not* required, or a `tensorflow` `dtype` if it *is* required
+ `header`: a `bool` indicating whether or not the file has a `header` row

we will need to specify values for those three arguments; the rest of the arguments can be left as defaults.


#### 3.2.1.1: `record_defaults`

check the first few records of the `csv` file with

```sh
head -n20 /tmp/train_positions.csv
```

the following table summarizes the columns, whether or not they contain null values, and the suggested default value. use this table to construct the `record_default` list

| column name | contains `null` values | suggested default value |
|-|-|-|
| `carcount` | no | `tf.int32` |
| `circuitid` | no | `tf.int32` |
| `destinationstationcode` | no | `''` |
| `directionnum` | no | `tf.int32` |
| `linecode` | no | `''` |
| `secondsatlocation` | no | `tf.int32` |
| `servicetype` | no | `tf.string` |
| `trainid` | no | `tf.string` |
| `timestamp` | no | `tf.string` |


### 3.2.2: invoking `CsvDataset`

fill in the code below to create your dataset

```python
import tensorflow as tf

filenames = #----------------#
            # FILL THIS IN!! #
            #----------------#
record_defaults = #----------------#
                  # FILL THIS IN!! #
                  #----------------#
header = #----------------#
         # FILL THIS IN!! #
         #----------------#
        
train_positions_dataset = tf.data.experimental.CsvDataset(
    filenames=filenames,
    record_defaults=record_defaults,
    header=header,
)
```


## 3.3: create a `batch`ed `iterator`

using your `train_positions_dataset` object's [`.batch`](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset#batch) and [`.make_one_shot_iterator`](https://www.tensorflow.org/api_docs/python/tf/data/experimental/CsvDataset#make_one_shot_iterator) methods, create an iterator that has a batch size of 3 by filling in the code below

```python
BATCH_SIZE = 3

# make a batched dataset
tp_batched = #----------------#
             # FILL THIS IN!! #
             #----------------#

from tensorflow.python.data.ops.dataset_ops import BatchDataset
assert isinstance(tp_batched, BatchDataset)

# make a one-shot iterator from your batched dataset
tp_batched_oneshot = #----------------#
                     # FILL THIS IN!! #
                     #----------------#

from tensorflow.python.data.ops.iterator_ops import Iterator
assert isinstance(tp_batched_oneshot, Iterator)
```

you can verify that this worked by executing


```python
with tf.Session() as sess:
    iterator = tp_batched_oneshot
    next_element = iterator.get_next()
    elem = sess.run(next_element)
assert elem[1].tolist() == [2009, 1912, 1480]
assert elem[-2].tolist() == [b'067', b'175', b'182']
```


## 3.4: put it together

fill in the following block of `python` code and save the results as `load_train_positions.py`

```python
import tensorflow as tf

from tensorflow.python.data.ops.dataset_ops import BatchDataset
from tensorflow.python.data.ops.iterator_ops import Iterator

BATCH_SIZE = 3

def build_train_positions_iterator():
    filenames = #----------------#
                # FILL THIS IN!! #
                #----------------#
    record_defaults = #----------------#
                      # FILL THIS IN!! #
                      #----------------#
    header = #----------------#
             # FILL THIS IN!! #
             #----------------#

    train_positions_dataset = tf.data.experimental.CsvDataset(
        filenames=filenames,
        record_defaults=record_defaults,
        header=header,
    )

    # make a batched dataset
    tp_batched = #----------------#
                 # FILL THIS IN!! #
                 #----------------#

    assert isinstance(tp_batched, BatchDataset)

    # make a one-shot iterator from your batched dataset
    tp_batched_oneshot = #----------------#
                         # FILL THIS IN!! #
                         #----------------#

    assert isinstance(tp_batched_oneshot, Iterator)

    return tp_batched_oneshot


def validate():
    tp_batched_oneshot = build_train_positions_iterator()
    
    with tf.Session() as sess:
        iterator = tp_batched_oneshot
        next_element = iterator.get_next()
        elem = sess.run(next_element)
        
    assert elem[1].tolist() == [2009, 1912, 1480]
    assert elem[-2].tolist() == [b'067', b'175', b'182']


if __name__ == '__main__':
    validate()
```

if everything works as expected, you should be able to run (from the `bash` command line)

```sh
python load_train_positions.py
```

and the result will be nothing -- no `python` errors


##### upload your `load_train_positions.py` file to your `s3` homework submission bucket

# exercise 4: simple `keras` models

let's create a pair of simple `keras` models to predict housing prices.


## 4.1: load the Boston housing price dataset

`keras` provides built-in access to a number of datasets via the [`keras.datasets`](https://keras.io/datasets/#boston-housing-price-regression-dataset) module. we will use that to load train and test data in a format that is immediately consumable in a `keras` model

```python
from tensorflow import keras

(x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data()
```

additionaly, let's normalize the predictor data:

```python
mean = x_train.mean(axis=0)
std = x_train.std(axis=0)
x_train = (x_train - mean) / std
x_test = (x_test - mean) / std
```


## 4.2: a linear model


### 4.2.1: build the model

we can build a linear regression in `keras` quite easily -- a linear regression is simply a

+ one-layer `Sequential` model
+ in which the one layer
    + is `Dense`
    + takes our `x` datasets as `inputs` (this defines `input_dim`)
    + has only one set of weights (i.e. is only one node tall)
    + has a `linear` activation (this is the default activation value, so no argument is necessary)

fill in the below snippet to create a linear model

```python
linear_model = #----------------#
               # FILL THIS IN!! #
               #----------------#
```

after doing so, you should be able to run

```python
linear_model.summary()
```

and see (`dense_17` may have a different number for you)

```
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_17 (Dense)             (None, 1)                 14        
=================================================================
Total params: 14
Trainable params: 14
Non-trainable params: 0
_________________________________________________________________
```


### 4.2.2: `compile`

furthermore, we want to `compile` this model to use the `adam` optimizer algorithm to optimize a `mse` `loss` funciton. let's track the `mean_absolute_error` `metric` as well

```python
linear_model.compile(
    loss=,  # FILL THIS IN!!
    optimizer=,  # FILL THIS IN!!
    metrics=,  # FILL THIS IN!!
)
```

### 4.2.3: `fit`

finally, let's fit our training dataset. let's use validation within each `epoch` with a `validation_split` of 0.05. also, in order to treat both model types on an equal footing, rather than stop after a fixed number of epochs we wil stop after our best `mse` value. to do this, we will use `EarlyStopping` and `ModelCheckpoint` `callback`s.

```python
es_callback = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    min_delta=0.01,
    patience=100
)

mc_callback = keras.callbacks.ModelCheckpoint(
    'linear.hdf5',
    monitor='val_loss',
    save_best_only=True
)

callbacks = [es_callback, mc_callback]
```

set the `validation_split` value to 0.05, set the `verbose` value to 0, the number of `epoch`s to be 10,000, and add the `callbacks` to fit on the `x` and `y` train datasets

```python
linear_model.fit(
    # FILL THIS IN!!
)
```

### 4.2.4: `evaluate`

load the saved best dataset and view the ultimate accuracy of this model on the held-out test data:

```python
best_linear_model = keras.models.load_model('linear.hdf5')
linear_test_mse, linear_test_mae = best_linear_model.evaluate(x_test, y_test)
print("linear test mse: {}".format(linear_test_mse))
print("linear test mae: {}".format(linear_test_mae))
```



## 4.3: a deep neural net model

let's repeat the above but with a neural network architecture. create a new `Sequential` model with the following:

+ several layers
    + one 20-node layer with `relu` activation and `input_dim` determined by the shape of `x_test`
    + one 10-node layer with `relu` activation
    + one 6-node layer with `relu` activation
    + one 1-node output layer with the default activation
+ compile with
    + an `adam` optimizer
    + a `mse` loss
    + a `mean_absolute_error` metric
+ fit with
    + 10,000 `epochs`
    + a `validation_split` of 0.05

```python
dnn_model = #----------------#
            # FILL THIS IN!! #
            #----------------#

dnn_model.compile(
    loss=,  # FILL THIS IN!!
    optimizer=,  # FILL THIS IN!!
    metrics=,  # FILL THIS IN!!
)

es_callback = keras.callbacks.EarlyStopping(
    monitor='val_loss',
    min_delta=0.01,
    patience=100
)

mc_callback = keras.callbacks.ModelCheckpoint(
    'dnn.hdf5',
    monitor='val_loss',
    save_best_only=True
)

callbacks = [es_callback, mc_callback]

dnn_model.fit(
    # FILL THIS IN!!
)

best_dnn_model = keras.models.load_model('dnn.hdf5')
dnn_test_mse, dnn_test_mae = best_dnn_model.evaluate(x_test, y_test)
print("dnn test mse: {}".format(dnn_test_mse))
print("dnn test mae: {}".format(dnn_test_mae))
```


## 4.4: bring it all together

fill in all of the above in one file named `boston_keras.py`

```python
from tensorflow import keras

def main():
    # load boston data
    (x_train, y_train), (x_test, y_test) = keras.datasets.boston_housing.load_data()
    
    # standardize
    mean = x_train.mean(axis=0)
    std = x_train.std(axis=0)
    x_train = (x_train - mean) / std
    x_test = (x_test - mean) / std
    
    # linear model ------------------------------------------------------------
    print('{:-<80}'.format('linear model '))
    
    # init
    linear_model = #----------------#
                   # FILL THIS IN!! #
                   #----------------#

    print(linear_model.summary())
    
    # compile
    linear_model.compile(
        loss=,  # FILL THIS IN!!
        optimizer=,  # FILL THIS IN!!
        metrics=,  # FILL THIS IN!!
    )
    
    # linear callbacks
    es_callback = keras.callbacks.EarlyStopping(
        monitor='val_loss',
        min_delta=0.01,
        patience=1000
    )

    mc_callback = keras.callbacks.ModelCheckpoint(
        'linear.hdf5',
        monitor='val_loss',
        save_best_only=True
    )

    callbacks = [es_callback, mc_callback]

    # fit
    linear_model.fit(
        # FILL THIS IN!!
    )
    
    # evaluate
    best_linear_model = keras.models.load_model('linear.hdf5')
    linear_test_mse, linear_test_mae = best_linear_model.evaluate(x_test, y_test)
    print("linear test mse: {}".format(linear_test_mse))
    print("linear test mae: {}".format(linear_test_mae))
    
    # dnn model ---------------------------------------------------------------
    print('{:-<80}'.format('dnn model '))

    # init
    dnn_model = #----------------#
                # FILL THIS IN!! #
                #----------------#

    print(dnn_model.summary())
            
    # compile
    dnn_model.compile(
        loss=,  # FILL THIS IN!!
        optimizer=,  # FILL THIS IN!!
        metrics=,  # FILL THIS IN!!
    )
    
    # dnn callbacks
    es_callback = keras.callbacks.EarlyStopping(
        monitor='val_loss',
        min_delta=0.01,
        patience=1000
    )

    mc_callback = keras.callbacks.ModelCheckpoint(
        'dnn.hdf5',
        monitor='val_loss',
        save_best_only=True
    )

    callbacks = [es_callback, mc_callback]

    # fit
    dnn_model.fit(
        # FILL THIS IN!!
    )

    # evaluate
    best_dnn_model = keras.models.load_model('dnn.hdf5')
    dnn_test_mse, dnn_test_mae = best_dnn_model.evaluate(x_test, y_test)
    print("dnn test mse: {}".format(dnn_test_mse))
    print("dnn test mae: {}".format(dnn_test_mae))


if __name__ == '__main__':
    main()
```


##### upload your completed `boston_keras.py` file to your `s3` homework submission bucket

# exercise 5: benchmark differences in performance using `gpu`s

let's spin up a `gpu` `ec2` instance and do a simple benchmark to see the performance improvements available via `gpu`s


## 5.1: `gpu`s are expensive!

go check [the per-hour price](https://aws.amazon.com/ec2/pricing/on-demand/) of `gpu` compute for a `p3.2xlarge` instance in the US East (Virginia) region. as of writing, it is 3.06 USD per hour.

we don't want to leave that on for long, so let's make this quick!


## 5.2: spin up a `p3.2xlarge` instance

`aws` has already created a deep learning `ami` for us, so let's use it and save time (and money) on downloads.

+ open the `ec2` web console and create a new instance
+ on page 1 (select `ami`)
    + select the "AWS Marketplace" menu on the left side
    + search for "Deep Learning AMI (Ubuntu)"
    + select the first one, titled "Deep Learning AMI (Ubuntu)" (*doesn't contain "base" in the title*)
    + press "continue"
+ on page 2 (choose an instance type)
    + find `p3.2xlarge` and select it
    + click "review and launch"
+ launch the instance
    + make sure you have your `ssh` key saved somewhere easy!


## 5.3: log in

after your `ec2` instance is up and running, log in to it using username `ubuntu` and providing the path to the private key `.pem` file you either downloaded just now or when you created that key pair for a previous `ec2` instance

if you don't know where this key file is, *terminate the instance* (right click > instance state > terminate) and start over.


## 5.4: download a benchmark

download [this public `gist`](https://gist.github.com/RZachLamberty/fe8e05060b809e90fd2722feeb80fcda) to your new `ec2` instance:

```sh
wget https://gist.githubusercontent.com/RZachLamberty/fe8e05060b809e90fd2722feeb80fcda/raw/0e0834395ba60c317168bdfdeb0c898afac6013b/cpu_gpu_benchmark.py
```


## 5.5: activate an environment

the good folks at `aws` have pre-configured this `ami` with a ton of different deep-learning-capable environments, and the commands for entering any one of them are printed out when you log in to the `ami`. one in particular is for us right now:

```
for TensorFlow(+Keras2) with Python3 (CUDA 9.0 and Intel MKL-DNN) ___________________________ source activate tensorflow_p36
```

run that command to activate that environment


## 5.6: run the benchmark

now that we have everything we need already installed (thanks, `aws` `ami`!), go ahead and run the benchmark script:

```sh
python cpu_gpu_benchmark.py
```

> **note**: the first time you run on this machine, the process of initializing `tensorflow` for the first time may require enough overhead to cause an error in the benchmarking script. you will see a `ValueError: Empty data passed with indices specified.` error. if you see this, just run the benchmark script again. if you see it more than three times, terminate your instance and send me an email.

the final output will be a dataframe which lists the amount of time it took to create random matrices of increasing sizes and multiply them, as well as the ratio of the speeds for the different devices for each operation.

it will also output a `csv` named `results.csv`. download that file (use `scp` to copy it from your `ec2` to your laptop, or just open it, highlight, and copy-paste to a local file)


## 5.7: TERMINATE YOUR `gpu` INSTANCE!!

don't forget to go back into the `ec2` web console and terminate your instance (right click > instance state > terminate).


##### upload `results.csv` to your `s3` homework submission bucket