##### Copyright 2018 The TensorFlow Authors.

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [None]:
#@title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

# Proj1A - Basic Regression: Understanding the ADNI Data Using Regression  






<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/tutorials/keras/regression"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/keras/regression.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/regression.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/keras/regression.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

## Instructions

Please make a copy and rename it with your name (ex: Proj1A_Ilmi_Yoon). All grading points should be explored in the notebook but some can be done in a separate pdf file. 

*Graded questions will be listed with "Q:" followed by the corresponding points.* 

You will be submitting **a pdf** file containing **the url of your own proj1A.**


---



>[Proj1A - Basic Regression: Understanding the ADNI Data Using Regression](#scrollTo=EIdT9iu_Z4Rb)

>>[1. Load in the Data](#scrollTo=gFh9ne3FZ-On)

>>[2. Clean the Data](#scrollTo=3MWuJTKEDM-f)

>>[3Inspect the Data](#scrollTo=J4ubs136WLNp)

>>[Select a Few Features to Work On and Split Features from Labels](#scrollTo=Db7Auq1yXUvh)

>>[Normalization](#scrollTo=mRklxK5s388r)

>>>[5.1 The Normalization Layer](#scrollTo=aFJ6ISropeoo)

>>[Linear regression](#scrollTo=6o3CrycBXA2s)

>>>[6.1. One Variable](#scrollTo=lFby9n0tnHkw)

>>>[6.2. Multiple Variables (Features)](#scrollTo=Yk2RmlqPoM9u)

>[Extra Credit: A DNN regression](#scrollTo=SmjdzxKzEu1-)

>>[Instructions](#scrollTo=DT_aHPsrzO1t)

>>[A. Train the Model](#scrollTo=ELz48lsgqC46)

>>>[A1. One Variable](#scrollTo=7T4RP1V36gVn)

>>>[A2. Full Model](#scrollTo=S_2Btebp2e64)

>>[B. Performance](#scrollTo=uiCucdPLfMkZ)

>>[C. Make Predictions](#scrollTo=ft603OzXuEZC)

>[Conclusion](#scrollTo=vgGQuV-yqYZH)



## Table of Contents

Introduction (Points: 30 points)
1. Load in the Data
2. Clean the Data
3. Inspect the Data
4. Select a Few Features to Work On and Split Features from Labels
5. Normalization

  5.1 The Normalization Layer

6. Linear Regression

  6.1 One Variable
 
  6.2 Multiple Variable (Features)

A DNN Regression (Extra Credit: 3 points)

1. Instructions
2. A. Train the Model 

  A1. One Variable 
  
  A2. Full Model

3. B. Performance
4. C. Make Predictions

Conclusion

---



## Introduction

In a **regression** problem, the aim is to *predict the output of a continuous value*, like a price or a probability.

Contrast this with a **classification** problem, where the aim is to *select a class from a list of classes* (for example, where a picture contains an apple or an orange, recognizing which fruit is in the picture).

This example uses the `tf.keras API`, see [this guide](https://www.tensorflow.org/guide/keras) for details.

In [None]:
# Use seaborn for pairplot
!pip install -q seaborn

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


# Make numpy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

In [None]:
import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.layers.experimental import preprocessing

print(tf.__version__)

### 1. Load in the Data
First download and import the dataset using pandas:

In [None]:
url = "https://raw.githubusercontent.com/pleunipennings/CSC508Data/main/PatData.csv" 
data = pd.read_csv(url)

In [None]:
dataset = data.copy()
dataset.tail()

In [None]:
dataset['DX'].value_counts()

So, I would like to take out everything in the dataset **EXCEPT** the following levels: NL, MCI and Dementia. 

*NL = cognitively normal , MCI = mild cognitive impairement.*

In [None]:
index_to_drop = dataset[ (dataset['DX'] != "MCI") & (dataset['DX'] != "NL") & (dataset['DX'] != "Dementia")].index
  
# drop these given row indices from data
dataset = dataset.drop(index_to_drop)

In [None]:
dataset['DX'].value_counts()

### 2. Clean the Data

The dataset contains a few unknown values. To see how many unknown values, use the following code:

In [None]:
dataset.isna().sum()

Drop those rows to keep this initial tutorial simple.

Q: **(1 point)** What are other ways to process these rows instead of dropping? 

In [None]:
dataset = dataset.dropna()

Categorical Data needs to be properly handled using one-hot-encoding. 

Q: **(2 points)** Explain in 200 words what is one-hot-encoding and why it is necessary to handle categorical data.

Q: **(1 point)** Make one more categorical feature into one-hot-encoding 

Q: **(1 point)** Explain why DX column is mapped to numeric values as below.

In [None]:
cleanup_DX = {"DX": {"NL": 1, "MCI": 2, "Dementia": 3}}
dataset = dataset.replace(cleanup_DX)

In [None]:
train_dataset = dataset.sample(frac=0.8, random_state=0)
test_dataset = dataset.drop(train_dataset.index)

### 3. Inspect the Data

Have a quick look at the joint distribution of a few pairs of columns from the training set. *Can you find the data that show their relationship clearly?*

Q: **(2 points)** Please work with different columns and write what you have learned from the visualization of the data.  

In [None]:
sns.pairplot(train_dataset[["AGE", "Hippocampus", "Ventricles", "WholeBrain", "Entorhinal", "Fusiform", "MidTemp", "ICV", "APOE4", "DX"]], diag_kind="kde")

Also look at the overall statistics, note how each feature covers a very different range:

In [None]:
train_dataset.describe().transpose()

The code below allows you to look into different groups of data -- normal patients, mild patients and dimential patients. 

Q: **(3 points)** Play with the total data and/or each group data and its regression on age, DX, and other columns 

In [None]:
index_to_drop = train_dataset[ (train_dataset['DX'] != 1) ].index

# drop these given row indices from data
train_dataset = train_dataset.drop(index_to_drop)
train_dataset.describe().transpose()

### 4. Select a Few Features to Work On and Split Features from Labels

Separate the target value (the "label") from the features. **This label is the value that you will train the model to predict.**

In [None]:
train_features = train_dataset[["Hippocampus", "WholeBrain", "Entorhinal"]]
test_features = test_dataset[["Hippocampus", "WholeBrain", "Entorhinal"]]


train_labels = train_dataset["AGE"]
test_labels = test_dataset["AGE"]

## 5. Normalization

In the table of statistics it's easy to see how different the ranges of each feature are.

Q: **(2 points)** Write in 100 words why normalization is important.

*Note*: There is no advantage to normalizing the one-hot features, it is done here for simplicity. For more details on how to use the preprocessing layers, refer the [Working with preprocessing layers](https://www.tensorflow.org/guide/keras/preprocessing_layers) guide and the [Classify structured data using Keras preprocessing layers](https://www.tensorflow.org/tutorials/structured_data/preprocessing_layers) tutorial.

In [None]:
train_features.describe().transpose()[['mean', 'std']]

### 5.1 The Normalization Layer
The `preprocessing.Normalization` layer is a clean and simple way to build that preprocessing into your model.

The first step is to create the layer:

In [None]:
normalizer = preprocessing.Normalization(axis=-1)

Then `.adapt()` it to the data:

In [None]:
normalizer.adapt(np.array(train_features))

The following code calculates the mean and variance, and stores them in the layer. 

In [None]:
print(normalizer.mean.numpy())

When the layer is called, it returns the input data with each feature independently normalized:

In [None]:
first = np.array(train_features[:1])

with np.printoptions(precision=2, suppress=True):
  print('First example:', first)
  print()
  print('Normalized:', normalizer(first).numpy())

## 6. Linear regression

Before building a DNN model, start with a linear regression.

### 6.1. One Variable

Start with a single-variable linear regression, to predict `AGE` from `hippocampus`.

Q: **(5 points)** Please pick different variable or lable to explore the relationship of the features. 

Try and show at least 3 different variations.
Training a model with `tf.keras` typically starts by defining the model architecture.

In this case use a `keras.Sequential` model. This model represents a sequence of steps. In this case there are two steps:

* Normalize the input `hippocampus`.
* Apply a linear transformation ($y = mx+b$) to produce 1 output using `layers.Dense`.

The number of _inputs_ can either be set by the `input_shape` argument, or automatically when the model is run for the first time.

First create the hippocampus `Normalization` layer:

In [None]:
hippocampus = np.array(train_features['Hippocampus'])


hippocampus_normalizer = preprocessing.Normalization(input_shape=[1,], axis=None)
hippocampus_normalizer.adapt(hippocampus)

Build the sequential model:

In [None]:
hippocampus_model = tf.keras.Sequential([
    hippocampus_normalizer,
    layers.Dense(units=1)
])

hippocampus_model.summary()

This model will predict `AGE` from `hippocampus`.

Run the untrained model on the first 10 horse-power values. The output won't be good, but you'll see that it has the expected shape, `(10,1)`:

In [None]:
hippocampus_model.predict(hippocampus[:10])

Once the model is built, configure the training procedure using the `Model.compile()` method. The most important arguments to compile are the `loss` and the `optimizer` since these define what will be optimized (`mean_absolute_error`) and how (using the `optimizers.Adam`).

In [None]:
hippocampus_model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

Once the training is configured, use `Model.fit()` to execute the training:

Q: **(5 points)** Explore different hyperparameters such as learning rate, epochs, batch sizes. Please document your explorations and reflections.

In [None]:
%%time
history = hippocampus_model.fit(
    train_features['Hippocampus'], train_labels,
    epochs=100,
    # suppress logging
    verbose=1,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)

Visualize the model's training progress using the stats stored in the `history` object.

In [None]:
hist = pd.DataFrame(history.history)
hist['epoch'] = history.epoch
hist.tail()

In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.ylim([0, 100])
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

In [None]:
plot_loss(history)

Collect the results on the test set, for later:

In [None]:
test_results = {}

test_results['hippocampus_model'] = hippocampus_model.evaluate(
    test_features['Hippocampus'],
    test_labels, verbose=0)

Since this is a single variable regression it's easy to look at the model's predictions as a function of the input:

Q: **(1 point)** Replace the hard-coded constants with the min & max of this variable to work with other variables without changing it.

In [None]:
x = tf.linspace(3000, 11000, 100)
y = hippocampus_model.predict(x)

Q: **(1 point)** The name of feature 'Hippocampus' and the label 'age' should be replaced as variables, so exploring different variables will be easy without making changes every time. 

In [None]:
def plot_hippocampus(x, y):
  plt.scatter(train_features['Hippocampus'], train_labels, label='Data')
  plt.plot(x, y, color='k', label='Predictions')
  plt.xlabel('Hippocampus')
  plt.ylabel('AGE')
  plt.legend()

In [None]:
plot_hippocampus(x,y)

### 6.2. Multiple Variables (Features)

You can use an almost identical setup to make predictions based on multiple inputs. This model still does the same $y = mx+b$ except that $m$ is a matrix and $b$ is a vector.

This time use the `Normalization` layer that was adapted to the whole dataset.

In [None]:
linear_model = tf.keras.Sequential([
    normalizer,
    layers.Dense(units=1)
])

linear_model.summary()

When you call this model on a batch of inputs, it produces `units=1` outputs for each example.

In [None]:
linear_model.predict(train_features[:10])

When you call the model it's weight matrices will be built. Now you can see that the `kernel` (the $m$ in $y=mx+b$) has a shape of `(9,1)`.

In [None]:
linear_model.layers[1].kernel

Use the same `compile` and `fit` calls as for the single input `hippocampus` model:

In [None]:
linear_model.compile(
    optimizer=tf.optimizers.Adam(learning_rate=0.1),
    loss='mean_absolute_error')

In [None]:
%%time
history = linear_model.fit(
    train_features, train_labels, 
    epochs=100,
    # suppress logging
    verbose=0,
    # Calculate validation results on 20% of the training data
    validation_split = 0.2)

Using all the inputs achieves a much lower training and validation error than the `hippocampus` model: 

In [None]:
plot_loss(history)

Collect the results on the test set, for later:

In [None]:
test_results['linear_model'] = linear_model.evaluate(
    test_features, test_labels, verbose=0)

In [None]:
test_predictions = linear_model.predict(test_features).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [AGE]')
plt.ylabel('Predictions [AGE]')
lims = [60, 90]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)


Q: **(6 points)** Explore different features as train_features and label and see what relationship help you understand the ADNI data set better. Show that you explored at least 3 different combinations of features and label and write your reflection in 200 words or more

# Extra Credit: A DNN regression

## Instructions

 DNN regression is for extra credit. You don't have to do the below parts. 
 
 Q: **(Extra Credit = 3 points)** If you like to explore, then please go ahead and compare with the linear regression and write a reflection in 200 words.

The previous section implemented linear models for single and multiple inputs.

This section implements single-input and multiple-input DNN models. The code is basically the same except the model is expanded to include some "hidden" non-linear layers. The word "hidden" here just means not directly connected to the inputs or outputs.

These models will contain a few more layers than the linear model:

* The normalization layer;
* Two hidden, nonlinear, `Dense` layers using the `relu` nonlinearity; and
* A linear single-output layer.

Both will use the same training procedure so the `compile` method is included in the `build_and_compile_model` function below.

In [None]:
def build_and_compile_model(norm):
  model = keras.Sequential([
      norm,
      layers.Dense(64, activation='relu'),
      layers.Dense(64, activation='relu'),
      layers.Dense(1)
  ])

  model.compile(loss='mean_absolute_error',
                optimizer=tf.keras.optimizers.Adam(0.001))
  return model

## A. Train the Model

### A1. One Variable

Start with a DNN model for a single input, "hippocampus":

In [None]:
dnn_hippocampus_model = build_and_compile_model(hippocampus_normalizer)

This model has quite a few more trainable parameters than the linear models.

In [None]:
dnn_hippocampus_model.summary()

Train the model:

In [None]:
%%time
history = dnn_hippocampus_model.fit(
    train_features['Hippocampus'], train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)

This model does slightly better than the linear-hippocampus model.

In [None]:
plot_loss(history)

If you plot the predictions as a function of `hippocampus`, you'll see how this model takes advantage of the nonlinearity provided by the hidden layers:

In [None]:
x = tf.linspace(3000.0, 11000, 100)
y = dnn_hippocampus_model.predict(x)

In [None]:
plot_hippocampus(x, y)

Collect the results on the test set, for later:

In [None]:
test_results['dnn_hippocampus_model'] = dnn_hippocampus_model.evaluate(
    test_features['Hippocampus'], test_labels,
    verbose=0)

### A2. Full Model

If you repeat this process using all the inputs it slightly improves the performance on the validation dataset.

In [None]:
dnn_model = build_and_compile_model(normalizer)
dnn_model.summary()

In [None]:
%%time
history = dnn_model.fit(
    train_features, train_labels,
    validation_split=0.2,
    verbose=0, epochs=100)

In [None]:
plot_loss(history)

Collect the results on the test set:

In [None]:
test_results['dnn_model'] = dnn_model.evaluate(test_features, test_labels, verbose=0)

## B. Performance

Now that all the models are trained check the test-set performance and see how they did:

In [None]:
pd.DataFrame(test_results, index=['Mean absolute error [AGE]']).T

These results match the validation error seen during training.

## C. Make Predictions

Finally, predict have a look at the errors made by the model when making predictions on the test set:

In [None]:
test_predictions = dnn_model.predict(test_features).flatten()

a = plt.axes(aspect='equal')
plt.scatter(test_labels, test_predictions)
plt.xlabel('True Values [AGE]')
plt.ylabel('Predictions [AGE]')
lims = [60, 90]
plt.xlim(lims)
plt.ylim(lims)
_ = plt.plot(lims, lims)


It looks like the model predicts reasonably well. 

Now take a look at the error distribution:

In [None]:
error = test_predictions - test_labels
plt.hist(error, bins=25)
plt.xlabel('Prediction Error [MPG]')
_ = plt.ylabel('Count')

If you're happy with the model save it for later use:

In [None]:
dnn_model.save('dnn_model')

If you reload the model, it gives identical output:

In [None]:
reloaded = tf.keras.models.load_model('dnn_model')

test_results['reloaded'] = reloaded.evaluate(
    test_features, test_labels, verbose=0)

In [None]:
pd.DataFrame(test_results, index=['Mean absolute error [MPG]']).T

# Conclusion

This notebook introduced a few techniques to handle a regression problem. Here are a few more tips that may help:

* [Mean Squared Error (MSE)](https://www.tensorflow.org/api_docs/python/tf/losses/MeanSquaredError) and [Mean Absolute Error (MAE)](https://www.tensorflow.org/api_docs/python/tf/losses/MeanAbsoluteError) are common loss functions used for regression problems. Mean Absolute Error is less sensitive to outliers. Different loss functions are used for classification problems.
* Similarly, evaluation metrics used for regression differ from classification.
* When numeric input data features have values with different ranges, each feature should be scaled independently to the same range.
* Overfitting is a common problem for DNN models, it wasn't a problem for this tutorial. See the [overfit and underfit](overfit_and_underfit.ipynb) tutorial for more help with this.
