# You are breaking my heart - exercises

This dataset contains information on 303 patients. Several medically relevant data are available (age, sex, cholesterol, resting blood pressure...). Our task is to predict the presence of heart disease (column "target", 0 means healty, 1 means sick).

This dataset is described in detail:

* on Kaggle datasets: https://www.kaggle.com/ronitf/heart-disease-uci
* on its original webpage: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

I've downloaded a copy of the data and made it available at the following url:

In [None]:
DATASET_URL = 'https://raw.githubusercontent.com/ne1s0n/coding_excercises/master/data/datasets_33180_43520_heart.csv'

# Data

In [None]:
import pandas

#pandas can read a csv directly from a url
heart_data = pandas.read_csv(DATASET_URL)
print(heart_data)

In [None]:
#splitting features and target
features = heart_data.iloc[:,:-1]
target = heart_data.iloc[:,-1]

In [None]:
#take a look at what we have done
print(heart_data.columns)
print(features.shape)
print(target.shape) #beware of rank 1 arrays

## Train and Validation sets

In [None]:
#we want to have the same proportion of classes in both train and validation sets
from sklearn.model_selection import StratifiedShuffleSplit

#building a StratifiedShuffleSplit object (sss among friends) with 20% data
#assigned to validation set (here called "test")
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)

#the .split() method returns (an iterable over) two lists which can be
#used to index the samples that go into train and validation sets
for train_index, val_index in sss.split(features, target):
    features_train = features.iloc[train_index, :]
    features_val   = features.iloc[val_index, :]
    target_train   = target[train_index]
    target_val     = target[val_index]

#let's print some shapes to get an idea of the resulting data structure
print(features_train.shape)
print(features_val.shape)
print(target_train.shape)
print(target_val.shape)

# EXCERCISES!

The previous code (`day3_code02 heart disease crossv.ipynb`) crossvalidated the effect of having different units in a single layer. We now want to explore the effect of using more than one layer.

## Exercise 1: expand the network

In the lesson, in section "Improvement, better model" we declared a simple, single-layer model. Let's do something bigger.

**ASSIGNMENT**: you are required to declare a new model with two layers. The first layer will have 10 units, the second 5 units. There will also be the final, output layer, with sigmoid activation function.

In [None]:
######## OLD CODE ########
#old code, copied here for your reference:

if False:
  from keras.models import Sequential
  from keras.layers import Dense

  # 2-class logistic regression in Keras
  model = Sequential()
  model.add(Dense(10, activation='relu', input_dim=features_train.shape[1]))
  model.add(Dense(1, activation='sigmoid'))

######## YOUR CODE HERE ########

################################

You have now defined the architecture. Let's take a look at it.

**ASSIGNMENT** invoke the [.summary()](https://keras.io/api/models/model/#summary-method) built-in method of your model object. Verify that the resulting network has 201 trainable parameters.

In [None]:
######## YOUR CODE HERE ########

################################

## Excercise 2: expand the network, programmatically

We are preparing the terrain to do a proper grid-search crossvalidation. This means we aim to investigate the effect of combining a different number of layers and of units per layer. To do so we need a function that, give the number of hidden layers and the number of units per layer, returns a compiled model of the required topography.

**ASSIGNMENT** define a function `build_model` with three input parameters: `n_layers`, `n_units`, `input_size`. The function internally will declare a sequential model of the required shape. ATTENTION: the first layer needs special treatment.

NOTE: Keep in mind the difference between hidden and total layers

In [None]:
######## YOUR CODE HERE ########

################################

Let's use the funtion you just declared.

**ASSIGNMENT**: invoke `build_model` with the following parameters:

- `n_layers` = 2
- `n_units` = 5
- `input_size` = features_train.shape[1]

Verify that the resulting number of trainable parameters is 106.

In [None]:
######## YOUR CODE HERE ########
model2 = build_model(n_layers=2, n_units=5, input_size=features_train.shape[1])
model2.summary()
################################

## Excercise 3: explore the hyperparameters

We want to explore the effects of having a different number of layers and of units per layer. In particular we want to investigate:

* number of layers: 1 (as done above, not counting the output layer), 2, 3
* number of units per layer: 2, 4

This brings to a total of 6 combinations.

**ASSIGNMENT**: write a loop that, for each combination of layers and units, trains a network on the available feature and validation sets. Inside the loop, once the model is trained, print the train and validation losses.

For compilation/training, use the following:

* optimizer: rmsprop
* loss: binary_crossentropy
* epochs: 20
* verbose=0 (or not, you decide)

In [None]:
#we want to study the combination of these parameters
layers_list = [1, 2, 3]
units_list = [2, 4]

#remember that the datasets have already been declared:
# - features_train
# - features_val
# - target_train
# - target_val

######## YOUR CODE HERE ########
#a double loop to explore the parameters

################################

## Exercise 4: a proper crossvalidation

It's now time to do a proper crossvalidation over our train dataset. For this exercise we ignore the old validation dataset (`features_val`, `target_val`) that could be used as a TEST set.

Our training set (`features_train`, `target_train`) needs to be sliced in five parts (i.e., folds). We'll then:

* use the folds number 1, 2, 3, and 4 for training, fold number 5 for validation
* use the folds number 1, 2, 3, and 5 for training, fold number 4 for validation
* use the folds number 1, 2, 4, and 5 for training, fold number 3 for validation
* and so forth

To slice the dataset we'll use the [StratifiedKFold class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) from `sklearn.model_selection` subpackage. We used the same object in the lesson, so feel free to refer to that code for reference. As a reminder, to use it porperly you'll need to:

* import the class
* declare a StratifiedKFold object telling the constructor how many folds (`n_splits`) you want
* loop over folds via the `.split()` method, which requires the data (features and target) as input and returns the indices of the current split

**ASSIGNMENT** modify the loop you wrote in the previous exercise so that the model is trained 5 times on different splits of (`features_train`, `target_train`). Print the loss and val_loss averaged over the folds.


In [None]:
######## YOUR CODE HERE ########

################################