# You are breaking my heart - exercises

This dataset contains information on 303 patients. Several medically relevant data are available (age, sex, cholesterol, resting blood pressure...). Our task is to predict the presence of heart disease (column "target", 0 means healty, 1 means sick).

This dataset is described in detail:

* on Kaggle datasets: https://www.kaggle.com/ronitf/heart-disease-uci
* on its original webpage: https://archive.ics.uci.edu/ml/datasets/Heart+Disease

I've downloaded a copy of the data and made it available at the following url:

In [1]:
DATASET_URL = 'https://raw.githubusercontent.com/ne1s0n/coding_excercises/master/data/datasets_33180_43520_heart.csv'

# Data

In [2]:
import pandas

#pandas can read a csv directly from a url
heart_data = pandas.read_csv(DATASET_URL)
print(heart_data)

     age  sex  cp  trestbps  chol  fbs  ...  exang  oldpeak  slope  ca  thal  target
0     63    1   3       145   233    1  ...      0      2.3      0   0     1       1
1     37    1   2       130   250    0  ...      0      3.5      0   0     2       1
2     41    0   1       130   204    0  ...      0      1.4      2   0     2       1
3     56    1   1       120   236    0  ...      0      0.8      2   0     2       1
4     57    0   0       120   354    0  ...      1      0.6      2   0     2       1
..   ...  ...  ..       ...   ...  ...  ...    ...      ...    ...  ..   ...     ...
298   57    0   0       140   241    0  ...      1      0.2      1   0     3       0
299   45    1   3       110   264    0  ...      0      1.2      1   0     3       0
300   68    1   0       144   193    1  ...      0      3.4      1   2     3       0
301   57    1   0       130   131    0  ...      1      1.2      1   1     3       0
302   57    0   1       130   236    0  ...      0      0.0      

In [3]:
#splitting features and target
features = heart_data.iloc[:,:-1]
target = heart_data.iloc[:,-1]

In [4]:
#take a look at what we have done
print(heart_data.columns)
print(features.shape)
print(target.shape) #beware of rank 1 arrays

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal', 'target'],
      dtype='object')
(303, 13)
(303,)


## Train and Validation sets

In [5]:
#we want to have the same proportion of classes in both train and validation sets
from sklearn.model_selection import StratifiedShuffleSplit

#building a StratifiedShuffleSplit object (sss among friends) with 20% data
#assigned to validation set (here called "test")
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=0)

#the .split() method returns (an iterable over) two lists which can be
#used to index the samples that go into train and validation sets
for train_index, val_index in sss.split(features, target):
    features_train = features.iloc[train_index, :]
    features_val   = features.iloc[val_index, :]
    target_train   = target[train_index]
    target_val     = target[val_index]
    
#let's print some shapes to get an idea of the resulting data structure
print(features_train.shape)
print(features_val.shape)
print(target_train.shape)
print(target_val.shape)

(242, 13)
(61, 13)
(242,)
(61,)


# EXCERCISES!

The code above crossvalidated the effect of having different units in a single layer. We now want to explore the effect of using more than one layer.

## Exercise 1: expand the network

In the lesson, in section "Improvement, better model" we declared a simple, single-layer model. Let's do something bigger.

**ASSIGNMENT**: you are required to declare a new model with two layers. The first layer will have 10 units, the second 5 units. There will also be the final, output layer, with sigmoid activation function.

In [6]:
######## OLD CODE ########
#old code, put here for your reference:

if False:
  from keras.models import Sequential
  from keras.layers import Dense

  # 2-class logistic regression in Keras
  model2 = Sequential()
  model2.add(Dense(10, activation='relu', input_dim=features_train.shape[1]))
  model2.add(Dense(1, activation='sigmoid'))

######## YOUR CODE HERE ########
from keras.models import Sequential
from keras.layers import Dense

model4 = Sequential()
model4.add(Dense(10, activation='relu', input_dim=features_train.shape[1]))
model4.add(Dense(5, activation='relu'))
model4.add(Dense(1, activation='sigmoid'))
################################

You have now defined the architecture. Let's take a look at it.

**ASSIGNMENT** invoke the [.summary()](https://keras.io/api/models/model/#summary-method) built-in method of your model object. Verify that the resulting network has 201 trainable parameters.

In [7]:
######## YOUR CODE HERE ########
model4.summary()
################################

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 10)                140       
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 55        
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 6         
Total params: 201
Trainable params: 201
Non-trainable params: 0
_________________________________________________________________


## Excercise 2: expand the network, programmatically

We are preparing the terrain to do a proper grid-search crossvalidation. This means we aim to investigate the effect of combining a different number of layers and of units per layer. To do so we need a function that, give the number of hidden layers and the number of units per layer, returns a compiled model of the required topography.

**ASSIGNMENT** define a function `build_model` with three input parameters: `n_layers`, `n_units`, `input_size`. The function internally will declare a sequential model of the required shape. ATTENTION: the first layer needs special treatment.

In [8]:
######## YOUR CODE HERE ########
def build_model(n_layers, n_units, input_size):
  #declaring a local model
  m = Sequential()

  #a loop that goes l=0, l=1, l=2, ..., l=(n_layers-1)
  for l in range(n_layers):
    #are we doing the first layer? if yes the declaration has an extra param
    if l == 0:
      m.add(Dense(units = n_units, activation='relu', input_dim=input_size))
    else:
      m.add(Dense(units = n_units, activation='relu'))

  #adding the output layer
  m.add(Dense(1, activation='sigmoid'))
  
  #returning the declared model
  return(m)
################################

Let's use the funtion you just declared.

**ASSIGNMENT**: invoke `build_model` with the following parameters:

- `n_layers` = 2
- `n_units` = 5
- `input_size` = features_train.shape[1]

Verify that the resulting number of trainable parameters is 106.

In [9]:
######## YOUR CODE HERE ########
build_model(2, 5, features_train.shape[1]).summary()
################################

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 5)                 70        
_________________________________________________________________
dense_4 (Dense)              (None, 5)                 30        
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 6         
Total params: 106
Trainable params: 106
Non-trainable params: 0
_________________________________________________________________


## Excercise 3: explore the hyperparameters

We want to explore the effects of having a different number of layers and of units per layer. In particular we want to investigate:

* number of layers: 1 (as done above, not counting the output layer), 2, 3
* number of units per layer: 2, 4

This brings to a total of 6 combinations.

**ASSIGNMENT**: write a loop that, for each combination of layers and units, trains a network on the available feature and validation sets. Inside the loop, once the model is trained, print the train and validation losses.

For compilation/training, use the following:

* optimizer: rmsprop
* loss: binary_crossentropy
* epochs: 20
* verbose=0 (or not, you decide)

In [10]:
#we want to study the combination of these parameters
layers_list = [1, 2, 3]
units_list = [2, 4]

#remember that the datasets have already been declared:
# - features_train
# - features_val
# - target_train
# - target_val

######## YOUR CODE HERE ########
#a double loop to explore the parameters
for layers in layers_list:
  for units in units_list:
    #a little user interface
    print('Doing layers:' + str(layers) + ' units:', str(units))

    #getting the model
    m = build_model(layers, units, features_train.shape[1])

    #compiling
    m.compile(optimizer='rmsprop', loss='binary_crossentropy')
    
    #fitting
    history = m.fit(
        features_train, target_train, 
        epochs=20, 
        validation_data=(features_val, target_val), verbose=0)
    
    #let's just print the final loss
    print(' - train loss     : ' + str(history.history['loss'][-1]))
    print(' - validation loss: ' + str(history.history['val_loss'][-1]))
################################

Doing layers:1 units: 2
 - train loss     : 0.6908085942268372
 - validation loss: 0.6910037398338318
Doing layers:1 units: 4
 - train loss     : 6.628240585327148
 - validation loss: 6.547569274902344
Doing layers:2 units: 2
 - train loss     : 0.6909536123275757
 - validation loss: 0.6911321878433228
Doing layers:2 units: 4
 - train loss     : 0.693126916885376
 - validation loss: 0.6511917114257812
Doing layers:3 units: 2
 - train loss     : 1.6838568449020386
 - validation loss: 1.473851203918457
Doing layers:3 units: 4
 - train loss     : 0.7067516446113586
 - validation loss: 0.5840612649917603


## Exercise 4: a proper crossvalidation

It's now time to do a proper crossvalidation over our train dataset. For this exercise we ignore the old validation dataset (`features_val`, `target_val`) that could be used as a TEST set.

Our training set (`features_train`, `target_train`) needs to be sliced in five parts (i.e., folds). We'll then:

* use the folds number 1, 2, 3, and 4 for training, fold number 5 for validation
* use the folds number 1, 2, 3, and 5 for training, fold number 4 for validation
* use the folds number 1, 2, 4, and 5 for training, fold number 3 for validation
* and so forth

To slice the dataset we'll use the [StratifiedKFold class](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) from `sklearn.model_selection` subpackage. We used the same object in the lesson, so feel free to refer to that code for reference. As a reminder, to use it porperly you'll need to:

* import the class
* declare a StratifiedKFold object telling the constructor how many folds (`n_splits`) you want
* loop over folds via the `.split()` method, which requires the data (features and target) as input and returns the indices of the current split

**ASSIGNMENT** modify the loop you wrote in the previous exercise so that the model is trained 5 times on different splits of (`features_train`, `target_train`). Print the loss and val_loss averaged over the folds.


In [15]:
######## YOUR CODE HERE ########
from sklearn.model_selection import StratifiedKFold

#same code as exercise 3, up to a point.

#a double loop to explore the parameters
for layers in layers_list:
  for units in units_list:
    #a little user interface
    print('Layers:' + str(layers) + ' units:', str(units))

    #getting the model
    m = build_model(layers, units, features_train.shape[1])

    #compiling
    m.compile(optimizer='rmsprop', loss='binary_crossentropy')
    
    #we'll put the losses for the five folds in these two container
    folds_loss = []
    folds_val_loss = []

    #a counter for the folds, useful for keeping track of what's going on
    f = 0

    #declaring the splitter
    skf = StratifiedKFold(n_splits = 5)

    #loop over folds
    for train_index_cv, val_index_cv in skf.split(features_train, target_train):
      #informing the user
      f += 1
      print('- fold: ' + str(f))

      #extracting the data for this fold
      features_train_cv = features_train.iloc[train_index_cv, :]
      features_val_cv   = features_train.iloc[val_index_cv, :]
      target_train_cv   = target_train.iloc[train_index_cv]
      target_val_cv     = target_train.iloc[val_index_cv]

      #fitting
      history = m.fit(
          features_train_cv, target_train_cv, 
          epochs=20, 
          validation_data=(features_val_cv, target_val_cv), verbose=0)
    
      #we just store the losses at the last epoch
      folds_loss.append(history.history['loss'][-1])
      folds_val_loss.append(history.history['val_loss'][-1])
    
    #we have finished the folds, it's time to print the averaged results
    print(' - train loss     : ' + str(sum(folds_loss) / len(folds_loss)))
    print(' - validation loss: ' + str(sum(folds_val_loss) / len(folds_val_loss)))
################################


Layers:1 units: 2
- fold: 1
- fold: 2
- fold: 3
- fold: 4
- fold: 5
 - train loss     : 1.7836564779281616
 - validation loss: 1.7460849404335022
Layers:1 units: 4
- fold: 1
- fold: 2
- fold: 3
- fold: 4
- fold: 5
 - train loss     : 1.7561102509498596
 - validation loss: 1.2965580701828003
Layers:2 units: 2
- fold: 1
- fold: 2
- fold: 3
- fold: 4
- fold: 5
 - train loss     : 0.6908377051353455
 - validation loss: 0.690997838973999
Layers:2 units: 4
- fold: 1
- fold: 2
- fold: 3
- fold: 4
- fold: 5
 - train loss     : 0.810384726524353
 - validation loss: 0.769555139541626
Layers:3 units: 2
- fold: 1
- fold: 2
- fold: 3
- fold: 4
- fold: 5
 - train loss     : 0.6429208397865296
 - validation loss: 0.6381661653518677
Layers:3 units: 4
- fold: 1
- fold: 2
- fold: 3
- fold: 4
- fold: 5
 - train loss     : 0.6670983672142029
 - validation loss: 0.7045563697814942
