# Model validation exercises
These exercises are about tuning a model, and getting familiar with the way this is done. As quite a few steps are involved, this set of exercises, requires you to fill in values in places that are marked with `# FILL IN`.

### Warm up exercise

In order to perform tuning, we need to be able to create a parameter grid. There is a built in function for this in scikit-learn, and this exercise is about understanding what it does. 
The `ParameterGrid` class takes a dictionary as argument. The dictionary should contain the names of the hyperparameters as keys, and a list containing the hyperparameter values. If you want to know more about dictionaries in python check this link https://www.tutorialspoint.com/python3/python_dictionary.htm.
Let us see an example:

In [None]:
# Input to ParameterGrid class from sklearn
grid_inp = {'param0': [5,6,7], 'param1': [10,100]}

Putting this into the `ParameterGrid` class, we get an object containing all combinations for the hyperparameters.

In [None]:
# Import class to create parameter grid
from sklearn.model_selection import ParameterGrid

# Create a grid using the ParameterGrid class
grid = ParameterGrid(grid_inp)

# Print values
print("Parameter combinations:")
for elem in grid:
    print(elem)

Notice how the grid contains all combinations of `param0` and `param1`. An example of a valid value for `param0` or `param1` is `max_depth` if using the `DecisionTreeClassifier`.

# Exercises in model validation
The first exercise is a bit long, since it involves the whole process of fitting and validating a model. Below you will find a code template with some bits and pieces missing. You are supposed to fill out the remaining lines. Everything that should be filled is indicated by a comment.

First step is to load the data and create a validation set.

In [None]:
import numpy as np
import pandas as pd

# Load data
df_train = pd.read_csv('gaussian_2d_train.csv', index_col = 0)
df_test = pd.read_csv('gaussian_2d_test.csv', index_col = 0)

# sklearn has a built in tool to split data.
# Here it is used to split the training set into a training and validation set
from sklearn.model_selection import train_test_split
train, vali = train_test_split(df_train, test_size = 0.30)

# Split into features and labels
X_train = train[['a','b']]
y_train = train['group']
X_vali = vali[['a','b']]
y_vali = vali['group']

Now we need to create a grid of hyperparameters for tuning the model. Create a dictionary with values for `max_depth`. Don't use more than three values (So it doesn't take too long). Afterwards you can try adding `min_samples_split` to the parameter grid. If you are in doubt, see the example in the first cell of this notebook.

In [None]:
# Import class to create parameter grid
from sklearn.model_selection import ParameterGrid

# ParameterGrid input
param_grid = {} # FILL IN

# Create a grid using the ParameterGrid class
grid = ParameterGrid() # FILL IN

Initialize the model, and create a `DataFrame` with two columns, one for the accuracy and one for the corresponding `max_depth`. The number of rows should be equal to the size of the parameter grid. Afterwards fill out the values needed for the loop to run through different hyperparameters and to fit and predict.

In [None]:
# Import model
from sklearn.tree import DecisionTreeClassifier

# Create a DataFrame to store results
res = pd.DataFrame(np.zeros(( , )), columns = ['Acc', 'max_depth'])# FILL IN

# Loop over the grid
for row, elem in enumerate(grid):
    
    # Initialize model, set the correct max_depth
    model = DecisionTreeClassifier(max_depth = )# FILL IN
    
    # Fit model on training data
    model.fit(X = , y = )# FILL IN
    
    # Predict on validation data
    pred = model.predict(X = )# FILL IN
    
    # Calculate accuracy
    accuracy = # FILL IN
    
    # Store results
    res.iloc[row, :] = [accuracy, elem['max_depth']] # For min_samples_split: elem['min_samples_split']

Find the best value of `max_depth` in `res` by looking at which one yielded the best accuracy. Afterwards refit the model with the best hyperparameter on both the training and validation data, and predict on the test data.

In [None]:
# Get best hyperparameters
max_score = res['Acc'].max()
best_rows = res.loc[res['Acc'] == max_score, :]
best_max_d = # FILL IN

# Retrain model with best param on full training data
model = DecisionTreeClassifier(max_depth = best_max_d)

# Fit model
model.fit(X = , y = )# FILL IN

# Predict on test
pred_test = model.predict(X = )# FILL IN

# Accuracy
print("Accuracy on test data is " + )# FILL IN

In [None]:
#ANS
import numpy as np
import pandas as pd

# Load data
df_train = pd.read_csv('gaussian_2d_train.csv', index_col = 0)
df_test = pd.read_csv('gaussian_2d_test.csv', index_col = 0)

# sklearn has a built in tool to split data.
# Here it is used to split the training set into a training and validation set
from sklearn.model_selection import train_test_split
train, vali = train_test_split(df_train, test_size = 0.30)

# Split into features and labels
X_train = train[['a','b']]
y_train = train['group']
X_vali = vali[['a','b']]
y_vali = vali['group']

# Import class to create parameter grid
from sklearn.model_selection import ParameterGrid

# ParameterGrid input
param_grid = {'max_depth': [1,2,3]}

# Create a grid using the ParameterGrid class
grid = ParameterGrid(param_grid)

# Import model
from sklearn.tree import DecisionTreeClassifier

# Create a DataFrame to store results
res = pd.DataFrame(np.zeros((len(grid), 2)), columns = ['Acc', 'max_depth'])

# Loop over the grid
for row, elem in enumerate(grid):
    
    print(elem)
    
    # Initialize model
    model = DecisionTreeClassifier(max_depth = elem['max_depth'])
    
    # Fit model on training data
    model.fit(X = X_train, y = y_train)
    
    # Predict on validation data
    pred = model.predict(X = X_vali)
    
    # Calculate accuracy
    accuracy = np.sum(pred == y_vali)/pred.shape[0]
    
    # Store results
    res.iloc[row, 0] = accuracy
    res.iloc[row, 1] = elem['max_depth']

print(res)

# Get best hyperparameters
max_score = res['Acc'].max()
best_rows = res.loc[res['Acc'] == max_score, :]
best_max_d = int(best_rows.iloc[0, 1])

# Retrain model with best param on full training data
model = DecisionTreeClassifier(max_depth = best_max_d)

# Fit model
model.fit(X = df_train[['a','b']], y = df_train['group'])

# Predict on test
pred_test = model.predict(X = df_test[['a','b']])

# Accuracy
print("Accuracy on test data is " + str(np.sum(pred_test == df_test['group'])/pred_test.shape[0]))

# Cross-validation
The next exercise is to try out cross-validation. Use `GridSearchCV` from `sklearn` to perform K-fold cross-validation. The `GridSearchCV` object fits the model on the different subfolds automatically, and identifies the best set of hyperparameters. The best set is stored in the `best_estimator_` attribute.

In [None]:
import pandas as pd
import numpy as np

# Import grid search cross validation
from sklearn.model_selection import GridSearchCV

# Import model
from sklearn.tree import DecisionTreeClassifier

# Initialize model
mod = DecisionTreeClassifier()

# Load data
df_train = pd.read_csv('gaussian_2d_train.csv', index_col = 0)
df_test = pd.read_csv('gaussian_2d_test.csv', index_col = 0)

# Create a parameter grid
param_grid = {} # FILL IN

# Initialize grid search CV
gs_cv = GridSearchCV(mod, param_grid, cv = 5)

# Start cross validation with the fit method
# FILL IN

# Predict on the test set using the predict method
pred = # FILL IN

# Calculate accuracy
print("Accuracy on test set: {0}".format())# FILL IN

In [None]:
#ANS
import pandas as pd
import numpy as np

# Import grid search cross validation
from sklearn.model_selection import GridSearchCV

# Import model
from sklearn.tree import DecisionTreeClassifier

# Initialize model
mod = DecisionTreeClassifier()

# Load data
df_train = pd.read_csv('gaussian_2d_train.csv', index_col = 0)
df_test = pd.read_csv('gaussian_2d_test.csv', index_col = 0)

# Create a parameter grid
param_grid = {'max_depth': [1,2,3]}

# Initialize grid search CV
gs_cv = GridSearchCV(mod, param_grid, cv = 5)

# Start cross validation
gs_cv.fit(X = df_train[['a','b']], y = df_train['group'])

# Predict on the test set
pred = gs_cv.predict(X = df_test[['a', 'b']])

# Calculate accuracy
print("Accuracy on test set: {0}".format(np.sum(pred == df_test['group'])/pred.shape[0]))

# Bonus exercises

## Custom cross-validation
The exercise is about creating your own cross-validation function. The function should take an initialized model, the training data as a pandas DataFrame, the name of the label column and a number `k` as argument where `k` indicates the number of folds to use. Calculate the loss as accuracy. You are free to do it in the way you see fit, but here is a way that it could be done:

You could make use of the random number generator in numpy which is called `np.random.randint`. If you want to do 5 fold CV you could generate a random number from 0 to 4 for all rows in the data frame. Afterwards you could fit a model on all rows having 0, 1, 2, 3 and predict on those with index 4. Do the same where you change the hold out set to the other numbers and finally calculate the loss.

Another hint: if you want to select all columns except one, you can do the following:
```python
df.loc[:, df.columns != EXC_COL]
```
where `EXC_COL` contains the name of the column to exclude. This is similar to the `drop` method which does the same thing.

In [None]:
#ANS
def cv_itr(model, data, label_name, k):
    # Find number of rows in data
    rows = data.shape[0]
    
    # Generate random numbers for each row with vals between 0 and k.
    idx = np.random.randint(0, k, rows)
    
    # Create a vector to store predictions
    pred = np.zeros(rows)
    
    # Go through all folds
    for i in range(k):
        # Create dataset to fit on
        data_fit = data.loc[idx != i, :]
        
        # Create a dataset to predict on
        data_pred = data.iloc[idx == i, :]
        
        # Fit model
        model.fit(X = data_fit.loc[:, data_fit.columns != label_name], y = data_fit[label_name])
        
        # Predict
        pred[idx == i] = model.predict(X = data_pred.loc[:, data_pred.columns != label_name])
    
    # Calculate accuracy
    acc = np.sum(pred == data[label_name])/rows
    
    return acc

cv_itr(DecisionTreeClassifier(), df_train, 'group', 20)

Wrap the function `cv_itr` that you just made into a into a new function takes a model and a parameter grid. The function should change the hyperparameters of the model to a new value from the grid for each iteration. This can be done quite easily:
```python
model.set_params(max_depth = 2)
```
or if you are storing the hyperparameters in a dictionary:
```python
hyperparam = {"max_depth": 2}
model.set_params(**hyperparam)
```
In the loop a call to `cv_itr` should be made in order to perform cross-validation on the model for the current hyperparameter set. The function should return the DataFrame with hyperparameters and corresponding scores.

In [None]:
#ANS
from sklearn.model_selection import ParameterGrid

def custom_cv_acc(model, data, label_name, k, param_grid):
    # Create the actual grid
    grid = ParameterGrid(param_grid)
    
    # Count the number of hyper-parameter combinations in the grid
    n_runs = len(grid) # Or manually if you are cool: np.prod([len(val) for key, val in param_grid.items()])
    
    # Count number of hyperparameters
    n_hyp_param = len(param_grid)
    
    # Get names of hyper params:
    hyp_names = list(param_grid.keys())
    
    # Create dataframe to store hyper-param and score
    res = pd.DataFrame(np.zeros((n_runs, n_hyp_param + 1)), columns = ['acc'] + hyp_names)
    
    # Loop through hyper parameters
    for idx, g in enumerate(grid):
        # Set new hyperparameters on the model
        model.set_params(**g)
        
        # Use the function from last exercise to do the actual CV
        acc = cv_itr(model, data, label_name, k)
        
        # Store result
        res.iloc[idx, 0] = acc
        
        # Store hyper-param
        res.iloc[idx, 1:] = [g[x] for x in hyp_names]
    
    return res

# Run function
custom_cv_acc(DecisionTreeClassifier(), df_train, 'group', 4 , param_grid)

## Tuning a model on your own
This exercise is similar to the "Fill-in" exercise, except that now you have to code the whole process yourself. It is recommended, that you try to do this without looking too much on the exercise above. That way you can test how much you actually remember.

In this exercise we will be using the breast cancer dataset, which can be found in the file `breast.csv`. The target variable is called `diagnosis`.

You should go through the following steps:
- Load data
- Split into train and validation
- Initialize model and create parameter grid
- Loop over parameters while training on the train dataset and predicting on the validation dataset
- Evaluate the performance of the model on the validation set, to find the best hyperparameter set.

In [None]:
import pandas as pd
br = pd.read_csv("breast.csv")

In [None]:
#CONFIG
# Hide code tagged with #ANS
from IPython.display import HTML
HTML('''<script>
function code_hide() {
    var cells = IPython.notebook.get_cells()
    cells.forEach(function(x){ if(x.get_text().includes("#ANS")){
        if (x.get_text().includes("#CONFIG")){

        } else{
            x.input.hide()
            x.output_area.clear_output()
        }

        
    }
    })
}
function code_hide2() {
    var cells = IPython.notebook.get_cells();
    cells.forEach(function(x){
    if( x.cell_type != "markdown"){
        x.input.show()      
    }
    
        });
} 
$( document ).ready(code_hide);
$( document ).ready(code_hide2);
</script>
<form action="javascript:code_hide()"><input type="submit" value="Hide answers"></form>
<form action="javascript:code_hide2()"><input type="submit" value="Show answers"></form>''')