# Week 12 Exercises



## Ex 1:  Bias Variance 

-   Does Bias and Variance terms (two numbers) in the Bias Variance
    decomposition depend on the learning algorithm.

-   What is Variance (in Bias Variance tradoff) if we have a hypothesis
    set of size $1$ namely the constant model $h(x) = 2$. The learning
    algorithm always picks this hypothesis no matter the data.

-   What is the Variance (in the Bias Variance tradeoff) if the simple
    hypothesis from the previous question is replaced by a very very
    sophisticated hypothesis.

-   Assume the target function is a second degree polynomial, and the
    input to your algorithm is always eleven distinct (noiseless) points. Your
    hypothesis set is the set of all degreee 10 polynomial and the
    learning algorithm returns the hypothesis with the best fit
    (miniming least squared error) given the data. What is Bias and what
    is Variance?




## Ex 2: Regularization with Weight decay
If we use weight decay regularization ($\lambda||w||^2)$  for some real number $\lambda$ in Linear Regression what 
happens to the optimal weight vector if we let $\lambda \rightarrow \infty$? (cost is $\frac{1}{n} \|Xw - y\|^2 + \lambda \|w\|^2$)

**Hard:** Can you say something about the changes in behaviour of the optimal solution $w$ as $\lambda$ decreases from $0$ towards $-\infty$. When does the optimal cost change from something finite to $-\infty$?




## Ex 3: Bias Variance - Hard Exercise
Book Problem 2.24 part (a)

Short Version:
   
  - The target function is $f(x) = x^2$ and the cost is Least Squares.

  - Sample two points $x_1, x_2$ from $[-1, 1]$ uniformly at random to get the data set $D = \{(x_1, x_1^2), (x_2, x_2^2)\}$

  - Use hypothesis space $\{h(x) = ax +b\mid a,b\in\Bbb R\}$ i.e. lines. There are two parameters $a$ and $b$.

  - Given a data set $D = \{(x_1, x_1^2), (x_2, x_2^2)\}$ the algorithm returns the line that fits these points.

  - Your task is to write down an analytical expression for $\bar{g} = \mathbb{E}_D [h_D]$ where $h_D$ is the hypothesis learned on D.

**Step 1.** What is the in sample error of $h_D$ and why?

**Step 2.** Given $D$ what are $a, b$ (defined by the line between $(x_1, x_1^2)$ and  $(x_2, x_2^2)$)? Hint: $x_2^2- x_1^2 = (x_2-x_1)(x_2 + x_1)$.

**Step 3.** What is the expected value of the slope $a$ over $x_1$ and $x_2$?

**Step 4.** What is the expected value of the intercerpt $b$ over $x_1$ and $x_2$? 




**More hints**
For the uniform distribution over $[-1,1]$ the mean is $0$ 


## Ex 4: Bias Variance Experiment 
In this exercise you must redo the experiment shown at the lectures.
This exercise takes up quite a lot of space so we have moved it to a separate notebook. Go to [BiasVariance Notebook](BiasVariance.ipynb)

## Ex 5: Grid Search For Regularization and Validation - Sklearn
In this exercise you must we will optimize a [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) using regularization and validation.
You must use the in grid search module [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) from sklearn.

In the cell below we have shown an example of how to use the gridsearch module test two different values for max_depth for a a decision tree for wine classification

Your job is to good hyperparameters for decision trees for the breast cancer detection.

### Task 1:
For the breast cancer data set find the best (or very good) combination of max_depth and  min_samples_split  (cell two below)

The **max_depth** parameter controls the max depth of a tree and the deeper the tree the more complex the model.

The **min_samples_split** controls how many elements the algorithm that constructs the tree is allowed to try and split.
So if a subtree contains less than min_leaf_size elements it many not be split into a larger subtree by the algorithm.


### Task 2:
- How long time does it take to use grid search validation for $k$ hyperparamers where we test each parameter for $d$ values, and the training algorithm uses f(n) time to train on n data points where we split the data into 5 parts.





In [2]:
from sklearn.datasets import load_wine, load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier


from sklearn.datasets import fetch_covtype

def show_result(clf):
    df = pd.DataFrame(clf.cv_results_)
    df = df.sort_values('mean_test_score', ascending=False)
    display(df)
    print('best parameter found', clf.best_params_)
    
w_data = load_wine()
wine_data = w_data.data
wine_labels = w_data.target

# grid search validation
reg_parameters = {'max_depth': [1, 30]}  # dict with all parameters we need to test
clf = GridSearchCV(DecisionTreeClassifier(), reg_parameters, cv=3, return_train_score=True)
clf.fit(wine_data, wine_labels)
# code for showing the result
bt = show_result(clf)
                   



Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
1,0.00059,5.3e-05,0.000255,1e-06,30,{'max_depth': 30},0.85,0.833333,0.913793,0.865169,0.03449,1,1.0,1.0,1.0,1.0,0.0
0,0.000439,3.5e-05,0.000252,9e-06,1,{'max_depth': 1},0.566667,0.583333,0.706897,0.617978,0.062196,2,0.677966,0.669492,0.691667,0.679708,0.009136


best parameter found {'max_depth': 30}


In [5]:
cancer_data = load_breast_cancer()
c_data = cancer_data.data
c_labels = cancer_data.target


def decisiontree_model_selection(train_data, labels):
    clf = None
    ### YOUR CODE HERE
    ### END CODE
    return clf
###
clf = decisiontree_model_selection(c_data, c_labels)
bt = show_result(clf)


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_max_depth,param_min_samples_split,params,split0_test_score,split1_test_score,split2_test_score,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,mean_train_score,std_train_score
0,0.004088,0.000395,0.00034,0.000117,5,2,"{'max_depth': 5, 'min_samples_split': 2}",0.910526,0.947368,0.904762,0.920914,0.018878,1,0.994723,0.997361,0.989474,0.993853,0.003278
2,0.003262,0.000288,0.000207,4e-06,5,10,"{'max_depth': 5, 'min_samples_split': 10}",0.9,0.957895,0.904762,0.920914,0.026256,1,0.984169,0.992084,0.986842,0.987698,0.003288
4,0.003442,0.000368,0.000211,4e-06,10,5,"{'max_depth': 10, 'min_samples_split': 5}",0.910526,0.952632,0.899471,0.920914,0.022906,1,0.992084,0.992084,0.997368,0.993846,0.002491
6,0.003804,0.000583,0.000285,6.9e-05,15,2,"{'max_depth': 15, 'min_samples_split': 2}",0.905263,0.947368,0.904762,0.919156,0.019976,4,1.0,1.0,1.0,1.0,0.0
1,0.003765,0.00025,0.000242,1e-05,5,5,"{'max_depth': 5, 'min_samples_split': 5}",0.894737,0.952632,0.904762,0.917399,0.025279,5,0.986807,0.992084,0.986842,0.988578,0.002479
5,0.003425,0.000399,0.000209,2e-06,10,10,"{'max_depth': 10, 'min_samples_split': 10}",0.894737,0.952632,0.904762,0.917399,0.025279,5,0.989446,0.992084,0.992105,0.991212,0.001249
8,0.003505,0.000491,0.000208,3e-06,15,10,"{'max_depth': 15, 'min_samples_split': 10}",0.884211,0.957895,0.910053,0.917399,0.03055,5,0.989446,0.992084,0.992105,0.991212,0.001249
3,0.003433,0.000379,0.000209,1e-06,10,2,"{'max_depth': 10, 'min_samples_split': 2}",0.894737,0.947368,0.883598,0.908612,0.027815,8,1.0,1.0,1.0,1.0,0.0
7,0.003638,0.000433,0.000255,3.5e-05,15,5,"{'max_depth': 15, 'min_samples_split': 5}",0.894737,0.942105,0.851852,0.896309,0.036846,9,0.992084,0.992084,0.997368,0.993846,0.002491


best parameter found {'max_depth': 5, 'min_samples_split': 2}
