# Week 5 Exercises



## Ex 1: Break Points and Growth Functions 

-   Is there always a break point for a finite hypothesis set of $n$
    hypotheses? If so, can you give a an upper bound? What is the growth
    function?

-   Does the set of all functions have a break point? What is its growth
    function?

-   What is the (smallest) break point for the hypothesis set consisting
    of circles centered around $(0,0)$? For a given circle the
    hypothesis returns $1$ for points inside the circle and $-1$ for
    points outside. What is the growth function?

-   What if we move to centered balls in the 3-dimensional space
    ${{\mathbb R}}^3$? Or in general $d$-dimensional space
    ${{\mathbb R}}^d$ (hyperspheres)?

-  Show that the growth function for a singleton hypothesis class $H = \{h\}$ is 1




## Ex 2: VC Dimension 

-   Does VC Dimension depend on the learning algorithm or the actual
    data set given?

-   Does VC Dimension depend on the probability distribution generating
    the data (not the labels)?

-   If $\mathcal{H}_1 \subseteq \mathcal{H}_2$ is
    $VC(\mathcal{H}_1) \leq VC(\mathcal{H}_2)?$

-   Can you give an upper bound on the VC dimension of a finite set of
    $M$ hypotheses?

-   What is the VC Dimension for the hypothesis set consisting of
    circles centered around 0?

-   What if we move to balls (3d)? or in general d dimensions
    (hyperspheres)?

-   What is the maximal VC dimension possible of the intersection of
    hypothesis sets $\mathcal{H}_1,\dots,\mathcal{H}_n$ with VC
    dimension $v_1,\dots,v_n$.

-   As previous question, instead what is the minimal VC dimension of
    the union of the hypothesis srets from the previous question

-   Show that the VC dimension of the hypothesis set consisting of axis aligned rectangles in $\mathbb{R}^2$ is 4,
    i.e. find a point set of 4 points you can shatter and argue that any point set of size 5 can not.
    



## Ex 3:  Book Exercise
### Exercise 1.11 in the [LFD] Book 
(Not problems but exercises inside the text. page 25





## Ex 4: Regularization with Weight decay
If we use weight decay regularization ($\lambda||w||^2)$  for some real number $\lambda$ in Linear Regression, what 
happens to the optimal weight vector if we let $\lambda \rightarrow \infty$? (cost is $\frac{1}{n} \|Xw - y\|^2 + \lambda \|w\|^2$)


## Ex 5: Grid Search For Regularization and Validation - Sklearn
In this exercise we will optimize a [Decision Tree Classifier](http://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) using regularization and validation.
You must use the grid search module [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) from sklearn.

In the cell below we have shown an example of how to use the grid search module to test two different values for max_depth for a decision tree for wine classification

Your job is to find good hyperparameters for decision trees for breast cancer detection.

### Task 1:
For the breast cancer data set, find the best (or very good) combination of max_depth and min_samples_split (cell two below)

The **max_depth** parameter controls the max depth of a tree and the deeper the tree the more complex the model.

The **min_samples_split** controls how many elements the algorithm that constructs the tree is allowed to try and split.
So if a subtree contains less than min_leaf_size elements, it many not be split into a larger subtree by the algorithm.


### Task 2:
- How long time does it take to use grid search validation for $k$ hyperparameters where we test each parameter for $d$ values, and the training algorithm uses $f(n)$ time to train on $n$ data points when we split the data into 5 parts.





In [None]:
from sklearn.datasets import load_wine, load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier


from sklearn.datasets import fetch_covtype

def show_result(clf):
    df = pd.DataFrame(clf.cv_results_)
    df = df.sort_values('mean_test_score', ascending=False)
    display(df)
    print('best parameter found', clf.best_params_)
    
w_data = load_wine()
wine_data = w_data.data
wine_labels = w_data.target

# grid search validation
reg_parameters = {'max_depth': [1, 30]}  # dict with all parameters we need to test
clf = GridSearchCV(DecisionTreeClassifier(), reg_parameters, cv=3, return_train_score=True)
clf.fit(wine_data, wine_labels)
# code for showing the result
bt = show_result(clf)
                   

In [None]:
cancer_data = load_breast_cancer()
c_data = cancer_data.data
c_labels = cancer_data.target


def decisiontree_model_selection(train_data, labels):
    clf = None
    ### YOUR CODE HERE
    ### END CODE
    return clf
###
clf = decisiontree_model_selection(c_data, c_labels)
bt = show_result(clf)


## Ex 6: VC Dimension of Hyperplanes (Book Exercise 2.4 p. 52)
Consider the input space $\mathcal{X} = \{1\} \times \mathbb{R}^d$ (with the first coordinate being the constant 1). Show that the VC dimension of the hypothesis space $\mathcal{H} = \{\textrm{sign}(w^\intercal x) \mid w\in \mathbb{R}^{d+1} \}$ corresponding to the perceptron is $d+1$.

We need to show 
1. That there exists a data set of size d+1 that can be shattered by hyperplanes
2. That no data set of size d+2 can be shattered by hyperplanes

We will give a few more hints than the book does.
### Shattering d+1 points
As the book hints you must create an "easy" data set that you store in matrix $X$. 

**Hint:** We suggest you consider as a data matrix, the $(d+1) \times (d+1)$ matrix $X$ whose first column is all-1s (required since $\mathcal{X} = \{1\} \times \mathbb{R}^d$) and where the lower $d \times d$ corner of the matrix is the $d \times d$ identity matrix.

Show that you can construct any dichotomy $y \in \{-1,+1\}^{d+1}$ using some $h \in \mathcal{H}$ and the data matrix $X$ defined above. That is, you have to show that for any $y \in \{-1,+1\}^{d+1}$, you can find some hypothesis $w$ such that for all $i$, we have $\textrm{sign}(w^\intercal x_i)=y_i$ where $x_i$ is the $i$'th row of $X$.


### No Shattering of d+2 points.
Must show that for any d+2 points, there is a  dichotomy hyperplanes can not capture.

**Hint:**

Consider an arbitrary set of d+2 points $x_1,\dots, x_{d+2}$ of dimension (d+1) and think of them as vectors in $\{1\} \times \mathbb{R}^d \subset \mathbb{R}^{d+1}$.
Since we have more vectors than dimensions the vectors must be linearly dependent.

i.e. there is a $j$ such that:
$$
x_j = \sum_{i\neq j} a_i x_i
$$
Since $x_j$ is determined by the other data points then so is $w^\intercal x_j$ for any $w$. This means the classification on point $x_j$ is dictated by the classification of the other data points and thus cannot freely be chosen.
i.e.
$$
w^\intercal x_j = w^\intercal \sum_{i\neq j} a_i x_i =\sum_{i\neq j} a_i w^\intercal x_i
$$
Define an impossible dichotomy as follows. 
$$
y_i = \textrm{sign}(a_i), \quad i\neq j, \quad y_j = -1
$$
Show this dichotomy is impossible!





## BONUS Exercise If Time 7: Book Problem 2.18 In short
Define
$$
\mathcal{H}= \{h_\alpha \mid h_\alpha(x) = (-1)^{\lfloor \alpha
          x\rfloor}, \alpha \in {{\mathbb R}}\}
$$ 

Show that the VC dimension of ${{\mathcal H}}$ is infinite (even though there is only one parameter!)

Hint: Use the points set
$x_1=10,x_2=100,\dots,x_i = 10^i,\dots,x_N=10^N$ and show how to implement any dichotomy $y_1,\dots,y_N \in \{-1, +1\}^N$ (find $\alpha$ that works).
You can safely assume $\alpha >0$.
