## Exercises

### Conceptual

----------

#### Question 1

![Screen Shot 2021-06-14 at 22.48.38.png](attachment:0847848b-cb0f-452c-8c0d-1e2913df5c73.png)

<b>Variance</b> refers to the amount by which the estimate of f would change if a different training data set is used. An spline will have a higher variance than an inflexible method because changing a few data points in the spline may cause the estimate of f to change considerably.

#### Question 2

1. Regression problem. We’re interested in Inference. N=500, P=3.
2. Classification problem. We’re interested in Prediction. N=20, P=13.
3. Regression problem. Prediction of the % of change in the USD/Eur exchange rate. N=52 weeks, P=3.

#### Question 3

![Screen Shot 2021-06-16 at 11.52.09.png](attachment:6670dd46-7d11-4656-9bb9-90cafb237ed7.png)

- Variance: models with a lower flexibility are less affected by changes or variations in their estimate f than models more flexible when different datasets are used to obtain an estimation of f.
- Bias: models with a lower flexibility attempt to fit both simple and complex datasets by using simple models (e.g. linear regression) which causes them to have a higher bias than flexible models.
- Training error: as flexibility increases the training error decreases because the model fits more closely the dataset.
- Test error: inflexible models initially have a high test error which begins to decrease by adding flexibility up to an inflection point where extra flexibility generates the test error to increase again, therefore the curve has a u-shape.
- Bayes (irreducible) error: an error that cannot be overcome. Even if the variance and bias error are reduced to zero, the bayes error is the minimum expected error under this condition.

#### Question 4

- a) An application of classification for a stock market trader would be to know the kind of position the trader should place. The response is a binary sell or the contrary to sell (buy), the predictors could be data such as: last closed daily price, daily RSI, 100 day moving average and 50 day moving average. The goal of the application would be prediction.
- b) A technology startup would use a regression application to forecast future monthly sales. The response would be the expected sales for a specific month. The predictors could be: past month sales, number of new customers, number of retained customers, average ticket sale per customer. The goal of the application would be prediction.
- c) A recommender system of an online store would benefit from cluster analysis in order to suggest or recommend what items a new user is likely to find valuable. A basic approach for a recommender system is to find current users with similar characteristics to the new user, thus the assumption is that a certain degree of similarity is shared among the members of the cluster.

#### Question 5

1. What are the advantages and disadvantages of a very flexible (vs a less flexible) approach for regression or classification?

Answ: A flexible approach for regression will help the model to fit more closely (less bias) but on the negative side it will be prone to overfitting because it will not generalize very well (more variance). A less flexible approach for regression will fail to capture all the datapoints closely (high bias) but it will generalize much better (less variance) for new unseen data. A flexible approach for classification will effectively capture the distribution of the data, specially if the data is non-linear. An inflexible approach for classification will try very hard to fit the data with a simplistic approach.

2. Under what circumstances might a more flexible approach be preferred to a less flexible approach?

Answ: When the data is complex in nature or non-linear. Also if we're more interested in the prediction and not in the interpretability of the results.

3. When a less flexible approach is preferred?

Answ: When the training of the model might lead to high variance. Also when the data shows a linear relationship and when interpretability of the results or relationship is important.

#### Question 6

1. Describe the difference between a parametric and a non-parametric statistical learning approach.

Answ: In a parametric approach the goal is to **estimate a set of parameters** of a model that we assume in the first step. For instance in least squares, the model Y = x1A + x0, the objective is to use the training set to estimate the parameters x1 and x0 that better fit the model. In a non-parametric approach, there is no assumption about the functional form of **f**. This approach aim to fit the data as closely as possible without being too jumpy.

2. What are the advantages of a parametric approach for regression or classification as opposed to the disadvantages?

Answ: Since the functional form of **f** is an assumption (i.e. least square), it reduces the estimation to a set of parameters. Less data is required in comparison to a non-parametric approach, which is fundamental to have an accurate estimate for **f**.

3. What are the disadvantages?

Answ: The model we assume in a parametric approach will usually not match the true unknown for of **f**, thus our estimate will be very poor. Even using a flexible model (in a parametric approach) will entail estimating a greater number of parameters, which might lead to more errors or to overfitting. 

#### Question 7

(a) Compute the Euclidean distance.

Answ: Two ways of computing the Euclidean distance for single row vectors are shown below:

In [3]:
import numpy as np

In [15]:
obs_1 = np.array([0, 3, 0])

In [5]:
test_point = np.array([0, 0, 0])

In [14]:
np.sqrt( ((obs_1-test_point)**2).sum() )

3.0

In [12]:
np.linalg.norm(obs_1-test_point)

3.0

Compute the distance for the whole array of observations:

In [17]:
obs = np.array([ [0, 3, 0], [2, 0, 0], [0, 1, 3], [0, 1, 2], [-1, 0, 1], [1, 1, 1]])
obs

array([[ 0,  3,  0],
       [ 2,  0,  0],
       [ 0,  1,  3],
       [ 0,  1,  2],
       [-1,  0,  1],
       [ 1,  1,  1]])

In [24]:
np.linalg.norm(obs-test_point, axis=1).reshape(6,1)

array([[3.        ],
       [2.        ],
       [3.16227766],
       [2.23606798],
       [1.41421356],
       [1.73205081]])

(b) What is our prediction with K = 1? Why?

Answ: **Green**. Our prediction is the closest observation (K=1) to the test point, which is the observation 5 (euclidean distance 1.41).

(c) What is our prediction with K = 3? Why?

Answ: **Red**. Using K = 3, the three closest observations are 2, 5 and 6. Since observation 2 and 6 are both Red and only one observation is Green.

More formally:

Let x0 be the test point [0 0 0].

$P(Y = j | X = x0) = 1/K * \sum_{i=0}\ I*(yi=j)$

For red and green, given that K = 3: 

$P(Y = red | X = x0) = 1/3 * \sum_{i=0}\ I*(yi=red) = 1/3(1+0+1) = 2/3$

$P(Y = green | X = x0) = 1/3 * \sum_{i=0}\ I*(yi=green) = 1/3(0+1+0) = 1/3$

(d) Answ: If the bayes boundary is non-linear, then a smaller k (i.e. K = 1) would be a better choice because the KNN decision boundary will mimic more closely the bayes boundary. On the other hand, if we use a higher k (i.e. K = 100), the KNN decision boundary will tend to be **non-flexible**, it will almost resemble a line.

Other source: If k were large, the boundary will become almost linear. Therefore, the best is for K to be small.

![Screen Shot 2021-06-16 at 8.53.37.png](attachment:52c12493-3413-423b-8bfd-da557b8d6c1c.png)