# Chapter 2 - Exercises

## Conceptual

### 1. For each parts a through d indicate wheter we would generally expect the performance of a flexible statistical learning to be better or worse than an inflexible method

1. The sample size n is extremely large, and the number of predictors is small
    * The flexible model would likely be better due to the large sample size reducing the risk of overfitting while the flexibility allow for better ability to fit the data. The low number of predictors also reduce the risk of the flexible model succumbing to the curse of dimensionality
2. The number of predictors p is extremely large, and the number of observations n is small
    * The flexible model would likely be worse as the low sample size increases the risk of overfitting. The large amount of predictors also increase the risk of random association which the flexible model more easily overfit.
3. The relationship between the predictors and response is highly non-linear
    * Given that the response is non-linear the inflexible approaches would perform worse compared to the flexible methods able to capture these relationships
4. The variance of the error terms, is extemely high
    * The flexible model is more susceptible to high variance as it is better able to overfit these errorterms, leading to overfitting. Thus the infelxible model would be best in this case


### 2. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally provide *n* and *p*

1. We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of emplyees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.
    * This is a regression problem as the response, CEO salary, is a quantative. The analysis is focused on inference as we seek to understand the factors that affect the response. *n*=500, *p*=3[profit, employees, industry]

2. We are considering launching a new product and wish to know whether it will be a succes or failure. We collect data on 20 similar products that were previosuly launched. For each product we have recorded whether it was a success or failure, price charged for the product, makreting budget, competition price and ten other variables.
    * This is a classification problem as the response variable, whether it will be a succes or failure, is catagorical. The analysis is mostly focused on prediction as we wish to predict whether our product will be a success. *n*=20, *p*=13[price, mark. budget, comp. price, 10 add. variables]

3. We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the exchange rate, the % change in the US market, the % change Euro market, % change in the british market, % change in the german market.
    * This is a regression problem as the response, % change in exchange rate, is quantitative. The analysis is focused on prediction. *n*=52 (Weeks in 2012), *p*=4[%changeUS, %changeEU, %changeGER, %changeBRI]

# 3. We now revisit the bias-variance decomposition

1. Provide a sketch of typical  (Squared) bias, variance, training error, test error and Bayes (irreducible error) error curves on a single plot going form less flexible methods to more flexible methods.

![image info](./ex3.jpg)

2. Explain why each of the five curves has the shape displayed in part (1)
    * Bias: Bias error is initially large as the model is too inflexible to capture the relationship, as flexibility increases the model is better able to capture the actual relationship
    * Variance: Variance error is initially low as the models inflexibilty prevents it from overfitting to the random variance in the data. As flexibility increases the model risk fitting spurious relationship.
    * Training error: Training error starts high, due to the bias, then reduces as the bias error is reduces. It continues to reduce as it begins to overfit the random variance in the data
    * Test error: Test error follows the training error, until it gets to the point where the variance error increases. The test error is a result of predicting on data that the model have not been able to over fit. Thus the test error increases as the flexibility increases
    * Irreducible error: The irreducible error is constant as it's irreducible. This is also why it's the lower limit for the test-error.

### 4. You will now think of some real-life applications for statistical learning

1. Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.
    * Classification of breasttumors. Response: tumor type, predictors: Gene expression measures, and metabolonomics. The goal can both be prediction or inference. Understand which factors are present in the different tumors aswell as predict if patients suffers from cancer
    * Face emotion classification. Response: Displayed emotion. Predictors: images of faces encoded. The goal is prediction
    * Fraudulent transaction. Response: Whether a transaction is fraudulent. Predictors: amount, location, purchased object etc. Goal is prediction

2. Repeat (1) for the regression setting
    * Predict house prices. Response: House price, predictors: location, floor size, garden, state etc. The goal is prediction and an example is tha danish ejendomsvurdering that is used as a basis for estate taxes. It can also have a goal of inference in order to see how the variables affect pricing, think ghetto plans etc
    * Predict amount sales of unit. Response: unit sales. predictors: sales last weeks, in sales magazine, competitors prize. The goal is to predict the sales number in order to optimize stocktacking for the unit. 
    * Self tracking, predict weight. Response: weight. predictors: personal info, daily training, calorie intake, stress, sleep etc. Infer how these factors affect weight

3. Repeat (1) for the cluster analysis might be useful
    * Clustering of cancer tumors, in order to infer what seperates them
    * Population genetic clustering of people based on genome as a basis for ancient migration hypothesis
    * market segmentation: cluster city population into different market segments to allow optimization of marketing strategy toward each group

### 5. What are the advantages and disadvantages of a very flexible aproach for regression or classification? Under what circumstances is a flexible approach preferred over an inflexible one and vice versa.

The advantage of the flexible model is that it is better able to describe more advanced and non-linear relationships. The model is however also more susceptible to overfitting leading to higher variance in it's prediction but a lower bias. It's often less intrepretable than the more inflexible methods and may therefore not be preferred in a inference setting.
Flexible models performs best when the sample size is large and also when the number of predictors is not too small

### 6. Describe the differences between a parametric and non parametric statistical learning. What are the advantage and disadvantages?

A parametric approach assumes the shape of the underlying function. Often these models are simpler and introduces a bias. Therefore they work best if we have some indication that the underlying data actually is from a similar distribution. It's easier to understand the direct relationships between predictor and response.
A non-parametric approach assumes nothing about the shape of the underlying function, but requires far more observations in order to give a low variance prediction as it is more difficult to fit f than predict the parameters of some assumed function f

### 7. The following table contains 6 observations, three predictors and one qualitative response variable. Suppose we wish to make a prediction for Y when X1=X2=X3=0 using K nearest neighbours

In [16]:
import pandas as pd
import numpy as np
data = {
    'x1': [0, 2, 0, 0, -1, 1],
    'x2': [3, 0, 1, 1, 0, 1],
    'x3': [0, 0, 3, 2, 1, 1],
    'response': ['Red', 'Red', 'Red', 'Green', 'Green', 'Red']
}
A = pd.DataFrame(data)
A


Unnamed: 0,x1,x2,x3,response
0,0,3,0,Red
1,2,0,0,Red
2,0,1,3,Red
3,0,1,2,Green
4,-1,0,1,Green
5,1,1,1,Red


1. Compute the Euclidian distance between each observation the test point, X1=X2=X3=0
$$ d(n, (0,0,0)) = \sqrt{(X_1n)^2 + (X_2n)^2 + (X_3n)^2} $$

In [17]:
Adistance = np.sqrt(A['x1']**2 + A['x2']**2 + A['x3']**2)
Adistance

0    3.000000
1    2.000000
2    3.162278
3    2.236068
4    1.414214
5    1.732051
dtype: float64

2. What is the prediction with K=1

With K=1 the prediction with 100% probability 'Green' as it only looks at one neighbour (obs4) which is red

3. What is the prediction with K=3

With K=3 the prediction is the average of the 3 nearest neighbours i.e. (obs1: 'Red', obs4: 'Green', obs5: 'Red') the conditional probability of being red is 66% which is then the prediction

3. If the bayes boundary is highly non linear would we expect the best value for K to be high or low

We would expect a low K to be the best value as it's much more flexible. As K increases the decision boundary becomes more and more linear