# Conceptual

__1.__  For each of parts (a) through (d), indicate whether we would generqally expect the performance of a flexible statistical learning method to be better or wrose than an inflexible method. Justify your answer.  

  __a.__ The sample size $n$ is extremely large, and the number of predictors $p$ is small.  
         _The flexible model takes advantage of the large sample size._  
     
  __b.__ The number of predictors $p$ is extremely large, and the number of observations $n$ is small.  
         _The flexible model would cause overfitting because of the small sample size._
  
  __c.__ The relationship between the predictors and response is highly non-linear.  
         _The flexible model is the best to find the nonlinear effect._
  
  __d.__ The variance of the error terms, i.e. $\sigma^2 = \text{Var}(\epsilon)$, is extremely high.  
         _The flexible model would incorporate the irreducible noise into the model_

---

__2.__ Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally provide $n$ and $p$.  

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.  

__SOLUTION__ This is a regression and inference problem, $n = 500$ the number of firms, and the predictors $p$ are: (1) profit, (2) number of employees, (3) industry, and (4) CEO salary.  

(b) We are considering launching a new product and wish to know whether it will be a _success_ or a _failure_. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.  

__SOLUTION__ This is a classificaiton and prediction problem, $n = 20$ and the predictors $p$ are: (1) price, (2) marketing budget, (3) competition price, (4) ten other variables.

(c) We are interested in predicting the % change in the USD/Euro exchange rate in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the USD/Euro, the % change in the US market, the % change in the British market, and the % change in the German market.  

__SOLUTION__ This is a regression and prediction problem, $n = 52$ weeks and the predictors $p$ are: (1) % change in USD/Euro, (2) % change in US market, (3) % change in the British market, (4) % change in the German market.

---

__3.__ We now revisit the bias-variance decomposition.  

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods towards more flexible approaches. The _x_-axis should represent the amount of flexibility in the method, and the _y_-axis should represent the values for each curve. There should be five curves. Make sure to label each one.

__SOLUTION__  
![Bias-Variance](Bias-Variance.png "Bias-Variance")

(b) Explain why each of the five curves has the shape displayed in part (a).  

As a general rule, as we use more flexible methods, the variance will increase and the bias will decrease. This is because the flexibility of the model will more fit the training set. The training MSE will steadly decrease as the flexibility of the model increases (fitting the training data). However, as the flexibility increases (fitting the training data) the test MSE will begin to decrease to a minimum and then follow a U turn up as the model fits the training data more and more.

---

__4.__ You will now think of some real-life applications for statistical learning.  

(a) Describe three real-life applications in which _classification_ might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.  

(b) Describe three real-life applications in which _regression_ might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.  

(c) Describe three real-life applications in which _cluster analysis_ might be useful.

---

__5.__ What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?  

__SOLUTION__  
The best way to describe the advantages and disadvantages of very flexible methods to regression or classification is to explain the accuracy versus interpretability tradeoff. The more flexible a method is, the more accurate the model will be in representing the data. However, there is a danger in overfitting with highly flexible methods. The major disadvantage of very flexible methods is that interpretability decreases because the relationship between each predictor and the response is now modeled using a curve. With highly flexible, non-linear methods like _bagging_, _support vector machines_, and _neural networks_ are very hard to interpret.  

More flexible approaches are useful when accuracy is more important then interpretability.

---

__6.__ Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

__SOLUTION__  
A Parametric approach to statistical learning involves two steps:  
(1) We make an assumption about the functional form, or shape of $f$. A simple assumption is that $f$ is linear in $X$
\begin{align}\tag{2.4}
f(X) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p
\end{align}  
(2) We _fit_ or _train_ the model using training data to estimate the $\beta$ parameters.  

A non-parametric method does not make an assumption about $f$. Instead we seek an estimate of $f$ that gets as close to the dat apoints as possible.

A major advantage of a parametric approach to non-parametric approaches is that, for non-parametric approaches, a large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for $f$.

A major disadvantage of a parametric approach is that you will not be able to truely fit the unknown form of $f$.

---

__7.__ The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

In [1]:
Obs <- c(1, 2, 3, 4, 5, 6)
X1 <- c(0, 2, 0, 0, -1, 1)
X2 <- c(3, 0, 1, 1, 0, 1)
X3 <- c(0, 0, 3, 2, 1, 1)
Y <- c("Red", "Red", "Red", "Green", "Green", "Red")
(df <- data.frame(X1, X2, X3, Y))

X1,X2,X3,Y
<dbl>,<dbl>,<dbl>,<chr>
0,3,0,Red
2,0,0,Red
0,1,3,Red
0,1,2,Green
-1,0,1,Green
1,1,1,Red


Suppose we wish to use this data set to make a prediction for $Y$ when $X_1=X_2=X_3=0$ using $K$-nearest neighbors.  

(a) Compute the Euclidean distance between each observation and the test point, $X_1=X_2=X_3=0$.  

In [2]:
euclidean <- function(a, b) sqrt(sum((a-b)^2))
tp <- c(0,0,0)
for (r in 1:nrow(df)){
    cat(euclidean(tp, as.numeric(df[r,1:3])), "\n")
}

3 
2 
3.162278 
2.236068 
1.414214 
1.732051 


(b) What is our prediction with $K=1$? Why?  
Based on the results above, the prediction will be `Green`, (observation 5)

(c) What is our prediction with $K=3$? Why?  
With $K=3$ the nearest neighbors are observations: 5, 6, and 2. 2 `Red` and 1 `Green`. The prediction will be `Red`

(d) If the Bayes decision boundary in this problem is highly non-linear, then would we expect the _best_ value for $K$ to be large or small? Why?  
The best value of $K$ would be small because a high $K$ value would not be flexible enough for a highly non-linear boundary.

---