# Statistical Learning 

**Q. For each of parts (a) through (d), indicate whether we would generally expect the performance of a flexible statistical learning method to be better or worse than an inflexible method. Justify your answer.**

(a) The sample size n is extremely large, and the number of predictors p is small.

(b) The number of predictors p is extremely large, and the number of observations n is small.

(c) The relationship between the predictors and response is highly non-linear.

(d) The variance of the error terms, i.e. σ2 = Var(), is extremely high.

#####################################################################################################

**Q. Explain whether each scenario is a classification or regression problem, and indicate whether we are most interested in inference or prediction. Finally, provide n and p.**

(a) We collect a set of data on the top 500 firms in the US. For each firm we record profit, number of employees, industry and the CEO salary. We are interested in understanding which factors affect CEO salary.

(b) We are considering launching a new product and wish to know whether it will be a success or a failure. We collect data on 20 similar products that were previously launched. For each product we have recorded whether it was a success or failure, price charged for the product, marketing budget, competition price, and ten other variables.

(c) We are interesting in predicting the % change in the US dollar in relation to the weekly changes in the world stock markets. Hence we collect weekly data for all of 2012. For each week we record the % change in the dollar, the % change in the US market, the % change in the British market, and the % change in the German market.

#####################################################################################################

Q. The bias-variance decomposition.

(a) Provide a sketch of typical (squared) bias, variance, training error, test error, and Bayes (or irreducible) error curves, on a single plot, as we go from less flexible statistical learning methods
towards more flexible approaches. The x-axis should represent the amount of flexibility in the method, and the y-axis should represent the values for each curve. There should be five curves.
Make sure to label each one.

(b) Explain why each of the five curves has the shape displayed in part (a).

#####################################################################################################

Q. Think of some real-life applications for statistical learning.

(a) Describe three real-life applications in which classification might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your answer.

(b) Describe three real-life applications in which regression might be useful. Describe the response, as well as the predictors. Is the goal of each application inference or prediction? Explain your
answer.

(c) Describe three real-life applications in which cluster analysis might be useful.

#####################################################################################################

Q. What are the advantages and disadvantages of a very flexible (versus a less flexible) approach for regression or classification? Under what circumstances might a more flexible approach be preferred to a less flexible approach? When might a less flexible approach be preferred?

#####################################################################################################

Q. Describe the differences between a parametric and a non-parametric statistical learning approach. What are the advantages of a parametric approach to regression or classification (as opposed to a nonparametric approach)? What are its disadvantages?

#####################################################################################################

Q. The table below provides a training data set containing six observations, three predictors, and one qualitative response variable.

<table><tr><th>Obs.</th><th>X1</th><th>X2</th><th>X3</th><th>Y</th></tr>
       <tr><td>1</td><td>0</td><td>3</td><td>0</td><td>Red</td></tr>
       <tr><td>2</td><td>2</td><td>0</td><td>0</td><td>Red</td></tr>
       <tr><td>3</td><td>0</td><td>1</td><td>3</td><td>Red</td></tr>
       <tr><td>4</td><td>0</td><td>1</td><td>2</td><td>Green</td></tr>
       <tr><td>5</td><td>-1</td><td>0</td><td>1</td><td>Green</td></tr>
       <tr><td>6</td><td>1/td><td>1</td><td>1</td><td>Red</td></tr>         </table>

Suppose we wish to use this data set to make a prediction for Y when X1 = X2 = X3 = 0 using K-nearest neighbors.

(a) Compute the Euclidean distance between each observation and the test point, X1 = X2 = X3 = 0.
(b) What is our prediction with K = 1? Why?
(c) What is our prediction with K = 3? Why?
(d) If the Bayes decision boundary in this problem is highly nonlinear, then would we expect the best value for K to be large or small? Why?

#####################################################################################################

# Linear Regression

-----------------------------------------------------------------------------------------------------

**Q. Describe the null hypotheses to which the p-values given in Table correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.**

<table><tr><th> </th><th>Coefficient</th><th>Std. Error</th><th>t-statisitc</th><th>p-value</th></tr>
       <tr><td>Intercept</td><td>2.939</td><td>0.3119</td><td>9.42</td><td> &lt; 0.0001 </td></tr>
       <tr><td>TV</td><td>0.046</td><td>0.0014</td><td>32.81</td><td>&lt; 0.0001</td></tr>
       <tr><td>radio</td><td>0.189</td><td>0.0086</td><td>21.89</td><td>&lt; 0.0001</td></tr>
       <tr><td>newspaper</td><td>-0.001</td><td>0.0059</td><td>-0.18</td><td>0.8599</td></tr>
</table>

----------------------------------------------------------------------------------------------------

**Q. Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get β0 = 50, ˆβ1 =20, ˆβ2 = 0.07, ˆβ3 = 35, ˆβ4 = 0.01, ˆβ5 = −10.**

(a) Which answer is correct, and why? 

    1. For a fixed value of IQ and GPA, males earn more on average than females. 

    2. For a fixed value of IQ and GPA, females earn more on average than males. 

    3. For a fixed value of IQ and GPA, males earn more on average than females provided that the            GPA is high enough. 

    4. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough. 

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0. 

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very  little evidence of an interaction effect. Justify your answer. 

----------------------------------------------------------------------------------------------------

**Q. I collect a set of data (n = 100 observations) containing a single predictor and a quantitative response. I then fit a linear regression model to the data, as well as a separate cubic regression, i.e. Y = β0 + β1X1 + β2X2 + β3X3 + ε.**

(a) Suppose that the true relationship between X and Y is linear, i.e. Y = β0 + β1X + . Consider the training                 residual sum of squares (RSS) for the linear regression, and also the training RSS for the cubic regression.               Would we expect one to be lower than the other, would we expect them to be the same, or is there not                 enough information to tell? Justify your answer. 

(b) Answer (a) using test rather than training RSS. 

(c) Suppose that the true relationship between X and Y is not linear, but we don’t know how far it is from linear. Consider the training RSS for the linear regression, and also the training RSS for the cubic regression. Would we expect one to be lower than the other, would we expect them to be the same, or is there not enough information to tell? Justify your answer. 

(d) Answer (c) using test rather than training RSS. 

----------------------------------------------------------------------------------------------------

