# Homework 4: Solutions


<br>

**Conceptual:** Short answer questions. Be concise.

---
1. Consider the problem of classifying a binary response variable (i.e., $y \in \{1,0\}$). If there is no overlap in the values of X when y = 1 and when y=0,  such that there is a large “gap” between the two distributions of X values, then this is problematic for one of the classifiers discussed in class and the text. What classifier does this situation pose a problem for? Explain conceptually why this is a problem and compare it with another classifier approach that does not suffer this limitation.

**Answer:** Perfect seperation will cause the coefficiences in logistic regression to go to infinity because
$$n(P/(1-P)) = \beta_0 + \beta_1X $$
When the probability is 1, y is divided by 0 and everything goes to infinity.

Instead of logistic regression, linear discriminant analysis can be used. This works because it models the probably of y = 1 and y = 0 separately, eliminating the need to divide by 1-P and thus eliminating the inifinity issue.



---
2. Compare logistic regression, LDA, and kNN classification approaches. Which are parametric which are non-parametric? For parametric models what functions do they assume? For non-parametric methods, how do the classifiers separate groups? How is the flexibility/bias tradeoff adjusted for each method?




**Answer:**
<b>Logistic regression:</b> parametric. Assumes linearity, independence of errors, no multicollinearity, y is dichotomous. <br>
<b>LDA:</b> parametric. Assumes multivariate normal distribution, Homogeneity of variance / covariance, independence of errors, no multicollinearity. Bias: As P approaches and overtakes N, bias increases due to overfitting.<br>
<b>kNN:</b> non-parametric. It uses consensus to separate groups. It polls a specified number of nearest neighbors (defined by Euclidean distance) and goes with the majority. Flexibility/bias: When k is small, the model has more flexibility. When k is big, the model has more bias.


---
3. What is the curse of dimensionality? Why is it especially problematic for kNN classification (i.e., why does kNN fail in high dimensional contexts)?


**Answer:** As p approaches and overtakes n, bias increases. This is especially problematic for kNN because it decreases the number of n in each group and creates a problem of sparsity. 


---
4. Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures. First we use logistic regression and get an error rate of 20% on the training data and 30% on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18%. Based on these results, which method should we prefer to use for classification of new observations? Why?

**Answer:** 
* For KNN with K=1, the training error rate is 0% because for any training observation, its nearest neighbor will be the response itself. So, that means that KNN has a test error rate of 36%. Thus logistic regression is preferred because of its lower test error rate of only 30%.

---
**Applied:** Show your code & plots
(Exercises 4.10 and 4.11 from ISLR.)

---

5. This question should be answered using the Weekly data set, which is part of the ISLR package. This data is similar in nature to the Smarket data from this chapter’s lab, except that it contains 1,089 weekly returns for 21 years, from the beginning of 1990 to the end of 2010.

(a) Produce some numerical and graphical summaries of the Weekly data. Do there appear to be any patterns?

(b) Use the full data set to perform a logistic regression with Direction as the response and the five lag variables plus Volume as predictors. Use the summary function to print the results. Do any of the predictors appear to be statistically significant? If so, which ones?

(c) Compute the confusion matrix and overall fraction of correct predictions. Explain what the confusion matrix is telling you about the types of mistakes made by logistic regression.

(d) Now fit the logistic regression model using a training data period from 1990 to 2008, with Lag2 as the only predictor. Compute the confusion matrix and the overall fraction of correct predictions for the held out data (that is, the data from 2009 and 2010).

(e) Repeat (d) using LDA.

(f) Repeat (d) using QDA.

(g) Repeat (d) using KNN with K = 1.

(h) Which of these methods appears to provide the best results on
this data?

(i) Experiment with different combinations of predictors, includ- ing possible transformations and interactions, for each of the methods. Report the variables, method, and associated confu- sion matrix that appears to provide the best results on the held out data. Note that you should also experiment with values for K in the KNN classifier.

In [None]:
# ------------------
# Exercize 5
# ------------------

library(lme4)
library(ggplot2)
library(lattice)
library(ISLR)
library(caret)
library(MASS)
data("Weekly")
summary(Weekly)
par(mar=c(1,1,1,1))
plot(Weekly$Volume)
plot(Weekly$Today)
plot(Weekly$Lag5)
#a. After looking at the summary and plots, it appears that volume is not normally distrubted.
#all the lags are relatively similar as well.
modelq5 <- glm(Direction~1+Lag1+Lag2+Lag3+Lag4+Lag5+Volume, data=Weekly, family=binomial)
summary(modelq5)
#b. Lag2 is a statistically significant predictor of direction
glm_prob_df = data.frame(predict(modelq5, type = "response"))
colnames(glm_prob_df) = c('predicted')
n_observations = nrow(glm_prob_df)
glm_prob_df$index = seq(1, n_observations)
ggplot(glm_prob_df, aes(index, predicted)) + geom_point() + xlab('observation') + ylab('predicted response probability')
threshold = 0.50 #binarizing threshold 
glm_prob_df$predicted_binary=rep("Down",n_observations)
glm_prob_df$predicted_binary[glm_prob_df$predicted>threshold]="Up" #find the rows that have prob. > threshold and cast as 'up'
confusion_df = data.frame(glm_prob_df$predicted_binary, Weekly$Direction)
colnames(confusion_df) = c('predicted', 'actual')

table(confusion_df)
#c. Logistic regression made misclassified more up down variables as up
train_glm=(Weekly$Year<2009)
glm_test = Weekly[!train_glm,]
glm_train = Weekly[train_glm,]

Direction_test=glm_test$Direction
glm.fit=glm(Direction~Lag2, data=Weekly, family=binomial, subset=train_glm) 
glm.probs=predict(glm.fit, glm_test, type="response")
glm.pred=rep("Down",nrow(glm_test)) 
glm.pred[glm.probs>0.5]="Up"
confusion_df = data.frame(glm.pred, Direction_test)
colnames(confusion_df) = c('predicted', 'actual')
table(confusion_df)
mean(confusion_df$predicted == confusion_df$actual)
# LDA
lda.fit = lda(Direction~Lag2, data=Weekly, subset=train_glm)
lda.fit
plot(lda.fit)
lda.pred = predict(lda.fit, glm_test)
lda.class = lda.pred$class
table(lda.class, Direction_test)
mean(lda.class==Direction_test)
#QDA
qda.fit = qda(Direction~Lag1, data=Weekly, subset=train_glm)
qda.fit
qda.class=predict(qda.fit, glm_test)$class
table(qda.class, Direction_test)
mean(qda.class==Direction_test)
#kNN
#kNN
library(class)
n <- knn(data.frame(glm_train$Lag2),data.frame(glm_test$Lag2),glm_train$Direction,k=1)
n
labels_test <- data.frame(glm_test$Direction)
merge <- data.frame(n,labels_test)

names(merge) <- c("Predicted Direction","Observed Direction")
merge

mean(merge$`Predicted Direction`==merge$`Observed Direction`)



---


6. In this problem, you will develop a model to predict whether a given car gets high or low gas mileage based on the Auto data set.

(a) Create a binary variable, mpg01, that contains a 1 if mpg contains a value above its median, and a 0 if mpg contains a value below its median. You can compute the median using the median() function. Note you may find it helpful to use the data.frame() function to create a single data set containing both mpg01 and the other Auto variables.

(b) Explore the data graphically in order to investigate the associ- ation between mpg01 and the other features. Which of the other features seem most likely to be useful in predicting mpg01? Scatterplots and boxplots may be useful tools to answer this ques- tion. Describe your findings.

(c) Split the data into a training set and a test set.

(d) Perform LDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(e) Perform QDA on the training data in order to predict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(f) Perform logistic regression on the training data in order to pre- dict mpg01 using the variables that seemed most associated with mpg01 in (b). What is the test error of the model obtained?

(g) Perform KNN on the training data, with several values of K, in order to predict mpg01. Use only the variables that seemed most associated with mpg01 in (b). What test errors do you obtain? Which value of K seems to perform the best on this data set?

In [None]:
# ------------------
# Exercize 6
# ------------------


library(stats)
library(miscTools)
data("Auto")
summary(Auto)
#a
Auto$mpg01 <- as.factor(ifelse(Auto$mpg>median(Auto$mpg),1,0))
#b
ggplot(Auto, aes(mpg01,year))+geom_boxplot()
#Mean and sd for year is higher for mpg > than the median
ggplot(Auto, aes(mpg01,acceleration))+geom_boxplot()
#same is true for acceleration but less so
ggplot(Auto, aes(mpg01,horsepower))+geom_boxplot()
#Horsepower declines for vehicles with more than the mpg median
ggplot(Auto, aes(mpg01,displacement))+geom_boxplot()
ggplot(Auto, aes(mpg01,weight))+geom_boxplot()
#And displacement
ggplot(Auto, aes(mpg01,origin))+geom_point()
#there may not be enough data to understand cylindars and origin
#(c)
seed<-1
train=sample(x=392,size=196)
Auto_train <- Auto[train,]
Auto_test <- Auto[-train,]
mpg_test <- Auto_test$mpg01
#LDA
lda.fit = lda(mpg01~horsepower+year+weight+acceleration, data=Auto, subset=train)
lda.fit
plot(lda.fit)
lda.pred = predict(lda.fit, Auto_test)
lda.class = lda.pred$class
table(lda.class, mpg_test)
mean(lda.class==mpg_test)
#Accuracy is high Test eror is ~.11
#QDA
qda.fit = qda(mpg01~horsepower+year+weight+acceleration, data=Auto, subset=train)
qda.fit
qda.class=predict(qda.fit, Auto_test)$class
table(qda.class, mpg_test)
mean(qda.class==mpg_test)
#also .11
#logistic
glm.fit=glm(mpg01~horsepower+year+weight+acceleration, data=Auto, family=binomial,subset=train) 
glm.probs=predict(glm.fit, Auto_test, type="response")
glm.pred=rep("1",nrow(Auto_test)) 
glm.pred[glm.probs>0.5]="0"
confusion_df = data.frame(glm.pred, mpg_test)
colnames(confusion_df) = c('predicted', 'actual')
table(confusion_df)
mean(confusion_df$predicted == confusion_df$actual)
#The reverse, .9
#KNN
ktest <- knn(data.frame(Auto_train$horsepower),data.frame(Auto_test$horsepower),Auto_train$mpg01,k=1)
ktest
labels_test <- data.frame(Auto_test$mpg01)
merge <- data.frame(ktest,labels_test)

names(merge) <- c("Predicted MPG","Observed MPG")
merge

mean(merge$`Predicted MPG`==merge$`Observed MPG`)
ktest <- knn(data.frame(Auto_train$horsepower),data.frame(Auto_test$horsepower),Auto_train$mpg01,k=5)
ktest
labels_test <- data.frame(Auto_test$mpg01)
merge <- data.frame(ktest,labels_test)

names(merge) <- c("Predicted MPG","Observed MPG")
merge

mean(merge$`Predicted MPG`==merge$`Observed MPG`)

ktest <- knn(data.frame(Auto_train$horsepower),data.frame(Auto_test$horsepower),Auto_train$mpg01,k=27)
ktest
labels_test <- data.frame(Auto_test$mpg01)
merge <- data.frame(ktest,labels_test)

names(merge) <- c("Predicted MPG","Observed MPG")
merge

mean(merge$`Predicted MPG`==merge$`Observed MPG`)
#~27 seems to give the lowest error 