### HSML 6295 Session 3 (Classification) - ANSWERS

#### I. The `Framingham Heart Study` Data Set

"Cardiovascular disease (CVD) is the leading cause of death and serious illness in the United States. In 1948, the Framingham Heart Study - under the direction of the National Heart Institute (now known as the National Heart, Lung, and Blood Institute or NHLBI) - embarked on an ambitious project in health research. ...
The researchers recruited 5,209 men and women between the ages of 30 and 62 from the town of Framingham, Massachusetts, and began the first round of extensive physical examinations and lifestyle interviews that they would later analyze for common patterns related to CVD development. Since 1948, the subjects have continued to return to the study every two years for a detailed medical history, physical examination, and laboratory tests." 
https://framinghamheartstudy.org/fhs-about/history/

Our objective is to predict the 10-year risk of Coronary Heart Disease (CHD). The response variable `observed.CHD` is 1 if the individual developed CHD over the 10-year observation period and 0 otherwise.

| Predictor | Description
| ---       | ---
| Male | sex of patient
| Age | age in years at first examination
| Education | some HS [1], high school [2], some college [3], college [4]
| Smoker | yes [1], no [0]
| Cigs.per.Day | cigarettes smoked per day
| BP.Medication | on blood pressure medication at time of first examination
| Stroke | has had a stroke
| Hypertension | currently hypertensive
| Diabetes | currently has diabetes 
| Cholesterol | total cholesterol (mg/dL)
| Systolic.BP | systolic blood pressure
| Diastolic.BP | diastolic blood pressure
| BMI | Body Mass Index, weight (kg)/height (m)^2
| Heart.Rate | heart rate (beats/minute)
| Glucose | blood glucose level (mg/dL)

Read in the data set


In [None]:
framingham = read.csv("HSML 6295 s3 Data Set Framingham.csv")
dim(framingham)



The `na.omit()` function removes all the observations with missing values for any variable.


In [None]:
framingham = na.omit(framingham)
dim(framingham)



[Load the R package "stargazer" for displaying the summary statistics succinctly:]


In [None]:
library(stargazer)



In [None]:
stargazer(framingham, 
          summary.stat = c("n", "mean", "sd", "min", "p25", "median", "p75", "max"),
          type = "text", title="Descriptive statistics", digits=1)



Tabulate relative frequencies of the values of `observed.CHD` in the full data set


In [None]:
round(prop.table(table(framingham$observed.CHD)),2)




Declare `observed.CHD` a factor variable and assign labels "Yes" and "No" to 1 and 0, respectively:


In [None]:
framingham$observed.CHD = factor(framingham$observed.CHD, 
                             levels = c(0,1), labels = c("No", "Yes"))
round(prop.table(table(framingham$observed.CHD)),2)



Randomly split the `framingham` data set into a *training set* comprising 75% of observations and a *test set* comprising the remainder.


In [None]:
library(caTools)
set.seed(1)
split = sample.split(framingham$observed.CHD, SplitRatio = 0.75)
train = subset(framingham, split==TRUE)
test = subset(framingham, split==FALSE)
dim(framingham)
dim(train)
dim(test)


#### II. Baseline Prediction 

Without information about any predictors, the best approach is to classify all observations to the most common class in the training data.

Tabulate relative frequencies of the values of `observed.CHD` in the training set


In [None]:
round(prop.table(table(train$observed.CHD)),2)



**Knowledge Check 1.** Declare `Education` a factor variable and assign labels "some HS", "high school", "some college", "college" to values 1, 2, 3, and 4, respectively, in the training set. Then tabulate relative frequencies of the values of `Education` in the training set. What is the baseline prediction for `Education`?



In [None]:
train$Education = factor(train$Education,     # declare Education as factor 
                         levels = c(1, 2, 3, 4), 
                         labels = c("some HS", "high school", "some college", "college"))
round(prop.table(table(train$Education)),2)   # tabulate relative frequencies
train$Education = as.numeric(train$Education) # convert Education from factor back to numeric format
round(prop.table(table(train$Education)),2)   # tabulate relative frequencies


    
**The baseline prediction for `Education` is "some HS" because it is the most common class in the training set.**

Create a vector of baseline ("naïve") predictions: 

Every individual in the test set is predicted to have "No" CHD.


In [None]:
predicted.CHD = rep("No",nrow(test))
round(prop.table(table(predicted.CHD)),2)


Confusion matrix

Refer to Table 4.6 in ISLR.


In [None]:
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm


|                 |         |                |       |       |
| ---             | ------- | ---:           | ---:  | ---:  |
|                 |         | **Predicted**  | **CHD**  
|                 |         | No             | Yes   | Total |
|**Observed CHD** | No      | 775 TN         | 0 FP  | 775 N |
|                 | Yes     | 139 FN         | 0 TP  | 139 P |
|                 | Total   | 914 N*         | 0 P*  | 914 A |


We compute 3 statistics that measure predictive performance

Statistic  
Definition  
Value  

**Accuracy**  (1 - Error Rate)  
(TN+TP)/A  


In [None]:
 round(100*(cm[1,1]+0)/cm[3,2],1)



**True Positive Rate**  (Sensitivity)  
TP/P  


In [None]:
 round(100*0/cm[2,2],1)



**False Positive Rate**  (1 - Specificity)  
FP/N  


In [None]:
 round(100*0/cm[1,2],1)



#### III. Logistic Regression

We obtain the predictions in 3 steps:

1. Fit the logistic regression model using the observations in the training set
2. Use the fitted model to estimate the probability of CHD for each observation in the test set
3. Use the estimated probabilities and a threshold value to classify the test observations

**Step 1.** Estimate a logistic regression model of `observed.CHD` as a function of all predictors using the observations in the *training* set


In [None]:
fit = glm(observed.CHD ~ ., data = train, family=binomial)




Print the results


In [None]:
summary(fit)




**Step 2.** Estimate the probability of CHD $\hat{p}$ for each observation in the *test* set


In [None]:
probability.CHD = predict(fit, type="response", newdata=test)




**Step 3.** Classify each observation in the *test* set for which the probability is greater than 0.5, i.e. $\hat{p} > 0.5$, as `predicted.CHD = "Yes"`; else classify the observation as `predicted.CHD = "No"`. 


In [None]:
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.5] = "Yes"



The predictions from Step 3 yield the confusion matrix and predictive performance statistics


In [None]:
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE) # add row & column totals
cm


Statistic  
Definition  
Value  

**Accuracy**  
(TN+TP)/A  


In [None]:
round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)



**True Positive Rate**  
TP/P  


In [None]:
round(100*cm[2,2]/cm[2,3],1)



**False Positive Rate**  
FP/N  


In [None]:
round(100*cm[1,2]/cm[1,3],1)



**Knowledge Check 2.** Compute the confusion matrix and the 3 resulting predictive performance statistics (accuracy, true positive rate, false positive rate) when the probability threshold is 0.48.

**The confusion matrix with threshold of 0.48 and the resulting predictive performance statistics are**


In [None]:
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.48] = "Yes"
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm


Statistic  
Value


**Accuracy**  


In [None]:
 round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)




**True Positive Rate**  


In [None]:
 round(100*cm[2,2]/cm[2,3],1)




**False Positive Rate**  


In [None]:
 round(100*cm[1,2]/cm[1,3],1)



**Knowledge Check 3.** Can you find a probability threshold for which the true positive rate exceeds 50%. What is the classifier's false positive rate in this case?



In [None]:
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.20] = "Yes"
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm


Statistic  
Value


**Accuracy**


In [None]:
 round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)




**True Positive Rate**


In [None]:
 round(100*cm[2,2]/cm[2,3],1)




**False Positive Rate**


In [None]:
 round(100*cm[1,2]/cm[1,3],1)




**A threshold of 0.20 achieves a true positive rate of**


In [None]:
 round(100*cm[2,2]/cm[2,3],1)



**but also raises the false positive rate to**


In [None]:
 round(100*cm[1,2]/cm[1,3],1)



Varying the probability threshold yields the ROC curve and AUC (area under the ROC curve)



In [None]:
library(pROC)
roc_full = roc(test$observed.CHD, probability.CHD, plot=TRUE, legacy.axes = TRUE, 
               xlab = "False Positive Rate", ylab = "True Positive Rate",
               auc.polygon=TRUE, max.auc.polygon=TRUE, print.auc=TRUE)


**Knowledge Check 4.** Estimate a logistic regression model on the training data but only include the following 5 predictors: Male, Age, Cigs.per.Day, Systolic.BP, Glucose. These are the predictors whose p-values were smaller than 0.05 (had at least one asterisk) in the results table above. Compute the confusion matrix and the 3 resulting predictive performance statistics (accuracy, true positive rate, false positive rate) when the probability threshold is 0.48, 0.47, 0.46, 0.45, and 0.44. If missing a case of CHD was much more costly than mistakenly diagnosing a patient with CHD, which threshold value would you choose?

**Run the following chunk and only replace the threshold value in the following line of code:**
`predicted.CHD[probability.CHD > 0.44] = "Yes"`


In [None]:
fit = glm(observed.CHD ~ Male + Age + Cigs.per.Day + Systolic.BP + Glucose, 
          data = train, family=binomial)
probability.CHD = predict(fit, type="response", newdata=test)
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.44] = "Yes"
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm
accuracy = round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)
accuracy
tpr = round(100*cm[2,2]/cm[2,3],1)
tpr
fpr = round(100*cm[1,2]/cm[1,3],1)
fpr


**For all 5 threshold values, the accuracy is the same. As we lower the threshold, we predict more CHD cases. Thus, we raise both the true positive rate and the false positive rate. If raising the true positive rate is more valuable than lowering the false positive rate, we would prefer the lowest threshold that still maintains the same accuracy level, i.e. 0.44.**

**Knowledge Check 5.** Construct the ROC curve and compute the AUC. How does the AUC of the restricted model (the one using only 5 predictors) compare to the AUC of the full model? Now change the value of the seed in line 100 from 1 to 1000 and compare again the AUC values for the full and restricted models.


In [None]:
library(pROC)
roc_restricted = roc(test$observed.CHD, probability.CHD, plot=TRUE, legacy.axes = TRUE, 
                     xlab = "False Positive Rate", ylab = "True Positive Rate",
                     auc.polygon=TRUE, max.auc.polygon=TRUE, print.auc=TRUE)



**When the seed is 1, the AUC of the full model is**


In [None]:
 round(roc_full$auc, 3)



**while the AUC of the restricted model is** 


In [None]:
 round(roc_restricted$auc, 3)



**When the seed is 1000, the AUC of the full model is 0.734, while the AUC of the restricted model is 0.739.**

#### IV. Linear Discriminant Analysis (LDA)

As with logistic regression, with LDA we obtain the predictions in 3 steps:

1. Fit the model on the training set


In [None]:
library(MASS)
fit = lda(observed.CHD ~ ., data = train)



2. Compute the estimated probability of CHD for each observation in the test set


In [None]:
probability.CHD = predict(fit, newdata = test)$posterior[,2]




3. Use the estimated probabilities and a threshold value to classify the test observations


In [None]:
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.5] = "Yes"



The predictions from Step 3 yield the confusion matrix and predictive performance statistics


In [None]:
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm


Statistic  
Value


**Accuracy**


In [None]:
 round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)




**True Positive Rate**


In [None]:
 round(100*cm[2,2]/cm[2,3],1)




**False Positive Rate**


In [None]:
 round(100*cm[1,2]/cm[1,3],1)



Varying the probability threshold yields the ROC curve and AUC (area under the ROC curve)



In [None]:
library(pROC)
roc = roc(test$observed.CHD, probability.CHD, plot=TRUE, legacy.axes = TRUE, 
          xlab = "False Positive Rate", ylab = "True Positive Rate",
          auc.polygon=TRUE, max.auc.polygon=TRUE, print.auc=TRUE)


**Knowledge Check 6.** Rerun the linear discriminant analysis on the training data but only include the following 5 predictors: Male, Age, Cigs.per.Day, Systolic.BP, Glucose. Compute the confusion matrix and the 3 resulting predictive performance statistics (accuracy, true positive rate, false positive rate) when the probability threshold is 0.5. Construct the ROC curve and compute the AUC.



In [None]:
fit = lda(observed.CHD ~ Male + Age + Cigs.per.Day + Systolic.BP + Glucose, 
          data = train)
probability.CHD = predict(fit, newdata = test)$posterior[,2]
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.5] = "Yes"


In [None]:
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm


Statistic  
Value


**Accuracy**


In [None]:
 round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)




**True Positive Rate**


In [None]:
 round(100*cm[2,2]/cm[2,3],1)




**False Positive Rate**


In [None]:
 round(100*cm[1,2]/cm[1,3],1)



In [None]:
library(pROC)
roc = roc(test$observed.CHD, probability.CHD, plot=TRUE, legacy.axes = TRUE, 
          xlab = "False Positive Rate", ylab = "True Positive Rate",
          auc.polygon=TRUE, max.auc.polygon=TRUE, print.auc=TRUE)


#### V. Quadratic Discriminant Analysis (QDA)



In [None]:
library(MASS)
fit = qda(observed.CHD ~ ., data = train)
probability.CHD = predict(fit, newdata = test)$posterior[,2]
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.5] = "Yes"


In [None]:
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm


Statistic  
Value


**Accuracy**


In [None]:
 round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)




**True Positive Rate**


In [None]:
 round(100*cm[2,2]/cm[2,3],1)




**False Positive Rate**


In [None]:
 round(100*cm[1,2]/cm[1,3],1)



In [None]:
library(pROC)
roc = roc(test$observed.CHD, probability.CHD, plot=TRUE, legacy.axes = TRUE, 
          xlab = "False Positive Rate", ylab = "True Positive Rate",
          auc.polygon=TRUE, max.auc.polygon=TRUE, print.auc=TRUE)


**Knowledge Check 7.** Rerun the quadratic discriminant analysis on the training data but only include the following 5 predictors: Male, Age, Cigs.per.Day, Systolic.BP, Glucose. Compute the confusion matrix and the 3 resulting predictive performance statistics (accuracy, true positive rate, false positive rate) when the probability threshold is 0.5. Construct the ROC curve and compute the AUC.



In [None]:
fit = qda(observed.CHD ~ Male + Age + Cigs.per.Day + Systolic.BP + Glucose, 
          data = train)
probability.CHD = predict(fit, newdata = test)$posterior[,2]
predicted.CHD = rep("No", nrow(test))
predicted.CHD[probability.CHD > 0.5] = "Yes"


In [None]:
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm


Statistic  
Value


**Accuracy**


In [None]:
 round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)




**True Positive Rate**


In [None]:
 round(100*cm[2,2]/cm[2,3],1)




**False Positive Rate**


In [None]:
 round(100*cm[1,2]/cm[1,3],1)



In [None]:
library(pROC)
roc = roc(test$observed.CHD, probability.CHD, plot=TRUE, legacy.axes = TRUE,
          xlab = "False Positive Rate", ylab = "True Positive Rate",
          auc.polygon=TRUE, max.auc.polygon=TRUE, print.auc=TRUE)


#### VI. K-Nearest Neighbors

Convert the factor variable `observed.CHD` back to numeric format:


In [None]:
train$observed.CHD = as.numeric(train$observed.CHD)-1
test$observed.CHD = as.numeric(test$observed.CHD)-1



Run K-NN and print the resulting confusion matrix and accuracy


In [None]:
library(class)
set.seed(1)
predicted.CHD = knn(train[,-1], test[,-1], train[,ncol(train)], k = 1)
predicted.CHD = factor(predicted.CHD, levels = c(0,1), labels = c("No", "Yes"))
test$observed.CHD = factor(test$observed.CHD, levels = c(0,1), labels = c("No", "Yes"))
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm
accuracy = round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)
accuracy
test$observed.CHD = as.numeric(test$observed.CHD)-1


**Knowledge Check 8.** Compute the accuracy of the K-nearest neighbor classifier for K = 2, 4, 8, 16, 32, 64, and 128 and record it in the table below. Which value of K minimizes the test error rate?

K         |    1 |    2 |    4 |    8 |   16 |   32 |   64 |  128 |
---       | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | 
Accuracy  | 76.8 |      |      |      |      |      |      |      |

**Replace the value of `k` in the following line of code above**

`predicted.CHD = knn(train[,-1], test[,-1], train[,ncol(train)], k = 1)`

**with 2, 4, 8, 16, 32, 64, and 128:**


In [None]:
library(class)
set.seed(1)
predicted.CHD = knn(train[,-1], test[,-1], train[,ncol(train)], k = 2)
predicted.CHD = factor(predicted.CHD, levels = c(0,1), labels = c("No", "Yes"))
test$observed.CHD = factor(test$observed.CHD, levels = c(0,1), labels = c("No", "Yes"))
cm = table(test$observed.CHD, predicted.CHD)
cm <- addmargins(cm, FUN = list(Total = sum), quiet = TRUE)
cm
accuracy = round(100*(cm[1,1]+cm[2,2])/cm[3,3],1)
accuracy
test$observed.CHD = as.numeric(test$observed.CHD)-1


K         |    1 |    2 |    4 |    8 |   16 |   32 |   64 |  128 |
---       | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | 
Accuracy  | 76.8 | 77.7 | 81.6 | 84.2 | 85.0 | 85.0 | 84.8 | 84.8 |
