### Logistic Regression

In this practice, we will use the white wine quality data set to create a model to predict the quality of the white wine based on the available variables. Let's read the data from 'wine quality/winequality-white.csv'.

In [None]:
wine_quality_data <- read.csv("/dsa/data/all_datasets/wine quality/winequality-white.csv",sep=";",header=TRUE)
head(wine_quality_data)
str(wine_quality_data)

Let's look at the distribution of the quality variable. 

In [None]:
# distribution of quality variable
table(wine_quality_data$quality)

**Activity 1:** Find the distribution of quality variable using count() function in plyr library. 

In [None]:
library(plyr)

library(plyr)
freq = count(wine_quality_data,'quality')
freq

As we can see, the value 6 for quality dominates the distribution; let's remove that value and label the rest as 'good' or 'bad' to create a binary variable for quality. If the quality is larger than 6, we'll call it 'good' wine, otherwise 'bad' wine. 

**Activity 2:** Remove all the observations from dataset where quality is equal to 6 to create a subset that has quality values larger or smaller than (but not equal to) 6.

In [None]:
# Complete the partially complete code and execute it..

wine_quality_subset_data <- subset(wine_quality_data, quality < 6 | quality > 6)

# Now create a new column named 'good' with initially all zeros. 
wine_quality_subset_data$good <- 0

# assign 1 to good if quality is larger than 6
wine_quality_subset_data$good[wine_quality_subset_data$quality > 6] <- 1

# Now remove the 'quality' column; we don't want that in the model any more.
wine_quality_subset_data$quality <- NULL

In [None]:
table(wine_quality_subset_data$good)

So there are 1640 'bad' white wines and 1060 'good' white wines in the data set now. Let's fit a logistic regression model
to predict the variable 'good'. Let's first start with the whole data i'e **wine_quality_subset_data**. Later we'll split it into testing and training sets.

**Activity 3:** Fit a logistic regression model to predict the variable 'good' in wine_quality_subset_data. 

In [None]:
# Complete the partially complete code and execute it..

wine_quality_log = glm(good ~ ., data=wine_quality_subset_data, family=binomial)
summary(wine_quality_log)

**Activity 4:** Find the accuracy of above model wine_quality_log.

In [None]:
# Complete the partially complete code and execute it..

probs = predict(wine_quality_log, type = "response", newdata=wine_quality_subset_data)
preds <- ifelse(probs > 0.5,1,0)
misClassificError <- mean (preds != wine_quality_subset_data$good)
print(paste('Accuracy',1-misClassificError))

Find the baseline model accuracy. There are 1640 'bad' wines, 1060 'good wines, so it should predict 'bad' all the time

In [None]:
table(wine_quality_subset_data$good)

In [None]:
print(paste('baseline accuracy =', 1640/(1640+1060)))

Most of the variables are useful to predict the quality of the wine except sulfur dioxide and citric acid. Let's see if we can create a model with good generalization. A model's generalization property refers to the ability to predict the outcome accurately for unseen data. We will now create
    a training set to fit a model, and then test it on the testing data the model hasn't 'seen' yet.

**Activity 5:** Split the data in **wine_quality_subset_data** into testing and training sets. Put 70% of the data into training set and rest into testing set. 

In [None]:
# Complete the partially complete code and execute it..

library(caTools)
set.seed(1000)

split = sample.split(wine_quality_subset_data$good, SplitRatio=0.7) # PAY ATTENTION TO THE VARIABLE NAME 

wine_quality_train_data = subset(wine_quality_subset_data, split==TRUE)

wine_quality_test_data  = subset(wine_quality_subset_data, split==FALSE)

**Activity 6:** Fit a logistic regression model to predict the variable 'good' in wine_quality_subset_data. Find the accuracy of the model in predicting the good variable. 

In [None]:
# Complete the partially complete code and execute it..

# Now fit a model on the training data
wine_quality_train_log =  glm(good ~ ., data=wine_quality_train_data, family=binomial)

# now predict on the test data
probs1 = predict(wine_quality_train_log, type = "response", newdata=wine_quality_test_data)

# Now let's use a threshold of 0.5 to turn probablities into actual predictions
preds1 <- ifelse(probs1 > 0.5,1,0)

#Now, compare this to the correct values for 'good' and compute the accuracy.
misClassificError1 <- mean (preds1 != wine_quality_test_data$good)
print(paste('Accuracy',1-misClassificError1))

The accuracy of the model for unseen data is about 82% where as baseline model has an accuracy of 62%. 

In [None]:
table(wine_quality_test_data$good,preds1>0.5)

Sensitivity = TP/(TP+FN)

Specificity = TN/(TN+FP)



**Activity 7:** Find sensitivity and specificity using the two way table results above. 

In [None]:
# Your answer for activity 6 goes here..

print(paste('sens =', 237/(237+81)))
print(paste('spec =', 431/(431+61)))

**Activity 8:** Can you plot an ROC curve for this model? 

In [None]:
# code for activity 8 
library(ROCR)

ROCR_predictions = prediction(probs1, wine_quality_test_data$good)
perf <- performance(ROCR_predictions,"tpr","fpr")

plot(perf,colorize=TRUE)
abline(0,1)
as.numeric(performance(ROCR_predictions,"auc")@y.values)


# Save your notebook!