**Predicting White Wine Quality **  
This data anlysis is the part of final project for the "Data Analysis with R" course on Udacity.   
*(Course: https://classroom.udacity.com/courses/ud651)   *  
I am using White Wine Quality dataset which is a tidy data set. This data set contains **4,898** white wines with **11 **variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

**Objective:** To identify which chemical properties influence the quality of white wines?

**Laoding the dataset and libraries**  
We will load the dataset and additional libraries and take a glimpse at our dataset. 

In [2]:
library(ggplot2)
library(dplyr)
library(gridExtra)
library(RColorBrewer)
library(randomForest)
library(party)
theme_set(theme_classic())

wine <- read.csv ("../input/wineQualityWhites.csv")
glimpse(wine)

**Variables in the data set**   
**Input variables** (based on physicochemical tests):  
   1 - fixed acidity (tartaric acid - g / dm^3)  
   2 - volatile acidity (acetic acid - g / dm^3)  
   3 - citric acid (g / dm^3)  
   4 - residual sugar (g / dm^3)  
   5 - chlorides (sodium chloride - g / dm^3  
   6 - free sulfur dioxide (mg / dm^3)  
   7 - total sulfur dioxide (mg / dm^3)  
   8 - density (g / cm^3)  
   9 - pH  
   10 - sulphates (potassium sulphate - g / dm3)  
   11 - alcohol (% by volume)  
  ** Output variable** (based on sensory data):   
   12 - quality (score between 0 and 10)  
   
 X is the row number. Quality is discrete variable while all the input variables are continuous. We will first remove "X" from our data frame and then we will see the summary of each variable in the dataframe.

In [3]:
wine$X <- NULL  # removing the first column 'X'
str(wine)
summary(wine)

**Wine Quality Distribution**  
The most important parameter is wine quality and this is also the variable we would like to predict based on other attributes. Let's have a look at the wine quality summary using the table and histogram.

In [4]:
print(" Number of wines for a partcular rating of wine:")
table(wine$quality)
theme_set(theme_minimal())
ggplot(wine,aes(quality)) + geom_histogram(stat="count") +
   xlab("Quality of white wines") + ylab("Number of white wines")

**Classifying the Quality of White Wines**   
Quality of the white wines is normally distributed. Most of the wines are rated **5-7**, while very few are rated **"very poor"** (rating **< 4**) or **"very good"** (rating **> 7**). We will try to make prediction on quality of wine using our 11 input variables. Looking at the dataset we can say that wines rated **8** and **9** are few and can be regarded as the best quality wines. We will identify how physiochemical properties for these wines are different from rest of the wines that make a wine taste better according to the experts.

**Subsetting: ** First we will create a subset of wines rated **8 or above** effectively making quality having two factors.  
Quality rating **> 7 **: wine is **"Good"**   
Quality rating **<=7** : wine is **"Not good"**   

In [4]:
wine$rating   <- ifelse (as.integer(wine$quality) > 7, 1, 0)
glimpse(wine)
table(wine$rating)

**Visualizing the attributes**  
Before making prediction, we will try to visualize the rest of the variables and see if we can find any pattern that could explain the quality of white wines. For our exploration, boxplot is the most useful and suitable type of plot. 

**Correlation: **   
The dataset has 11 physio-chemical attributes and there are chances that some of them are correlated. To check this and also to check how strongly quality is correlated with other variables we will plot correlation matrix.

In [None]:
library(corrplot)
M <- cor(wine)
corrplot(M, method = "number")

**Correlation among variables**    
* Quality is moderately correlated with alcohol  
* Alcohol is moderately correlated with the density of wine apart from being moderaltely correlated with quality  
* Density is strongly correlated with residual sugar quantity and moderately correlated with pH  
* Free sulfur dioxide and total sulfur dioxide are strongly correlated  

Now let's see the variables visually with the help of boxplot first.

In [7]:
# Boxplots of variables
p1 <- ggplot(wine, aes(as.factor(quality),fixed.acidity))+ geom_boxplot() + coord_cartesian(ylim = c(4.5,10.5))
p2 <- ggplot(wine, aes(as.factor(quality),volatile.acidity))+ geom_boxplot()+ coord_cartesian(ylim = c(0,0.6))
p3 <- ggplot(wine, aes(as.factor(quality),citric.acid))+ geom_boxplot()+ coord_cartesian(ylim = c(0,0.5))
p4 <- ggplot(wine, aes(as.factor(quality),residual.sugar))+ geom_boxplot()+ coord_cartesian(ylim = c(0,20))
p5 <- ggplot(wine, aes(as.factor(quality),chlorides))+ geom_boxplot()+ coord_cartesian(ylim = c(0,0.07))
p6 <- ggplot(wine, aes(as.factor(quality),free.sulfur.dioxide))+ geom_boxplot()+ coord_cartesian(ylim = c(0,70))
p7 <- ggplot(wine, aes(as.factor(quality),total.sulfur.dioxide))+ geom_boxplot()+ coord_cartesian(ylim = c(0,220))
p8 <- ggplot(wine, aes(as.factor(quality),density))+ geom_boxplot()+ coord_cartesian(ylim = c(0.98,1.0))
p9 <- ggplot(wine, aes(as.factor(quality),pH))+ geom_boxplot()+ coord_cartesian(ylim = c(2.8,3.6))
p10 <- ggplot(wine, aes(as.factor(quality),sulphates))+ geom_boxplot()+ coord_cartesian(ylim = c(0.3,0.8))
p11 <- ggplot(wine, aes(as.factor(quality),alcohol))+ geom_boxplot()+ coord_cartesian(ylim = c(8,13))
grid.arrange(p1,p2, nrow=1)

**Acidity:** It is diificult to find a certain pattern, but usually better quality wines have lower volatile.acidity. Fixed acidity quantity roughly follow the same pattern except the wines rated 9.

In [8]:
grid.arrange(p3,p4,p5, nrow=1)

Chlorides are clearly lower for highly rated wines and residual sugar is also mostly on the lower side for better wines.

In [9]:
grid.arrange(p6,p7, nrow=1)

**Sulfur dioxide:** Corplot showed that free sulfur dioxide and total sulfur dioxide are moderately correlated and their distribution agianst the quality of the wines are similar. Except the wines rated 4, usually the better wines have low sulfur dioxide.

In [10]:
grid.arrange(p8,p9, nrow=1)

**Density,pH and acidity:** Better wines on an average have lower density and higher pH. As we can see from the correlation matrix pH and acidity are moderately negatively correlated. We saw earlier that better wines had usually lower acidity, so one would expect pH for highly rated wines to be higher.

In [11]:
grid.arrange(p10,p11, nrow=1)

Sulphates don't have clear differentiating pattern but it is clear that better wines have on an average higher alcohol content.

**Linear Regression Model**     
Now let's create a linear regression model and find out which variables are significant. We will improve our model by removing the insignificant variables and checking for improvement in R-Squared value. So, first we will create a model to predict the quality of the wines. For this model we are going to use the first wine dataset that didn't have the derived variable 'rating'. 
First we will have to split our dataset into training and testing sets. Training set wiill be used to craete our model and we will predict over testing set.

In [12]:
# Splitting the data into Training and Testing sets
library(caTools)
set.seed(144)
spl = sample.split(wine$quality, 0.7)
train = subset(wine, spl == TRUE)
test=subset(wine,spl==FALSE)

In [13]:
model1 <- lm(quality ~ .-rating, data = train)
summary(model1)

Adjusted R-squared for this model is **0.2831. **Citric acid, total sulfur dioxide and chlorides are not significant. One peculiar observation is that from our box-plot we saw that chlorides were lower for better wines but according to this model it is not significant. One more peculiar observation is that chloride is moderately correlated with alcohol and alcohol was usually higher for better wines. We will remove the insignificant variables and see if that improves R-squared of the model.

In [14]:
model2 <- lm(quality ~ .-rating-chlorides-total.sulfur.dioxide-citric.acid, data = train)
summary(model2)

Adjusted R-squared for this model is **0.2837**. This model didn't improve on the previous one significantly. One more observation is that now all variables in the model are significant and same level of significance. Let's remove fixed acidity and see if that improves our model as it is the lowest significant variable based on the p-value.

In [15]:
model3 <- lm(quality ~ .-rating-chlorides-total.sulfur.dioxide-citric.acid-fixed.acidity, data = train)
summary(model3)

Removing fixed acidity didn't improve our model. We can see one problem in our model is that predictors are continuous while quality is integer in our dataset. Prediction made by our model will not be integer, so it is difficult to explain the variation in quality using linear model. We can check the prediction made by our best model (model2) over testing set.

In [16]:
predictTest <- predict(model2, newdata = test)
summary(predictTest)

We can see the limitations of our linear regression model. First it is not predicting integer values and maximum value is only **7.2**, so it is not predicting any wine to be of high quality.

We will now change our approach to predict the quality. Instead of predicting a single value, we will predict for the classification of the variable 'rating'. 

**Logistics Regression Model**    
We will make a logistics regression model to predict the quality of wine using input variables. First we will split our dataset into training and testing sets and then create a logistic regression model using the training set.

In [17]:
# Splitting the data into Training and Testing sets
library(caTools)
set.seed(144)
spl = sample.split(wine$rating, 0.7)
train1 = subset(wine, spl == TRUE)
test1=subset(wine,spl==FALSE)

In [18]:
# Creating the logistics regression model
mod = glm(rating ~.-quality ,data=train1,family = "binomial")
summary(mod)

**Significant variables**  
* fixed.acidity    
* volatile.acidity  
* residual.sugar  
* free.sulfur.dioxide  
* density  
* alcohol   
* pH   

**Predicting the quality of wine**   
We will use our logistic model to predict over testing set.

In [19]:
prediction = predict(mod, newdata=test1, type="response")
table(test$rating, prediction > 0.5)

Our model predicts that no wine is in the best rated wine group, but actually **54** of the wines are rated 8 or 9 in the test set. Our prediction is same as the baseline prediction. So, there is not much benefit from this model. Let's check what is the maximum value of prediction over test set. 

In [20]:
print("The maximum value of prediction over testing set is ")
round(max(prediction),3)

As the maximum value is less than **0.5**, our model will always predict that wine quality will be less than 8 on the rating used by the experts.

**Random Forest Model**   
We can remove the insignificant variables and keep modifying the regression model until we get the model with the best accuracy. Instead of repeating this over and over we will use Random Forest model as it is more efficient.

In [21]:
library(randomForest)
set.seed(144)
fit <- randomForest(rating ~ .-quality,data=train1, importance=TRUE, ntree=2000)
varImpPlot(fit)

**Important Variables: ** The first plot represents % increase in mean squared error accuracy without using a variable. This plot tests to see how worse the model performs without each variable, so a high decrease in accuracy would be expected for very predictive variables.  The second tests to see the result if each variable is taken out and a high score means the variable was important. For our case, we can see from both these plots that alcohol and residual sugar are the most important variables. 
Now we will visualize the tree for our model using additional libraries.
Now let's make prediction over test set.


In [22]:
Prediction <- predict(fit, test1)
table(test1$rating,Prediction > 0.5)

Unlike the first logistics model, Random Forest model has predicted **16** wines to be the best ones. Accuracy of the logistics model was 1415/(1415+54)= **0.96.**
Accuracy of the Random Forest model is (1415+16/(1415+16+38+0**)= 0.974**.

**Conditional Inference Trees**  
To make even better prediction we will try using a forest of conditional inference trees.

In [32]:
library(party)
set.seed(144)
fit2 <- cforest(rating ~ .-quality, data = train1,controls=cforest_unbiased(ntree=2000, mtry=3))
Prediction2 <- predict(fit2, newdata=test1, type = "response")
table(test1$rating, Prediction2 > 0.5)

Accuracy of the conditional inference trees model is (1415+52)/(1415+52+2+0) = **0.998.** This model performs the best prediction on test set. So, we are going to use variables used by this model to predict wine quality.

**Quality Rating Adjustment**   
Random Forest and Conditional inference Tree model predicted with very astonishing accuracy. One thing we missed to notice that less than **4%** of the wines were rated good in this case. Let's see if we change the basis of our rating and check how these two model perform then. This time we will rate wine quality above **6 (i.e. 7-9**) to be good wines and quality 6 or less than 6 to be not good.

In [33]:
wine$rating2   <- ifelse (as.integer(wine$quality) > 6, 1, 0)
glimpse(wine)
table(wine$rating2)

This time around **22%** of the wines are rated good. We will make our training and testing set again by splitting the data.

In [39]:
library(caTools)
set.seed(144)
spl = sample.split(wine$rating2, 0.7)
train2 = subset(wine, spl == TRUE)
test2 =subset(wine,spl==FALSE)

In [35]:
# random Forest Model
library(randomForest)
set.seed(144)
fit3 <- randomForest(rating2 ~ .-quality-rating,data=train2, importance=TRUE, ntree=2000)
varImpPlot(fit3)

In [36]:
Prediction3 <- predict(fit, test2)
table(test2$rating2,Prediction3 > 0.5)

Accuracy of the Random Forest model is (1151+45)/(1151+45+273+0) = **0.814.**  Let's now make a conditional inference tree to predict wine rating.

In [38]:
library(party)
set.seed(144)
fit4 <- cforest(rating2 ~ .-quality-rating, data = train2,controls=cforest_unbiased(ntree=2000, mtry=3))
Prediction4 <- predict(fit4, newdata=test2, type = "response")
table(test2$rating2, Prediction4 > 0.5)

Accuracy of the conditional inference tree model is 100 %.  