## Wine Quality prediction with the help of Linear Regression
- In this notebook, we will use data of red wine to predict the quality of wine. As a first step to start we will find the co-relation between different feature. We will use most effective features for Linear Regression to predict wine quality.
- As a perfomance matrix we will look at the R - squared, Adjusted R-squared and accuracy of model.

In [169]:
# This R environment comes with all of CRAN preinstalled, as well as many other helpful packages
# The environment is defined by the kaggle/rstats docker image: https://github.com/kaggle/docker-rstats
# For example, here's several helpful packages to load in 

library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(caret)
library(corrgram) # Correlograms http://www.datavis.ca/papers/corrgram.pdf
library(car) #required for nearest neighbors
library(FNN) # nearest neighbors techniques
library(pROC) # to make ROC curve
library(corrplot)
# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

system("ls ../input")
# Any results you write to the current directory are saved as output.

## Load Data

In [170]:
wd <- read.csv("../input/winequality-red.csv", sep=",")
wd

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.880,0.00,2.6,0.098,25,67,0.9968,3.20,0.68,9.8,5
7.8,0.760,0.04,2.3,0.092,15,54,0.9970,3.26,0.65,9.8,5
11.2,0.280,0.56,1.9,0.075,17,60,0.9980,3.16,0.58,9.8,6
7.4,0.700,0.00,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.660,0.00,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5
7.9,0.600,0.06,1.6,0.069,15,59,0.9964,3.30,0.46,9.4,5
7.3,0.650,0.00,1.2,0.065,15,21,0.9946,3.39,0.47,10.0,7
7.8,0.580,0.02,2.0,0.073,9,18,0.9968,3.36,0.57,9.5,7
7.5,0.500,0.36,6.1,0.071,17,102,0.9978,3.35,0.80,10.5,5


In [171]:
summary(wd$quality)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   5.000   6.000   5.636   6.000   8.000 

In [172]:
print("Quality type with data point count.")
table(wd$quality)


[1] "Quality type with data point count."



  3   4   5   6   7   8 
 10  53 681 638 199  18 

##  correlation between features

In [None]:
M<-cor(wd)
ccol <- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
corrplot(M, method = "color",
         type = "upper", order = "hclust", number.cex = .7,
         addCoef.col = "black", # Add coefficient of correlation
         tl.col = "black", tl.srt = 90, # Text label color and rotation
         diag = FALSE) 

#### Observation: By looking at the correlation matrix we can take alcohol, volatile.acidity, citric.acid and, sulphates for linear regression model and check the R squared and Adjusted R squared by different combination of selected features.

In [None]:
linear_quality_1 = lm(quality ~ alcohol, data = wd)
summary(linear_quality_1)


In [None]:
linear_quality_4 = lm(quality ~ alcohol + volatile.acidity + citric.acid + sulphates, data = wd)
summary(linear_quality_4)


#### Observation: Coefficients matrix indicates alcohol, volatile.acidity and sulphates as highly significant features and citric.acid as insignificant feature. so we will remove citric.acid from our model.  

In [None]:
linear_quality_5 = lm(quality ~ alcohol + volatile.acidity + sulphates, data = wd)
summary(linear_quality_5)

#### This model did not improve much from previous model, but it removed one unnecessary parameter. Now it got all the parameter which are very significant to it.

In [None]:
plot(linear_quality_5)

#### Observation: The above residual plot shows that the data was fit for linear regression model. It in homoscedasticity and does not cross to cook’s distance line.

## Apply Linear Regression Model 
### Split data into Training and Test Data

In [None]:
set.seed(2018)
train.size <- 0.8 
train.index <- sample.int(length(wd$quality), round(length(wd$quality) * train.size))
train.sample <- wd[train.index,]
test.sample <- wd[-train.index,]



In [None]:
lm = lm(quality ~ alcohol + volatile.acidity + sulphates, data = train.sample)
results = round(predict(lm, newdata=test.sample))


In [None]:
confusionMatrix(results, test.sample$quality)

## Linear regression model is getting accuracy of 60%. 