# Quality prediction of white wine and red wine based on physicochemical tests


## 1. Introduction


   Wine is an alcoholic beverage made from grapes. White wine and red wine are two variants of wine that is widely known. Red wine is made from dark-colored grape varieties, while white wine is made from non-colored grape pulp (source: Wikipedia https://en.wikipedia.org/wiki/Wine). Physicochemical properties, such as pH and acidity,  influence the tasting and quality of wine. We will focus on predicting the quality of white wine and red wine based on results of physicochemical tests.
Two separated datasets will be used in this research, which are related to red and white variants of the Portuguese "Vinho Verde" wine. One contains 11 physicochemical parameters (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH,  sulphates, alcohol) and the quality index (0-10), the other contains the data of red wine.


## 2. Finding/Conclusion

 - Best fit model


## 3. Data 

In [6]:
library(tidyverse)
library(testthat)
library(digest)
library(repr)
library(caret)
install.packages("e1071")
install.packages("GGally")
library(e1071)
library(GGally)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done

Attaching package: ‘GGally’

The following object is masked from ‘package:dplyr’:

    nasa



In [3]:
# Import Red wine data
red_data <- read.csv("winequality-red.csv", sep = ";") 
head(red_data)
# Import white wine data
white_data <- read.csv("winequality-white.csv", sep = ";")
head(white_data)

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.8,0.88,0.0,2.6,0.098,25,67,0.9968,3.2,0.68,9.8,5
7.8,0.76,0.04,2.3,0.092,15,54,0.997,3.26,0.65,9.8,5
11.2,0.28,0.56,1.9,0.075,17,60,0.998,3.16,0.58,9.8,6
7.4,0.7,0.0,1.9,0.076,11,34,0.9978,3.51,0.56,9.4,5
7.4,0.66,0.0,1.8,0.075,13,40,0.9978,3.51,0.56,9.4,5


fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6


In [14]:
# Scale the data
red_scaled <- red_data %>% 
  select(-quality) %>% 
  scale(center = FALSE) 
red_scaled <- data.frame(quality = red_data$quality, red_scaled)
head(red_scaled)
white_scaled <- white_data %>% 
  select(-quality) %>% 
  scale(center = FALSE) 
white_scaled <- data.frame(quality = white_data$quality, white_scaled)
head(white_scaled)

quality,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol
5,0.8703407,1.255555,0.0,0.6541052,0.7649761,0.5784783,0.5970695,1.0007419,1.0585855,0.82374,0.8968984
5,0.9173862,1.578412,0.0,0.8950914,0.9864165,1.3147233,1.1765782,0.9997389,0.9650922,1.0002557,0.9350642
5,0.9173862,1.363174,0.1198329,0.7918116,0.9260236,0.788834,0.9482869,0.9999395,0.9831877,0.9561268,0.9350642
6,1.3172725,0.502222,1.6776605,0.6541052,0.7549106,0.8940119,1.0536521,1.0009425,0.9530286,0.8531593,0.9350642
5,0.8703407,1.255555,0.0,0.6541052,0.7649761,0.5784783,0.5970695,1.0007419,1.0585855,0.82374,0.8968984
5,0.8703407,1.183809,0.0,0.6196787,0.7549106,0.6836561,0.7024347,1.0007419,1.0585855,0.82374,0.8968984


quality,fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol
6,1.0134309,0.9122795,1.0127692,2.5367895,0.8871633,1.1481392,1.1744078,1.0069072,0.9398009,0.8946063,0.8311998
6,0.9120878,1.0136438,0.9565042,0.1960804,0.9660223,0.3571989,0.9118931,0.9998658,1.033781,0.9741268,0.8973179
6,1.1726843,0.9460676,1.1252991,0.8455965,0.985737,0.7654261,0.6701032,1.0009723,1.0212503,0.8747261,0.9539906
6,1.0423861,0.7771269,0.9002393,1.0416769,1.143455,1.1991676,1.2849402,1.0014753,0.9993216,0.7952056,0.9350997
6,1.0423861,0.7771269,0.9002393,1.0416769,1.143455,1.1991676,1.2849402,1.0014753,0.9993216,0.7952056,0.9350997
6,1.1726843,0.9460676,1.1252991,0.8455965,0.985737,0.7654261,0.6701032,1.0009723,1.0212503,0.8747261,0.9539906


In [9]:
# See the distribution of quality level
red_data %>% 
    group_by(quality) %>%
    summarize(n = n()) 
nrow(red_data)
white_data %>%
    group_by(quality) %>%
    summarize(n=n()) 
nrow(white_data)


quality,n
3,10
4,53
5,681
6,638
7,199
8,18


quality,n
3,20
4,163
5,1457
6,2198
7,880
8,175
9,5


From the table above we can see that every observation has a quality level (i.e. no observation with quality unknown).

## 3.  Knn Regression 

 - Red Wine

 - White Wine

## 4. Linear Regression

 - Red Wine

 - White Wine