### Multivariate Regression

In this practice, we will use the same white wine quality data set to create a multivariate model. Let's read the data from 'wine quality/winequality-white.csv'.

In [1]:
wq <- read.csv("/dsa/data/all_datasets/wine quality/winequality-white.csv",sep=";", header=TRUE)
head(wq)
str(wq)

fixed.acidity,volatile.acidity,citric.acid,residual.sugar,chlorides,free.sulfur.dioxide,total.sulfur.dioxide,density,pH,sulphates,alcohol,quality
7.0,0.27,0.36,20.7,0.045,45,170,1.001,3.0,0.45,8.8,6
6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6
8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6


'data.frame':	4898 obs. of  12 variables:
 $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
 $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
 $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
 $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
 $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
 $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
 $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
 $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
 $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
 $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
 $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
 $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...


Let's look at the correlations to see if any variables are closely correlated. 

In [2]:
cor(wq[,8:12])
# alcohol and density -0.78

Unnamed: 0,density,pH,sulphates,alcohol,quality
density,1.0,-0.09359149,0.07449315,-0.78013762,-0.30712331
pH,-0.09359149,1.0,0.1559515,0.1214321,0.09942725
sulphates,0.07449315,0.1559515,1.0,-0.01743277,0.05367788
alcohol,-0.78013762,0.1214321,-0.01743277,1.0,0.43557472
quality,-0.30712331,0.09942725,0.05367788,0.43557472,1.0


Let's remove the quality column because it's subjective opinion of tasters; and then try to predict alcohol and density from the rest of the variables.

In [3]:
wq$quality <- NULL
str(wq)

'data.frame':	4898 obs. of  11 variables:
 $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
 $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
 $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
 $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
 $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
 $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
 $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
 $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
 $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
 $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
 $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...


In [4]:
# Now bind alcohol and density, and fit a model; make sure alcohol and density are not in the independent variables.
ad <- cbind(wq$alcohol, wq$density)
wq_mreg <- lm(ad ~ . -alcohol -density, data=wq)
summary(wq_mreg)

Response Y1 :

Call:
lm(formula = Y1 ~ (fixed.acidity + volatile.acidity + citric.acid + 
    residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
    density + pH + sulphates + alcohol) - alcohol - density, 
    data = wq)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2463 -0.6489 -0.0950  0.6104  5.2106 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           1.202e+01  4.073e-01  29.505  < 2e-16 ***
fixed.acidity        -8.060e-02  1.889e-02  -4.267 2.02e-05 ***
volatile.acidity      1.910e+00  1.419e-01  13.457  < 2e-16 ***
citric.acid           6.314e-01  1.217e-01   5.188 2.21e-07 ***
residual.sugar       -7.842e-02  3.039e-03 -25.803  < 2e-16 ***
chlorides            -1.628e+01  6.481e-01 -25.117  < 2e-16 ***
free.sulfur.dioxide   6.299e-03  1.061e-03   5.934 3.15e-09 ***
total.sulfur.dioxide -9.668e-03  4.529e-04 -21.346  < 2e-16 ***
pH                    1.836e-01  1.048e-01   1.753   0.0796 .  
sulph

In [5]:
# Let's see a summary of the fit
summary(manova(wq_mreg))

                       Df  Pillai approx F num Df den Df    Pr(>F)    
fixed.acidity           1 0.55775   3081.7      2   4887 < 2.2e-16 ***
volatile.acidity        1 0.10708    293.0      2   4887 < 2.2e-16 ***
citric.acid             1 0.11237    309.3      2   4887 < 2.2e-16 ***
residual.sugar          1 0.91492  26277.9      2   4887 < 2.2e-16 ***
chlorides               1 0.16899    496.9      2   4887 < 2.2e-16 ***
free.sulfur.dioxide     1 0.01352     33.5      2   4887 3.599e-15 ***
total.sulfur.dioxide    1 0.19437    589.5      2   4887 < 2.2e-16 ***
pH                      1 0.42298   1791.2      2   4887 < 2.2e-16 ***
sulphates               1 0.07230    190.4      2   4887 < 2.2e-16 ***
Residuals            4888                                             
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1