<a href="https://colab.research.google.com/github/linahourieh/Wine_Quality_Multilinear_Reg/blob/main/Statistical_Model_Multiple_Linear_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Objectives 🔍
In statistical Models, the main goal is to infer about the relationships between y and many x. Thus, our aim in this dataset is to characterize the relation between citric acid and different variables.

**P.S**: We are not interested in building a model that will predict the amount of citric acid.

----------------------


# Loading Essential Libraries 📚

In [1]:
# we start first by importing essential libraries

# for data manipulation
import pandas as pd
import numpy as np


# for vizualization
import plotly.express as px
import plotly.graph_objects as go
import plotly.figure_factory as ff
import matplotlib.pyplot as plt
import matplotlib as mpl

# for statistical tests
import statsmodels.api as sma
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
from statsmodels.tools.tools import add_constant
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.graphics.gofplots import qqplot
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score 

import scipy.stats as stats


  import pandas.util.testing as tm


# Reading Data Set 👓

In [2]:
url = 'https://raw.githubusercontent.com/linahourieh/Wine_Quality_Multilinear_Reg/main/winequality-red.csv'
df_wine = pd.read_csv(url)

In [3]:
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [4]:
df_wine.shape

(1599, 12)

In [5]:
df_wine.describe()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
count,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0,1599.0
mean,8.319637,0.527821,0.270976,2.538806,0.087467,15.874922,46.467792,0.996747,3.311113,0.658149,10.422983,5.636023
std,1.741096,0.17906,0.194801,1.409928,0.047065,10.460157,32.895324,0.001887,0.154386,0.169507,1.065668,0.807569
min,4.6,0.12,0.0,0.9,0.012,1.0,6.0,0.99007,2.74,0.33,8.4,3.0
25%,7.1,0.39,0.09,1.9,0.07,7.0,22.0,0.9956,3.21,0.55,9.5,5.0
50%,7.9,0.52,0.26,2.2,0.079,14.0,38.0,0.99675,3.31,0.62,10.2,6.0
75%,9.2,0.64,0.42,2.6,0.09,21.0,62.0,0.997835,3.4,0.73,11.1,6.0
max,15.9,1.58,1.0,15.5,0.611,72.0,289.0,1.00369,4.01,2.0,14.9,8.0


note that fixed acidity has a low standard deviation while free sulfur dioxide has huge one. This shows that fixed acidity might be a good predictor for our model. On the countrary to free sulfur dioxide.

In [6]:
df_wine.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [7]:
# no null values are detected
df_wine.isnull().sum()

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

In [8]:
# define some colors
colors = ['#2B463C', '#688F4E', '#B1D182', '#F4F1E9']

# function to generate a continuous color palette from 2 colors
def colorFader(c1,c2,mix=0):
    c1=np.array(mpl.colors.to_rgb(c1))
    c2=np.array(mpl.colors.to_rgb(c2))
    return mpl.colors.to_hex((1-mix)*c1 + mix*c2)

c1=colors[1] 
c2=colors[3] 
n=1000

# list containing color series
c=[]
for x in range(n+1):
    c.append(colorFader(c1,c2,x/n))

In [9]:
y_data = [df_wine['fixed acidity']]
x_data = ['fixed acidity']

fig = go.Figure()

for xd, yd, cls in zip(x_data,y_data, c):
  fig.add_trace(
      go.Box(
              y=yd,
              name=xd,
              jitter=0.5,
              whiskerwidth=0.2,
              fillcolor=cls,
              marker_size=2,
              marker_color = 'black',
              line_width=0.7)
             )
      
fig.update_layout(template="plotly_white",
                  width=600,
                  height=700,
                  font=dict(size=18))
              
fig.show()

In [10]:
y_data = [df_wine['free sulfur dioxide']]
x_data = ['free sulfur dioxide']

fig = go.Figure()

for xd, yd, cls in zip(x_data,y_data, c):
  fig.add_trace(
      go.Box(
              y=yd,
              name=xd,
              jitter=0.5,
              whiskerwidth=0.2,
              fillcolor=cls,
              marker_size=2,
              marker_color = 'black',
              line_width=0.7)
             )
      
fig.update_layout(template="plotly_white",
                  width=600,
                  height=700,
                  font=dict(size=18))
              
fig.show()

# Model Development 🛠

### **Forward Selection**
When picking the independent variables to our model, we should rely more on common sense and our background knowledge.

Here is a good [article](https://quantifyinghealth.com/variables-to-include-in-regression/#:~:text=As%20a%20rule%20of%20thumb,sense%20and%20your%20background%20knowledge.) explaining how you should pick your independent variables.


### **Backward Selection**
Here we will follow this methodology. It is more common in industry.We include all variables in our model then, according to p-value and VIF, we eliminate variables accordingly.


###**Key Metrics**

> **AIC:**
- No absolute value is significant. It is a relative measure, the lower the better

> **Adjusted R-squared:**
- It is >= 0.7

> **Individual variable's P-value (P>|t|):**
- It is =<0.05

> **Individual variable's VIF:**
- It is <5

## Iteration 1 : Backward Selection

In [9]:
# have a list of the columns name 
df_wine.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [10]:
# prepare the X and the Y for the model
X = df_wine[['fixed acidity', 'volatile acidity', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality']]
Y = df_wine['citric acid']

In [11]:
# split the data
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.7,random_state=42)

In [12]:
# we add constant value (that is b in y=ax+b)
x_train_new = sma.add_constant(x_train)
x_test_new = sma.add_constant(x_test)

  x = pd.concat(x[::order], 1)


In [13]:
# Build our model and fit
full_model = sma.OLS(y_train, x_train_new)
full_results = full_model.fit()

In [14]:
# print out the results
print(full_results.summary())

                            OLS Regression Results                            
Dep. Variable:            citric acid   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.679
Method:                 Least Squares   F-statistic:                     215.8
Date:                Thu, 17 Feb 2022   Prob (F-statistic):          2.95e-266
Time:                        11:25:56   Log-Likelihood:                 880.23
No. Observations:                1119   AIC:                            -1736.
Df Residuals:                    1107   BIC:                            -1676.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -2.2504 

In [15]:
# calculate vif factor
print('Variance Inflating Factor')
cnames = x_train.columns
for i in np.arange(0,len(cnames)):
  xvars = list(cnames)
  yvar = xvars.pop(i)
  mod = sma.OLS(x_train[yvar], sma.add_constant(x_train_new[xvars]))
  res= mod.fit()
  vif = 1 / (1- res.rsquared)
  print(yvar, round(vif,3))

  x = pd.concat(x[::order], 1)


Variance Inflating Factor
fixed acidity 6.408
volatile acidity 1.347
residual sugar 1.731
chlorides 1.366
free sulfur dioxide 1.914
total sulfur dioxide 2.059
density 5.903
pH 3.224
sulphates 1.473
alcohol 3.084
quality 1.563


Based on P-value and Vif we remove: `density`


## Iteration 2 : Backward Selection

In [16]:
# prepare the X for the model
X = df_wine[['fixed acidity', 'volatile acidity', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide',
       'pH', 'sulphates', 'alcohol', 'quality']]
Y = df_wine['citric acid']

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.7,random_state=42)

x_train_new = sma.add_constant(x_train)
x_test_new = sma.add_constant(x_test)

full_model = sma.OLS(y_train, x_train_new)
full_results = full_model.fit()

print(full_results.summary())

                            OLS Regression Results                            
Dep. Variable:            citric acid   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.679
Method:                 Least Squares   F-statistic:                     237.6
Date:                Thu, 17 Feb 2022   Prob (F-statistic):          2.09e-267
Time:                        11:25:56   Log-Likelihood:                 880.12
No. Observations:                1119   AIC:                            -1738.
Df Residuals:                    1108   BIC:                            -1683.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.2024 

  x = pd.concat(x[::order], 1)


In [17]:
print('Variance Inflating Factor')
cnames = x_train.columns
for i in np.arange(0,len(cnames)):
  xvars = list(cnames)
  yvar = xvars.pop(i)
  mod = sma.OLS(x_train[yvar], sma.add_constant(x_train_new[xvars]))
  res= mod.fit()
  vif = 1 / (1- res.rsquared)
  print(yvar, round(vif,3))

Variance Inflating Factor
fixed acidity 2.025
volatile acidity 1.329
residual sugar 1.075
chlorides 1.359
free sulfur dioxide 1.895
total sulfur dioxide 2.048
pH 2.198
sulphates 1.375
alcohol 1.483
quality 1.562


  x = pd.concat(x[::order], 1)


Based on P-value and Vif we remove: `sulphates`


## Iteration 3 : Backward Selection

In [20]:
# prepare the X for the model
X = df_wine[['fixed acidity', 'volatile acidity', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide',
       'pH', 'alcohol', 'quality']]
Y = df_wine['citric acid']

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.7,random_state=42)
x_train_new = sma.add_constant(x_train)
x_test_new = sma.add_constant(x_test)
full_model = sma.OLS(y_train, x_train_new)
full_results = full_model.fit()
print(full_results.summary())

                            OLS Regression Results                            
Dep. Variable:            citric acid   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.679
Method:                 Least Squares   F-statistic:                     263.8
Date:                Thu, 17 Feb 2022   Prob (F-statistic):          2.18e-268
Time:                        11:27:00   Log-Likelihood:                 879.56
No. Observations:                1119   AIC:                            -1739.
Df Residuals:                    1109   BIC:                            -1689.
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.1986 

  x = pd.concat(x[::order], 1)


In [21]:
print('Variance Inflating Factor')
cnames = x_train.columns
for i in np.arange(0,len(cnames)):
  xvars = list(cnames)
  yvar = xvars.pop(i)
  mod = sma.OLS(x_train[yvar], sma.add_constant(x_train_new[xvars]))
  res= mod.fit()
  vif = 1 / (1- res.rsquared)
  print(yvar, round(vif,3))

Variance Inflating Factor
fixed acidity 2.016
volatile acidity 1.277
residual sugar 1.072
chlorides 1.128
free sulfur dioxide 1.894
total sulfur dioxide 2.037
pH 2.198
alcohol 1.472
quality 1.512


  x = pd.concat(x[::order], 1)


we remove: `quality`

## Iteration 4 : Backward Selection

In [22]:
# prepare the X for the model
X = df_wine[['fixed acidity', 'volatile acidity', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide',
       'pH', 'alcohol']]
Y = df_wine['citric acid']

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.7,random_state=42)
x_train_new = sma.add_constant(x_train)
x_test_new = sma.add_constant(x_test)
full_model = sma.OLS(y_train, x_train_new)
full_results = full_model.fit()
print(full_results.summary())

                            OLS Regression Results                            
Dep. Variable:            citric acid   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.679
Method:                 Least Squares   F-statistic:                     296.5
Date:                Thu, 17 Feb 2022   Prob (F-statistic):          2.64e-269
Time:                        11:27:56   Log-Likelihood:                 878.78
No. Observations:                1119   AIC:                            -1740.
Df Residuals:                    1110   BIC:                            -1694.
Df Model:                           8                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.2256 

  x = pd.concat(x[::order], 1)


In [23]:
print('Variance Inflating Factor')
cnames = x_train.columns
for i in np.arange(0,len(cnames)):
  xvars = list(cnames)
  yvar = xvars.pop(i)
  mod = sma.OLS(x_train[yvar], sma.add_constant(x_train_new[xvars]))
  res= mod.fit()
  vif = 1 / (1- res.rsquared)
  print(yvar, round(vif,3))

Variance Inflating Factor
fixed acidity 2.016
volatile acidity 1.171
residual sugar 1.072
chlorides 1.125
free sulfur dioxide 1.886
total sulfur dioxide 2.008
pH 2.189
alcohol 1.207


  x = pd.concat(x[::order], 1)


remove : `pH`

## Iteration 5 : Backward Selection

In [24]:
# prepare the X for the model
X = df_wine[['fixed acidity', 'volatile acidity', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'alcohol']]
Y = df_wine['citric acid']

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.7,random_state=42)
x_train_new = sma.add_constant(x_train)
x_test_new = sma.add_constant(x_test)
full_model = sma.OLS(y_train, x_train_new)
full_results = full_model.fit()
print(full_results.summary())

                            OLS Regression Results                            
Dep. Variable:            citric acid   R-squared:                       0.681
Model:                            OLS   Adj. R-squared:                  0.679
Method:                 Least Squares   F-statistic:                     338.5
Date:                Thu, 17 Feb 2022   Prob (F-statistic):          2.89e-270
Time:                        11:28:22   Log-Likelihood:                 878.03
No. Observations:                1119   AIC:                            -1740.
Df Residuals:                    1111   BIC:                            -1700.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.3649 

  x = pd.concat(x[::order], 1)


In [25]:
print('Variance Inflating Factor')
cnames = x_train.columns
for i in np.arange(0,len(cnames)):
  xvars = list(cnames)
  yvar = xvars.pop(i)
  mod = sma.OLS(x_train[yvar], sma.add_constant(x_train_new[xvars]))
  res= mod.fit()
  vif = 1 / (1- res.rsquared)
  print('--',yvar,'=',round(vif,3))

Variance Inflating Factor
-- fixed acidity = 1.145
-- volatile acidity = 1.14
-- residual sugar = 1.071
-- chlorides = 1.058
-- free sulfur dioxide = 1.867
-- total sulfur dioxide = 1.94
-- alcohol = 1.172


  x = pd.concat(x[::order], 1)


###**Key Metrics**

> **AIC:**
- Reduced from -1736 from iteration -1740 to in iteration 5

> **Adjusted R-squared:**
- 0.679 --> 0.679

> **Individual variable's P-value (P>|t|):**
- It is =<0.05

> **Individual variable's VIF:**
- It is <5



# Testing

In [26]:
# Prediction of Data
y_pred = full_results.predict(x_test_new)
y_pred_df = pd.DataFrame(y_pred)
y_pred_df.columns = ['y_pred']

pred_data = pd.DataFrame(y_pred_df['y_pred'])
y_test_new = pd.DataFrame(y_test)
y_test_new.reset_index(inplace=True)
pred_data['y_test'] = pd.DataFrame(y_test_new['citric acid'])

# R-Squared Calculation
rsqd = r2_score(y_test_new['citric acid'].tolist(),
y_pred_df['y_pred'].to_list())

print("Training R-square value = ", round(full_results.rsquared_adj,4))
print("Test R-square value = ", round(rsqd,4))




Training R-square value =  0.6788
Test R-square value =  0.6731


- The training and testing R-sqr are very similar to each other. 

- then, the relationship between the dependent and the independent variables could be represented by a linear Regression.

- We can say that these Independent variables explains 67% of the variablity in citric acid values.


# Linear Regression Assumptions ☁️

In order to have a robust model, linear regression requires certain assumptions. Some assumptions are related to the relationship between the x & y and others are concerned with the Residuals/error terms. I will start defining the assumption and the reason behind it. 

## Assumption 1: Linear Relationship between y and x 📈

Since Linear regression describes a linear relationship between the dependent and independent variable/s; thus a linear relationship should exist between them.
It can be:
y = ax + b 
y = a log(x)

However, it can't be:
y = ax2x3 + b


we use a scatter_matrix; equivalent to pairplot; to check the relation of independent variables with the citric acid.

---------------------------

In [27]:
fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='fixed acidity',
                                 values=df_wine['fixed acidity']),
                            dict(label='volatile acidity',
                                 values=df_wine['volatile acidity']),
                            dict(label='free sulfur dioxide',
                                 values=df_wine['free sulfur dioxide']),
                            dict(label='total sulfur dioxide',
                                 values=df_wine['total sulfur dioxide']),
                            dict(label='chlorides',
                                 values=df_wine['chlorides']),
                            dict(label='alcohol',
                                 values=df_wine['alcohol']),
                            dict(label='residual sugar',
                                 values=df_wine['residual sugar']),
                            dict(label='citric acid',
                                 values=df_wine['citric acid'])],
                diagonal_visible=False, # remove plots on diagonal
                marker=dict(color=colors[2],
                            showscale=False, 
                            size=4,# colors encode categorical variables
                            line_color='white', line_width=0)
                ))

fig.update_layout(
    template="plotly_white",
    title='Scatter Matrix',
    title_x=0.5,
    width=1200,
    height=1100,
    font=dict(size=13)
)
fig.show()

We noticed that some variables are not in a linear relationship with the dependent variable. Then visualize them to make sure

In [28]:
fig = go.Figure(data=go.Splom(
                dimensions=[dict(label='free sulfur dioxide',
                                 values=df_wine['free sulfur dioxide']),
                            dict(label='total sulfur dioxide',
                                 values=df_wine['total sulfur dioxide']),
                            dict(label='alcohol',
                                 values=df_wine['alcohol']),
                            dict(label='citric acid',
                                 values=df_wine['citric acid'])],
                diagonal_visible=False, # remove plots on diagonal
                marker=dict(color=colors[1],
                            showscale=False, 
                            size=4,# colors encode categorical variables
                            line_color='white', line_width=0)
                ))

fig.update_layout(
    template="plotly_white",
    title='Scatter Matrix',
    title_x=0.5,
    width=1200,
    height=1100,
    font=dict(size=13)
)
fig.show()

By looking at the plots we can see that `total sulfur dioxide` form somehow a linear shape with the `citric acid`, although some outliers exist. However, `free sulfur dioxide` and `alcohol` don't show any linearity with `citric acid`.Then we should eliminate these two variables from the model.

## Tune the model

In [29]:
# prepare the X for the model
X = df_wine[['fixed acidity', 'volatile acidity', 'residual sugar',
       'chlorides', 'total sulfur dioxide']]
Y = df_wine['citric acid']

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(X,Y,train_size=0.7,random_state=42)
x_train_new = sma.add_constant(x_train)
x_test_new = sma.add_constant(x_test)
full_model = sma.OLS(y_train, x_train_new)
res = full_model.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:            citric acid   R-squared:                       0.656
Model:                            OLS   Adj. R-squared:                  0.654
Method:                 Least Squares   F-statistic:                     424.1
Date:                Thu, 17 Feb 2022   Prob (F-statistic):          9.24e-255
Time:                        11:46:48   Log-Likelihood:                 835.89
No. Observations:                1119   AIC:                            -1660.
Df Residuals:                    1113   BIC:                            -1630.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -0.1127 


In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only



In [30]:
print('Variance Inflating Factor')
cnames = x_train.columns
for i in np.arange(0,len(cnames)):
  xvars = list(cnames)
  yvar = xvars.pop(i)
  mod = sma.OLS(x_train[yvar], sma.add_constant(x_train_new[xvars]))
  res_1 = mod.fit()
  vif = 1 / (1- res_1.rsquared)
  print('--',yvar,'=',round(vif,3))

Variance Inflating Factor
-- fixed acidity = 1.11
-- volatile acidity = 1.077
-- residual sugar = 1.051
-- chlorides = 1.024
-- total sulfur dioxide = 1.048



In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only



## Assumption 2: Check for Homoscedasticity 📊
Homoscedasticity means that the residuals have equal or almost equal variance across the regression line. By plotting the residuals against the predicted terms we can check the presence of any pattern.

------------



### **Graphical Method**


We plot the residuals against the predicted values or the X. If there is a definite pattern (like linear or quadratic or funnel shaped) obtained from the scatter plot then heteroscedasticity is present.

In [31]:
fig = go.Figure(
    layout=go.Layout(
        title="Residuals vs fitted values plot for homoscedasticity check",
        title_x=0.5,
        width=1000,
        height=800,
        font=dict(size=18),
        template="plotly_white",
        autosize=True,
        yaxis_title="Residuals",
        xaxis_title="Predicted Values")
    )


fig.add_trace(go.Scatter(x=res.fittedvalues,
                         y=res.resid,
                         showlegend=False,
                         mode='markers',
                         name='lines',
                         marker=dict(color=colors[2],size = 5)))
fig.add_shape(type="line",
    x0=-0.25, y0=0, x1=0.9, y1=0,
    line=dict(
        color=colors[1],
        width=4
    ))


fig.show()

we can see from the graph that there is no pattern/ shape presenting the residuals. Although, there are some values that are very far from the zero line. So, the assumption here is not violated 






### **Statistical Tests**


  




**Goldfeld Quandt Test:**

$$\mathcal{H}_{0}:  Residuals\ are\ homoscedastic\ $$ 

$$\mathcal{H}_{1}:  Residuals\ are\ not\ homoscedastic\ $$

In [32]:
name = ['F statistic', 'p-value']
goldfeld = sms.het_goldfeldquandt(res.resid, x_train_new)
lzip(name, goldfeld)

[('F statistic', 0.9594617635104512), ('p-value', 0.6867050093000607)]



**Breusch Pagan Test for Heteroscedasticity**:

$$\mathcal{H}_{0}:  Residuals\ variances\ are\ equal\ (Homoscedasticity)$$ 

$$\mathcal{H}_{1}:  Residuals\ variances\ are\ not\ equal\ (Heteroscedasticity)\ $$ 


In [33]:
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ['F statistic', 'p-value']
test = sms.het_breuschpagan(res.resid, x_train_new)
lzip(name, test)

[('F statistic', 48.06161324596686), ('p-value', 3.450531269363563e-09)]

In both test the p-value is more than 0.05 in Goldfeld Quandt Test and Breush Pagan Test, we accept their null hypothesis that error terms are homoscedastic.  ✅

## Assumption 3: Check for Normality of residuals 🧮
This assumptions requires the residual terms to be normally distributed


----------------

### **Graphically**

In [34]:
color = [colors[2]]

hist_data = [res.resid]
group_labels = ['distribution of plot'] # name of the dataset

fig = ff.create_distplot(hist_data, 
                         group_labels,
                         bin_size=0.01,
                         colors= color)

fig.update_layout(
        title="Normality of Residuals",
        title_x=0.5,
        width=800,
        height=700,
        font=dict(size=18),
        template="plotly_white",
        yaxis_title="",
        xaxis_title="Possible residual Values")

fig.show()

In [None]:
qqplot_data = qqplot(res.resid, line='s').gca().lines

In [36]:
fig = go.Figure()

fig.add_trace(go.Scatter(x= qqplot_data[0].get_xdata(),
                         y= qqplot_data[0].get_ydata(),
                         mode='markers',
                         marker=dict(color=colors[2],size =8 )))


fig.add_trace(go.Scatter(x= qqplot_data[1].get_xdata(),
                         y= qqplot_data[1].get_ydata(),
                         showlegend=False,
                         mode='lines',
                         marker=dict(color=colors[1],size = 5)))




fig.update_layout(
        title="Quantile-Quantile Plot",
        title_x=0.5,
        width=800,
        height=700,
        font=dict(size=18),
        showlegend =False,
        template="plotly_white",
        yaxis_title="Sample Quantities",
        xaxis_title="Theoritical Quantities")


fig.show()
#py.iplot(fig, filename='normality-QQ')

### **Statistical Tests**

**Anderson Darling Test for checking Normality of Errors:**

$$\mathcal{H}_{0}:  The\ residuals\ follows\ a\ specified\ distribution $$ 

$$\mathcal{H}_{1}:  The\ residuals\ doesn't\ follows\ a\ specified\ distribution $$ 


In [37]:
anderson_results = stats.anderson(res.resid, dist='norm')
name = ['Overall p-value', 'p-value']
lzip(name,anderson_results)

[('Overall p-value', 3.0803763706908285),
 ('p-value', array([0.574, 0.654, 0.784, 0.915, 1.088]))]

- The distribution plot shows somehow a bell shape, skewed to the right a little bit. But acceptable. ✅

- The Q-Q plot shows that most values are present on straight line. ✅

- In the test the p-value is more than 0.05 then, we accept the null hypothesis that residuals follows a normal distribution. ✅

## Assumption 4: Dropping Multicollinear Variables 🔻

In regression, multicollinearity refers to the extent to which independent variables are correlated. Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity

-------------------------

In [38]:
#Calculate the VIF and remove the features with VIF above 5 if you see fit to do so

print('Variance Inflating Factor')
cnames = x_train.columns
for i in np.arange(0,len(cnames)):
  xvars = list(cnames)
  yvar = xvars.pop(i)
  mod = sma.OLS(x_train[yvar], sma.add_constant(x_train_new[xvars]))
  res_1 = mod.fit()
  vif = 1 / (1- res_1.rsquared)
  print('--',yvar,'=',round(vif,3))

Variance Inflating Factor
-- fixed acidity = 1.11
-- volatile acidity = 1.077
-- residual sugar = 1.051
-- chlorides = 1.024
-- total sulfur dioxide = 1.048



In a future version of pandas all arguments of concat except for the argument 'objs' will be keyword-only



In [39]:
def df_to_plotly(df):
    return {'z': df.values.tolist(),
            'x': df.columns.tolist(),
            'y': df.index.tolist()}

In [40]:
z = X
z = z.corr()

In [41]:
fig = go.Figure(data=go.Heatmap(df_to_plotly(z),
                                colorscale = c))

fig.update_layout(
        title="Heat Map",
        title_x=0.5,
        width=800,
        height=700,
        font=dict(size=18),
        showlegend =False,
        template="plotly_white")
fig.show()

- VIF showed no multicollinearity. ✅
- Heatmap shared the same result. ✅


## Assumption 5: No Autocorrelation of Residuals 🔗

Linear regression model assumes that error terms are independent. This means that the error term of one observation is not influenced by the error term of another observation. In case it is not so, it is termed as autocorrelation.


**Durbin Watson test is used to check for autocorrelation:**

$$\mathcal{H}_{0}:  Autocorrelation\ is\ absent\  $$ 

$$\mathcal{H}_{1}:  Autocorrelation\ is\ present\ $$ 

In [42]:
from statsmodels.stats.stattools import durbin_watson
durbin_watson(res.resid)

2.093652924882907

The value of the statistic will lie between 0 to 4. A value between 1.8 and 2.2 indicates no autocorrelation. A value less than 1.8 indicates positive autocorrelation and a value greater than 2.2 indicates negative autocorrelation

- Durbin test indicates no autocorrelation. ✅ 


# Final Evaluation 📍

In [44]:
print(res.summary2())

                   Results: Ordinary least squares
Model:                OLS              Adj. R-squared:     0.654     
Dependent Variable:   citric acid      AIC:                -1659.7740
Date:                 2022-02-17 11:54 BIC:                -1629.6529
No. Observations:     1119             Log-Likelihood:     835.89    
Df Model:             5                F-statistic:        424.1     
Df Residuals:         1113             Prob (F-statistic): 9.24e-255 
R-squared:            0.656            Scale:              0.013214  
---------------------------------------------------------------------
                      Coef.  Std.Err.    t     P>|t|   [0.025  0.975]
---------------------------------------------------------------------
const                -0.1127   0.0236  -4.7826 0.0000 -0.1589 -0.0664
fixed acidity         0.0613   0.0021  29.0163 0.0000  0.0572  0.0655
volatile acidity     -0.4726   0.0196 -24.1117 0.0000 -0.5111 -0.4342
residual sugar        0.0101   0.0025  

## Overall Model Accuarcy:

This is evaluated by R-squared. R2 = 0.65 or 65%. Thus, our model **MAY BE** good enough to deploy on unseen data.



In [45]:
res.params

const                  -0.112672
fixed acidity           0.061313
volatile acidity       -0.472614
residual sugar          0.010075
chlorides               0.762333
total sulfur dioxide    0.000700
dtype: float64

## Model Significance

Our Model:

$$\mathcal{Y}_{citric acid}= 0.075 {x}_{fixed acidity} +  0.0008 {x}_{free sulfur dioxide} - 0.33  $$

In order to prove that our linear model is statistically significant, we have to perform hypothesis testing for every β. Let us asume that:
$$\mathcal{H}_{0}: β_{1} = 0 $$
$$\mathcal{H}_{1}: β_{1} ≠ 0 $$
Simply, if β1 = 0 then the model shows no association between both variables 
$$\mathcal{Y}_{}= β_{0} + ε $$





To test the coefficient’s null hypothesis we will be using the t statistic. Look at the P>| t | column. These are the p-values for the t-test. In short, if they are less than the desired significance (commonly .05), you reject the null hypothesis. Otherwise, you fail to reject the null and therefore should toss out that independent variable.

Above, assuming a significance value of 0.05, our P-Value of 0.000 is much lower than a significance. Therefore, we reject the null hypothesis that the coefficient is equal to 0 and conclude that` fixed acidity` and `free sulfur dioxide` is an important independent variable to utilize.

Now, going back to the assumptions of the linear regression, some assumptions were violated. It seems that the free sulfur dioxide is skewing the results. 
Notice the t-score of both variables.
Generally, any t-value greater than +2 or less than – 2 is acceptable. We know that the higher the t-value, the greater the confidence we have in the coefficient as a predictor.Low t-values are indications of low reliability of the predictive power of that coefficient.

- fixed acidity           36.208    very high 🟢
- free sulfur dioxide     2.314     very low 🔴

Therefore in the model update i will remove the free sulfur dioxide from the equation.

