# IN269 Kecerdasan Bisnis
## Pertemuan 11: Multicollinearity

## 10.1 Introduction
- In multiple regression, two or more predictor variables might be correlated with each other. 
- This situation is referred as **collinearity**.

- There is an extreme situation, called **multicollinearity**, where collinearity exists between three or more variables even if no pair of variables has a particularly high correlation. 
- This means that there is redundancy between predictor variables.

In the presence of multicollinearity, the solution of the regression model becomes unstable.

For a given predictor ($p$), multicollinearity can assessed by computing a **score called the variance inflation factor (or VIF)**, which measures <u>how much the variance of a regression coefficient is inflated due to multicollinearity in the model</u>.

- The smallest possible value of VIF is one (absence of multicollinearity). 
- As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity

When faced to multicollinearity, the concerned variables should be removed, since the presence of multicollinearity implies that the information that this variable provides about the response is redundant in the presence of the other variables 

## 10.2 Loading Required R packages
- `tidyverse` for easy data manipulation and visualization
- `caret` for easy machine learning workflow

In [2]:
library(tidyverse)
library(caret)
library(car)

Loading required package: carData


Attaching package: ‘car’


The following object is masked from ‘package:dplyr’:

    recode


The following object is masked from ‘package:purrr’:

    some




## 10.3 Preparing the data
- `tidyverse` for easy data manipulation and visualization
- `caret` for easy machine learning workflow

We'll use the `Boston` data set \[in `MASS` package\] for predicting the median house value (`mdev`), in Boston Suburbs, based on multiple predictor variables.    
   
We’ll randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).      
   
Make sure to set seed for reproducibility.   

In [3]:
data("Boston", package ="MASS")
set.seed(123)
training.samples <- Boston$medv %>% createDataPartition(p = 0.8, list = FALSE)

In [4]:
train.data <- Boston[training.samples, ]
test.data <- Boston[-training.samples, ]

In [5]:
head(test.data)

Unnamed: 0_level_0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<int>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
6,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
9,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
11,0.22489,12.5,7.87,0,0.524,6.377,94.3,6.3467,5,311,15.2,392.52,20.45,15.0
14,0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4
15,0.63796,0.0,8.14,0,0.538,6.096,84.5,4.4619,4,307,21.0,380.02,10.26,18.2


## 10.4 Building a Regression Model 
The following regression model include all predictor variables:


In [6]:
# Build the model
model1 <- lm(medv ~ ., data = train.data)

In [7]:
# Make predictions
predictions <- model1 %>% predict(test.data)

In [8]:
# Model performance
data.frame(RMSE = RMSE(predictions, test.data$medv), MAE = MAE(predictions, 
                                                               test.data$medv))

RMSE,MAE
<dbl>,<dbl>
4.588948,3.365475


## 10.5 Detecting multicollinearity
The R function `vif()` can be used to detect multicollinearity in a regression model:

In [10]:
vif(model1)

- In our example, the VIF score for the predictor variable `tax` is very high (VIF = 9.16).     
- This might be problematic.

## 10.6 Dealing with multicollinearity
In this section, we'll update our model by removing the the predictor variables with high VIF value:

In [11]:
# Build a model excluding the tax variable
model2 <- lm(medv ~ . -tax, data=train.data)

In [15]:
# Make predictions
predictions <- model2 %>% predict(test.data)

In [16]:
# Model performance
data.frame( RMSE = RMSE(predictions, test.data$medv), MAE = MAE(predictions, test.data$medv) )

RMSE,MAE
<dbl>,<dbl>
4.644396,3.479546


It can be seen that removing the tax variable does not affect very much the model performance metrics.

- Pertemuan kali ini describes how to detect and deal with multicollinearity in regression models. 
- Multicollinearity problems consist of including, in the model, different variables that have a similar predictive relationship with the outcome. 
- This can be assessed for each predictor by computing the VIF value.
- Any variable with a high VIF value (above 5 or 10) should be removed from the model. 
- This leads to a simpler model without compromising the model accuracy, which is good.

## Ringkasan Uji Asumsi
1. Uji Linearitas dari data
2. Uji Homogeneity dari variance
3. Uji Normalitas dari residual
4. Uji Multicollinearity

In [68]:
lm.fit <- lm(medv ~ lstat + age, data=Boston)

In [69]:
summary(lm.fit)


Call:
lm(formula = medv ~ lstat + age, data = Boston)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.981  -3.978  -1.283   1.968  23.158 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
age          0.03454    0.01223   2.826  0.00491 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.173 on 503 degrees of freedom
Multiple R-squared:  0.5513,	Adjusted R-squared:  0.5495 
F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16


In [70]:
vif(lm.fit)

## Latihan Soal
Sebagai latihan, gunakan dataset `Advertising` dan lakukan
1. Ujilah keempat asumsi tersebut 
2. Bentuklah model regresi linier 
$$
    \text{sales} = \beta_0 + \beta_1 \times \text{TV} + \beta_2 \times \text{radio} + \beta_3 \times \text{newspaper}
$$
3. Lakukan prediksi dan hitung MSE dan MAE.

In [65]:
Advertising <- read_csv("Advertising.csv") 
Advertising <- Advertising %>% select(-c(no))

[1mRows: [22m[34m200[39m [1mColumns: [22m[34m5[39m
[36m──[39m [1mColumn specification[22m [36m──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (5): no, TV, radio, newspaper, sales

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [66]:
head(Advertising)

TV,radio,newspaper,sales
<dbl>,<dbl>,<dbl>,<dbl>
230.1,37.8,69.2,22.1
44.5,39.3,45.1,10.4
17.2,45.9,69.3,9.3
151.5,41.3,58.5,18.5
180.8,10.8,58.4,12.9
8.7,48.9,75.0,7.2


In [67]:
sample_n(Advertising, 3)

TV,radio,newspaper,sales
<dbl>,<dbl>,<dbl>,<dbl>
273.7,28.9,59.7,20.8
250.9,36.5,72.3,22.2
7.8,38.9,50.6,6.6


<center>
        <h1>The End</h1>
</center>