# Variance inflation factor(VIF)

A variance inflation factor (VIF) is a measure of the amount of multicollinearity in regression analysis. Multicollinearity exists when there is a correlation between multiple independent variables in a multiple regression model. This can adversely affect the regression results.

[statsmodel python librari related to the variance_inflation_factor](https://www.statsmodels.org/dev/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html?highlight=vif)

In [1]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

**KEY TAKEAWAYS**

A variance inflation factor (VIF) provides a measure of multicollinearity among the independent variables in a multiple regression model.
Detecting multicollinearity is important because while multicollinearity does not reduce the explanatory power of the model, it does reduce the statistical significance of the independent variables. 
A large VIF on an independent variable indicates a highly collinear relationship to the other variables that should be considered or adjusted for in the structure of the model and selection of independent variables.

**Understanding a Variance Inflation Factor (VIF)**

A variance inflation factor is a tool to help identify the degree of multicollinearity. Multiple regression is used when a person wants to test the effect of multiple variables on a particular outcome. The dependent variable is the outcome that is being acted upon by the independent variablesâ€”the inputs into the model. Multicollinearity exists when there is a linear relationship, or correlation, between one or more of the independent variables or inputs.

### The Problem of Multicollinearity

Multicollinearity creates a problem in the multiple regression model because the inputs are all influencing each other. Therefore, they are not actually independent, and it is difficult to test how much the combination of the independent variables affects the dependent variable, or outcome, within the regression model.

While multicollinearity does not reduce a model's overall predictive power, it can produce estimates of the regression coefficients that are not statistically significant. In a sense, it can be thought of as a kind of double-counting in the model.

In statistical terms, a multiple regression model where there is high multicollinearity will make it more difficult to estimate the relationship between each of the independent variables and the dependent variable. In other words, when two or more independent variables are closely related or measure almost the same thing, then the underlying effect that they measure is being accounted for twice (or more) across the variables. When the independent variables are closely-related, it becomes difficult to say which variable is influencing the dependent variables.

### Tests to Solve Multicollinearity

To ensure the model is properly specified and functioning correctly, there are tests that can be run for multicollinearity. **The variance inflation factor** is one such measuring tool. Using variance inflation factors **helps to identify the severity of any multicollinearity issues** so that the model can be adjusted. Variance inflation factor **measures how much the behavior (variance) of an independent variable is influenced, or inflated, by its interaction/correlation with the other independent variables**.

Variance inflation factors allow a quick measure of how much a variable is contributing to the standard error in the regression. When significant multicollinearity issues exist, the variance inflation factor will be very large for the variables involved. After these variables are identified, several approaches can be used to eliminate or combine collinear variables, resolving the multicollinearity issue.

### Formula and Calculation of VIF
The formula for VIF is:
$$ VIF_i = \frac{1}{1 - {R^2}_i} $$

where:
- $ {R^2}_i $ : Unadjusted coefficient of determination for regressing the ith independent variable on the remaining ones

**What Can VIF Tell You?**

When $ {R^2}_i $ is equal to $ 0 $, and therefore, when $ VIF $ or tolerance is equal to $ 1 $, the $ i^{th} $ independent variable is not correlated to the remaining ones, meaning that multicollinearity does not exist.

In general terms,

- VIF equal to 1 = variables are not correlated
- VIF between 1 and 5 = variables are moderately correlated 
- VIF greater than 5 = variables are highly correlated

The higher the VIF, the higher the possibility that multicollinearity exists, and further research is required. When VIF is higher than 10, there is significant multicollinearity that needs to be corrected.

--------------------

# Curated Articles - Introduction to Supervised Learning and Regression

- [Linear Regression](https://www.knowledgehut.com/blog/data-science/linear-regression-for-machine-learning): This article explains the linear regression algorithm along with the underlying assumptions. It talks about the various performance measures used to evaluate the model.

- [Maximum Likelihood Estimation](https://brilliant.org/wiki/maximum-likelihood-estimation-mle/): This article explains the concept of maximum likelihood estimation and why it is used in machine learning with the help of examples.

- [Empirical Risk Minimization](https://prateekvjoshi.com/2017/08/19/what-is-empirical-risk-minimization/): This article gives an intuition behind empirical risk minimization.

------------------