In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns',100)
pd.set_option('display.max_rows',100)

plt.style.use('dark_background')

# Variance Inflation Factor (VIF)

The **Variance Inflation Factor (VIF)** quantifies how much the variance of an estimated regression coefficient is inflated due to **linear dependence among predictors**.

## Setup

Suppose we have predictors:

$$
X_1, X_2, \dots, X_n
$$

and we are concerned about collinearity among them.

For each predictor $X_j$, we fit an **auxiliary regression**:

$$
\hat X_j = \sum_{\substack{k=1 \\ k \neq j}}^n \beta_k X_k
$$

That is, we try to explain $X_j$ using all of the other predictors.

## Definition

Let $R_j^2$ be the correlation between the predicted values $\hat X_j$ and the true values $X_j$.

The Variance Inflation Factor for $X_j$ is:

$$
\text{VIF}(X_j) = \frac{1}{1 - R_j^2}
$$

## Interpretation

- If $R_j^2 \approx 0$, then $\text{VIF}(X_j) \approx 1$:  
  the predictor is not explained by the others, so there is **no multicollinearity**.

- If $R_j^2$ is large, then $\text{VIF}(X_j)$ is much greater than 1:  
  the predictor can be well explained by the others, so its coefficient in a regression will have **inflated variance**.

- If $R_j^2 = 1$, then $\text{VIF}(X_j) = \infty$:  
  predictor $X_j$ is an **exact linear combination** of the others (perfect multicollinearity).

## Rule of Thumb

- $\text{VIF} < 5$: Generally acceptable.  
- $5 \leq \text{VIF} < 10$: Indicates moderate multicollinearity.  
- $\text{VIF} \geq 10$: Suggests serious multicollinearity issues.

# Create data

In [10]:
n = 1000
x1 = np.random.randn(n)
x2 = np.random.randn(n)

# Create a third variable that is a linear combination of x1 and x2
#x3 = np.random.randn(n)                    # Uncorrelated with x1, x2
x3 = 2*x1 - 3*x2 + 0.05*np.random.randn(n) # Correlated with x1, x2
#x3 = 2*x1 - 3*x2                          # Perfectly collinear with x1, x2

# Put in DataFrame
df = pd.DataFrame({
    "x1": x1,
    "x2": x2,
    "x3": x3
})

df.corr()

Unnamed: 0,x1,x2,x3
x1,1.0,0.009533,0.53975
x2,0.009533,1.0,-0.836531
x3,0.53975,-0.836531,1.0


# Compute VIF

In [11]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Compute VIFs without adding a constant
vif = pd.Series(
    [variance_inflation_factor(df.values, i) 
     for i in range(df.shape[1])],
    index=df.columns
)

print("Variance Inflation Factors (no constant):")
print(vif)


Variance Inflation Factors (no constant):
x1    1612.879658
x2    3794.538308
x3    5354.955004
dtype: float64
