# Topic 03 - Problem 7: Handle Multicollinearity in a Dataset

---

## 1. About the Problem

This problem asks me to detect and handle **multicollinearity** in a dataset.  
**Multicollinearity** occurs when two or more features in a dataset are highly correlated with each other, which can affect the performance of regression models.  
To solve this, I will use the **correlation matrix** and calculate the **Variance Inflation Factor (VIF)** to check for multicollinearity.  
If the correlation between features is high, I can remove one of them to reduce multicollinearity.

---


## 2. Solution Code

In [2]:
import numpy as np
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif(data):
    # Check if there are at least two features in the data
    if len(data.columns) < 2:
        return pd.DataFrame(columns=['features', 'VIF'])
    
    vif_data = pd.DataFrame()
    vif_data["features"] = data.columns
    vif_data["VIF"] = [variance_inflation_factor(data.values, i) for i in range(len(data.columns))]
    
    return vif_data

def remove_multicollinearity(data, threshold=5.0):
    vif_data = calculate_vif(data)
    
    # Avoid infinite loop when only one feature remains
    while vif_data["VIF"].max() > threshold and len(data.columns) > 1:
        remove = vif_data.loc[vif_data["VIF"].idxmax(), "features"]
        print(f"Removing feature: {remove} with VIF: {vif_data['VIF'].max()}")
        data = data.drop(remove, axis=1)
        vif_data = calculate_vif(data)
        
    return data

# Testing the function with a sample dataset
data = {
    "age": [25, 30, 28, 60, 70, 55, 65],
    "salary": [50000, 60000, 55000, 68000, 75000, 57000, 63000],
    "experience": [2, 5, 3, 10, 15, 8, 12],
    "years_in_industry": [1, 3, 2, 8, 10, 5, 7]
}

df = pd.DataFrame(data)

# Removing features with high multicollinearity
cleaned_data = remove_multicollinearity(df, threshold=5.0)
print("\nCleaned Dataset after removing multicollinearity:\n", cleaned_data)



Removing feature: experience with VIF: 124.52036088076348
Removing feature: age with VIF: 69.55615830317608
Removing feature: salary with VIF: 5.921950115943668

Cleaned Dataset after removing multicollinearity:
    years_in_industry
0                  1
1                  3
2                  2
3                  8
4                 10
5                  5
6                  7


---

## 3. Summary / Takeaways

By solving this problem, I learned how to detect **multicollinearity** using the **Variance Inflation Factor (VIF)**.  
I understood that **multicollinearity** can inflate model coefficients, making them less reliable.  
The process of removing features with high VIF improves the model's interpretability and performance.  
Handling multicollinearity is crucial for linear models to function effectively and provide accurate predictions.  
Next, I want to explore **feature scaling** techniques to further optimize the dataset for machine learning.
