**"Proxy features"** in AI ethics are known as the hidden biases in data which the model learns and uses in the predictions or outcomes. Even if an AI model does not explicitly incorporate protected characteristics such as race, gender, or income, it can make use of other data points-colour discrimination, which would indirectly represent these characteristics-such as ZIP codes, job titles, or buying behaviour. Such problems can result in discriminatory decisions, even if discrimination is not their goal. It is the risk of AI giving conflicting judgments and unfairly evaluating individuals on some indirect signal. Thus, fairness and transparency are two main concerns in AI development. 

## Variance Inflation Factor (VIF)

1. VIF looks at the coefficient of determination (R2), for each explanatoryvariable. This method is widely used to remove collinear or multicollinear features during any statistical modelling.
   
2. This method can also be used to detect andremove features that are proxies to a given protected/sensitive features. 
3. Typically, this is used for dropping multicollinear (redundant) features. VIF internally developsmultiple regression among all features and returns the list of features that arecorrelated.


In [1]:
import seaborn as sns
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/loan-approvals/loan_data.csv


In [2]:
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

In [3]:
df= pd.read_csv("/kaggle/input/loan-approvals/loan_data.csv")

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 12 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Gender              601 non-null    object 
 1   Married             611 non-null    object 
 2   Dependents          599 non-null    object 
 3   Education           614 non-null    object 
 4   Self_Employed       582 non-null    object 
 5   Applicant_Income    614 non-null    int64  
 6   Coapplicant_Income  614 non-null    float64
 7   Loan_Amount         614 non-null    int64  
 8   Term                600 non-null    float64
 9   Credit_History      564 non-null    float64
 10  Area                614 non-null    object 
 11  Status              614 non-null    object 
dtypes: float64(3), int64(2), object(7)
memory usage: 57.7+ KB


In [5]:
df=df.dropna()

In [6]:
# Convert categorical columns to numerical using one-hot encoding
df_encoded = pd.get_dummies(df, drop_first=True)

In [7]:
# Ensure all data is numeric
df_encoded = df_encoded.astype(float)

In [8]:
# Add constant column (required for VIF computation)
X = add_constant(df_encoded)

In [9]:
# Compute VIF for each feature
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

In [10]:
# Print VIF values
print(vif_data)

                  Variable        VIF
0                    const  50.704496
1         Applicant_Income   1.419236
2       Coapplicant_Income   1.136774
3              Loan_Amount   1.477853
4                     Term   1.052287
5           Credit_History   1.400720
6              Gender_Male   1.208174
7              Married_Yes   1.370920
8             Dependents_1   1.207056
9             Dependents_2   1.250409
10           Dependents_3+   1.171462
11  Education_Not Graduate   1.082594
12       Self_Employed_Yes   1.049295
13          Area_Semiurban   1.501173
14              Area_Urban   1.488556
15                Status_Y   1.463456


**Key Observations:**

*Constant (const) → 50.70*

1. This is expected to be high. It represents the intercept, and we can ignore this value.
Independent Variables (All ≤ 1.5)

2. All features have VIF below 5, meaning no significant multicollinearity is present.
This suggests that each variable provides independent information to the model.
Highest VIF Values (Still Acceptable)

3. Loan_Amount (1.47), Area_Semiurban (1.50), Area_Urban (1.49)
These values are far below 5, so there is no severe collinearity.


✅ No strong multicollinearity detected → The model should work fine.

✅ No need to drop or transform variables → Since all VIFs are low.


In [11]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [12]:
# Target variable (loan approval status)
y = df_encoded['Status_Y']

# Drop the target variable to get the independent variables
X = df_encoded.drop(['Status_Y'], axis=1)

# Initialize the Decision Tree Regressor model
tree_regressor = DecisionTreeRegressor(random_state=42)

# Store the explained variance (R^2) for each feature
explained_variance = {}

# Fit the model for each feature individually and calculate R^2
for feature in X.columns:
    # Reshape feature to 2D array for fitting
    X_feature = X[[feature]]
    
    # Split into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X_feature, y, test_size=0.3, random_state=42)
    
    # Fit the decision tree model
    tree_regressor.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = tree_regressor.predict(X_test)
    
    # Calculate R^2 (coefficient of determination)
    r2 = r2_score(y_test, y_pred)
    
    # Store the result
    explained_variance[feature] = r2

# Display the explained variance (R^2) for each feature
explained_variance_df = pd.DataFrame(list(explained_variance.items()), columns=['Feature', 'R^2'])
print(explained_variance_df)

                   Feature       R^2
0         Applicant_Income -0.534105
1       Coapplicant_Income -0.287240
2              Loan_Amount -0.539080
3                     Term -0.076980
4           Credit_History  0.387557
5              Gender_Male -0.036321
6              Married_Yes -0.020374
7             Dependents_1 -0.033771
8             Dependents_2 -0.026825
9            Dependents_3+ -0.037334
10  Education_Not Graduate -0.026508
11       Self_Employed_Yes -0.036095
12          Area_Semiurban -0.016739
13              Area_Urban -0.034599


Credit_History is the most important feature in predicting loan approval status, with a moderate positive R².

Most other features have negative or very low R², which implies that they either do not have a direct or significant relationship with loan approval or they may be proxy variables that introduce bias.

In a practical scenario, this suggests that the model should potentially focus more on the credit history of the applicant and avoid overfitting to features that do not explain much of the target variable’s variance.