![Logo](logo.jpg)

# Interpretable Analysis of California Housing Prices Using EBM

In this notebook, we will use the Explainable Boosting Machine (EBM) model to predict housing prices in California and understand which factors have the greatest impact on those prices.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from interpret.glassbox import ExplainableBoostingRegressor
from interpret import show
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
%matplotlib inline

## Loading and Preparing the Data

We use the California Housing dataset, which includes the following features:

MedInc: median income in the block

HouseAge: average age of houses in the block

AveRooms: average number of rooms

AveBedrms: average number of bedrooms

Population: population in the block

AveOccup: average number of occupants

Latitude: geographic latitude

Longitude: geographic longitude

Target variable:

Median house price in the block (in hundreds of thousands of dollars)

In [10]:
# Loading data
california = fetch_california_housing()
data = pd.DataFrame(california.data, columns=california.feature_names)
data['price'] = california.target

print("Example data:")
data.head()

Example data:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,price
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


## Statistical Descriptive Data

In [11]:
print("\nStatistical descriptive data:")
data.describe()


Statistical descriptive data:


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,price
count,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0,20640.0
mean,3.870671,28.639486,5.429,1.096675,1425.476744,3.070655,35.631861,-119.569704,2.068558
std,1.899822,12.585558,2.474173,0.473911,1132.462122,10.38605,2.135952,2.003532,1.153956
min,0.4999,1.0,0.846154,0.333333,3.0,0.692308,32.54,-124.35,0.14999
25%,2.5634,18.0,4.440716,1.006079,787.0,2.429741,33.93,-121.8,1.196
50%,3.5348,29.0,5.229129,1.04878,1166.0,2.818116,34.26,-118.49,1.797
75%,4.74325,37.0,6.052381,1.099526,1725.0,3.282261,37.71,-118.01,2.64725
max,15.0001,52.0,141.909091,34.066667,35682.0,1243.333333,41.95,-114.31,5.00001


## Preparing Data for Modeling

In [12]:
# Split into features and target variable
X = data.drop('price', axis=1)
y = data['price']

# Feature names
feature_names = list(X.columns)

# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scaling features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert to DataFrame with feature names
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_names)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=feature_names)

print("Dimensions of datasets:")
print(f"Training set: {X_train_scaled.shape}")
print(f"Test set: {X_test_scaled.shape}")

Dimensions of datasets:
Training set: (16512, 8)
Test set: (4128, 8)


## Model Training

We use ExplainableBoostingRegressor, because we are predicting a continuous value of house price.

In [13]:
# Initialization and training EBM
ebm = ExplainableBoostingRegressor(random_state=42, feature_names=feature_names)
ebm.fit(X_train_scaled, y_train)

# Model evaluation
from sklearn.metrics import r2_score, mean_squared_error
y_pred = ebm.predict(X_test_scaled)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"R² coefficient: {r2:.3f}")
print(f"RMSE: {rmse:.3f} (in thousands of dollars)")

R² coefficient: 0.831
RMSE: 0.470 (in thousands of dollars)


## Global Interpretation

Let's see which factors have the greatest impact on house prices in the entire model:

In [None]:
# Global interpretation
global_explanation = ebm.explain_global()
show(global_explanation)

## Local Interpretation

Let's analyze the predictions of house prices in detail:

In [18]:
sample = X_test[0:14]
y_pred_sample = ebm.predict(sample)
# Ground Truth
y_true_sample = y_test[0:14]
local_explanation = ebm.explain_local(sample, y_true_sample)
show(local_explanation)

## Conclusions

Model EBM allows us to understand:
1. Which factors have the greatest impact on house prices (global interpretation)
2. How individual values of features affect the price of a specific house (local interpretation)
3. What are the nonlinear relationships between features and house prices