The California Housing dataset is a dataset that provides information about various features related to housing prices in California.

The dataset consists of 20640 samples (rows).
There are 8 features (columns) that describe the characteristics of the areas in California.

Features: The dataset contains the following 8 features:

- MedInc: Median income in the area (in $10,000s).
- HouseAge: Median house age (in years).
- AveRooms: Average number of rooms per household.
- AveOccup: Average number of occupants per household.
- Latitude: Latitude of the block group (geographical location).
- Longitude: Longitude of the block group (geographical location).
- MedHouseVal: Target variable: median house value (in $100,000s).
- AveBedrms: Average number of bedrooms per household.

In [59]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.feature_selection import RFE
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.metrics import mean_squared_error
import pandas as pd

data = fetch_california_housing(as_frame=True)
pd.DataFrame(pd.concat([data.data, data.target], axis=1))

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [60]:
X = data.data  
y = data.target  

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
model = LinearRegression()

# ------------------ SelectKBest ------------------
select_k_best = SelectKBest(f_regression, k=5)
X_train_kbest = select_k_best.fit_transform(X_train, y_train)
X_test_kbest = select_k_best.transform(X_test)
# Train the model on selected features
model.fit(X_train_kbest, y_train)
y_pred_kbest = model.predict(X_test_kbest)
mse_kbest = mean_squared_error(y_test, y_pred_kbest)

# ------------------ Recursive Feature Elimination (RFE) ------------------
rfe = RFE(estimator=model, n_features_to_select=5)
X_train_rfe = rfe.fit_transform(X_train, y_train)
X_test_rfe = rfe.transform(X_test)
# Train the model on selected features
model.fit(X_train_rfe, y_train)
y_pred_rfe = model.predict(X_test_rfe)
mse_rfe = mean_squared_error(y_test, y_pred_rfe)

# ------------------ Sequential Feature Selection (SFS) ------------------
sfs = SequentialFeatureSelector(estimator=model, n_features_to_select=5, direction='forward', cv=5)
X_train_sfs = sfs.fit_transform(X_train, y_train)
X_test_sfs = sfs.transform(X_test)
# Train the model on selected features
model.fit(X_train_sfs, y_train)
y_pred_sfs = model.predict(X_test_sfs)
mse_sfs = mean_squared_error(y_test, y_pred_sfs)

print("SelectKBest MSE:", mse_kbest)
print("RFE MSE:", mse_rfe)
print("SFS MSE:", mse_sfs)

print("Selected features for SelectKBest:", np.array(data["feature_names"])[select_k_best.get_support()])
print("Selected features for RFE:", np.array(data["feature_names"])[rfe.get_support()])
print("Selected features for SFS:", np.array(data["feature_names"])[sfs.get_support()])
# or use sfs.get_feature_names_out(data["feature_names"])

SelectKBest MSE: 0.638256544155592
RFE MSE: 0.5667695170781499
SFS MSE: 0.5516459297026114
Selected features for SelectKBest: ['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Latitude']
Selected features for RFE: ['MedInc' 'AveRooms' 'AveBedrms' 'Latitude' 'Longitude']
Selected features for SFS: ['MedInc' 'HouseAge' 'AveBedrms' 'Latitude' 'Longitude']
