# Try various models
This exercise:
1. Prepares the data for the models
2. Run various models and error estimators

## Prepare data for models

Undering the data in step 1, look at various X_train, X_test samples to create to test models.
We shall look at random and startified train/test samples.

Here are the step we will perform:

1. Create a numeris pipeline that
    - Imputes the numerical columns with median strategy for numerical columns 
    - Adds three more columns (add custom transformer):
        - rooms_per_household = total_rooms / households
        - population_per_household = population / households
        - bedrooms_per_room = total_bedrooms / total_rooms
    - Scales all numerical columns
2. Perform One Hot Encoding on ocean_proximity 
3. Drop ocean_proximity from training and labels
4. Perform startification on "income_cat"

In [1]:
# Set logging first
import logging

# Configure logging for the notebook
logging.basicConfig(
    level=logging.WARNING,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    datefmt='%Y-%m-%d %H:%M:%S'
)

# Optionally, add console handler explicitly (if basicConfig isn't enough)
console = logging.StreamHandler()
console.setLevel(logging.WARNING)
formatter = logging.Formatter(
    '%(asctime)s - %(name)s - %(levelname)s - %(message)s')
console.setFormatter(formatter)
logging.getLogger('').addHandler(console)

In [2]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline # Pipeline
from sklearn.model_selection import StratifiedShuffleSplit # Stratified split

In [3]:
# Read saved data
df = pd.read_parquet("../../data/housing-geron.parquet")
df.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity,income_cat
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY,5
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY,5


In [4]:
# We will use "income_cat" for stratified split
income_categories_col = df['income_cat']
median_house_value_labels = df['median_house_value'] # Series
df = df.drop(columns=['income_cat', 'median_house_value'])

In [5]:
df.head(2)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,NEAR BAY


In [6]:
median_house_value_labels[:2]

0    452600.0
1    358500.0
Name: median_house_value, dtype: float64

In [7]:
# These are the columns we will use for the numerical pipeline
# The order of columns in Numpy and DataFrame is same
numerical_cols = df.drop(columns=['ocean_proximity']).columns.tolist()
categorical_col = ['ocean_proximity']
ocean_categories = df['ocean_proximity'].unique().tolist()

In [8]:
from step_2_data_transformer import ca_housing_data_transformer

# Prepare the data for the models
data_prepared = ca_housing_data_transformer(
    numerical_cols, categorical_col, ocean_categories)
data_prepared = data_prepared.fit_transform(df)

In [9]:
# Extract the labels and features from prepared data
# The label in data_prepared is the "label_col"
stratified_split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=282)

# Get the indices of the training and test sets and split the data
for train_index, test_index in stratified_split.split(X=data_prepared,
                                                      y=income_categories_col):
    X_train = data_prepared[train_index]
    y_train = median_house_value_labels[train_index]
    X_test = data_prepared[test_index]
    y_test = median_house_value_labels[test_index]

In [10]:
# Print shapes of X_train, X_test, y_train, y_test
print(f"X_train.shape: {X_train.shape}, X_test.shape: {X_test.shape}")
print(f"y_train.shape: {y_train.shape}, y_test.shape: {y_test.shape}")
print(f"data_prepared.shape: {data_prepared.shape}")

X_train.shape: (16512, 16), X_test.shape: (4128, 16)
y_train.shape: (16512,), y_test.shape: (4128,)
data_prepared.shape: (20640, 16)


__Remark__ At this point we have a X_train, X_test, y_train, y_test for following data:
1. Three new per household columns added: rooms, bedrooms, popolation
2. ocean_proximity was encoded using OHE, added 5 more columns
3. X_train and X_test were stratifi-cally sampled.

What's still remaining is _clipped_ median house value and age.
 
 __Imoprtant SKLearn Points:__

1. ColumnTransformer rows are independent. They can not depend on each other. At the end output of each step is concatenated. 
2. The pipeline rows on other hand are exceuted sequenctially. Output of `step-n` is fed to input of `step-n+1`
3. In our pipline:
    - SimpleImputer:
        - Takes in raw numeric data (with possible NaNs)
        - Replaces missing values with the median
        - Outputs a NumPy array of the same shape
    - PerHouseholdFeaturesAdder (custom transformer):
        - Takes in the imputed NumPy array
        - Adds new columns: e.g., rooms_per_household, etc.
        - Outputs an array with more columns
    - StandardScaler:
        - Receives the expanded feature matrix
        - Scales each feature to have zero mean and unit variance
        - Outputs a fully scaled matrix (including output labels)
4. Stratified sampling needs a reference strat point. In this case it was income category.


## Models

First let us train model on all the data (not just train/test). We will see how LinearRegression and DecisionTreeRegressor will work.

In [11]:
# Import all the models
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score, KFold
from sklearn.metrics import mean_squared_error, r2_score

In [12]:
def print_parameters(model, model_name):
    # Print the parameters of the model
    parameters = model.get_params()
    print(f"{model_name} parameters:")
    for key, value in parameters.items():
        print(f"  {key}: {value}")

In [13]:
# Save results in a dictionary
prediction_results = []
cv_scores = []
kfold = KFold(n_splits=10, shuffle=True, random_state=282)

In [14]:
def print_and_save_prediction_errors(y_test, y_pred, model_name):
    # Print the errors of the model
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    print(f"{model_name} MSE: {mse:,.2f}, RMSE: $ {rmse:,.2f}, R2 score: {r2:,.4f}")
    prediction_results.append({"model_name": model_name, "rmse": rmse, "r2": r2, "mse": mse})

In [15]:
def print_and_save_cv_scores(scores, model_name):
    # Print the scores of the model
    rmse_scores = np.sqrt(-scores)

    print(f"{model_name}:")
    mean, std = rmse_scores.mean(), rmse_scores.std()
    print(f"  Scores: {rmse_scores.round(2)}")
    print(f"  Mean: $ {mean:,.2f}", end=" ")
    print(f"  Std Dev: $ {std:,.2f}")
    cv_scores.append({"model_name": model_name, "mean": mean, "std": std})

## Linear Regression

Run LinearRegression on all the data. And see the errors.

In [16]:
# Train on all the data
lin_reg = LinearRegression().fit(data_prepared, median_house_value_labels)
print(f"lin_reg.coef_: {lin_reg.coef_.round(2)}")
print(f"lin_reg.intercept_: {lin_reg.intercept_.round(2)}")
print(f"lin_reg.n_features_in_: {lin_reg.n_features_in_}")
print_parameters(lin_reg, "LinearRegression")

lin_reg.coef_: [-55320.72 -56255.15  13364.74  -1882.43   7465.25 -46331.97  45752.37
  74791.32   6372.1     863.34   9613.22 -27120.68 -23233.93 -60499.22
 -18806.89 129660.72]
lin_reg.intercept_: 241741.59
lin_reg.n_features_in_: 16
LinearRegression parameters:
  copy_X: True
  fit_intercept: True
  n_jobs: None
  positive: False


In [17]:
# Test the model on all the data
y_pred = lin_reg.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "LinearRegression (All Data)")

LinearRegression (All Data) MSE: 4,518,787,119.35, RMSE: $ 67,221.92, R2 score: 0.6435


## Decision Tree Regression

__Remark:__ Our predictions have an average RMSE as $67K. Let us try a `DecisionTreeRegressor`. This is model is very complex and will overfit the data. Note the errors. Train on all the data.

In [18]:
# Fit the model on all the data
tree_reg_model = DecisionTreeRegressor(
    random_state=282).fit(data_prepared, median_house_value_labels)
print_parameters(tree_reg_model, "DecisionTreeRegressor")

DecisionTreeRegressor parameters:
  ccp_alpha: 0.0
  criterion: squared_error
  max_depth: None
  max_features: None
  max_leaf_nodes: None
  min_impurity_decrease: 0.0
  min_samples_leaf: 1
  min_samples_split: 2
  min_weight_fraction_leaf: 0.0
  monotonic_cst: None
  random_state: 282
  splitter: best


In [19]:
# Predict on the test set
y_pred = tree_reg_model.predict(X_test)
print_and_save_prediction_errors(
    y_test, y_pred, "DecisionTreeRegressor (All Data)")

DecisionTreeRegressor (All Data) MSE: 0.00, RMSE: $ 0.00, R2 score: 1.0000


## Lasso Model
Run Lasson on all the data.

In [20]:
lasso_reg_model = Lasso(alpha=0.1, max_iter=100_000, random_state=282)
lasso_reg_model = lasso_reg_model.fit(data_prepared, median_house_value_labels)
print_parameters(lasso_reg_model, "Lasso")

Lasso parameters:
  alpha: 0.1
  copy_X: True
  fit_intercept: True
  max_iter: 100000
  positive: False
  precompute: False
  random_state: 282
  selection: cyclic
  tol: 0.0001
  warm_start: False


In [21]:
y_pred = lasso_reg_model.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "Lasso (All Data)")

Lasso (All Data) MSE: 4,518,787,305.01, RMSE: $ 67,221.93, R2 score: 0.6435


## Ridge Model

In [22]:
# Fit Ridge on all the data
ridge_reg_model = Ridge(alpha=0.1).fit(data_prepared, median_house_value_labels)
print_parameters(ridge_reg_model, "Ridge")

Ridge parameters:
  alpha: 0.1
  copy_X: True
  fit_intercept: True
  max_iter: None
  positive: False
  random_state: None
  solver: auto
  tol: 0.0001


In [23]:
y_pred = ridge_reg_model.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "Ridge (All Data)")

Ridge (All Data) MSE: 4,518,792,169.70, RMSE: $ 67,221.96, R2 score: 0.6435


## Random Forest RandomForestRegressor

Run the Random Forest on all the data.

In [24]:
rf_reg_model = RandomForestRegressor(
    random_state=282).fit(data_prepared, median_house_value_labels)
print_parameters(rf_reg_model, "RandomForestRegressor")

RandomForestRegressor parameters:
  bootstrap: True
  ccp_alpha: 0.0
  criterion: squared_error
  max_depth: None
  max_features: 1.0
  max_leaf_nodes: None
  max_samples: None
  min_impurity_decrease: 0.0
  min_samples_leaf: 1
  min_samples_split: 2
  min_weight_fraction_leaf: 0.0
  monotonic_cst: None
  n_estimators: 100
  n_jobs: None
  oob_score: False
  random_state: 282
  verbose: 0
  warm_start: False


In [25]:
# Predict on the test set
y_pred = rf_reg_model.predict(X_test)
print_and_save_prediction_errors(
    y_test, y_pred, "RandomForestRegressor (All Data)")

RandomForestRegressor (All Data) MSE: 327,214,542.00, RMSE: $ 18,089.07, R2 score: 0.9742


## Try all above models with the X_Train

In [None]:
# Train all aboved models on all the Training data and test on test data
train_lin_reg = LinearRegression().fit(X_train, y_train)
train_decision_tree_reg_model = DecisionTreeRegressor(random_state=282).fit(X_train, y_train)
train_lasso_reg_model = Lasso(alpha=0.1, max_iter=100_000, random_state=282)
train_lasso_reg_model = train_lasso_reg_model.fit(X_train, y_train)
train_ridge_reg_model = Ridge(alpha=0.1).fit(X_train, y_train)
train_rf_reg_model = RandomForestRegressor(random_state=282).fit(X_train, y_train)
y_pred = train_lin_reg.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "LinearRegression (Train)")
y_pred = train_decision_tree_reg_model.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "DecisionTreeRegressor (Train)")
y_pred = train_lasso_reg_model.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "Lasso (Train)")
y_pred = train_ridge_reg_model.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "Ridge (Train)")
y_pred = train_rf_reg_model.predict(X_test)
print_and_save_prediction_errors(y_test, y_pred, "RandomForestRegressor (Train)")

## Linear Regression with KFold=10

Try the CSV with the Linear Regression Model. Similar to Decision Regressor, redo sclaing at fold level.

In [None]:
full_pipeline = ca_housing_data_transformer(
    numerical_cols, categorical_col, ocean_categories)

In [None]:
lr_cv_pipeline = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', LinearRegression())
])
scores = cross_val_score(lr_cv_pipeline, df, median_house_value_labels,
                         scoring='neg_mean_squared_error', cv=kfold)
print_and_save_cv_scores(scores, "Linear Regression")

## Decision Tree Resgression with KFold = 10
__Remark:__ The error on training on training on whole data by DecisionRegression model is ZERO. Clearly model is overfitting. Let us test that by running k-fold test. 

The scaling parameters above are being computed on the entire dataset. This can lead to inconsistent scaling across folds. We create a new pipeline, rerun it with original data so scaling is done at the fold level.

In [None]:
# Create the complete pipeline including the model,
tree_cv_pipeline = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', DecisionTreeRegressor())
])
scores = cross_val_score(tree_cv_pipeline, df, median_house_value_labels,
                         scoring='neg_mean_squared_error', cv=kfold)
print_and_save_cv_scores(scores, "Decision Tree")

## Lasso Model with KFold=10

In [None]:
lasso_cv_pipeline = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', Lasso(alpha=0.1, max_iter=100_000, random_state=282))
])
# Apply k-fold cross-validation to the lasso model
scores = cross_val_score(lasso_cv_pipeline, df, median_house_value_labels,
                         scoring='neg_mean_squared_error', cv=kfold)
print_and_save_cv_scores(scores, "Lasso")

## Ridge Model with KFold=10

In [None]:
# Ridge Model with KFold = 10
ridge_cv_pipeline = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', Ridge(alpha=0.1))
])
# Apply k-fold cross-validation to the ridge model
scores = cross_val_score(ridge_cv_pipeline, df, median_house_value_labels,
                         scoring='neg_mean_squared_error', cv=kfold)
print_and_save_cv_scores(scores, "Ridge")

## Random Forest with KFold=10

In [None]:

# Run the Random Forest on all the data.
rf_cv_pipeline = Pipeline([
    ('preprocessing', full_pipeline),
    ('model', RandomForestRegressor())
])
scores = cross_val_score(rf_cv_pipeline, df, median_house_value_labels,
                         scoring='neg_mean_squared_error', cv=kfold)
print_and_save_cv_scores(scores, "Random Forest")

## Compare Results
The results of different approaches are:

In [None]:
from tabulate import tabulate
# Print the results
print("Prediction Results:")
prediction_results_sorted = sorted(prediction_results, key=lambda x: x['model_name'])
print(tabulate(prediction_results_sorted,
      headers='keys', tablefmt='grid', floatfmt=",.2f"))

In [None]:
print("\nCV Scores:")
cv_scores_sorted = sorted(cv_scores, key=lambda x: x['model_name'])
print(tabulate(cv_scores_sorted, headers='keys', tablefmt='grid', floatfmt=",.2f"))

# Save all processed data for model finetuning

In [35]:
# Finally save the processed data and labels
df_processed_colnames = ['longitude', 'latitude', 'housing_median_age',
                         'total_rooms','total_bedrooms', 'population',
                         'households', 'median_income', 'rooms_per_household',
                         'population_per_household', 'bedrooms_per_room',
                         'NEAR BAY', '<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'ISLAND']
df_processed = pd.DataFrame(data_prepared, columns=df_processed_colnames)
df_processed.to_parquet("../../data/housing-geron-processed.parquet")
np.save("../../data/median_house_value_labels.npy", median_house_value_labels)