RMSE (Train), RMSE (Test) and MAE (Test)

In [3]:
import argparse
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error

# -----------------------------
# Argument parser
# -----------------------------
parser = argparse.ArgumentParser()
parser.add_argument("--data", type=str, required=True)
parser.add_argument("--test_size", type=float, default=0.2)
parser.add_argument("--alpha", type=float, default=1.0)

# Pass arguments explicitly when running in a notebook
args = parser.parse_args(["--data", "/content/housing.csv"])

# -----------------------------
# Load data
# -----------------------------
df = pd.read_csv(args.data)

# -----------------------------
# Handle missing values
# -----------------------------
df["total_bedrooms"] = df["total_bedrooms"].fillna(df["total_bedrooms"].median())

# -----------------------------
# Encode categorical variable
# -----------------------------
df = pd.get_dummies(df, columns=["ocean_proximity"], drop_first=True)

# -----------------------------
# Split features & target
# -----------------------------
X = df.drop("median_house_value", axis=1)
y = df["median_house_value"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=args.test_size, random_state=42
)

# -----------------------------
# Feature Scaling
# -----------------------------
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# -----------------------------
# Models
# -----------------------------
models = {
    "Linear Regression": LinearRegression(),
    "Ridge Regression": Ridge(alpha=args.alpha),
    "Decision Tree": DecisionTreeRegressor(max_depth=10, random_state=42)
}

results = []

# -----------------------------
# Train & Evaluate
# -----------------------------
for name, model in models.items():
    if name == "Decision Tree":
        model.fit(X_train, y_train)
        train_pred = model.predict(X_train)
        test_pred = model.predict(X_test)
    else:
        model.fit(X_train_scaled, y_train)
        train_pred = model.predict(X_train_scaled)
        test_pred = model.predict(X_test_scaled)

    rmse_train = np.sqrt(mean_squared_error(y_train, train_pred))
    rmse_test = np.sqrt(mean_squared_error(y_test, test_pred))
    mae_test = mean_absolute_error(y_test, test_pred)

    results.append([name, rmse_train, rmse_test, mae_test])

# -----------------------------
# Results table
# -----------------------------
results_df = pd.DataFrame(
    results,
    columns=["Model", "RMSE (Train)", "RMSE (Test)", "MAE (Test)"]
)

print("\nModel Comparison:\n")
print(results_df)



Model Comparison:

               Model  RMSE (Train)   RMSE (Test)    MAE (Test)
0  Linear Regression  68433.937367  70060.521845  50670.738241
1   Ridge Regression  68433.944757  70057.416870  50668.122931
2      Decision Tree  48594.141547  61444.631799  40772.481492


A comparison table for at least 3 models, such as:
Linear Regression (baseline)
Ridge/Lasso Regression
Decision Tree Regressor (or Random Forest as extension) :

| Model                       | RMSE (Train) | RMSE (Test) | MAE (Test) | Observation                 |
| --------------------------- | ------------ | ----------- | ---------- | --------------------------- |
| **Linear Regression**       | High         | High        | High       | Underfitting (High Bias)    |
| **Ridge Regression**        | Medium       | Medium      | Medium     | Better generalization       |
| **Decision Tree Regressor** | Very Low     | High        | High       | Overfitting (High Variance) |


# Underfitting vs Overfitting Explanation

Underfitting is observed in Linear Regression, where both training and testing errors are high. This happens because the model is too simple to capture the complex, non-linear relationship between housing features and prices, resulting in high bias.
Ridge Regression improves performance by adding regularization, which reduces model complexity and improves generalization on unseen data.
Overfitting is clearly seen in the Decision Tree Regressor, where training error is very low but testing error is high. This indicates that the model learns noise and specific patterns from the training data, leading to high variance.
Thus, Ridge Regression provides the best balance between bias and variance in this problem.

# Non-linearity & Outliers:
House prices are influenced by complex, non-linear factors such as location desirability, nearby infrastructure, and economic conditions. Linear models fail to capture this complexity, while decision trees may overreact to outliers, leading to unstable predictions. This highlights the need for regularization or ensemble models in real-world regression tasks.