# Housing Prices
This notebook walks through a baseline Ridge Regression model for housing prices prediction. The dataset is from Kaggle, found here: https://www.kaggle.com/datasets/yasserh/housing-prices-dataset

## Dependencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import root_mean_squared_error
import joblib


sns.set_style('darkgrid')

## Data Exploration

In [None]:
def check_basic_info(df):
    """No return, prints basic information & missing values in the given DataFrame."""
    print(df.head())
    print("\nInfo: ")
    print(df.info())
    print("\nDescription: ")
    print(df.describe())
    print('\nMissing values: ')
    print(df.isnull().sum())

From the initial summary of the data, we can see that there are 545 rows and no missing values. There are 6 numerical columns, including price, area, bdrooms, bathrooms, stories, and parking. There are 7 categorical columns of object type, consisting of mainroad, guestroom, basement, hotwaterheating, airconditioning, prefarea, and furnishingstatus.

The target variable price looks like it may be skewed and has a large standard deviation.

Some categorical features like prefarea are similar to binaries and are simply yes/no.


---

The next step is to visualize some of the data to check for the distribution of the target price, find any outliers, and look at correlations between features.


In [None]:
def plot_num_distribution(df, col):
    """Plots a Histogram/KDE of a given col, designed for numericals."""
    plt.figure(figsize=(8,5))
    sns.histplot(df[col], bins=30, kde=True)
    plt.xlabel(col)
    plt.title(f"Distribution of {col}")
    plt.ylabel("Count")
    plt.show()


def plot_cat_distribution(df, col):
    """Plots countplots of a given col, designed for categoricals."""
    plt.figure(figsize=(8,5))
    sns.countplot(data=df, x=col)
    plt.xlabel(col)
    plt.ylabel("Count")
    plt.title(f"Distribution of {col}")
    plt.show()


def plot_cat_to_target(df, col, target='price'):
    """Plots boxplot of target grouped by (categorical) col."""
    plt.figure(figsize=(10,6))
    sns.boxplot(data=df, x=col, y=target)
    plt.xticks(rotation=45)
    plt.title(f"{target} by {col}")
    plt.show()
    

def plot_num_to_target(df, col, target='price'):
    """Plots scatterplots of (numeric) col vs target."""
    plt.figure(figsize=(8,5))
    sns.regplot(data=df, x=col, y=target, scatter_kws={'alpha':.5})
    plt.xlabel(col)
    plt.ylabel(target)
    plt.title(f"{target} by {col}")
    plt.show()



def eda_utility(df, cat_cols, num_cols, target='price'):
    """Loops through all EDA functions above."""
    plot_num_distribution(df, 'price')
    for col in num_cols:
        plot_num_distribution(df, col)
        plot_num_to_target(df, col)
    
    for col in cat_cols:
        plot_cat_distribution(df, col)
        plot_cat_to_target(df, col)


From the EDA, we can see several key insights, including:

- The target variable `price` is right-skewed, so a log-transformation might improve modeling.  
- `Area` is the only truly continuous numerical feature; others like bedrooms and parking are discrete counts better treated as categorical.  
- Furnished houses tend to have higher prices.  
- Houses located on the main road generally command higher prices.  
- More stories, bedrooms, and bathrooms correlate with higher prices.  
- Larger area is associated with more expensive houses.  
- `Hotwaterheating` is mostly false, limiting its predictive usefulness.


## Creating model pipeline

We will use Ridge regression for this dataset because:

- Some features, like number of bedrooms and bathrooms, are likely correlated, causing multicollinearity issues.

- Ridge regression applies L2 regularization, which adds penalties on large coefficients and helps prevent overfitting.

- Unlike Lasso, Ridge shrinks coefficients without dropping features entirely, preserving the influence of all variables.


---

To create the pipeline, we will make a preprocessor that scales numeric features and uses one hot encoding on categorical features. The data is split into train and test sets, and then the model is fit onto the training set, and evaluated on the test set.

In [None]:
def build_preprocessor(num_cols, cat_cols):
    """
    Builds the preprocessing pipelines for numeric and categorical features. 
    Scales and encodes columns. Returns preprocessing transformer.
    """
    numeric_transformer = Pipeline([
        ('scaler', StandardScaler())
    ])

    # One hot encoding is used, does not imply order
    categorical_transformer = Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    return ColumnTransformer([
        ('num', numeric_transformer, num_cols),
        ('cat', categorical_transformer, cat_cols)
        ])


def build_model(preprocessor):
    """Returns the pipeline of preprocessing & the Ridge regression model."""
    model = Pipeline([
        ('preprocessor', preprocessor),
        ('regressor', Ridge(random_state=42))
    ])
    
    return model


def train_model(model, X, y):
    """Splits data into train and test sets, fits model, and returns the trained pipeline along with test sets."""
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=.2, random_state=42)
    
    model.fit(X_train, y_train)
    
    return model, X_test, y_test




def evaluate_model_cv(model, X, y, cv=5):
    """Evaluates pipeline using cross validation and prints the average RMSE."""
    cv_scores = cross_val_score(model, X, y, scoring= 'neg_root_mean_squared_error', cv=cv)

    # RMSE is a loss metric that scikit-learn returns as negative, so invert scores
    rmse_scores = -cv_scores

    print(f"CV RMSE scores: {rmse_scores}")
    print(f"Average CV RMSE: {rmse_scores.mean():.4f}")




def evaluate_model(model, X_test, y_test):
    """Evaluates the trained model on the holdout test set."""

    y_pred = model.predict(X_test)
    rmse = root_mean_squared_error(y_test, y_pred)
    print(f"Test RMSE: {rmse:.4f}")


    # Scatterplot of actual vs predicted
    plt.figure(figsize=(8,6))
    plt.scatter(y_test, y_pred, alpha=0.5)
    plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
    plt.xlabel('Actual Price')
    plt.ylabel('Predicted Price')
    plt.title('Actual vs Predicted Prices')
    plt.show()


    # Residual plot
    residuals = y_test - y_pred
    plt.figure(figsize=(8,6))
    plt.scatter(y_pred, residuals, alpha=0.5)
    plt.axhline(0, color='r', linestyle='--')
    plt.xlabel('Predicted Price')
    plt.ylabel('Residuals')
    plt.title('Residuals vs Predicted Prices')
    plt.show()


    # Residual distribution
    plt.figure(figsize=(8,6))
    sns.histplot(residuals, kde=True)
    plt.title('Residuals Distribution')
    plt.show()


In [None]:
def run_model_pipeline(df, target='price'):
    """
    Driver function that runs all modeling steps above, including preprocessing, training, and evaluation.
    Returns model.
    """

    # log transform price
    df['price'] = np.log(df['price'])
    X = df.drop(columns=[target, 'hotwaterheating'])
    y = df[target]
    

    model_num_cols = ['area', 'bedrooms', 'bathrooms', 'stories', 'parking']
    model_cat_cols = ['mainroad', 'guestroom', 'basement', 'airconditioning', 'prefarea', 'furnishingstatus']

    preprocessor = build_preprocessor(model_num_cols, model_cat_cols)
    model = build_model(preprocessor)


    # cross validation set
    evaluate_model_cv(model, X, y, cv=5)

    # holdout set
    model, X_test, y_test = train_model(model, X, y)
    evaluate_model(model, X_test, y_test)
    
    return model



In [None]:

def main():

    # load and inspect data
    df = pd.read_csv('Housing.csv')
    check_basic_info(df)

    # define numerical & categorical cols for EDA, excluding target price. Discrete numerical counts are treated as categorical in visualization.
    eda_num_cols = ['area']
    eda_cat_cols = ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 
                'prefarea', 'furnishingstatus', 'bedrooms', 'bathrooms', 'stories', 'parking']
    
    # EDA
    eda_utility(df, eda_cat_cols, eda_num_cols, target='price')

    # creates & evaluates model
    model = run_model_pipeline(df)

    # saves model
    joblib.dump(model, 'ridge_model.joblib')


    
if __name__ == "__main__":
    main()

## Interpretation

The cross-validation RMSE scores range from about 0.18 to 0.36, showing some variability in how well the model performs depending on the data split. The test RMSE is 0.26. Because these errors are on the log scale, this corresponds roughly to predictions being within ±29% of the actual house prices on average. The spread in CV scores indicates some sensitivity to the data subsets, likely due to differences in the samples across folds or limited data size.



## Steps to Improve Performance

- Explore more flexible models like Random Forests or Gradient Boosting to capture complex patterns.  
- Perform feature engineering to add interaction terms or polynomial features that may improve predictions.  
- Increase dataset size if possible to increase the stability in model performance.  
- Experiment with hyperparameter tuning in Ridge regularization strength.  
- Investigate and address potential outliers or influential data points that might be affecting performance.  
- Continue residual analysis to identify and correct systematic errors.
