**Project 3: House Price Prediction**

**Workshop:** Geeks for Geeks 21 Projects, 21 Days: ML, Deep Learning & GenAI

**Date:** October 15, 2025

**Author:** Harsh Bhanushali

**Objective:** Build a regression model to predict house sale prices, covering EDA, preprocessing, feature engineering, categorical encoding, model training (Linear Regression and XGBoost), and evaluation using RMSE, MAE, and R-squared.

**Step 1: Import Libraries**

Import libraries for data processing, visualization, and modeling.

In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
from scipy.stats import skew
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
import xgboost as xgb

**Step 2: Load the Dataset**

Load the House Prices dataset from the Kaggle competition.

In [2]:
train_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv')
test_df = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv')
print("Training Data Loaded")
print("\nFirst 5 Rows:")
display(train_df.head())
print("\nShape:", train_df.shape)

Training Data Loaded

First 5 Rows:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000



Shape: (1460, 81)


**Step 3: Target Variable Analysis**

Analyze and transform the target variable SalePrice to handle skewness.

In [3]:
fig = px.histogram(train_df, x='SalePrice', title='SalePrice Distribution',
                  labels={'SalePrice': 'Sale Price'}, nbins=50)
fig.update_layout(title_x=0.5, xaxis_title='Sale Price', yaxis_title='Count')
fig.show()

print("Skewness of SalePrice:", skew(train_df['SalePrice']))
train_df['SalePrice'] = np.log1p(train_df['SalePrice'])
print("Skewness after log transform:", skew(train_df['SalePrice']))

fig = px.histogram(train_df, x='SalePrice', title='Log-Transformed SalePrice Distribution',
                  labels={'SalePrice': 'Log Sale Price'}, nbins=50)
fig.update_layout(title_x=0.5, xaxis_title='Log Sale Price', yaxis_title='Count')
fig.show()

Skewness of SalePrice: 1.880940746034036
Skewness after log transform: 0.12122191311528359


**Finding:** SalePrice is right-skewed; log transformation reduces skewness, improving model performance.

**Step 4: Data Preprocessing**

Handle missing values and encode categorical features.

    Missing Values (Numerical): Fill with median.

    Missing Values (Categorical): Fill with 'Unknown'.

    Categorical Encoding: Apply LabelEncoder for ordinal features, One-Hot Encoding for       nominal features.

In [4]:
# Combine train and test for consistent preprocessing
all_data = pd.concat([train_df.drop('SalePrice', axis=1), test_df], axis=0)

# Numerical missing values
numerical_cols = all_data.select_dtypes(include=['int64', 'float64']).columns
all_data[numerical_cols] = all_data[numerical_cols].fillna(all_data[numerical_cols].median())

# Categorical missing values
categorical_cols = all_data.select_dtypes(include=['object']).columns
all_data[categorical_cols] = all_data[categorical_cols].fillna('Unknown')

# Encode ordinal features (example: LotShape)
ordinal_cols = ['LotShape']
for col in ordinal_cols:
    le = LabelEncoder()
    all_data[col] = le.fit_transform(all_data[col])

# One-Hot Encoding for nominal features
all_data = pd.get_dummies(all_data, columns=[col for col in categorical_cols if col not in ordinal_cols], drop_first=True)

# Split back into train and test with explicit copies
train_processed = all_data.iloc[:len(train_df)].copy()
test_processed = all_data.iloc[len(train_df):].copy()
train_processed['SalePrice'] = train_df['SalePrice']

**Step 5: Feature Engineering**

Create new features to enhance model performance.

    TotalSF: Total square footage (sum of living areas).

    TotalBath: Total bathrooms.

    Age: Years since house was built or remodeled.

In [5]:
train_processed.loc[:, 'TotalSF'] = train_processed['GrLivArea'] + train_processed['TotalBsmtSF'].fillna(0)
train_processed.loc[:, 'TotalBath'] = train_processed['FullBath'] + 0.5 * train_processed['HalfBath']
train_processed.loc[:, 'Age'] = train_processed['YrSold'] - train_processed['YearBuilt']
test_processed.loc[:, 'TotalSF'] = test_processed['GrLivArea'] + test_processed['TotalBsmtSF'].fillna(0)
test_processed.loc[:, 'TotalBath'] = test_processed['FullBath'] + 0.5 * test_processed['HalfBath']
test_processed.loc[:, 'Age'] = test_processed['YrSold'] - test_processed['YearBuilt']

**Step 6: Model Training**

Train Linear Regression and XGBoost models.

In [6]:
X = train_processed.drop(['Id', 'SalePrice'], axis=1)
y = train_processed['SalePrice']
X_test = test_processed.drop('Id', axis=1)

# Split data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train_scaled, y_train)
lr_pred = lr_model.predict(X_val_scaled)

# XGBoost
xgb_model = xgb.XGBRegressor(random_state=42)
xgb_model.fit(X_train_scaled, y_train)
xgb_pred = xgb_model.predict(X_val_scaled)

**Step 7: Model Evaluation**

Evaluate models using RMSE, MAE, and R-squared.

In [7]:
lr_rmse = np.sqrt(mean_squared_error(y_val, lr_pred))
lr_mae = mean_absolute_error(y_val, lr_pred)
lr_r2 = r2_score(y_val, lr_pred)  # Fixed typo from previous version
xgb_rmse = np.sqrt(mean_squared_error(y_val, xgb_pred))
xgb_mae = mean_absolute_error(y_val, xgb_pred)
xgb_r2 = r2_score(y_val, xgb_pred)

print("Linear Regression Metrics:")
print(f"RMSE: {lr_rmse:.4f}, MAE: {lr_mae:.4f}, R-squared: {lr_r2:.4f}")
print("XGBoost Metrics:")
print(f"RMSE: {xgb_rmse:.4f}, MAE: {xgb_mae:.4f}, R-squared: {xgb_r2:.4f}")

Linear Regression Metrics:
RMSE: 1638646167.3743, MAE: 135621667.5054, R-squared: -14389084384327854080.0000
XGBoost Metrics:
RMSE: 0.1449, MAE: 0.0965, R-squared: 0.8875


**Finding:** XGBoost outperforms Linear Regression, with lower RMSE and MAE, and higher R-squared.

**Step 8: Generate Predictions**

Predict test set prices and create submission file.

In [8]:
test_pred = np.expm1(xgb_model.predict(X_test_scaled))  # Reverse log transform
submission = pd.DataFrame({'Id': test_df['Id'], 'SalePrice': test_pred})
submission.to_csv('submission.csv', index=False)
print("Submission file saved as 'submission.csv'")

Submission file saved as 'submission.csv'
