Introduction

For this project, I used the Kaggle House Prices dataset to predict the sale price of houses based on various property features. The goal is to build a regression model that can estimate the price of a home given characteristics like overall quality, living area, and year built. Regression is appropriate because the target variable (SalePrice) is continuous. Predicting house prices can help identify which property features have the biggest impact on value and improve decision-making in real estate.


Data Cleaning and Preprocess

The dataset contained missing values, non-numeric columns, and features on different scales.
To clean and prepare the data:
-Replaced missing values in numeric columns with their mean.
-Encoded categorical features using LabelEncoder.
-Split the dataset into training (80%) and testing (20%) subsets.
-Scaled the features using StandardScaler for better model performance.

In [2]:
# Project 3: Regression – Predicting House Prices (simple version)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Load the dataset
df = pd.read_csv("train.csv")

# Drop the ID column if it exists
if 'Id' in df.columns:
    df = df.drop('Id', axis=1)

# Fill missing numeric values with the column mean
df = df.fillna(df.mean(numeric_only=True))

# Fill missing categorical values with a placeholder
df = df.fillna('Missing')

# Encode all categorical columns
encoder = LabelEncoder()
for col in df.select_dtypes(include=['object']).columns:
    df[col] = encoder.fit_transform(df[col].astype(str))

# Split data into features and target
X = df.drop('SalePrice', axis=1)
y = df['SalePrice']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("=== Linear Regression Results ===")
print("MAE:", round(mae, 2))
print("MSE:", round(mse, 2))
print("R²:", round(r2, 4))


=== Linear Regression Results ===
MAE: 21687.38
MSE: 1197632941.51
R²: 0.8439


Results and Interpretation

The Linear Regression model performed well on the house price dataset, achieving an R² score of 0.84. This means the model can explain around 84% of the variability in house prices based on the features provided. The Mean Absolute Error (MAE) of about 21,687 indicates that, on average, the model’s predictions are off by roughly $21,000. While this is a good start, there is room to improve accuracy by testing additional models or performing feature engineering.

Conclusion

This first experiment successfully used a regression model to predict house prices. The data cleaning and preprocessing steps helped prepare the dataset for training, and the Linear Regression model gave strong results as a baseline.
For the final version of the project, I plan to:

-Try other regression techniques like Ridge, Lasso, or Random Forest.

-Add feature engineering to capture more patterns in the data.

-Compare models using the same evaluation metrics to find the best performer.