# **PREDICTING HOUSE PRICES USING REGRESSION**

## **📥 Downloading and Extracting the Dataset**
Before we start building our house price prediction model, we need to download the dataset from Kaggle and extract its contents.

### **1️⃣ Install Kaggle API**  
The Kaggle API is required to download datasets directly from Kaggle. If you haven't installed it yet, run the command below:  

In [None]:
%pip install kaggle

### **2️⃣ Download the Dataset**

Use the following command to download the dataset from the Kaggle competition:

In [1]:
!kaggle competitions download -c house-prices-advanced-regression-techniques

Downloading house-prices-advanced-regression-techniques.zip to c:\Users\Juls\Desktop\houseprice\notebooks




  0%|          | 0.00/199k [00:00<?, ?B/s]
100%|██████████| 199k/199k [00:00<00:00, 336kB/s]
100%|██████████| 199k/199k [00:00<00:00, 335kB/s]


### **3️⃣ Extract the Dataset**

The dataset is stored as a ZIP file. We need to extract it before using it in our model.

In [2]:
import zipfile
import os

dataset_zip = "../data/house-prices-advanced-regression-techniques.zip"
dataset_folder = "../data/raw"

with zipfile.ZipFile(dataset_zip, 'r') as folder:
    folder.extractall(dataset_folder)
    

os.listdir(dataset_folder)

['data_description.txt', 'sample_submission.csv', 'test.csv', 'train.csv']

## 📊 Loading the Dataset
Now that we have extracted the dataset, let's load it into Pandas DataFrames for further analysis.

### **1️⃣ Install Pandas**  
Pandas is a powerful library for data manipulation and analysis. If you haven't installed it yet, you can do so using:


In [None]:
%pip install pandas

In [3]:
import pandas as pd

### **2️⃣ Load the Dataset into Pandas**

- We will use pd.read_csv() to load both the training and test datasets.
- The .head() function allows us to preview the first five rows of the training dataset. This helps us understand the structure of the data, including the features and target variable.

In [216]:
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')

train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## **🧼 Data Cleaning**

This section focuses on handling missing values, detecting outliers, and preparing the dataset for modeling.

### **1️⃣ Find columns with missing values.**

In [217]:
missing_values = train_df.isnull().sum()
missing_values = missing_values[missing_values > 0]

print(missing_values)

LotFrontage      259
Alley           1369
MasVnrType       872
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64


### **2️⃣ Identify feature types.**

In [218]:
categorical_columns = train_df.select_dtypes(include=['object']).columns
numerical_columns = train_df.select_dtypes(include=['int64', 'float64']).columns

print(categorical_columns)
print(numerical_columns)

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 

### **3️⃣ Handle Missing Values.**

- Drop columns with missing values greater than 30%.

In [219]:
threshold = 0.3
missing_values = train_df.isnull().sum()
missing_values_test = test_df.isnull().sum()

missing_ratio = missing_values / len(train_df)
missing_ratio_test = missing_values_test / len(test_df)

drop_cols = missing_ratio[missing_ratio > threshold].index
train_df.drop(columns=drop_cols, inplace=True)
test_df.drop(columns=drop_cols, inplace=True)

print("Dropped columns:", list(drop_cols))

Dropped columns: ['Alley', 'MasVnrType', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']


- Replace missing values.
- Replace missing values in categorical columns with the mode(most frequent element).

In [220]:
for col in categorical_columns:
    if col in train_df.columns:
        train_df[col] = train_df[col].fillna(train_df[col].mode()[0])
    if col in test_df.columns:
        test_df[col] = test_df[col].fillna(test_df[col].mode()[0])

- Check if the numerical column is normally distributed or if it is skewed.

In [None]:
%pip install matplotlib
%pip install seaborn

In [139]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

- Create a box plot to check for outliers in the numerical data.

In [None]:
def numerical_box_plot(df, cols_per_row):
    num_cols = len(numerical_columns)
    num_rows = int(np.ceil(num_cols / cols_per_row))

    fig, ax = plt.subplots(num_rows, cols_per_row, figsize=(20,num_rows*5))
    ax = ax.flatten()
    
    for i, col in enumerate(numerical_columns):
        sns.boxplot(y=df[col], ax=ax[i])
        ax[i].set_title(col)
    
    for i in range(num_cols, len(ax)):
        fig.delaxes(ax[i])
        
    plt.tight_layout()
    plt.show()

numerical_box_plot(train_df, 6)

### **4️⃣ Handle Outliers** 
- Using inter quartile range, check if each columns contains outliers.
- If a column contains outliers, fill the missing values with the median.
- If a column contains 2 or less outliers, fill the missing values with the mean.

In [221]:
# Ensure SalePrice is not in test_df
if 'SalePrice' in test_df.columns:
    test_df.drop(columns=['SalePrice'], inplace=True)

for col in numerical_columns:
    if col == 'SalePrice':  # Skip SalePrice to avoid KeyError
        continue

    if train_df[col].dtype in ['int64', 'float64']:  # Ensure numerical columns
        mean = train_df[col].mean()
        median = train_df[col].median()
        q1 = train_df[col].quantile(0.25)
        q3 = train_df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr

        # Count outliers in train_df
        outliers = train_df[(train_df[col] < lower_bound) | (train_df[col] > upper_bound)].shape[0]
        
        # Choose filling value based on outliers
        fill_value = mean if outliers <= 2 else median

        # Assign directly instead of inplace=True (avoids FutureWarning)
        train_df[col] = train_df[col].fillna(fill_value)

        # Ensure test_df has the column before filling
        if col in test_df.columns:
            test_df[col] = test_df[col].fillna(fill_value)
    else:
        print(f"Skipping non-numeric column: {col}")

## **⚙️ Data Preprocessing**

This section focuses on preparing the dataset for modeling by encoding categorical features and scaling numerical data.

In [None]:
%pip install scikit-learn

In [222]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import pandas as pd

### **1️⃣ Scale Numerical Features**

- Since SalePrice is not in the test data, lets exclude it in the standardizing process.

In [None]:
scaler = StandardScaler()

numerical_features = [col for col in numerical_columns if col in train_df.columns and col != "SalePrice"]

train_df[numerical_features] = scaler.fit_transform(train_df[numerical_features])
test_df[numerical_features] = scaler.transform(test_df[numerical_features])

print("Feature Scaling Done!")
print(train_df.shape, test_df.shape)

Feature Scaling Done!
(1460, 75) (1459, 74)


### **2️⃣ Encode Categorical Features**

- Fix categorical features as we dropped columns earlier.

In [None]:
categorical_features = [col for col in categorical_columns if col in train_df.columns]

train_df = pd.get_dummies(train_df, columns=categorical_features, drop_first=True)
test_df = pd.get_dummies(test_df, columns=categorical_features, drop_first=True)

print("Categorical Encoding Done!")
print(train_df.shape, test_df.shape)

Categorical Encoding Done!
(1460, 231) (1459, 214)


In [223]:
categorical_features = [col for col in categorical_columns if col in train_df.columns]
numerical_features = [col for col in numerical_columns if col in train_df.columns and col != "SalePrice"]

preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numerical_features),  
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), categorical_features)  
])

## **📌 Training the Linear Regression Model**  

In [224]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import cross_val_score
import numpy as np

### **1️⃣ Define the target and train the Linear Regression Model**
- The target variable is **`SalePrice`**.
- We use **Scikit-Learn's `LinearRegression`** to train our model.
- The model learns relationships between the features and house prices.

In [225]:
linear_regression = LinearRegression()

x_train = train_df.drop(columns=['SalePrice'])
y_train = train_df['SalePrice']
x_test = test_df.copy()

x_train = preprocessor.fit_transform(x_train)
x_test = preprocessor.transform(x_test)

print("✅ OneHotEncoding & Scaling Done!")
print("X_train shape:", x_train.shape)
print("X_test shape:", x_test.shape)

✅ OneHotEncoding & Scaling Done!
X_train shape: (1460, 230)
X_test shape: (1459, 230)


In [226]:
linear_regression.fit(x_train, y_train)
print("Linear Regression Model Trained!")

Linear Regression Model Trained!


### **2️⃣ Linear Regression Model Evaluation**

After training the Linear Regression model, we evaluate its performance using **MAE, MSE, RMSE, and R² score**.  

#### **📌 Evaluation Metrics**  
- **Mean Absolute Error (MAE)**: Measures the average absolute difference between predicted and actual values. Lower is better.  
- **Mean Squared Error (MSE)**: Similar to MAE but squares the differences, penalizing larger errors more. Lower is better.  
- **Root Mean Squared Error (RMSE)**: The square root of MSE, keeping units consistent with the target variable. Lower is better.  
- **R² Score (R-squared)**: Indicates how well the model explains variance in the data. **Ranges from 0 to 1**, where **1 is a perfect fit**.
- **Cross-Validation RMSE**: Evaluates model stability by splitting the training set into multiple folds and averaging RMSE across them. 

In [227]:
y_train_pred = linear_regression.predict(x_train)

mae = mean_absolute_error(y_train, y_train_pred)
mse = (mean_squared_error(y_train, y_train_pred))
rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
r2 = r2_score(y_train, y_train_pred)

cv_scores = cross_val_score(linear_regression, x_train, y_train, cv=5, scoring='neg_root_mean_squared_error')

print(f"Training MAE: {mae}")
print(f"Training MSE: {mse}")
print(f"Training RMSE: {rmse}")
print(f"Training R^2: {r2}")
print(f"Cross-Validation RMSE: {-cv_scores.mean()}")

Training MAE: 13509.327849941874
Training MSE: 440691048.06912225
Training RMSE: 20992.642712843997
Training R^2: 0.9301243347378332
Cross-Validation RMSE: 42412.1700414257


### **3️⃣ Generating Predictions and Saving to CSV**

After training the Linear Regression model, we use it to make predictions on the test dataset and save the results to a CSV file for submission to [Kaggle](https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview).  

In [None]:
test_pred = linear_regression.predict(x_test)

results = pd.DataFrame({"Id": test_df.index + 1461, "SalePrice": test_pred})
results.to_csv("../data/results.csv", index=False)

print("Predictions saved to results.csv")

Predictions saved to results.csv


- Save training predictions on a csv file.

In [229]:
y_train_results = pd.DataFrame({"Id": train_df.index, "SalePrice": y_train_pred})
y_train_results.to_csv("../data/y_results.csv", index=False)

print("y_train_pred predictions saved to y_results.csv")

y_train_pred predictions saved to y_results.csv
