# **PREDICTING HOUSE PRICES USING REGRESSION**

## **📥 Downloading and Extracting the Dataset**
Before we start building our house price prediction model, we need to download the dataset from Kaggle and extract its contents.

### **1️⃣ Install Kaggle API**  
The Kaggle API is required to download datasets directly from Kaggle. If you haven't installed it yet, run the command below:  

In [None]:
%pip install kaggle

### **2️⃣ Download the Dataset**

Use the following command to download the dataset from the Kaggle competition:

In [2]:
!kaggle competitions download -c house-prices-advanced-regression-techniques

Downloading house-prices-advanced-regression-techniques.zip to c:\Users\Juls\Desktop\houseprice\notebooks




  0%|          | 0.00/199k [00:00<?, ?B/s]
100%|██████████| 199k/199k [00:00<00:00, 361kB/s]
100%|██████████| 199k/199k [00:00<00:00, 361kB/s]


### **3️⃣ Extract the Dataset**

The dataset is stored as a ZIP file. We need to extract it before using it in our model.

In [4]:
import zipfile
import os

dataset_zip = "../data/house-prices-advanced-regression-techniques.zip"
dataset_folder = "../data/raw"

with zipfile.ZipFile(dataset_zip, 'r') as folder:
    folder.extractall(dataset_folder)
    

os.listdir(dataset_folder)

['data_description.txt', 'sample_submission.csv', 'test.csv', 'train.csv']

## 📊 Loading the Dataset
Now that we have extracted the dataset, let's load it into Pandas DataFrames for further analysis.

### **1️⃣ Install Pandas**  
Pandas is a powerful library for data manipulation and analysis. If you haven't installed it yet, you can do so using:


In [None]:
%pip install pandas

In [2]:
import pandas as pd

### **2️⃣ Load the Dataset into Pandas**

- We will use pd.read_csv() to load both the training and test datasets.
- The .head() function allows us to preview the first five rows of the training dataset. This helps us understand the structure of the data, including the features and target variable.

In [3]:
train_df = pd.read_csv('../data/raw/train.csv')
test_df = pd.read_csv('../data/raw/test.csv')

train_df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


## **🧼 Data Cleaning**

This section focuses on handling missing values, detecting outliers, and preparing the dataset for modeling.

### **1️⃣ Find columns with missing values.**

In [4]:
missing_values = train_df.isnull().sum()
missing_values = missing_values[missing_values > 0]

print(missing_values)

LotFrontage      259
Alley           1369
MasVnrType       872
MasVnrArea         8
BsmtQual          37
BsmtCond          37
BsmtExposure      38
BsmtFinType1      37
BsmtFinType2      38
Electrical         1
FireplaceQu      690
GarageType        81
GarageYrBlt       81
GarageFinish      81
GarageQual        81
GarageCond        81
PoolQC          1453
Fence           1179
MiscFeature     1406
dtype: int64


### **2️⃣ Identify feature types.**

In [5]:
categorical_columns = train_df.select_dtypes(include=['object']).columns
numerical_columns = train_df.select_dtypes(include=['int64', 'float64']).columns

print(categorical_columns)
print(numerical_columns)

Index(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities',
       'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2',
       'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st',
       'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation',
       'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
       'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
       'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual',
       'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature',
       'SaleType', 'SaleCondition'],
      dtype='object')
Index(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual',
       'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1',
       'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 

### **3️⃣ Handle Missing Values.**

- Drop columns with missing values greater than 30%.

In [6]:
threshold = 0.3
missing_ratio = missing_values / len(train_df)

drop_cols = missing_ratio[missing_ratio > threshold].index
train_df.drop(columns=drop_cols, inplace=True)
test_df.drop(columns=drop_cols, inplace=True)

print("Dropped columns:", list(drop_cols))

Dropped columns: ['Alley', 'MasVnrType', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']


- Replace missing values.
- Replace missing values in categorical columns with the mode(most frequent element).

In [7]:
for col in categorical_columns:
    if col in train_df.columns:
        train_df[col] = train_df[col].fillna(train_df[col].mode()[0])
    if col in test_df.columns:
        test_df[col] = test_df[col].fillna(test_df[col].mode()[0])

- Check if the numerical column is normally distributed or if it is skewed.

In [None]:
%pip install matplotlib
%pip install seaborn

In [9]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

- Create a box plot to check for outliers in the numerical data.

In [None]:
def numerical_box_plot(df, cols_per_row):
    num_cols = len(numerical_columns)
    num_rows = int(np.ceil(num_cols / cols_per_row))

    fig, ax = plt.subplots(num_rows, cols_per_row, figsize=(20,num_rows*5))
    ax = ax.flatten()
    
    for i, col in enumerate(numerical_columns):
        sns.boxplot(y=df[col], ax=ax[i])
        ax[i].set_title(col)
    
    for i in range(num_cols, len(ax)):
        fig.delaxes(ax[i])
        
    plt.tight_layout()
    plt.show()

numerical_box_plot(train_df, 6)

### **4️⃣ Handle Outliers** 
- Using inter quartile range, check if each columns contains outliers.
- If a column contains outliers, fill the missing values with the median.
- If a column contains 2 or less outliers, fill the missing values with the mean.

In [None]:
for col in numerical_columns:
    mean = train_df[col].mean()
    median = train_df[col].median()
    q1 = train_df[col].quantile(0.25)
    q3 = train_df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    '''
        outliers = traindf[(lower outliers) | (upper outliers)].shape[0] (shape counts the number of rows)
        traindf[(lower outliers) | (upper outliers)] = counts the number of lower and upper outliers in one line
        shape[0] = rows
        shape[1] = columns
    '''
    outliers = train_df[(train_df[col] < lower_bound) | (train_df[col] > upper_bound)].shape[0]
        
    if outliers <= 2:
        train_df[col] = train_df[col].fillna(mean)
    else:
        train_df[col] = train_df[col].fillna(median) 
        

## **⚙️ Data Preprocessing**

This section focuses on preparing the dataset for modeling by encoding categorical features and scaling numerical data.

In [None]:
%pip install scikit-learn

In [45]:
from sklearn.preprocessing import StandardScaler

### **1️⃣ Scale Numerical Features**

- Since SalePrice is not in the test data, lets exclude it in the standardizing process.

In [46]:
scaler = StandardScaler()

numerical_features = [col for col in numerical_columns if col != "SalePrice"]

train_df[numerical_features] = scaler.fit_transform(train_df[numerical_features])
test_df[numerical_features] = scaler.transform(test_df[numerical_features])

print("Feature Scaling Done!")

Feature Scaling Done!


### **2️⃣ Encode Categorical Features**

- Fix categorical features as we dropped columns earlier.

In [47]:
categorical_features = [col for col in categorical_columns if col in train_df.columns]

train_df = pd.get_dummies(train_df, columns=categorical_features, drop_first=True)
test_df = pd.get_dummies(test_df, columns=categorical_features, drop_first=True)

print("Categorical Encoding Done!")
print(train_df.shape, test_df.shape)

Categorical Encoding Done!
(1460, 231) (1459, 231)
