# Neural Networks

In this notebook, we aim to surpass the accuracy achieved with scikit-learn models, which was **92%**. Our goal is to reach at least **98% or higher** by leveraging the power of neural networks. 

Since we have already conducted Exploratory Data Analysis (EDA) and visualizations in the scikit-learn part of the project, we will skip those steps here. Instead, we will directly focus on:
- Handling missing values,
- Encoding the dataset,
- Training a neural network model using TensorFlow and Keras.

Let’s get started!

**Checking GPU support (optional)**

In [1]:
import tensorflow as tf

print("TensorFlow version:", tf.__version__)
print("Is GPU available:", tf.config.list_physical_devices('GPU'))

TensorFlow version: 2.10.0
Is GPU available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Preparing the Data

**Import the Data**

In [2]:
import pandas as pd

In [3]:
train_df = pd.read_csv("../../data/train.csv")
test_df = pd.read_csv("../../data/test.csv")

ids = test_df['Id']

# Drop Ids
train_df = train_df.drop(columns=['Id'])
test_df = test_df.drop(columns=['Id'])

print("Training Data:")
train_df.head()

Training Data:


Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,,,,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
print("Testing Data:")
test_df.head()

Testing Data:


Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,Inside,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,6,2010,WD,Normal
4,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,Inside,...,144,0,,,,0,1,2010,WD,Normal


**Handle the Missing Data**

In [5]:
train_df.isna().sum().sort_values(ascending=False)

PoolQC         1453
MiscFeature    1406
Alley          1369
Fence          1179
MasVnrType      872
               ... 
Heating           0
HeatingQC         0
MSZoning          0
1stFlrSF          0
SalePrice         0
Length: 80, dtype: int64

In [6]:
test_df.isna().sum().sort_values(ascending=False)

PoolQC           1456
MiscFeature      1408
Alley            1352
Fence            1169
MasVnrType        894
                 ... 
Electrical          0
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF        0
SaleCondition       0
Length: 79, dtype: int64

In [7]:
# Import necessary libraries
from sklearn.impute import SimpleImputer

# Handling missing values for categorical features
categorical_imputer = SimpleImputer(strategy="constant", fill_value="NoFeature")

# List of categorical columns with potential missing values
categorical_cols = [
    'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'GarageType', 'GarageQual', 
    'GarageCond', 'BsmtQual', 'BsmtCond', 'MasVnrType', 'FireplaceQu', 
    'GarageFinish', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical',
    'MSZoning', 'Utilities', 'Functional', 'Exterior1st', 'Exterior2nd', 'KitchenQual', 'SaleType'
]
train_df[categorical_cols] = categorical_imputer.fit_transform(train_df[categorical_cols])
test_df[categorical_cols] = categorical_imputer.transform(test_df[categorical_cols])

# Handling missing values for numerical features
numerical_imputer = SimpleImputer(strategy="median")

# List of numerical columns with potential missing values
numerical_cols = [
    'LotFrontage', 'GarageYrBlt', 'MasVnrArea', 'BsmtFullBath', 'BsmtHalfBath',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'GarageCars', 'GarageArea'
]
train_df[numerical_cols] = numerical_imputer.fit_transform(train_df[numerical_cols])
test_df[numerical_cols] = numerical_imputer.transform(test_df[numerical_cols])

Let's see if there is remaning missing values.

In [8]:
print("Missing values in train_df:", train_df.isnull().sum().sum())
print("Missing values in test_df:", test_df.isnull().sum().sum())

Missing values in train_df: 0
Missing values in test_df: 0


Now, we successfully got rid of the missing values.

In [9]:
# Export train_df to a CSV file
train_df.to_csv("../../data/cleaned_train.csv", index=False)
test_df.to_csv("../../data/cleaned_test.csv", index=False)

print("Cleaned train_df exported to cleaned_train.csv")
print("Cleaned test_df exported to cleaned_test.csv")

Cleaned train_df exported to cleaned_train.csv
Cleaned test_df exported to cleaned_test.csv


**Encode the Data**

In [10]:
# Import necessary Encoders
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Define nominal and ordinal features
nominal_features = [
    'MSZoning', 'Street', 'Alley', 'LotConfig', 'Neighborhood', 'Condition1', 
    'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 
    'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 
    'GarageType', 'SaleType', 'SaleCondition', 'LotShape', 'LandContour',
    'Utilities', 'LandSlope', 'CentralAir', 'Electrical', 'Functional',
    'GarageFinish', 'PavedDrive', 'MiscFeature'
]

ordinal_features = [
    'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 
    'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
    'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'
]

# Define ordinal encoding mapping for ordinal features
ordinal_mapping = {
    'ExterQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'ExterCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'HeatingQC': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'KitchenQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'FireplaceQu': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'PoolQC': ['NoFeature', 'Fa', 'TA', 'Gd', 'Ex'],
    'Fence': ['NoFeature', 'MnWw', 'MnPrv', 'GdWo', 'GdPrv'],
    'BsmtExposure': ['NoFeature', 'No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['NoFeature', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'BsmtFinType2': ['NoFeature', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
}

# Convert ordinal_mapping dictionary to a list of lists in the same order as ordinal_features
ordinal_categories = [ordinal_mapping[feature] for feature in ordinal_features]

# 1. One-Hot Encoding for Nominal Features
nominal_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
nominal_encoder.fit(train_df[nominal_features])

# Fit and transform nominal features
nominal_encoded_train = nominal_encoder.transform(train_df[nominal_features])
nominal_encoded_test = nominal_encoder.transform(test_df[nominal_features])

# Create a DataFrame for the one-hot encoded features
nominal_encoded_train_df = pd.DataFrame(
    nominal_encoded_train,
    columns=nominal_encoder.get_feature_names_out(nominal_features),
    index=train_df.index
)
nominal_encoded_test_df = pd.DataFrame(
    nominal_encoded_test,
    columns=nominal_encoder.get_feature_names_out(nominal_features),
    index=test_df.index
)

# Drop original nominal columns and concatenate one-hot encoded features
train_df = train_df.drop(columns=nominal_features).join(nominal_encoded_train_df)
test_df = test_df.drop(columns=nominal_features).join(nominal_encoded_test_df)

# 2. Ordinal Encoding for Ordinal Features
ordinal_encoder = OrdinalEncoder(categories=ordinal_categories)
ordinal_encoder.fit(train_df[ordinal_features])

# Fit and transform ordinal features
ordinal_encoded_train = ordinal_encoder.transform(train_df[ordinal_features])
ordinal_encoded_test = ordinal_encoder.transform(test_df[ordinal_features])

# Create a DataFrame for the ordinal encoded features
ordinal_encoded_train_df = pd.DataFrame(
    ordinal_encoded_train,
    columns=ordinal_features,
    index=train_df.index
)
ordinal_encoded_test_df = pd.DataFrame(
    ordinal_encoded_test,
    columns=ordinal_features,
    index=test_df.index
)

# Drop original ordinal columns and concatenate ordinal encoded features
train_df = train_df.drop(columns=ordinal_features).join(ordinal_encoded_train_df)
test_df = test_df.drop(columns=ordinal_features).join(ordinal_encoded_test_df)

Now, it's time to make sure if encoding is successful.

In [11]:
missing_cols = [col for col in train_df.columns if col not in test_df.columns]
missing_cols

['SalePrice']

In [12]:
train_df.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,HeatingQC,KitchenQual,FireplaceQu,GarageQual,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2
0,60,65.0,8450,7,5,2003,2003,196.0,706.0,0.0,...,5.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,6.0,1.0
1,20,80.0,9600,6,8,1976,1976,0.0,978.0,0.0,...,5.0,3.0,3.0,3.0,3.0,0.0,0.0,4.0,5.0,1.0
2,60,68.0,11250,7,5,2001,2002,162.0,486.0,0.0,...,5.0,4.0,3.0,3.0,3.0,0.0,0.0,2.0,6.0,1.0
3,70,60.0,9550,7,5,1915,1970,0.0,216.0,0.0,...,4.0,4.0,4.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0
4,60,84.0,14260,8,5,2000,2000,350.0,655.0,0.0,...,5.0,4.0,3.0,3.0,3.0,0.0,0.0,3.0,6.0,1.0


In [13]:
test_df.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,HeatingQC,KitchenQual,FireplaceQu,GarageQual,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2
0,20,80.0,11622,5,6,1961,1961,0.0,468.0,144.0,...,3.0,3.0,0.0,3.0,3.0,0.0,2.0,1.0,3.0,2.0
1,20,81.0,14267,6,6,1958,1958,108.0,923.0,0.0,...,3.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0
2,60,74.0,13830,5,5,1997,1998,0.0,791.0,0.0,...,4.0,3.0,3.0,3.0,3.0,0.0,2.0,1.0,6.0,1.0
3,60,78.0,9978,6,6,1998,1998,20.0,602.0,0.0,...,5.0,4.0,4.0,3.0,3.0,0.0,0.0,1.0,6.0,1.0
4,120,43.0,5005,8,5,1992,1992,0.0,263.0,0.0,...,5.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0


In [14]:
# Check for non-numeric columns
non_numeric_columns_train_df = train_df.select_dtypes(include=['object']).columns
if non_numeric_columns_train_df.empty:
    print("All columns are numeric!")
else:
    print("Non-numeric columns:", non_numeric_columns_train_df)

All columns are numeric!


In [15]:
# Check for non-numeric columns
non_numeric_columns_test_df = test_df.select_dtypes(include=['object']).columns
if non_numeric_columns_test_df.empty:
    print("All columns are numeric!")
else:
    print("Non-numeric columns:", non_numeric_columns_test_df)

All columns are numeric!


In [16]:
# Export train_df to a CSV file
train_df.to_csv("../../data/encoded_train.csv", index=False)
test_df.to_csv("../../data/encoded_test.csv", index=False)

print("Encoded train_df exported to encoded_train.csv")
print("Encoded test_df exported to encoded_test.csv")

Encoded train_df exported to encoded_train.csv
Encoded test_df exported to encoded_test.csv


Everything looks fine! Now time to move on...

## Feature Enginnering

**Addressing Skewness**

In [17]:
import numpy as np

# Select only numerical columns for train and test
numerical_features_train = train_df.select_dtypes(include=['float64', 'int64'])
numerical_features_test = test_df.select_dtypes(include=['float64', 'int64'])

# Identifying skewed numerical features in train
skewed_features = numerical_features_train.skew().sort_values(ascending=False)
high_skew = skewed_features[skewed_features > 0.5].index

# Exclude 'SalePrice' from high_skew
if 'SalePrice' in high_skew:
    high_skew = high_skew.drop('SalePrice')

# Ensure all high_skew features contain non-negative values
assert (train_df[high_skew] >= 0).all().all(), "Negative values found in train data"
assert (test_df[high_skew] >= 0).all().all(), "Negative values found in test data"

# Apply log1p transformation to reduce skewness in train and test
train_df[high_skew] = train_df[high_skew].apply(np.log1p)
test_df[high_skew] = test_df[high_skew].apply(np.log1p)

# Check skewness after transformation (optional)
print("Skewness in train after transformation:")
print(train_df[high_skew].skew().sort_values(ascending=False))

print("\nSkewness in test after transformation:")
print(test_df[high_skew].skew().sort_values(ascending=False))

Skewness in train after transformation:
Condition2_RRAe        38.209946
Heating_Floor          38.209946
Exterior1st_AsphShn    38.209946
RoofMatl_Membran       38.209946
Exterior2nd_Other      38.209946
                         ...    
OverallCond            -0.254015
BsmtFinSF1             -0.618410
LotFrontage            -0.870006
BsmtUnfSF              -2.186504
TotalBsmtSF            -5.154670
Length: 198, dtype: float64

Skewness in test after transformation:
Functional_Sev         38.196859
RoofMatl_WdShngl       38.196859
Exterior2nd_Stone      38.196859
Exterior1st_CBlock     38.196859
Exterior1st_AsphShn    38.196859
                         ...    
LotArea                -0.915598
LotFrontage            -1.138210
OverallCond            -1.160771
BsmtUnfSF              -2.136927
TotalBsmtSF            -4.832565
Length: 198, dtype: float64


**Creating New Features**

In [18]:
# Creating new features for train_df and test_df
# Replace LotArea == 0 with NaN to avoid division errors
train_df['LotArea'].replace(0, np.nan, inplace=True)
test_df['LotArea'].replace(0, np.nan, inplace=True)

# Define new features
new_features_train = pd.DataFrame({
    'TotalBathrooms': (
        train_df['FullBath'] + train_df['HalfBath'] * 0.5 +
        train_df['BsmtFullBath'] + train_df['BsmtHalfBath'] * 0.5
    ),
    'TotalSF': train_df['TotalBsmtSF'] + train_df['1stFlrSF'] + train_df['2ndFlrSF'],
    'GrLivAreaToLotArea': train_df['GrLivArea'] / train_df['LotArea'],
    'GarageAreaToLotArea': train_df['GarageArea'] / train_df['LotArea']
})

new_features_test = pd.DataFrame({
    'TotalBathrooms': (
        test_df['FullBath'] + test_df['HalfBath'] * 0.5 +
        test_df['BsmtFullBath'] + test_df['BsmtHalfBath'] * 0.5
    ),
    'TotalSF': test_df['TotalBsmtSF'] + test_df['1stFlrSF'] + test_df['2ndFlrSF'],
    'GrLivAreaToLotArea': test_df['GrLivArea'] / test_df['LotArea'],
    'GarageAreaToLotArea': test_df['GarageArea'] / test_df['LotArea']
})

# Handle missing values in new features
new_features_train.fillna(0, inplace=True)
new_features_test.fillna(0, inplace=True)

# Concatenating new features with the original DataFrames
train_df = pd.concat([train_df, new_features_train], axis=1)
test_df = pd.concat([test_df, new_features_test], axis=1)

# Validation
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

# Check for any remaining missing values (optional)
print("Missing values in train:", train_df.isnull().sum().sum())
print("Missing values in test:", test_df.isnull().sum().sum())

Train shape: (1460, 248)
Test shape: (1459, 247)
Missing values in train: 0
Missing values in test: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train_df['LotArea'].replace(0, np.nan, inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test_df['LotArea'].replace(0, np.nan, inplace=True)


In [19]:
# Checking new features
print("Train DataFrame:")
train_df[['TotalBathrooms', 'TotalSF', 'GrLivAreaToLotArea', 'GarageAreaToLotArea']].head()

Train DataFrame:


Unnamed: 0,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,3.039721,20.257977,0.823358,60.605792
1,2.346574,14.28249,0.778794,50.165642
2,3.039721,20.415959,0.802758,65.17862
3,1.693147,20.127741,0.81281,70.053677
4,3.039721,21.048414,0.804551,87.399393


In [20]:
print("\nTest DataFrame:")
test_df[['TotalBathrooms', 'TotalSF', 'GrLivAreaToLotArea', 'GarageAreaToLotArea']].head()


Test DataFrame:


Unnamed: 0,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,1.0,13.582381,0.726337,77.985278
1,1.346574,14.385868,0.751945,32.616282
2,2.346574,20.222151,0.775731,50.552365
3,2.346574,20.184528,0.801552,51.041252
4,2.0,14.310793,0.839994,59.400879


**Scaling Numerical Features**

In [21]:
from sklearn.preprocessing import StandardScaler

# Selecting numerical features for scaling (excluding 'SalePrice')
numerical_features = train_df.select_dtypes(include=['float64', 'int64']).columns.drop('SalePrice')

# Verify that numerical_features exist in both train and test datasets
assert set(numerical_features).issubset(train_df.columns)
assert set(numerical_features).issubset(test_df.columns)

# Initializing and fitting the scaler on the training data
scaler = StandardScaler()
train_df[numerical_features] = scaler.fit_transform(train_df[numerical_features])

# Transforming the test data using the same scaler
test_df[numerical_features] = scaler.transform(test_df[numerical_features])

# Validation (optional)
print("Mean and Std of scaled features in train:")
print(train_df[numerical_features].mean().round(2))  # Should be ~0
print(train_df[numerical_features].std().round(2))   # Should be ~1

Mean and Std of scaled features in train:
MSSubClass            -0.0
LotFrontage            0.0
LotArea               -0.0
OverallQual            0.0
OverallCond            0.0
                      ... 
BsmtFinType2          -0.0
TotalBathrooms        -0.0
TotalSF                0.0
GrLivAreaToLotArea    -0.0
GarageAreaToLotArea   -0.0
Length: 247, dtype: float64
MSSubClass             1.0
LotFrontage            1.0
LotArea                1.0
OverallQual            1.0
OverallCond            1.0
                      ... 
BsmtFinType2           1.0
TotalBathrooms         1.0
TotalSF                1.0
GrLivAreaToLotArea     1.0
GarageAreaToLotArea    1.0
Length: 247, dtype: float64


In [22]:
train_df['SalePrice'] = np.log1p(train_df['SalePrice'])

In [23]:
print("Scaled train data:")
train_df[numerical_features].head()

Scaled train data:


Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,0.424462,-0.078896,-0.13327,0.651479,-0.460408,1.050994,0.878668,1.203619,0.779431,-0.355342,...,0.265618,-0.066618,-0.477705,-0.553434,1.164712,-0.237308,1.546225,1.081632,0.500534,0.396376
1,-1.125202,0.572719,0.113413,-0.071836,1.948163,0.156734,-0.429577,-0.806841,0.888257,-0.355342,...,0.265618,-0.066618,-0.477705,1.949086,0.690115,-0.237308,0.507699,-0.69674,-0.439243,-0.06546
2,0.424462,0.062541,0.420049,0.651479,-0.460408,0.984752,0.830215,1.131524,0.654803,-0.355342,...,0.265618,-0.066618,-0.477705,0.553949,1.164712,-0.237308,1.546225,1.12865,0.066114,0.598663
3,0.645073,-0.329561,0.103317,0.651479,-0.460408,-1.863632,-0.720298,-0.806841,0.384539,-0.355342,...,0.265618,-0.066618,-0.477705,-0.553434,0.690115,-0.237308,-0.471315,1.042873,0.2781,0.814319
4,0.424462,0.726089,0.878431,1.374795,-0.460408,0.951632,0.733308,1.423411,0.7544,-0.355342,...,0.265618,-0.066618,-0.477705,1.339649,1.164712,-0.237308,1.546225,1.316875,0.103932,1.581634


In [24]:
print("Scaled test data:")
test_df[numerical_features].head()

Scaled test data:


Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,-1.125202,0.572719,0.482944,-0.795151,0.455288,-0.340077,-1.15638,-0.806841,0.642211,2.342933,...,0.265618,-0.066618,1.711993,-0.553434,-0.259078,1.190471,-1.509841,-0.905101,-1.54546,1.165186
1,-1.125202,0.61176,0.87938,-0.071836,0.455288,-0.43944,-1.30174,0.978395,0.868926,-0.355342,...,0.265618,-0.066618,-0.477705,-0.553434,0.690115,-0.237308,-0.990578,-0.665974,-1.005443,-0.841784
2,0.424462,0.327844,0.819235,-0.795151,-0.460408,0.852269,0.6364,-0.806841,0.817388,-0.355342,...,0.265618,-0.066618,1.711993,-0.553434,1.164712,-0.237308,0.507699,1.07097,-0.503836,-0.048353
3,0.424462,0.49317,0.188077,-0.071836,0.455288,0.88539,0.6364,0.351715,0.726234,-0.355342,...,0.265618,-0.066618,-0.477705,-0.553434,1.164712,-0.237308,0.507699,1.059773,0.040683,-0.026726
4,1.41981,-1.369005,-1.145753,1.374795,-0.460408,0.686666,0.345679,-0.806841,0.450086,-0.355342,...,0.265618,-0.066618,-0.477705,-0.553434,0.690115,-0.237308,-0.011565,-0.688317,0.85136,0.343075


In [25]:
# Check for zeros in divisor columns
print(f"LotArea with zeros in train_df: {sum(train_df['LotArea'] == 0)}")
print(f"LotArea with zeros in test_df: {sum(test_df['LotArea'] == 0)}")

LotArea with zeros in train_df: 0
LotArea with zeros in test_df: 0


In [26]:
# Example: Using IQR to detect outliers in SalePrice (train only)
Q1 = train_df['SalePrice'].quantile(0.25)
Q3 = train_df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1
outliers = train_df[(train_df['SalePrice'] < Q1 - 1.5 * IQR) | (train_df['SalePrice'] > Q3 + 1.5 * IQR)]
print(f"Number of outliers in SalePrice: {len(outliers)}")

Number of outliers in SalePrice: 28


In [27]:
# Export train_df to a CSV file
train_df.to_csv("../../data/featured_train.csv", index=False)
test_df.to_csv("../../data/featured_test.csv", index=False)

print("Feature Enginnered train_df exported to featured_train.csv")
print("Feature Engineered test_df exported to featured_test.csv")

Feature Enginnered train_df exported to featured_train.csv
Feature Engineered test_df exported to featured_test.csv


So far so good! Now it is time to train our neural network!

## Model Training

**Split Data to X and y and Preparing Data**

In [28]:
from sklearn.model_selection import train_test_split

X = train_df.drop(columns=['SalePrice'])
y = train_df['SalePrice']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [29]:
# Convert data to TensorFlow tensors directly
import tensorflow as tf

X_train_tensor = tf.convert_to_tensor(X_train, dtype=tf.float32)
y_train_tensor = tf.convert_to_tensor(y_train, dtype=tf.float32)
X_test_tensor = tf.convert_to_tensor(X_test, dtype=tf.float32)
y_test_tensor = tf.convert_to_tensor(y_test, dtype=tf.float32)

# Validate tensor shapes
print("X_train_tensor shape:", X_train_tensor.shape)
print("y_train_tensor shape:", y_train_tensor.shape)
print("X_test_tensor shape:", X_test_tensor.shape)
print("y_test_tensor shape:", y_test_tensor.shape)

# Optional: Create TensorFlow datasets for batching
train_dataset = tf.data.Dataset.from_tensor_slices((X_train_tensor, y_train_tensor)).batch(32)
test_dataset = tf.data.Dataset.from_tensor_slices((X_test_tensor, y_test_tensor)).batch(32)

X_train_tensor shape: (1168, 247)
y_train_tensor shape: (1168,)
X_test_tensor shape: (292, 247)
y_test_tensor shape: (292,)


In [30]:
type(X_train_tensor), type(X_test_tensor), type(y_train_tensor), type(y_test_tensor)

(tensorflow.python.framework.ops.EagerTensor,
 tensorflow.python.framework.ops.EagerTensor,
 tensorflow.python.framework.ops.EagerTensor,
 tensorflow.python.framework.ops.EagerTensor)

**Create and Train the Model**

In [31]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.optimizers import Adam

# Define the model
model = Sequential([
    Dense(128, activation='relu', input_shape=(X_train_tensor.shape[1],)),
    BatchNormalization(),
    Dense(64, activation='relu'),
    BatchNormalization(),
    Dense(32, activation='relu'),
    Dense(1)
])

# Create optimizer
optimizer = Adam(learning_rate=0.0005)

# Compile the model
model.compile(optimizer=optimizer, loss='mse', metrics=['mae'])

# Summary of the model
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 128)               31744     
                                                                 
 batch_normalization (BatchN  (None, 128)              512       
 ormalization)                                                   
                                                                 
 dense_1 (Dense)             (None, 64)                8256      
                                                                 
 batch_normalization_1 (Batc  (None, 64)               256       
 hNormalization)                                                 
                                                                 
 dense_2 (Dense)             (None, 32)                2080      
                                                                 
 dense_3 (Dense)             (None, 1)                 3

In [32]:
# Create EarlyStopping Callback
from tensorflow.keras.callbacks import EarlyStopping

early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=20,
    restore_best_weights=True,
    verbose=1
)

In [33]:
from tensorflow.keras.callbacks import ReduceLROnPlateau

lr_scheduler = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=5, verbose=1, min_lr=1e-6)

In [34]:
# Create TensorBoard for Visualization
import datetime

log_dir = "../logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

In [35]:
# Train the Model
history = model.fit(
    X_train_tensor, y_train_tensor, epochs=1000,
    batch_size=32, validation_split=0.2,
    callbacks=[early_stopping, lr_scheduler, tensorboard_callback]
)

Epoch 1/1000
Epoch 2/1000
Epoch 3/1000
Epoch 4/1000
Epoch 5/1000
Epoch 6/1000
Epoch 7/1000
Epoch 8/1000
Epoch 9/1000
Epoch 10/1000
Epoch 11/1000
Epoch 12/1000
Epoch 13/1000
Epoch 14/1000
Epoch 15/1000
Epoch 16/1000
Epoch 17/1000
Epoch 18/1000
Epoch 19/1000
Epoch 20/1000
Epoch 21/1000
Epoch 22/1000
Epoch 23/1000
Epoch 24/1000
Epoch 25/1000
Epoch 26/1000
Epoch 27/1000
Epoch 28/1000
Epoch 29/1000
Epoch 30/1000
Epoch 31/1000
Epoch 32/1000
Epoch 33/1000
Epoch 34/1000
Epoch 35/1000
Epoch 36/1000
Epoch 37/1000
Epoch 38/1000
Epoch 39/1000
Epoch 40/1000
Epoch 41/1000
Epoch 42/1000
Epoch 43/1000
Epoch 44/1000
Epoch 45/1000
Epoch 46/1000
Epoch 47/1000
Epoch 48/1000
Epoch 49/1000
Epoch 50/1000
Epoch 51/1000
Epoch 52/1000
Epoch 53/1000
Epoch 54/1000
Epoch 55/1000
Epoch 56/1000
Epoch 57/1000
Epoch 58/1000
Epoch 59/1000
Epoch 60/1000
Epoch 61/1000
Epoch 62/1000
Epoch 63/1000
Epoch 64/1000
Epoch 65/1000
Epoch 66/1000
Epoch 67/1000
Epoch 67: ReduceLROnPlateau reducing learning rate to 0.000250000011874

Run `TensorBoard` in Jupyter Notebook

In [36]:
test_loss, test_mae = model.evaluate(X_test_tensor, y_test_tensor)
print(f"Test Loss: {test_loss}, Test MAE: {test_mae}")

Test Loss: 0.362118661403656, Test MAE: 0.4178880751132965


In [37]:
%load_ext tensorboard

In [38]:
%tensorboard --logdir ../logs/fit

Reusing TensorBoard on port 6006 (pid 19808), started 3:02:43 ago. (Use '!kill 19808' to kill it.)

## Evaluate the model

In [39]:
loss, mae = model.evaluate(X_test_tensor, y_test_tensor, verbose=0)
print(f"Test MAE: {mae:.4f}")

Test MAE: 0.4179


## File Submission

In [41]:
# Process the test data (drop 'Id' for predictions)
X_submission = test_df

# Apply consistent preprocessing to X_submission
missing_features = set(X_train.columns) - set(X_submission.columns)
for feature in missing_features:
    X_submission[feature] = 0
X_submission = X_submission[X_train.columns]

# Convert to TensorFlow tensor
X_submission_tensor = tf.convert_to_tensor(X_submission, dtype=tf.float32)

# Make predictions
predictions = model.predict(X_submission_tensor)
final_predictions = np.expm1(predictions)

# Create submission DataFrame with original Ids
submission = pd.DataFrame({
    "Id": ids,
    "SalePrice": final_predictions.flatten()
})

# Save to CSV
submission.to_csv("../../data/submission_nn.csv", index=False)
print("Submission file created: submission_nn.csv")

Submission file created: submission_nn.csv


In [42]:
submission_df = pd.read_csv("../../data/submission_nn.csv")
submission_df.head()

Unnamed: 0,Id,SalePrice
0,1461,120250.54
1,1462,65283.05
2,1463,179427.02
3,1464,377754.3
4,1465,364144.34
