# Neural Networks

In this notebook, we aim to surpass the accuracy achieved with scikit-learn models, which was **92%**. Our goal is to reach at least **98% or higher** by leveraging the power of neural networks. 

Since we have already conducted Exploratory Data Analysis (EDA) and visualizations in the scikit-learn part of the project, we will skip those steps here. Instead, we will directly focus on:
- Handling missing values,
- Encoding the dataset,
- Training a neural network model using TensorFlow and Keras.

Let’s get started!

**Checking GPU support (optional)**

In [1]:
import tensorflow as tf

print("TensorFlow version:", tf.__version__)
print("Is GPU available:", tf.config.list_physical_devices('GPU'))

TensorFlow version: 2.10.0
Is GPU available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Preparing the Data

**Import the Data**

In [2]:
import pandas as pd

In [3]:
train_df = pd.read_csv("../../data/train.csv")
test_df = pd.read_csv("../../data/test.csv")

print("Training Data:")
train_df.head()

Training Data:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
print("Testing Data:")
test_df.head()

Testing Data:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


**Handle the Missing Data**

In [5]:
train_df.isna().sum().sort_values(ascending=False)

PoolQC         1453
MiscFeature    1406
Alley          1369
Fence          1179
MasVnrType      872
               ... 
ExterQual         0
Exterior2nd       0
Exterior1st       0
RoofMatl          0
SalePrice         0
Length: 81, dtype: int64

In [6]:
test_df.isna().sum().sort_values(ascending=False)

PoolQC           1456
MiscFeature      1408
Alley            1352
Fence            1169
MasVnrType        894
                 ... 
Electrical          0
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF        0
SaleCondition       0
Length: 80, dtype: int64

In [7]:
# Import necessary libraries
from sklearn.impute import SimpleImputer

# Handling missing values for categorical features
categorical_imputer = SimpleImputer(strategy="constant", fill_value="NoFeature")

# List of categorical columns with potential missing values
categorical_cols = [
    'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'GarageType', 'GarageQual', 
    'GarageCond', 'BsmtQual', 'BsmtCond', 'MasVnrType', 'FireplaceQu', 
    'GarageFinish', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical',
    'MSZoning', 'Utilities', 'Functional', 'Exterior1st', 'Exterior2nd', 'KitchenQual', 'SaleType'
]
train_df[categorical_cols] = categorical_imputer.fit_transform(train_df[categorical_cols])
test_df[categorical_cols] = categorical_imputer.fit_transform(test_df[categorical_cols])

# Handling missing values for numerical features
numerical_imputer = SimpleImputer(strategy="median")

# List of numerical columns with potential missing values
numerical_cols = [
    'LotFrontage', 'GarageYrBlt', 'MasVnrArea', 'BsmtFullBath', 'BsmtHalfBath',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'GarageCars', 'GarageArea'
]
train_df[numerical_cols] = numerical_imputer.fit_transform(train_df[numerical_cols])
test_df[numerical_cols] = numerical_imputer.fit_transform(test_df[numerical_cols])

Let's see if there is remaning missing values.

In [8]:
print("Missing values in train_df:", train_df.isnull().sum().sum())
print("Missing values in test_df:", test_df.isnull().sum().sum())

Missing values in train_df: 0
Missing values in test_df: 0


Now, we successfully got rid of the missing values.

In [9]:
# Export train_df to a CSV file
train_df.to_csv("../../data/cleaned_train.csv", index=False)
test_df.to_csv("../../data/cleaned_test.csv", index=False)

print("Cleaned train_df exported to cleaned_train.csv")
print("Cleaned test_df exported to cleaned_test.csv")

Cleaned train_df exported to cleaned_train.csv
Cleaned test_df exported to cleaned_test.csv


**Encode the Data**

In [10]:
# Import necessary Encoders
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Define nominal and ordinal features
nominal_features = [
    'MSZoning', 'Street', 'Alley', 'LotConfig', 'Neighborhood', 'Condition1', 
    'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 
    'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 
    'GarageType', 'SaleType', 'SaleCondition', 'LotShape', 'LandContour',
    'Utilities', 'LandSlope', 'CentralAir', 'Electrical', 'Functional',
    'GarageFinish', 'PavedDrive', 'MiscFeature'
]

ordinal_features = [
    'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 
    'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
    'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'
]

# Define ordinal encoding mapping for ordinal features
ordinal_mapping = {
    'ExterQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'ExterCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'HeatingQC': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'KitchenQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'FireplaceQu': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'PoolQC': ['NoFeature', 'Fa', 'TA', 'Gd', 'Ex'],
    'Fence': ['NoFeature', 'MnWw', 'MnPrv', 'GdWo', 'GdPrv'],
    'BsmtExposure': ['NoFeature', 'No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['NoFeature', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'BsmtFinType2': ['NoFeature', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
}

# Convert ordinal_mapping dictionary to a list of lists in the same order as ordinal_features
ordinal_categories = [ordinal_mapping[feature] for feature in ordinal_features]

# 1. One-Hot Encoding for Nominal Features
nominal_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
nominal_encoder.fit(train_df[nominal_features])

# Fit and transform nominal features
nominal_encoded_train = nominal_encoder.transform(train_df[nominal_features])
nominal_encoded_test = nominal_encoder.transform(test_df[nominal_features])

# Create a DataFrame for the one-hot encoded features
nominal_encoded_train_df = pd.DataFrame(
    nominal_encoded_train,
    columns=nominal_encoder.get_feature_names_out(nominal_features),
    index=train_df.index
)
nominal_encoded_test_df = pd.DataFrame(
    nominal_encoded_test,
    columns=nominal_encoder.get_feature_names_out(nominal_features),
    index=test_df.index
)

# Drop original nominal columns and concatenate one-hot encoded features
train_df = train_df.drop(columns=nominal_features).join(nominal_encoded_train_df)
test_df = test_df.drop(columns=nominal_features).join(nominal_encoded_test_df)

# 2. Ordinal Encoding for Ordinal Features
ordinal_encoder = OrdinalEncoder(categories=ordinal_categories)
ordinal_encoder.fit(train_df[ordinal_features])

# Fit and transform ordinal features
ordinal_encoded_train = ordinal_encoder.transform(train_df[ordinal_features])
ordinal_encoded_test = ordinal_encoder.transform(test_df[ordinal_features])

# Create a DataFrame for the ordinal encoded features
ordinal_encoded_train_df = pd.DataFrame(
    ordinal_encoded_train,
    columns=ordinal_features,
    index=train_df.index
)
ordinal_encoded_test_df = pd.DataFrame(
    ordinal_encoded_test,
    columns=ordinal_features,
    index=test_df.index
)

# Drop original ordinal columns and concatenate ordinal encoded features
train_df = train_df.drop(columns=ordinal_features).join(ordinal_encoded_train_df)
test_df = test_df.drop(columns=ordinal_features).join(ordinal_encoded_test_df)

Now, it's time to make sure if encoding is successful.

In [11]:
missing_cols = [col for col in train_df.columns if col not in test_df.columns]
missing_cols

['SalePrice']

In [13]:
train_df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,HeatingQC,KitchenQual,FireplaceQu,GarageQual,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2
0,1,60,65.0,8450,7,5,2003,2003,196.0,706.0,...,5.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,6.0,1.0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978.0,...,5.0,3.0,3.0,3.0,3.0,0.0,0.0,4.0,5.0,1.0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486.0,...,5.0,4.0,3.0,3.0,3.0,0.0,0.0,2.0,6.0,1.0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216.0,...,4.0,4.0,4.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655.0,...,5.0,4.0,3.0,3.0,3.0,0.0,0.0,3.0,6.0,1.0


In [14]:
test_df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,HeatingQC,KitchenQual,FireplaceQu,GarageQual,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2
0,1461,20,80.0,11622,5,6,1961,1961,0.0,468.0,...,3.0,3.0,0.0,3.0,3.0,0.0,2.0,1.0,3.0,2.0
1,1462,20,81.0,14267,6,6,1958,1958,108.0,923.0,...,3.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0
2,1463,60,74.0,13830,5,5,1997,1998,0.0,791.0,...,4.0,3.0,3.0,3.0,3.0,0.0,2.0,1.0,6.0,1.0
3,1464,60,78.0,9978,6,6,1998,1998,20.0,602.0,...,5.0,4.0,4.0,3.0,3.0,0.0,0.0,1.0,6.0,1.0
4,1465,120,43.0,5005,8,5,1992,1992,0.0,263.0,...,5.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0


In [18]:
# Check for non-numeric columns
non_numeric_columns_train_df = train_df.select_dtypes(include=['object']).columns
if non_numeric_columns_train_df.empty:
    print("All columns are numeric!")
else:
    print("Non-numeric columns:", non_numeric_columns_train_df)

All columns are numeric!


In [19]:
# Check for non-numeric columns
non_numeric_columns_test_df = test_df.select_dtypes(include=['object']).columns
if non_numeric_columns_test_df.empty:
    print("All columns are numeric!")
else:
    print("Non-numeric columns:", non_numeric_columns_test_df)

All columns are numeric!


In [16]:
# Export train_df to a CSV file
train_df.to_csv("../../data/encoded_train.csv", index=False)
test_df.to_csv("../../data/encoded_test.csv", index=False)

print("Encoded train_df exported to encoded_train.csv")
print("Encoded test_df exported to encoded_test.csv")

Encoded train_df exported to encoded_train.csv
Encoded test_df exported to encoded_test.csv


Everything looks fine! Now time to move on...

## Feature Enginnering

**Addressing Skewness**

In [21]:
import numpy as np

# Select only numerical columns for train and test
numerical_features_train = train_df.select_dtypes(include=['float64', 'int64'])
numerical_features_test = test_df.select_dtypes(include=['float64', 'int64'])

# Identifying skewed numerical features in train
skewed_features = numerical_features_train.skew().sort_values(ascending=False)
high_skew = skewed_features[skewed_features > 0.5].index

# Exclude 'SalePrice' from high_skew
if 'SalePrice' in high_skew:
    high_skew = high_skew.drop('SalePrice')

# Apply log1p transformation to reduce skewness in train and test
train_df[high_skew] = train_df[high_skew].apply(lambda x: np.log1p(x))

# Use the same skewed features for test data
test_df[high_skew] = test_df[high_skew].apply(lambda x: np.log1p(x))

# Check skewness after transformation (optional)
print("Skewness in train after transformation:")
print(train_df[high_skew].skew().sort_values(ascending=False))

print("\nSkewness in test after transformation:")
print(test_df[high_skew].skew().sort_values(ascending=False))

Skewness in train after transformation:
Exterior2nd_Other    38.209946
Heating_Floor        38.209946
Utilities_NoSeWa     38.209946
Condition2_RRAe      38.209946
RoofMatl_Metal       38.209946
                       ...    
HalfBath              0.540108
ExterQual             0.436780
MasVnrArea            0.420593
BsmtFinType2          0.353697
BsmtExposure         -0.113315
Length: 182, dtype: float64

Skewness in test after transformation:
Functional_Sev          38.196859
Exterior2nd_Stone       38.196859
RoofMatl_WdShngl        38.196859
Exterior1st_AsphShn     38.196859
Exterior2nd_AsphShn     38.196859
                          ...    
Electrical_NoFeature     0.000000
Exterior1st_Stone        0.000000
Heating_OthW             0.000000
Condition2_RRNn          0.000000
Condition2_RRAe          0.000000
Length: 182, dtype: float64


**Creating New Features**

In [22]:
# Creating new features for train_df and test_df
new_features_train = pd.DataFrame({
    'TotalBathrooms': (
        train_df['FullBath'] + train_df['HalfBath'] * 0.5 + train_df['BsmtFullBath'] + train_df['BsmtHalfBath'] * 0.5
    ),
    'TotalSF': train_df['TotalBsmtSF'] + train_df['1stFlrSF'] + train_df['2ndFlrSF'],
    'GrLivAreaToLotArea': train_df['GrLivArea'] / train_df['LotArea'],
    'GarageAreaToLotArea': train_df['GarageArea'] / train_df['LotArea']
})

new_features_test = pd.DataFrame({
    'TotalBathrooms': (
        test_df['FullBath'] + test_df['HalfBath'] * 0.5 + test_df['BsmtFullBath'] + test_df['BsmtHalfBath'] * 0.5
    ),
    'TotalSF': test_df['TotalBsmtSF'] + test_df['1stFlrSF'] + test_df['2ndFlrSF'],
    'GrLivAreaToLotArea': test_df['GrLivArea'] / test_df['LotArea'],
    'GarageAreaToLotArea': test_df['GarageArea'] / test_df['LotArea']
})

# Concatenating new features with the original DataFrames
train_df = pd.concat([train_df, new_features_train], axis=1)
test_df = pd.concat([test_df, new_features_test], axis=1)

In [23]:
# Checking new features
print("Train DataFrame:")
train_df[['TotalBathrooms', 'TotalSF', 'GrLivAreaToLotArea', 'GarageAreaToLotArea']].head()

Train DataFrame:


Unnamed: 0,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,2.956442,20.257977,0.823358,60.605792
1,2.263295,14.28249,0.778794,50.165642
2,2.956442,20.415959,0.802758,65.17862
3,1.693147,20.127741,0.81281,70.053677
4,2.956442,21.048414,0.804551,87.399393


In [24]:
print("\nTest DataFrame:")
test_df[['TotalBathrooms', 'TotalSF', 'GrLivAreaToLotArea', 'GarageAreaToLotArea']].head()


Test DataFrame:


Unnamed: 0,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,1.0,1778.0,0.077095,0.062812
1,1.346574,2658.0,0.093152,0.021869
2,2.346574,2557.0,0.117787,0.034852
3,2.346574,2530.0,0.160754,0.047104
4,2.0,2560.0,0.255744,0.101099


**Scaling Numerical Features**

In [28]:
from sklearn.preprocessing import StandardScaler

# Selecting numerical features for scaling (excluding 'SalePrice')
numerical_features = train_df.select_dtypes(include=['float64', 'int64']).columns.drop('SalePrice')

# Initializing and fitting the scaler on the training data
scaler = StandardScaler()
train_df[numerical_features] = scaler.fit_transform(train_df[numerical_features])

# Transforming the test data using the same scaler
test_df[numerical_features] = scaler.transform(test_df[numerical_features])

In [29]:
print("Scaled train data:")
train_df[numerical_features].head()

Scaled train data:


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,-1.730865,0.424462,-0.078896,-0.13327,0.651479,-0.460408,1.050994,0.878668,1.225422,0.779431,...,0.265618,-0.067785,-0.483083,-0.493361,1.164712,-0.181493,1.505663,1.081632,0.500534,0.396376
1,-1.728492,-1.125202,0.572719,0.113413,-0.071836,1.948163,0.156734,-0.429577,-0.819538,0.888257,...,0.265618,-0.067785,-0.483083,1.760085,0.690115,-0.181493,0.446679,-0.69674,-0.439243,-0.06546
2,-1.72612,0.424462,0.062541,0.420049,0.651479,-0.460408,0.984752,0.830215,1.191356,0.654803,...,0.265618,-0.067785,-0.483083,0.625102,1.164712,-0.181493,1.505663,1.12865,0.066114,0.598663
3,-1.723747,0.645073,-0.329561,0.103317,0.651479,-0.460408,-1.863632,-0.720298,-0.819538,0.384539,...,0.265618,-0.067785,-0.483083,-0.493361,0.690115,-0.181493,-0.424387,1.042873,0.2781,0.814319
4,-1.721374,0.424462,0.726089,0.878431,1.374795,-0.460408,0.951632,0.733308,1.323273,0.7544,...,0.265618,-0.067785,-0.483083,1.294371,1.164712,-0.181493,1.505663,1.316875,0.103932,1.581634


In [30]:
print("Scaled test data:")
test_df[numerical_features].head()

Scaled test data:


Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2,TotalBathrooms,TotalSF,GrLivAreaToLotArea,GarageAreaToLotArea
0,1461.0,20.0,80.0,11622.0,5.0,6.0,1961.0,1961.0,-7.543433000000001e-17,468.0,...,3.0,1.9466920000000002e-17,1.098612,0.693147,3.0,1.098612,1.0,1778.0,0.077095,0.062812
1,1462.0,20.0,81.0,14267.0,6.0,6.0,1958.0,1958.0,4.691348,923.0,...,3.0,1.9466920000000002e-17,-7.421765e-17,0.693147,5.0,0.693147,1.346574,2658.0,0.093152,0.021869
2,1463.0,60.0,74.0,13830.0,5.0,5.0,1997.0,1998.0,-7.543433000000001e-17,791.0,...,3.0,1.9466920000000002e-17,1.098612,0.693147,6.0,0.693147,2.346574,2557.0,0.117787,0.034852
3,1464.0,60.0,78.0,9978.0,6.0,6.0,1998.0,1998.0,3.044522,602.0,...,3.0,1.9466920000000002e-17,-7.421765e-17,0.693147,6.0,0.693147,2.346574,2530.0,0.160754,0.047104
4,1465.0,120.0,43.0,5005.0,8.0,5.0,1992.0,1992.0,-7.543433000000001e-17,263.0,...,3.0,1.9466920000000002e-17,-7.421765e-17,0.693147,5.0,0.693147,2.0,2560.0,0.255744,0.101099


In [31]:
# Check for zeros in divisor columns
print(f"LotArea with zeros in train_df: {sum(train_df['LotArea'] == 0)}")
print(f"LotArea with zeros in test_df: {sum(test_df['LotArea'] == 0)}")

LotArea with zeros in train_df: 0
LotArea with zeros in test_df: 0


In [32]:
# Example: Using IQR to detect outliers in SalePrice (train only)
Q1 = train_df['SalePrice'].quantile(0.25)
Q3 = train_df['SalePrice'].quantile(0.75)
IQR = Q3 - Q1
outliers = train_df[(train_df['SalePrice'] < Q1 - 1.5 * IQR) | (train_df['SalePrice'] > Q3 + 1.5 * IQR)]
print(f"Number of outliers in SalePrice: {len(outliers)}")

Number of outliers in SalePrice: 28


In [33]:
# Export train_df to a CSV file
train_df.to_csv("../../data/featured_train.csv", index=False)
test_df.to_csv("../../data/featured_test.csv", index=False)

print("Feature Enginnered train_df exported to featured_train.csv")
print("Feature Engineered test_df exported to featured_test.csv")

Feature Enginnered train_df exported to featured_train.csv
Feature Engineered test_df exported to featured_test.csv


So far so good! Now it is time to train our neural network!