# Neural Networks

In this notebook, we aim to surpass the accuracy achieved with scikit-learn models, which was **92%**. Our goal is to reach at least **98% or higher** by leveraging the power of neural networks. 

Since we have already conducted Exploratory Data Analysis (EDA) and visualizations in the scikit-learn part of the project, we will skip those steps here. Instead, we will directly focus on:
- Handling missing values,
- Encoding the dataset,
- Training a neural network model using TensorFlow and Keras.

Let’s get started!

**Checking GPU support (optional)**

In [1]:
import tensorflow as tf

print("TensorFlow version:", tf.__version__)
print("Is GPU available:", tf.config.list_physical_devices('GPU'))

TensorFlow version: 2.10.0
Is GPU available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Preparing the Data

**Import the Data**

In [2]:
import pandas as pd

In [3]:
train_df = pd.read_csv("../../data/train.csv")
test_df = pd.read_csv("../../data/test.csv")

print("Training Data:")
train_df.head()

Training Data:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
print("Testing Data:")
test_df.head()

Testing Data:


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


**Handle the Missing Data**

In [5]:
train_df.isna().sum().sort_values(ascending=False)

PoolQC         1453
MiscFeature    1406
Alley          1369
Fence          1179
MasVnrType      872
               ... 
ExterQual         0
Exterior2nd       0
Exterior1st       0
RoofMatl          0
SalePrice         0
Length: 81, dtype: int64

In [6]:
test_df.isna().sum().sort_values(ascending=False)

PoolQC           1456
MiscFeature      1408
Alley            1352
Fence            1169
MasVnrType        894
                 ... 
Electrical          0
1stFlrSF            0
2ndFlrSF            0
LowQualFinSF        0
SaleCondition       0
Length: 80, dtype: int64

In [7]:
# Import necessary libraries
from sklearn.impute import SimpleImputer

# Handling missing values for categorical features
categorical_imputer = SimpleImputer(strategy="constant", fill_value="NoFeature")

# List of categorical columns with potential missing values
categorical_cols = [
    'PoolQC', 'MiscFeature', 'Alley', 'Fence', 'GarageType', 'GarageQual', 
    'GarageCond', 'BsmtQual', 'BsmtCond', 'MasVnrType', 'FireplaceQu', 
    'GarageFinish', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Electrical',
    'MSZoning', 'Utilities', 'Functional', 'Exterior1st', 'Exterior2nd', 'KitchenQual', 'SaleType'
]
train_df[categorical_cols] = categorical_imputer.fit_transform(train_df[categorical_cols])
test_df[categorical_cols] = categorical_imputer.fit_transform(test_df[categorical_cols])

# Handling missing values for numerical features
numerical_imputer = SimpleImputer(strategy="median")

# List of numerical columns with potential missing values
numerical_cols = [
    'LotFrontage', 'GarageYrBlt', 'MasVnrArea', 'BsmtFullBath', 'BsmtHalfBath',
    'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'GarageCars', 'GarageArea'
]
train_df[numerical_cols] = numerical_imputer.fit_transform(train_df[numerical_cols])
test_df[numerical_cols] = numerical_imputer.fit_transform(test_df[numerical_cols])

Let's see if there is remaning missing values.

In [8]:
print("Missing values in train_df:", train_df.isnull().sum().sum())
print("Missing values in test_df:", test_df.isnull().sum().sum())

Missing values in train_df: 0
Missing values in test_df: 0


Now, we successfully got rid of the missing values.

In [9]:
# Export train_df to a CSV file
train_df.to_csv("../../data/cleaned_train.csv", index=False)
test_df.to_csv("../../data/cleaned_test.csv", index=False)

print("Cleaned train_df exported to cleaned_train.csv")
print("Cleaned test_df exported to cleaned_test.csv")

Cleaned train_df exported to cleaned_train.csv
Cleaned test_df exported to cleaned_test.csv


**Encode the Data**

In [10]:
# Import necessary Encoders
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

# Define nominal and ordinal features
nominal_features = [
    'MSZoning', 'Street', 'Alley', 'LotConfig', 'Neighborhood', 'Condition1', 
    'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 
    'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Foundation', 'Heating', 
    'GarageType', 'SaleType', 'SaleCondition', 'LotShape', 'LandContour',
    'Utilities', 'LandSlope', 'CentralAir', 'Electrical', 'Functional',
    'GarageFinish', 'PavedDrive', 'MiscFeature'
]

ordinal_features = [
    'ExterQual', 'ExterCond', 'BsmtQual', 'BsmtCond', 'HeatingQC', 
    'KitchenQual', 'FireplaceQu', 'GarageQual', 'GarageCond', 'PoolQC', 'Fence',
    'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2'
]

# Define ordinal encoding mapping for ordinal features
ordinal_mapping = {
    'ExterQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'ExterCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'BsmtCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'HeatingQC': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'KitchenQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'FireplaceQu': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageQual': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'GarageCond': ['NoFeature', 'Po', 'Fa', 'TA', 'Gd', 'Ex'],
    'PoolQC': ['NoFeature', 'Fa', 'TA', 'Gd', 'Ex'],
    'Fence': ['NoFeature', 'MnWw', 'MnPrv', 'GdWo', 'GdPrv'],
    'BsmtExposure': ['NoFeature', 'No', 'Mn', 'Av', 'Gd'],
    'BsmtFinType1': ['NoFeature', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
    'BsmtFinType2': ['NoFeature', 'Unf', 'LwQ', 'Rec', 'BLQ', 'ALQ', 'GLQ'],
}

# Convert ordinal_mapping dictionary to a list of lists in the same order as ordinal_features
ordinal_categories = [ordinal_mapping[feature] for feature in ordinal_features]

# 1. One-Hot Encoding for Nominal Features
nominal_encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
nominal_encoder.fit(train_df[nominal_features])

# Fit and transform nominal features
nominal_encoded_train = nominal_encoder.transform(train_df[nominal_features])
nominal_encoded_test = nominal_encoder.transform(test_df[nominal_features])

# Create a DataFrame for the one-hot encoded features
nominal_encoded_train_df = pd.DataFrame(
    nominal_encoded_train,
    columns=nominal_encoder.get_feature_names_out(nominal_features),
    index=train_df.index
)
nominal_encoded_test_df = pd.DataFrame(
    nominal_encoded_test,
    columns=nominal_encoder.get_feature_names_out(nominal_features),
    index=test_df.index
)

# Drop original nominal columns and concatenate one-hot encoded features
train_df = train_df.drop(columns=nominal_features).join(nominal_encoded_train_df)
test_df = test_df.drop(columns=nominal_features).join(nominal_encoded_test_df)

# 2. Ordinal Encoding for Ordinal Features
ordinal_encoder = OrdinalEncoder(categories=ordinal_categories)
ordinal_encoder.fit(train_df[ordinal_features])

# Fit and transform ordinal features
ordinal_encoded_train = ordinal_encoder.transform(train_df[ordinal_features])
ordinal_encoded_test = ordinal_encoder.transform(test_df[ordinal_features])

# Create a DataFrame for the ordinal encoded features
ordinal_encoded_train_df = pd.DataFrame(
    ordinal_encoded_train,
    columns=ordinal_features,
    index=train_df.index
)
ordinal_encoded_test_df = pd.DataFrame(
    ordinal_encoded_test,
    columns=ordinal_features,
    index=test_df.index
)

# Drop original ordinal columns and concatenate ordinal encoded features
train_df = train_df.drop(columns=ordinal_features).join(ordinal_encoded_train_df)
test_df = test_df.drop(columns=ordinal_features).join(ordinal_encoded_test_df)

Now, it's time to make sure if encoding is successful.

In [11]:
missing_cols = [col for col in train_df.columns if col not in test_df.columns]
missing_cols

['SalePrice']

In [13]:
train_df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,HeatingQC,KitchenQual,FireplaceQu,GarageQual,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2
0,1,60,65.0,8450,7,5,2003,2003,196.0,706.0,...,5.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,6.0,1.0
1,2,20,80.0,9600,6,8,1976,1976,0.0,978.0,...,5.0,3.0,3.0,3.0,3.0,0.0,0.0,4.0,5.0,1.0
2,3,60,68.0,11250,7,5,2001,2002,162.0,486.0,...,5.0,4.0,3.0,3.0,3.0,0.0,0.0,2.0,6.0,1.0
3,4,70,60.0,9550,7,5,1915,1970,0.0,216.0,...,4.0,4.0,4.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0
4,5,60,84.0,14260,8,5,2000,2000,350.0,655.0,...,5.0,4.0,3.0,3.0,3.0,0.0,0.0,3.0,6.0,1.0


In [14]:
test_df.head()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,HeatingQC,KitchenQual,FireplaceQu,GarageQual,GarageCond,PoolQC,Fence,BsmtExposure,BsmtFinType1,BsmtFinType2
0,1461,20,80.0,11622,5,6,1961,1961,0.0,468.0,...,3.0,3.0,0.0,3.0,3.0,0.0,2.0,1.0,3.0,2.0
1,1462,20,81.0,14267,6,6,1958,1958,108.0,923.0,...,3.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0
2,1463,60,74.0,13830,5,5,1997,1998,0.0,791.0,...,4.0,3.0,3.0,3.0,3.0,0.0,2.0,1.0,6.0,1.0
3,1464,60,78.0,9978,6,6,1998,1998,20.0,602.0,...,5.0,4.0,4.0,3.0,3.0,0.0,0.0,1.0,6.0,1.0
4,1465,120,43.0,5005,8,5,1992,1992,0.0,263.0,...,5.0,4.0,0.0,3.0,3.0,0.0,0.0,1.0,5.0,1.0


In [18]:
# Check for non-numeric columns
non_numeric_columns_train_df = train_df.select_dtypes(include=['object']).columns
if non_numeric_columns_train_df.empty:
    print("All columns are numeric!")
else:
    print("Non-numeric columns:", non_numeric_columns_train_df)

All columns are numeric!


In [19]:
# Check for non-numeric columns
non_numeric_columns_test_df = test_df.select_dtypes(include=['object']).columns
if non_numeric_columns_test_df.empty:
    print("All columns are numeric!")
else:
    print("Non-numeric columns:", non_numeric_columns_test_df)

All columns are numeric!


In [16]:
# Export train_df to a CSV file
train_df.to_csv("../../data/encoded_train.csv", index=False)
test_df.to_csv("../../data/encoded_test.csv", index=False)

print("Encoded train_df exported to encoded_train.csv")
print("Encoded test_df exported to encoded_test.csv")

Encoded train_df exported to encoded_train.csv
Encoded test_df exported to encoded_test.csv


Everything looks fine! Now time to move on...