# Neural Network Kaggle Exercise (Core)

## Assignment

Now, put neural networks into action. You are tasked with building a neural network using data from this kaggle competition. To complete the assignment, you will train and evaluate your model using only the train.csv. Remember, with Kaggle competitions, the test.csv does not include values for the target! The test.csv is only used for the competition, and you cannot evaluate your model using the test.csv without submitting your predictions to Kaggle.

https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data?select=sample_submission.csv

## Required Task

### 1. Be sure to perform a train test split on the train.csv so you can evaluate your models.

In [112]:
# import libraries

# general
import pandas as pd

# preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import make_column_selector

# modeling
from tensorflow.keras import Sequential
from tensorflow.keras import metrics
from tensorflow.keras.layers import Dense, Dropout

# visualization and evaluation
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

In [101]:
# load train.csv
df = pd.read_csv('Data/house_kaggle.csv', index_col = 'Id')

In [102]:
# inspect
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 80 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSSubClass     1460 non-null   int64  
 1   MSZoning       1460 non-null   object 
 2   LotFrontage    1201 non-null   float64
 3   LotArea        1460 non-null   int64  
 4   Street         1460 non-null   object 
 5   Alley          91 non-null     object 
 6   LotShape       1460 non-null   object 
 7   LandContour    1460 non-null   object 
 8   Utilities      1460 non-null   object 
 9   LotConfig      1460 non-null   object 
 10  LandSlope      1460 non-null   object 
 11  Neighborhood   1460 non-null   object 
 12  Condition1     1460 non-null   object 
 13  Condition2     1460 non-null   object 
 14  BldgType       1460 non-null   object 
 15  HouseStyle     1460 non-null   object 
 16  OverallQual    1460 non-null   int64  
 17  OverallCond    1460 non-null   int64  
 18  YearBuil

In [103]:
# save copy of data
df2 = df.copy()

# df (first pass) with no feature engineering

In [104]:
# check for duplicates
df.drop_duplicates()
df.duplicated().sum()

0

In [105]:
# check for missing values
missing = df.isna().sum()

for index, value in missing.items():
    if value != 0:
        print(f"{index} missing {value} values")

LotFrontage missing 259 values
Alley missing 1369 values
MasVnrType missing 8 values
MasVnrArea missing 8 values
BsmtQual missing 37 values
BsmtCond missing 37 values
BsmtExposure missing 38 values
BsmtFinType1 missing 37 values
BsmtFinType2 missing 38 values
Electrical missing 1 values
FireplaceQu missing 690 values
GarageType missing 81 values
GarageYrBlt missing 81 values
GarageFinish missing 81 values
GarageQual missing 81 values
GarageCond missing 81 values
PoolQC missing 1453 values
Fence missing 1179 values
MiscFeature missing 1406 values


In [106]:
# check values of categorical columns
cat_cols = list(df.select_dtypes(include = 'object'))
for col in cat_cols:
    print(col)
    print(df[col].value_counts(dropna = False))
    print()

MSZoning
RL         1151
RM          218
FV           65
RH           16
C (all)      10
Name: MSZoning, dtype: int64

Street
Pave    1454
Grvl       6
Name: Street, dtype: int64

Alley
NaN     1369
Grvl      50
Pave      41
Name: Alley, dtype: int64

LotShape
Reg    925
IR1    484
IR2     41
IR3     10
Name: LotShape, dtype: int64

LandContour
Lvl    1311
Bnk      63
HLS      50
Low      36
Name: LandContour, dtype: int64

Utilities
AllPub    1459
NoSeWa       1
Name: Utilities, dtype: int64

LotConfig
Inside     1052
Corner      263
CulDSac      94
FR2          47
FR3           4
Name: LotConfig, dtype: int64

LandSlope
Gtl    1382
Mod      65
Sev      13
Name: LandSlope, dtype: int64

Neighborhood
NAmes      225
CollgCr    150
OldTown    113
Edwards    100
Somerst     86
Gilbert     79
NridgHt     77
Sawyer      74
NWAmes      73
SawyerW     59
BrkSide     58
Crawfor     51
Mitchel     49
NoRidge     41
Timber      38
IDOTRR      37
ClearCr     28
StoneBr     25
SWISU       25
Meado

In [107]:
# check values of numerical columns
num_cols = list(df.select_dtypes(include = 'number'))

df[num_cols].describe()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
count,1460.0,1201.0,1460.0,1460.0,1460.0,1460.0,1460.0,1452.0,1460.0,1460.0,...,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,56.89726,70.049958,10516.828082,6.099315,5.575342,1971.267808,1984.865753,103.685262,443.639726,46.549315,...,94.244521,46.660274,21.95411,3.409589,15.060959,2.758904,43.489041,6.321918,2007.815753,180921.19589
std,42.300571,24.284752,9981.264932,1.382997,1.112799,30.202904,20.645407,181.066207,456.098091,161.319273,...,125.338794,66.256028,61.119149,29.317331,55.757415,40.177307,496.123024,2.703626,1.328095,79442.502883
min,20.0,21.0,1300.0,1.0,1.0,1872.0,1950.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2006.0,34900.0
25%,20.0,59.0,7553.5,5.0,5.0,1954.0,1967.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,2007.0,129975.0
50%,50.0,69.0,9478.5,6.0,5.0,1973.0,1994.0,0.0,383.5,0.0,...,0.0,25.0,0.0,0.0,0.0,0.0,0.0,6.0,2008.0,163000.0
75%,70.0,80.0,11601.5,7.0,6.0,2000.0,2004.0,166.0,712.25,0.0,...,168.0,68.0,0.0,0.0,0.0,0.0,0.0,8.0,2009.0,214000.0
max,190.0,313.0,215245.0,10.0,9.0,2010.0,2010.0,1600.0,5644.0,1474.0,...,857.0,547.0,552.0,508.0,480.0,738.0,15500.0,12.0,2010.0,755000.0


In [108]:
# X, y split
target = ['SalePrice']
y = df[target]
X = df.drop(columns = target)

# check
print(f"y:\n{y}")
print(f"X:\n{X}")

y:
      SalePrice
Id             
1        208500
2        181500
3        223500
4        140000
5        250000
...         ...
1456     175000
1457     210000
1458     266500
1459     142125
1460     147500

[1460 rows x 1 columns]
X:
      MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
Id                                                                      
1             60       RL         65.0     8450   Pave   NaN      Reg   
2             20       RL         80.0     9600   Pave   NaN      Reg   
3             60       RL         68.0    11250   Pave   NaN      IR1   
4             70       RL         60.0     9550   Pave   NaN      IR1   
5             60       RL         84.0    14260   Pave   NaN      IR1   
...          ...      ...          ...      ...    ...   ...      ...   
1456          60       RL         62.0     7917   Pave   NaN      Reg   
1457          20       RL         85.0    13175   Pave   NaN      Reg   
1458          70       RL      

In [109]:
# train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42)

# check
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")

X_train shape: (1095, 79)
X_test shape: (365, 79)
y_train shape: (1095, 1)
y_test shape: (365, 1)


In [110]:
# create preprocessor

# categorical features
cat_cols = make_column_selector(dtype_include = 'object')
missing_imputer = SimpleImputer(strategy = 'constant', 
                               fill_value = 'missing')
ohe = OneHotEncoder(handle_unknown = 'ignore')
imp_cat_pipe = make_pipeline(missing_imputer, ohe)
cat_tuple = (imp_cat_pipe, cat_cols)

# numeric features
num_cols = make_column_selector(dtype_include = 'number')
median_imputer = SimpleImputer(strategy = 'median')
scaler = StandardScaler()
imp_num_pipe = make_pipeline(median_imputer, scaler)
num_tuple = (imp_num_pipe, num_cols)

preprocessor = make_column_transformer(cat_tuple,
                                      num_tuple,
                                      remainder = 'drop')

# check
preprocessor

In [111]:
# fit and transform
X_train_proc = preprocessor.fit_transform(X_train)
X_test_proc = preprocessor.transform(X_test)

# get shape
X_train_proc.shape

(1095, 300)

### 2. Create and evaluate 3 iterations of a deep learning model to predict housing prices using the techniques you have learned to optimize your model's performance. Be sure to include some form of regularization with at least one model.

#### Iteration 1: 4 layers, 25% drop number of neurons per layer, no regularization

In [119]:
# create model architecture
input_features = X_train_proc.shape[1]

model1 = Sequential()

# first layer
model1.add(Dense(input_features, 
                 input_dim = input_features, 
                 activation = 'relu'))

# second layer
model1.add(Dense((input_features * 0.75),
                activation = 'relu'))

# third layer
model1.add(Dense((input_features * 0.5),
                activation = 'relu'))

# fourth layer
model1.add(Dense((input_features * 0.25),
                activation = 'relu'))

# output layer
model1.add(Dense(1, 
                 activation = 'linear'))

# check
model1.summary()

Model: "sequential_2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_10 (Dense)            (None, 300)               90300     
                                                                 
 dense_11 (Dense)            (None, 225)               67725     
                                                                 
 dense_12 (Dense)            (None, 150)               33900     
                                                                 
 dense_13 (Dense)            (None, 75)                11325     
                                                                 
 dense_14 (Dense)            (None, 1)                 76        
                                                                 
Total params: 203,326
Trainable params: 203,326
Non-trainable params: 0
_________________________________________________________________


In [120]:
# compile model, add additional metrics

model1.compile(loss = 'mse',
              optimizer = 'adam',
              metrics = [metrics.MeanAbsoluteError(),
                        metrics.RootMeanSquaredError()])

In [125]:
X_test_proc

<365x300 sparse matrix of type '<class 'numpy.float64'>'
	with 28830 stored elements in Compressed Sparse Row format>

In [128]:
# fit model and save learning history
history = model1.fit(X_train_proc, y_train,
                    validation_data = (X_test_proc, y_test),
                    epochs = 100,
                    verbose = 0)

ValueError: Failed to find data adapter that can handle input: <class 'scipy.sparse._csr.csr_matrix'>, <class 'pandas.core.frame.DataFrame'>

In [None]:
# plot learning history

In [None]:
# evaluate model

#### Iteration 2: Iteration 1 copy with regularization (dropout)

#### Iteration 3

### 3. Select your best model!

## Optional

- Use your best model to make predictions using the features in test.csv.
- Submit to the Kaggle competition to see how your did!
- Include a screenshot of your results from the Kaggle competition inserted in a markdown cell at the bottom of your notebook.