# House Price Prediction

## About Completition
### Description
Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

### Goal
It is your job to predict the sales price for each house. For each Id in the test set, you must predict the value of the SalePrice variable. 

### Evaluation
Submissions are evaluated on **Root-Mean-Squared-Error (RMSE)** between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

# Training The Model

## 1. Check GPU For Training

In [0]:
# Load the TensorBoard notebook extension
%load_ext tensorboard

import os
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models, regularizers, callbacks, optimizers

tf.keras.backend.clear_session() 

print(tf.__version__)

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')

2.1.0
Found GPU at: /device:GPU:0


## 2. Loading Data

### 2.1 Download

#### Mounted Data
Skip this if you are using your own computer

In [0]:
# from google.colab import drive
# drive.mount('/content/gdrive')

#### Set Path
Change this path according to your dataset folder

In [0]:
folderPath = './'
dataPath = './data/'

### 2.2 Pharse
First pharse the data using `pandas`, then divide the `X_train` data into 2 groups which we've discussed in the preprocess section (for complex and deep and wide model).

In [0]:
import pandas as pd
import re
import numpy as np

np.random.seed(1)

X_train = pd.read_csv(dataPath + 'preprocessed_train_x.csv')
y_train = pd.read_csv(dataPath + 'preprocessed_train_y.csv')

# Remove ID
X_train = X_train.drop('Id', axis=1)
y_train = y_train.drop('Id', axis=1)

# Devided input into 2 groups
X_related = X_train[['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']]
X_effected = X_train.drop(['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt'], axis=1)

X_train.shape


(1457, 219)

### 2.3 Scaling Data
In order to make it possible to train, we've used `StandardScaler` to scale down the value to make it's easily to convert while training. There are multiple type of scaler provided in sklearn. After trying multiple scaler, we'found that `StandardScaler` held the best result from our model. 

In [0]:
from sklearn.preprocessing import StandardScaler
import datetime
from sklearn.model_selection import KFold

X_train = StandardScaler().fit_transform(X_train)
y_train = StandardScaler().fit_transform(y_train)
X_related = StandardScaler().fit_transform(X_related)
X_effected = StandardScaler().fit_transform(X_effected)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


In [0]:
import seaborn as sns
# Histogram to check the distribution of the price after scale down
sns.distplot(y_train);

## 3. Finding Model
Next we tried to find the best model by conducting multiple experiments with different model.

### Basic Model



In [0]:
def baseline_model():
    # create model
    model = models.Sequential()
    model.add(layers.Dense(1, input_dim=X_train.shape[1], activity_regularizer=regularizers.l1(0.001)))

    return model

In [0]:
# define the model
def larger_model():
    # create model
    model = models.Sequential()
    model.add(layers.Dense(X_train.shape[1], input_dim=X_train.shape[1], activation='relu'))
    model.add(layers.Dense(X_train.shape[1]//2, activation='relu'))
    model.add(layers.Dense(1, activity_regularizer=regularizers.l1(0.001)))
 
    return model

In [0]:
# define wider model
def wider_model():
    # create model
    model = models.Sequential()
    model.add(layers.Dense(256, input_dim=X_train.shape[1], activation='relu'))
    model.add(layers.Dense(1, activity_regularizer=regularizers.l1(0.001)))

    return model

In [0]:
def more_larger_model():
    model = models.Sequential()
    model.add(layers.Dense(200, input_dim=X_train.shape[1], kernel_initializer='normal', activation='relu'))
    model.add(layers.Dense(100, kernel_initializer='normal', activation='relu'))
    model.add(layers.Dense(50, kernel_initializer='normal', activation='relu'))
    model.add(layers.Dense(25, kernel_initializer='normal', activation='relu'))
    model.add(layers.Dense(1, kernel_initializer='normal', activity_regularizer=regularizers.l1(0.001)))

    return model

### Deep and Wide Model

In [0]:
def deep_and_wide_model():
    inp = layers.Input(shape=(X_train.shape[1],))
    
    # Deep
    hidden = layers.Dense(200, kernel_initializer='normal', activation='relu')(inp)
    hidden = layers.Dense(100, kernel_initializer='normal', activation='relu')(hidden)
    hidden = layers.Dense(50, kernel_initializer='normal', activation='relu')(hidden)
    hidden = layers.Dense(25, kernel_initializer='normal', activation='relu')(hidden)
    
    # Concate
    output = layers.concatenate([hidden, inp])
    output = layers.Dense(200, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(100, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(50, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(25, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(1, kernel_initializer='normal', activity_regularizer=regularizers.l1(0.001))(output)
    
    model = models.Model(inputs=inp, outputs=output)
    
    return model

### Complex Model

In [0]:
def balance_complex_model():
    input_related = layers.Input(shape=(X_related.shape[1],))
    input_effected = layers.Input(shape=(X_effected.shape[1],))
    
    # Related Side
    dense_re = layers.Dense(200, kernel_initializer='normal', activation='relu')(input_related)
    dense_re = layers.Dense(100, kernel_initializer='normal', activation='relu')(dense_re)

    # Effected Side
    dense_eff = layers.Dense(200, kernel_initializer='normal', activation='relu')(input_effected)
    dense_eff = layers.Dense(100, kernel_initializer='normal', activation='relu')(dense_eff)
    
    # Concate
    output = layers.concatenate([dense_re, dense_eff])
    output = layers.Dense(200, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(100, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(50, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(25, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(1, kernel_initializer='normal', activity_regularizer=regularizers.l1(0.001))(output)
    
    model = models.Model(inputs=[input_related, input_effected], outputs=output)
    
    return model

In [0]:
def larger_complex_model():
    input_related = layers.Input(shape=(X_related.shape[1],))
    input_effected = layers.Input(shape=(X_effected.shape[1],))
    
    # Related Side
    dense_re = layers.Dense(200, kernel_initializer='normal', activation='relu')(input_related)
    dense_re = layers.Dense(100, kernel_initializer='normal', activation='relu')(dense_re)
    dense_re = layers.Dense(200, kernel_initializer='normal', activation='relu')(dense_re)

    # Effected Side
    dense_eff = layers.Dense(200, kernel_initializer='normal', activation='relu')(input_effected)
    dense_eff = layers.Dense(200, kernel_initializer='normal', activation='relu')(dense_eff)
    
    # Concate
    output = layers.concatenate([dense_re, dense_eff])
    output = layers.Dense(400, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(200, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(100, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(50, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(25, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(1, kernel_initializer='normal', activity_regularizer=regularizers.l1(0.001))(output)
    
    model = models.Model(inputs=[input_related, input_effected], outputs=output)
    
    return model

In [0]:
def single_side_complex_model():
    input_related = layers.Input(shape=(X_related.shape[1],))
    input_effected = layers.Input(shape=(X_effected.shape[1],))

    # Effected Side
    dense_eff = layers.Dense(200, kernel_initializer='normal', activation='relu')(input_effected)
    dense_eff = layers.Dense(200, kernel_initializer='normal', activation='relu')(dense_eff)
    
    # Concate
    output = layers.concatenate([input_related, dense_eff])
    output = layers.Dense(200, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(100, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(50, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(25, kernel_initializer='normal', activation='relu')(output)
    output = layers.Dense(1, kernel_initializer='normal', activity_regularizer=regularizers.l1(0.001))(output)
    
    model = models.Model(inputs=[input_related, input_effected], outputs=output)
    
    return model

## 4. Training the Model

In [0]:
# Clear any logs from previous runs
!rm -rf ./logs/ 

In [0]:
kfold = KFold(n_splits=5, random_state=2020, shuffle=True)
fold = 1
scores = {'loss': 0, 'rmse': 0}
for train_index, test_index in kfold.split(X_train):
    print('Fold: ', fold)
    X_tr, y_tr = X_train[train_index], y_train[train_index]
    X_val, y_val = X_train[test_index], y_train[test_index]
    
    X_tr_re, X_tr_eff = X_related[train_index], X_effected[train_index]
    X_val_re, X_val_eff = X_related[test_index], X_effected[test_index]

    model = larger_complex_model() # The model here can be change see section 3 for more information
    adam = optimizers.Adam(lr=0.001)
    model.compile(optimizer=adam,
              loss=tf.keras.metrics.mean_squared_error,
              metrics=[tf.keras.metrics.RootMeanSquaredError(name='rmse')])
    
    # Create log directory for tensorboard
    log_dir = os.path.join(
        "logs",
        "fit",
        datetime.datetime.now().strftime("%Y%m%d-%H%M%S"),
    )
    tensorboard_callback = callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)
    
    # Note: If the model is complex or deep and wide model uncomment this 2 lines of codes
    model.fit(x=[X_tr_re, X_tr_eff], 
          y=y_tr, 
          validation_data = ([X_val_re, X_val_eff], y_val),
          epochs=100, verbose=0, callbacks=[tensorboard_callback], batch_size=128)
    
    score = model.evaluate([X_val_re, X_val_eff], y_val, verbose=0)

    # Note: If the model is the basic model uncomment this 2 lines of codes
#     model.fit(x=X_tr, 
#           y=y_tr, 
#           validation_data = (X_val, y_val),
#           epochs=100, verbose=0, callbacks=[tensorboard_callback], batch_size=128)
    
#     score = model.evaluate(X_val, y_val, verbose=0)
    print('Val loss: ', score[0])
    print('Val RMSE: ', score[1])
    print('-'*20)

    scores['loss'] = scores['loss'] + score[0]
    scores['rmse'] = scores['rmse'] + score[1]

    fold += 1

print('Mean loss: %.3f' % (scores['loss']/5))
print('Mean RMSE: %.3f' % (scores['rmse']/5))

Fold:  1
Val loss:  0.08198038830536686
Val RMSE:  0.28497294
--------------------
Fold:  2
Val loss:  0.09519042350249747
Val RMSE:  0.3072554
--------------------
Fold:  3
Val loss:  0.08294809908570908
Val RMSE:  0.2867522
--------------------
Fold:  4
Val loss:  0.10181516836496562
Val RMSE:  0.3178539
--------------------
Fold:  5
Val loss:  0.10353426281938848
Val RMSE:  0.32069647
--------------------
Mean loss: 0.093
Mean RMSE: 0.304


### Experiment Results
Here is the list of experiments we've conducted

**Experiments Report**


*5-folds cross-validation*
- 6 Important variables + baseline: RMSE = 0.549
- 6 Important variables + larger: RMSE = 0.517
- 6 Important variables + wider: RMSE = 0.383
- 6 Important variables + more_larger: RMSE = 0.387


- 6 Important variables + more_larger + tanh: RMSE = 0.411
- 6 Important variables + more_larger + leakyReLU(0.1): RMSE = 0.387
- 6 Important variables + more_larger + leakyReLU(0.3): RMSE = 0.391
- 6 Important variables + more_larger + elu: RMSE = 0.392


- other variables + more_larger + relu: RMSE = 0.367
- other variables + more_larger + tanh: RMSE = 0.399
- other variables + more_larger + elu: RMSE = 0.428


- all variables + baseline: RMSE = 0.559
- all variables + larger: RMSE = 0.367
- all variables + wider: RMSE = 0.480
- all variables + more_larger: RMSE = 0.328


**Deep and Wide**
- small: RMSE = 0.393
- large: RMSE = 0.315


**Complex**
- balance: RMSE = 0.329
- larger+all relu: RMSE = 0.315
- larger+all relu+1layer on related side: RMSE = 0.304
- single-side 1 layer: RMSE = 0.324
- single-side 2 layer: RMSE = 0.316



### Graph with Tensorboard

In [0]:
%tensorboard --logdir logs/fit

Reusing TensorBoard on port 6006 (pid 34696), started 0:01:43 ago. (Use '!kill 34696' to kill it.)

## For Submission

### Training the Model
Use all of the training data

In [0]:
model = larger_complex_model()
adam = optimizers.Adam(lr=0.001)
model.compile(optimizer=adam,
              loss=tf.keras.metrics.mean_squared_error,
              metrics=[tf.keras.metrics.RootMeanSquaredError(name='rmse')])
model.summary()

# Note: If the model is the basic model uncomment this 2 lines of codes
# model.fit(x=X_train, 
#           y=y_train, 
#           epochs=100, batch_size=128)

# Note: If the model is complex or deep and wide model uncomment this 2 lines of codes
model.fit(x=[X_related, X_effected], 
          y=y_train, 
          epochs=100, batch_size=128)

Model: "model_35"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_71 (InputLayer)           [(None, 6)]          0                                            
__________________________________________________________________________________________________
dense_385 (Dense)               (None, 200)          1400        input_71[0][0]                   
__________________________________________________________________________________________________
input_72 (InputLayer)           [(None, 213)]        0                                            
__________________________________________________________________________________________________
dense_386 (Dense)               (None, 100)          20100       dense_385[0][0]                  
___________________________________________________________________________________________

<tensorflow.python.keras.callbacks.History at 0x2389fc7f978>

In [0]:
# Save model for further use
model.save(folderPath + "model/model.h5")
print("Saved model to disk")

Saved model to disk


### Making a Prediction

#### Preparing Testing Data

In [0]:
X_test = pd.read_csv(dataPath + 'preprocessed_test.csv')
y_train = pd.read_csv(dataPath + 'preprocessed_train_y.csv')
# Remove ID
X_test = X_test.drop('Id', axis=1)
y_train = y_train.drop('Id', axis=1)
# Devided input into 2 groups
X_related_test = X_test[['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt']]
X_effected_test = X_test.drop(['OverallQual', 'GrLivArea', 'GarageCars', 'TotalBsmtSF', 'FullBath', 'YearBuilt'], axis=1)
# Scale Data
X_test = StandardScaler().fit_transform(X_test)
X_related_test = StandardScaler().fit_transform(X_related_test)
X_effected_test = StandardScaler().fit_transform(X_effected_test)
scaler = StandardScaler().fit(y_train)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)
  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


#### Predict

In [0]:
result = model.predict([X_related_test, X_effected_test])
result.shape

(1459, 1)

Inverse the transformation

In [0]:
# Inverse transformation
result = scaler.inverse_transform(result)

### Saving to File

In [0]:
def save_to_file(result, filename='submission.csv'):
    row_id = 1461
    results = {'Id': [], 'SalePrice': []}
    for r in result:
        price = np.exp(r[0])
        results['Id'].append(row_id)
        results['SalePrice'].append(price)
        row_id += 1

    df = pd.DataFrame(data=results)
    df.to_csv(filename, index=False)

In [0]:
save_to_file(result, filename=folderPath+'submission.csv')