# House Prices Prediction using TensorFlow Decision Forests

## Import the library

In [None]:
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgb

## Load the dataset


In [None]:
train_file_path = "../input/house-prices-advanced-regression-techniques/train.csv"
dataset_df = pd.read_csv(train_file_path)
print("Full train dataset shape is {}".format(dataset_df.shape))

The data is composed of 81 columns and 1460 entries. We can see all 81 dimensions of our dataset by printing out the first 3 entries using the following code:

In [None]:
dataset_df.head(3)

* There are 79 feature columns. Using these features your model has to predict the house sale price indicated by the label column named `SalePrice`.

We will drop the `Id` column as it is not necessary for model training.

In [None]:
dataset_df = dataset_df.drop('Id', axis=1)
dataset_df.head(3)

We can inspect the types of feature columns using the following code:

In [None]:
dataset_df.info()

## House Price Distribution

Now let us take a look at how the house prices are distributed.

In [None]:
print(dataset_df['SalePrice'].describe())
plt.figure(figsize=(9, 8))
sns.distplot(dataset_df['SalePrice'], color='g', bins=100, hist_kws={'alpha': 0.4});

## Numerical data distribution

We will now take a look at how the numerical features are distributed. In order to do this, let us first list all the types of data from our dataset and select only the numerical ones.

In [None]:
list(set(dataset_df.dtypes.tolist()))

In [None]:
df_num = dataset_df.select_dtypes(include = ['float64', 'int64'])
df_num.head()

Now let us plot the distribution for all the numerical features.

In [None]:
df_num.hist(figsize=(16, 20), bins=50, xlabelsize=8, ylabelsize=8);

## Prepare the dataset

This dataset contains a mix of numeric, categorical and missing features. TF-DF supports all these feature types natively, and no preprocessing is required. This is one advantage of tree-based models, making them a great entry point to Tensorflow and ML.

Now let us split the dataset into training and testing datasets:

In [None]:
import numpy as np

def split_dataset(dataset, test_ratio=0.30):
  test_indices = np.random.rand(len(dataset)) < test_ratio
  return dataset[~test_indices], dataset[test_indices]

train_ds_pd, valid_ds_pd = split_dataset(dataset_df)
print("{} examples in training, {} examples in testing.".format(
    len(train_ds_pd), len(valid_ds_pd)))

There's one more step required before we can train the model. We need to convert the datatset from Pandas format (`pd.DataFrame`) into TensorFlow Datasets format (`tf.data.Dataset`).

[TensorFlow Datasets](https://www.tensorflow.org/datasets/overview) is a high performance data loading library which is helpful when training neural networks with accelerators like GPUs and TPUs.

By default the Random Forest Model is configured to train classification tasks. Since this is a regression problem, we will specify the type of the task (`tfdf.keras.Task.REGRESSION`) as a parameter here.

In [None]:
label = 'SalePrice'
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=label, task = tfdf.keras.Task.REGRESSION)
valid_ds = tfdf.keras.pd_dataframe_to_tf_dataset(valid_ds_pd, label=label, task = tfdf.keras.Task.REGRESSION)

## Select a Model

There are several tree-based models for you to choose from.

* RandomForestModel
* GradientBoostedTreesModel
* CartModel
* DistributedGradientBoostedTreesModel

To start, we'll work with a Random Forest. This is the most well-known of the Decision Forest training algorithms.

A Random Forest is a collection of decision trees, each trained independently on a random subset of the training dataset (sampled with replacement). The algorithm is unique in that it is robust to overfitting, and easy to use.

We can list the all the available models in TensorFlow Decision Forests using the following code:

In [None]:
tfdf.keras.get_all_models()

## How can I configure them?

TensorFlow Decision Forests provides good defaults for you (e.g. the top ranking hyperparameters on our benchmarks, slightly modified to run in reasonable time). If you would like to configure the learning algorithm, you will find many options you can explore to get the highest possible accuracy.

You can select a template and/or set parameters as follows:

```rf = tfdf.keras.RandomForestModel(hyperparameter_template="benchmark_rank1", task=tfdf.keras.Task.REGRESSION)```

Read more [here](https://www.tensorflow.org/decision_forests/api_docs/python/tfdf/keras/RandomForestModel).

## Create a Random Forest

First, we will use the defaults to create the Random Forest Model while specifiyng the task type as `tfdf.keras.Task.REGRESSION`.

In [None]:
rf = tfdf.keras.RandomForestModel(task = tfdf.keras.Task.REGRESSION)
rf.compile(metrics=["mse"]) # Optional, you can use this to include a list of eval metrics

## Train the model

We will train the model using a one-liner.

Note: you may see a warning about Autograph. You can safely ignore this, it will be fixed in the next release.

In [None]:
rf.fit(x=train_ds)

## Visualize the model
One benefit of tree-based models is that you can easily visualize them. The default number of trees used in the Random Forests is 300. We can select a tree to display below.

In [None]:
tfdf.model_plotter.plot_model_in_colab(rf, tree_idx=0, max_depth=3)

## Evaluate the model on the Out of bag (OOB) data and the validation dataset

Before training the dataset we have manually seperated 20% of the dataset for validation named as `valid_ds`.

We can also use Out of bag (OOB) score to validate our RandomForestModel.
To train a Random Forest Model, a set of random samples from training set are choosen by the algorithm and the rest of the samples are used to finetune the model.The subset of data that is not chosen is known as Out of bag data (OOB).
OOB score is computed on the OOB data.

Read more about OOB data [here](https://developers.google.com/machine-learning/decision-forests/out-of-bag).

The training logs show the Root Mean Squared Error (RMSE) evaluated on the out-of-bag dataset according to the number of trees in the model. Let us plot this.

Note: Smaller values are better for this hyperparameter.

In [None]:
import matplotlib.pyplot as plt
logs = rf.make_inspector().training_logs()
plt.plot([log.num_trees for log in logs], [log.evaluation.rmse for log in logs])
plt.xlabel("Number of trees")
plt.ylabel("RMSE (out-of-bag)")
plt.show()

We can also see some general stats on the OOB dataset:

In [None]:
inspector = rf.make_inspector()
inspector.evaluation()

Now, let us run an evaluation using the validation dataset.

In [None]:
evaluation = rf.evaluate(x=valid_ds,return_dict=True)

for name, value in evaluation.items():
  print(f"{name}: {value:.4f}")

## Variable importances

Variable importances generally indicate how much a feature contributes to the model predictions or quality. There are several ways to identify important features using TensorFlow Decision Forests.
Let us list the available `Variable Importances` for Decision Trees:

In [None]:
print(f"Available variable importances:")
for importance in inspector.variable_importances().keys():
  print("\t", importance)

As an example, let us display the important features for the Variable Importance `NUM_AS_ROOT`.

The larger the importance score for `NUM_AS_ROOT`, the more impact it has on the outcome of the model.

By default, the list is sorted from the most important to the least. From the output you can infer that the feature at the top of the list is used as the root node in most number of trees in the random forest than any other feature.

In [None]:
inspector.variable_importances()["NUM_AS_ROOT"]

Plot the variable importances from the inspector using Matplotlib

In [None]:
plt.figure(figsize=(12, 4))

# Mean decrease in AUC of the class 1 vs the others.
variable_importance_metric = "NUM_AS_ROOT"
variable_importances = inspector.variable_importances()[variable_importance_metric]

# Extract the feature name and importance values.
#
# `variable_importances` is a list of <feature, importance> tuples.
feature_names = [vi[0].name for vi in variable_importances]
feature_importances = [vi[1] for vi in variable_importances]
# The feature are ordered in decreasing importance value.
feature_ranks = range(len(feature_names))

bar = plt.barh(feature_ranks, feature_importances, label=[str(x) for x in feature_ranks])
plt.yticks(feature_ranks, feature_names)
plt.gca().invert_yaxis()

# TODO: Replace with "plt.bar_label()" when available.
# Label each bar with values
for importance, patch in zip(feature_importances, bar.patches):
  plt.text(patch.get_x() + patch.get_width(), patch.get_y(), f"{importance:.4f}", va="top")

plt.xlabel(variable_importance_metric)
plt.title("NUM AS ROOT of the class 1 vs the others")
plt.tight_layout()
plt.show()

## Create a GradientBoostedTrees Model

In [None]:
gbtm = tfdf.keras.GradientBoostedTreesModel()

In [None]:
gbtm = tfdf.keras.GradientBoostedTreesModel(task = tfdf.keras.Task.REGRESSION)

In [None]:
gbtm = tfdf.keras.GradientBoostedTreesModel(
    task = tfdf.keras.Task.REGRESSION,
    features = None,
    num_threads = None,
    tuner = None,
    discretize_numerical_features = False,
    allow_na_conditions = False,
    categorical_algorithm = 'CART',
    growing_strategy = 'LOCAL',
    loss = 'SQUARED_ERROR',
    max_depth = 5,
    max_num_nodes = None,
    min_examples = 5,
    missing_value_policy = 'GLOBAL_IMPUTATION',
)

In [None]:
gbtm.fit(x=train_ds)

In [None]:
print(gbtm.summary())

SE:
* max_depth 5 = 26509.7
* max_depth 4 = 27684.6
* max_depth 5 with local_imputation = 27873.8 (we will use global_imputation)
* max_depth 5 with min_examples 4 = 27262.3
* max_depth 5 with min_exmaples 6.= 29150.1 (we will use min_examples 5)

**The GradientBoostedDecisionTreeModel has better results, so this is now our preliminary choice instead of RandomForestModel for prediction and submission**

### Now we are moving on from TensorFlow and as such must do our own data cleanup.

## Data Cleanup

We will now evaluate for missing values

In [None]:
plt.figure(figsize=(25,8))
plt.title('Number of missing rows')
missing_count = pd.DataFrame(dataset_df.isnull().sum(), columns=['sum']).sort_values(by=['sum'],ascending=False).head(20).reset_index()
missing_count.columns = ['features','sum']
sns.barplot(x='features',y='sum', data = missing_count)

In [None]:
# drop columns with high number of missing values
dataset_df.drop(['PoolQC','MiscFeature','Alley'], axis=1, inplace=True)

In [None]:
# check remaining missing values
pd.DataFrame(dataset_df.isnull().sum(), columns=['sum']).sort_values(by=['sum'],ascending=False).head(15)

We will fill in missing values according to variable type, first being ordinal data.

### Ordinal data

In [None]:
cat = ['GarageType','GarageFinish','BsmtFinType2','BsmtExposure','BsmtFinType1', 
       'GarageCond','GarageQual','BsmtCond','BsmtQual','FireplaceQu','Fence',"KitchenQual",
       "HeatingQC",'ExterQual','ExterCond']

dataset_df[cat] = dataset_df[cat].fillna("NA")

### Categorical data

In [None]:
cols = ["MasVnrType", "MSZoning", "Exterior1st", "Exterior2nd", "SaleType", "Electrical", "Functional"]
dataset_df[cols] = dataset_df.groupby("Neighborhood")[cols].transform(lambda x: x.fillna(x.mode()[0]))

### Numerical data

This is a bit more complicated for variables such as LotFrontage because there are high levels of variation in the data. We will use the mean of the train data.

In [None]:
print("Mean of LotFrontage: ", dataset_df['LotFrontage'].mean())
print("Mean of GarageArea: ", dataset_df['GarageArea'].mean())

In [None]:
#for correlated relationship
dataset_df['LotFrontage'] = dataset_df.groupby('Neighborhood')['LotFrontage'].transform(lambda x: x.fillna(x.mean()))
dataset_df['GarageArea'] = dataset_df.groupby('Neighborhood')['GarageArea'].transform(lambda x: x.fillna(x.mean()))
dataset_df['MSZoning'] = dataset_df.groupby('MSSubClass')['MSZoning'].transform(lambda x: x.fillna(x.mode()[0]))

#numerical
cont = ["BsmtHalfBath", "BsmtFullBath", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "MasVnrArea"]
dataset_df[cont] = dataset_df[cont] = dataset_df[cont].fillna(dataset_df[cont].mean())

## Feature Selection

In [None]:
numeric_train = dataset_df.select_dtypes(exclude=['object'])
correlation = numeric_train.corr()
correlation[['SalePrice']].sort_values(['SalePrice'], ascending=False)

Let us also check for linear relationships

In [None]:
fig = plt.figure(figsize=(20,20))
for index in range(len(numeric_train.columns)):
    plt.subplot(10,5,index+1)
    sns.scatterplot(x=numeric_train.iloc[:,index], y='SalePrice', data=numeric_train.dropna())
fig.tight_layout(pad=1.0)

Let us drop the colums with evidence of high colinearity

In [None]:
dataset_df.drop(['GarageYrBlt','TotRmsAbvGrd','1stFlrSF','GarageCars'], axis=1, inplace=True)

In [None]:
fig,axes = plt.subplots(1,2, figsize=(15,5))
sns.regplot(x=numeric_train['MoSold'], y='SalePrice', data=numeric_train, ax = axes[0], line_kws={'color':'black'})
sns.regplot(x=numeric_train['YrSold'], y='SalePrice', data=numeric_train, ax = axes[1],line_kws={'color':'black'})
fig.tight_layout(pad=2.0)

In [None]:
# drop columns with no linearity
correlation[['SalePrice']].sort_values(['SalePrice'], ascending=False).tail(10)

dataset_df.drop(['MoSold','YrSold'], axis=1, inplace=True)

Change variables into categorical if not already as necessary

In [None]:
#MSSubClass=The building class
dataset_df['MSSubClass'] = dataset_df['MSSubClass'].apply(str)


#Changing OverallCond into a categorical variable
dataset_df['OverallCond'] = dataset_df['OverallCond'].astype(str)

Label encoder

In [None]:
from sklearn.preprocessing import LabelEncoder
cols = ('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond', 
        'ExterQual', 'ExterCond','HeatingQC', 'KitchenQual', 'BsmtFinType1', 
        'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',
        'LotShape', 'PavedDrive', 'Street', 'CentralAir', 'MSSubClass', 'OverallCond')
# process columns, apply LabelEncoder to categorical features
for c in cols:
    lbl = LabelEncoder() 
    lbl.fit(list(dataset_df[c].values)) 
    dataset_df[c] = lbl.transform(list(dataset_df[c].values))

# shape        
print('Shape all_data: {}'.format(dataset_df.shape))

Get dummy variables

In [None]:
dataset_df = pd.get_dummies(dataset_df)
print(dataset_df.shape)

## Fit regression model

First transform SalePrice to log

In [None]:
#We use the numpy fuction log1p which  applies log(1+x) to all elements of the column
train_ds_pd["SalePrice"] = np.log1p(train_ds_pd["SalePrice"])

#Check the new distribution 
sns.distplot(train_ds_pd['SalePrice']);

# Get the fitted parameters used by the function
(mu, sigma) = norm.fit(train_ds_pd['SalePrice'])
print( '\n mu = {:.2f} and sigma = {:.2f}\n'.format(mu, sigma))

#Now plot the distribution
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)],
            loc='best')
plt.ylabel('Frequency')
plt.title('SalePrice distribution')

#Get also the QQ-plot
fig = plt.figure()
res = stats.probplot(train_ds_pd['SalePrice'], plot=plt)
plt.show()

In [None]:
#Validation function
n_folds = 5

def rmsle_cv(model):
    kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)
    rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf))
    return(rmse)

In [None]:
ntrain = train_ds_pd.shape[0]
ntest = valid_ds_pd.shape[0]
y_train = train_ds_pd.SalePrice.values

In [None]:
train = dataset_df[:ntrain]
test = dataset_df[ntrain:]

In [None]:
lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

In [None]:
score = rmsle_cv(lasso)
print("LGBM score: {:.4f} ({:.4f})\n" .format(score.mean(), score.std()))

# Submission
Finally predict on the competition test data using the model.

In [None]:
test_file_path = "../input/house-prices-advanced-regression-techniques/test.csv"
test_data = pd.read_csv(test_file_path)
ids = test_data.pop('Id')

test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(
    test_data,
    task = tfdf.keras.Task.REGRESSION)

preds = gbtm.predict(test_ds)
output = pd.DataFrame({'Id': ids,
                       'SalePrice': preds.squeeze()})

output.head()


In [None]:
submission = pd.read_csv('../input/house-prices-advanced-regression-techniques/sample_submission.csv')
submission['SalePrice'] = gbtm.predict(test_ds)
submission.to_csv('/kaggle/working/submission.csv', index=False)
submission.head()