# House Prices: Advanced Regression Techniques (Kaggle)

## 04-adding-categorical-variables

Sources:
* Kaggle competition: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
* Check missing values: https://www.kaggle.com/willkoehrsen/start-here-a-gentle-introduction by Will Koehrsen
* Neural net implementation: https://yashuseth.blog/2018/07/22/pytorch-neural-network-for-tabular-data-with-categorical-embeddings/ by Yashu Seth
* Sklearn pipelines: https://medium.com/dunder-data/from-pandas-to-scikit-learn-a-new-exciting-workflow-e88e2271ef62
* Pipelines with dataframes: https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/

## Problem description

**Previous**:

**kaggle-houseprice-01-linear-model-and-continuous-imputation.ipynb**
We try to predict house prices based on a number of continuous and categorical variables.
In the first step, the prediction will be made using only a small selection of continuous variables:

* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* TotalBsmtSF: Total square feet of basement area
* SalePrice: target variable

We will use a very simple network: a linear network with a single non-linearity.

**kaggle-houseprice-02-data-scaling.ipynb**

In order to make it a little easier for gradient descent to converge to a minimum, we will scale the input data to have 0 mean and a standard deviation of 1. For a discussion on why it is useful to scale input data, see https://stats.stackexchange.com/questions/249378/is-scaling-data-0-1-necessary-when-batch-normalization-is-used. We will not scale the target data, following this discussion: https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re.

**kaggle-houseprice-03-one-hot-for-missing-continuous.ipynb**

Instead of just replacing missing values in our dataset with the mean or the median of the respective column, we will now create a *one-hot encoded vector* to mark the previously *missing data* and add it to the data set. For the same reason that we used the *sklearn.preprocessing StandardScaler* we will now make use of the *sklearn.impute Imputer* to replace missing values. Also, to make this part of data processing a little easier to reuse, we will refactor the code into a function. 

* missing_LotFrontage: one-hot vector with 1 for each missing value in LotFrontage and 0 else

**Now:**

**kaggle-houseprice-04-pipeline-for-preprocessing.ipynb**

Instead of relying on self-written code for processing our continuous variables we will now delegate this part of the processing to sklearn transformers. Additionally, those transformers will be put in a pipeline so that the transformers don't have to be called individually every time. This will help keeping our code simple and clean, and produce consistent results for processing multiple data.

In [34]:
from pathlib import Path
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.impute import SimpleImputer, MissingIndicator
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import StandardScaler

In [2]:
# Show more rows and columns in the pandas output
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
#pd.set_option('display.width', 1000)

In [3]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

## Helpers

In [4]:
def show_missing(df, show_all=True):
    """    
    Shows absolute and relative number of missing values for each column of a dataframe,
    show_all=True also shows columns with no missing values.
    """
    mis_val_abs = df.isnull().sum()
    mis_val_rel = df.isnull().sum()/df.shape[0]
    mis_val_table = pd.concat([df.dtypes, mis_val_abs, mis_val_rel], axis=1)
    mis_val_table = mis_val_table.rename(columns={0: 'dtype', 1: 'Missing abs', 2: 'Missing rel'})

    if show_all:
        # Sort table descending by relative amount missing
        mis_val_table = mis_val_table.sort_values('Missing rel', ascending=False).round(3)
    else:
        # Sort table descending by relative amount missing, remove columns where no values are missing
        mis_val_table = mis_val_table[mis_val_table.iloc[:, 1] != 0].sort_values('Missing rel', ascending=False).round(3)
    
    return mis_val_table

In [5]:
def proc_cont(cont_names, df_train, df_val=pd.DataFrame(), df_test=pd.DataFrame(), strat='median'):
    if df_val.empty:
        df_val = pd.DataFrame(columns=df_train.columns)
    if df_test.empty:
        df_test = pd.DataFrame(columns=df_train.columns)
    
    # Add one-hot vectors for missing values
    found_missing = False
    df_train_miss = pd.DataFrame()
    df_val_miss = pd.DataFrame()
    df_test_miss = pd.DataFrame()
    missing_cols = {}
    for name in cont_names:
        miss_train = pd.isna(df_train[name])*1
        miss_val = pd.isna(df_val[name])*1
        miss_test = pd.isna(df_test[name])*1
        if sum(miss_train) + sum(miss_val) + sum(miss_test) > 0:
            found_missing = True
            # Get one-hot name
            missing_cols[name] = 'missing_'+name
            
            # Add one-hot vector to temporary dataframe
            df_train_miss = pd.concat([df_train_miss, miss_train], axis=1)
            df_val_miss = pd.concat([df_val_miss, miss_val], axis=1)
            df_test_miss = pd.concat([df_test_miss, miss_test], axis=1)
    
    if found_missing:
        # Rename new one-hot encoded columns
        df_train_miss = df_train_miss.rename(columns=missing_cols)
        df_val_miss = df_val_miss.rename(columns=missing_cols)
        df_test_miss = df_test_miss.rename(columns=missing_cols)
        
        # Add new columns to dataframes
        df_train = pd.concat([df_train, df_train_miss], axis=1)
        df_val = pd.concat([df_val, df_val_miss], axis=1)
        df_test = pd.concat([df_test, df_test_miss], axis=1)

    # Impute missing values
    sk_imputer = impute.SimpleImputer(strategy=strat)
    sk_imputer.fit(df_train[cont_names])
    df_train[cont_names] = sk_imputer.transform(df_train[cont_names])
    
    # Scale variables to have 0 mean and 1 std
    sk_scaler = preprocessing.StandardScaler()
    sk_scaler.fit(df_train[cont_names])
    df_train[cont_names] = sk_scaler.transform(df_train[cont_names])
    
    # Apply to validation and test data
    if not df_val.empty:
        df_val[cont_names] = sk_imputer.transform(df_val[cont_names])
        df_val[cont_names] = sk_scaler.transform(df_val[cont_names])
    if not df_test.empty:
        df_test[cont_names] = sk_imputer.transform(df_test[cont_names])
        df_test[cont_names] = sk_scaler.transform(df_test[cont_names])
        
    return df_train, df_val, df_test

## Load data

In [6]:
PATH = Path('../data/houseprice/')
#!dir {PATH}  # For Windows
!ls {PATH}

Der Befehl "ls" ist entweder falsch geschrieben oder
konnte nicht gefunden werden.


In [14]:
# Import data
dep = ['SalePrice']
df_train = pd.read_csv(PATH/'train.csv', sep=',', header=0,
                       usecols=['LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF',
                                'TotalBsmtSF', 'SalePrice'])
df_y = df_train[dep]
df_train = df_train.drop(dep, axis=1)
df_train.shape

(1460, 5)

In [9]:
# New
cont_names = ['LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF', 'TotalBsmtSF']

## Pre-processing

First, we take a look at a couple of rows and some descriptive statistics. This gives us an idea about the scale of values, and helps to decide if some continuous variables should perhaps be treated as categorical. In this case all variables will be treated as continuous.

We also check for missing values. If we find any, we have two options: remove the rows that contain missing values (which might lead to losing a lot of observations), or replace them with other values so that the network can use them. Common values used as a replacement are the mean or the median of the series, or some constant.

In [16]:
df_train.head()

Unnamed: 0,LotFrontage,LotArea,TotalBsmtSF,1stFlrSF,2ndFlrSF
0,65.0,8450,856,856,854
1,80.0,9600,1262,1262,0
2,68.0,11250,920,920,866
3,60.0,9550,756,961,756
4,84.0,14260,1145,1145,1053


In [17]:
df_train[cont_names].describe()

Unnamed: 0,LotFrontage,LotArea,1stFlrSF,2ndFlrSF,TotalBsmtSF
count,1201.0,1460.0,1460.0,1460.0,1460.0
mean,70.049958,10516.828082,1162.626712,346.992466,1057.429452
std,24.284752,9981.264932,386.587738,436.528436,438.705324
min,21.0,1300.0,334.0,0.0,0.0
25%,59.0,7553.5,882.0,0.0,795.75
50%,69.0,9478.5,1087.0,0.0,991.5
75%,80.0,11601.5,1391.25,728.0,1298.25
max,313.0,215245.0,4692.0,2065.0,6110.0


In [18]:
show_missing(df_train)

Unnamed: 0,dtype,Missing abs,Missing rel
LotFrontage,float64,259,0.177
LotArea,int64,0,0.0
TotalBsmtSF,int64,0,0.0
1stFlrSF,int64,0,0.0
2ndFlrSF,int64,0,0.0


In [20]:
show_missing(df_test)

Unnamed: 0,dtype,Missing abs,Missing rel
LotFrontage,float64,227,0.156
TotalBsmtSF,float64,1,0.001
LotArea,int64,0,0.0
1stFlrSF,int64,0,0.0
2ndFlrSF,int64,0,0.0


### Pre-processing pipeline

Now we need to

1. Transform all continuous variables to the *float64* dtype to make sure the transformers in the pipeline can handle the data.
2. Define the pipeline
3. Fit the transformers of the pipeline to our training data
4. Apply the transformers to the training data

In [19]:
cont_names

['LotFrontage', 'LotArea', '1stFlrSF', '2ndFlrSF', 'TotalBsmtSF']

In [42]:
# Convert all continuous columns to float64
df_train[cont_names] = df_train[cont_names].astype('float64')

The pipeline consists of 4 transformers:

1. The *MissingIndicator* creates a one-hot vector for every column that contains missing values. These vectors will be added to the resulting dataframe before the original columns.
2. The *SimpleImputer* replaces the missing values in the original columns based on a strategy, in this case 'median'
3. The *StandardScaler* scales the input data to have a mean of 0 and a standard deviation of 1
4. The *FeatureUnion* combines the results above

In [25]:
# Creating the pre-processing pipeline
preprocessing_pipeline = make_pipeline(
    FeatureUnion(transformer_list=[
        ('missing_features', make_pipeline(
            MissingIndicator(missing_values=np.nan)            
        )),
        ('numeric_features', make_pipeline(
            SimpleImputer(strategy='median'),
            StandardScaler()
        ))
    ])
)

In [43]:
# Fit the pipeline to the training data
preprocessing_pipeline.fit(df_train)

Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=None,
       transformer_list=[('missing_features', Pipeline(memory=None,
     steps=[('missingindicator', MissingIndicator(error_on_new=True, features='missing-only',
         missing_values=nan, sparse='auto'))])), ('numeric_features', Pipeline(memory=No...', StandardScaler(copy=True, with_mean=True, with_std=True))]))],
       transformer_weights=None))])

In [38]:
# Create a list of names for the columns with missing values and append the column names from our training dataframe.
mis_names = [f'mis_{name}' for name in df_train.columns if df_train[name].isnull().any()]
names = mis_names + list(df_train.columns)

In [44]:
processed_train = pd.DataFrame(preprocessing_pipeline.transform(df_train), columns=names)

In [45]:
proc_train

Unnamed: 0,mis_LotFrontage,LotFrontage,LotArea,TotalBsmtSF,1stFlrSF,2ndFlrSF
0,0.0,-0.220875,-0.207142,-0.459303,-0.793434,1.161852
1,0.0,0.460320,-0.091886,0.466465,0.257140,-0.795163
2,0.0,-0.084636,0.073480,-0.313369,-0.627826,1.189351
3,0.0,-0.447940,-0.096897,-0.687324,-0.521734,0.937276
4,0.0,0.641972,0.375148,0.199680,-0.045611,1.617877
5,0.0,0.687385,0.360616,-0.596115,-0.948691,0.501875
6,0.0,0.233255,-0.043379,1.433276,1.374993,-0.795163
7,1.0,-0.039223,-0.013513,0.113032,-0.143941,1.457466
8,0.0,-0.856657,-0.440659,-0.240402,-0.363889,0.928110
9,0.0,-0.902070,-0.310370,-0.151473,-0.221569,-0.795163


In [46]:
proc_train.describe()

Unnamed: 0,mis_LotFrontage,LotFrontage,LotArea,TotalBsmtSF,1stFlrSF,2ndFlrSF
count,1460.0,1460.0,1460.0,1460.0,1460.0,1460.0
mean,0.177397,2.79837e-16,-5.840077000000001e-17,2.457699e-16,6.509253e-17,-1.825024e-17
std,0.382135,1.000343,1.000343,1.000343,1.000343,1.000343
min,0.0,-2.219047,-0.9237292,-2.411167,-2.144172,-0.7951632
25%,0.0,-0.44794,-0.2969908,-0.5966855,-0.7261556,-0.7951632
50%,0.0,-0.03922314,-0.1040633,-0.1503334,-0.1956933,-0.7951632
75%,0.0,0.4149067,0.108708,0.5491227,0.5915905,0.8731117
max,1.0,11.04155,20.51827,11.52095,9.132681,3.936963


In [47]:
show_missing(proc_train)

Unnamed: 0,dtype,Missing abs,Missing rel
mis_LotFrontage,float64,0,0.0
LotFrontage,float64,0,0.0
LotArea,float64,0,0.0
TotalBsmtSF,float64,0,0.0
1stFlrSF,float64,0,0.0
2ndFlrSF,float64,0,0.0


As we can see, missing values are indicated in the *mis_LotFrontage* column, missing values were replaced and the data has been scaled. We can now replace our self-written proc_cont() function with just a couple of lines of code.

# PyTorch

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader

In [None]:
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
device

## Dataset, dataloader

In [None]:
# Convert all data containers to tensors
t_train = torch.tensor(df_train.values, dtype=torch.float32, device=device)
t_y = torch.tensor(df_y.values, dtype=torch.float32, device=device)
#t_y = (t_y-t_y.mean())/t_y.std()

In [None]:
# Dataset
train_ds = TensorDataset(t_train, t_y)

In [None]:
# Dataloader
batch_size=64
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)

## Model

In [None]:
class LinearNet(nn.Module):
    def __init__(self):
        super().__init__()
        
        # Layers
        self.linear1 = nn.Linear(6, 100)
        self.act1 = nn.ReLU()
        self.linear2 = nn.Linear(100, 1)        
    
    def forward(self, x):        
        x = self.linear1(x)
        x = self.act1(x)
        x = self.linear2(x)
        
        return x

In [None]:
# Instantiate the model
model = LinearNet().to(device)

## Optimizer

In [None]:
lr = 0.1
opt = torch.optim.Adam(model.parameters(), lr=lr)

## Loss

In [None]:
loss_fn = F.mse_loss

## Train

In [None]:
losses = []
def fit(num_epochs, model, loss_fn, opt):    
    for epoch in range(num_epochs):
        for xb, yb in train_dl:
            # Forward            
            preds = model(xb)
            loss = loss_fn(preds, yb)
            losses.append(loss)
            
            # Gradient descent
            loss.backward()
            opt.step()
            opt.zero_grad()
            
        if epoch%20==0:
            print('Training loss:', loss_fn(model(t_train), t_y))

In [None]:
# Train for 300 epochs
fit(num_epochs=300, model=model, loss_fn=loss_fn, opt=opt)

In [None]:
plt.plot(losses)

In [None]:
preds = model(t_train)

In [None]:
torch.cat([preds, t_y.reshape(-1,1)], dim=1)[:10, :]

In [None]:
plt.scatter(preds.detach().cpu().numpy(), t_y.reshape(-1,1).detach().cpu())