# XGBoost in a pipeline using the Ames Housing dataset

by Héctor Ramírez
<hr>

Throughout this example, we will be working with the Ames Housing prices dataset introdced in http://jse.amstat.org/v19n3/decock.pdf. As the abstaract states: "[...] a data set describing the sale of individual residential property in Ames, Iowa from 2006 to 2010. The data set contains observations and a large number of explanatory variables involved in assessing home values.".
<hr>

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.feature_extraction import DictVectorizer
from sklearn.pipeline import Pipeline
import xgboost as xgb
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import cross_val_score
import warnings; warnings.simplefilter(action='ignore', category=FutureWarning)
# import warnings; warnings.simplefilter('ignore')

In [3]:
df = pd.read_csv('ames_housing_data.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 21 columns):
MSSubClass      1460 non-null int64
MSZoning        1460 non-null object
LotFrontage     1201 non-null float64
LotArea         1460 non-null int64
Neighborhood    1460 non-null object
BldgType        1460 non-null object
HouseStyle      1460 non-null object
OverallQual     1460 non-null int64
OverallCond     1460 non-null int64
YearBuilt       1460 non-null int64
Remodeled       1460 non-null int64
GrLivArea       1460 non-null int64
BsmtFullBath    1460 non-null int64
BsmtHalfBath    1460 non-null int64
FullBath        1460 non-null int64
HalfBath        1460 non-null int64
BedroomAbvGr    1460 non-null int64
Fireplaces      1460 non-null int64
GarageArea      1460 non-null int64
PavedDrive      1460 non-null object
SalePrice       1460 non-null int64
dtypes: float64(1), int64(15), object(5)
memory usage: 239.7+ KB


<hr>

## Preprocessing data

To use this dataset to train a model and use it into a pipeline, we need to do some preprocessing in advance. First, missing values need to be filled in in the LotFrontage column. Then, categorical columns (those of object type) need to be transformed to numeric, encoded values.

We follow two different approaches: LabelEncoder + OneHotEncoder and DictVectorizer.

### Encoding categorical columns I: LabelEncoder + OneHotEncoder

First, we will need to fill in missing values. Then, we will need to encode any categorical columns in the dataset using one-hot encoding so that they are encoded numerically.

The data has five categorical columns: MSZoning, PavedDrive, Neighborhood, BldgType, and HouseStyle. Scikit-learn has a LabelEncoder function that converts the values in each categorical column into integers.
<hr>

In [4]:
# Fill missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

# Create a boolean mask for categorical columns and get list of categorical column names
categorical_mask = (df.dtypes == object)
categorical_columns = df.columns[categorical_mask].tolist()

print(df[categorical_columns].head())

# Create LabelEncoder object and apply LabelEncoder to categorical columns
le = LabelEncoder()
df[categorical_columns] = df[categorical_columns].apply(lambda x: le.fit_transform(x))

print(df[categorical_columns].head())

  MSZoning Neighborhood BldgType HouseStyle PavedDrive
0       RL      CollgCr     1Fam     2Story          Y
1       RL      Veenker     1Fam     1Story          Y
2       RL      CollgCr     1Fam     2Story          Y
3       RL      Crawfor     1Fam     2Story          Y
4       RL      NoRidge     1Fam     2Story          Y
   MSZoning  Neighborhood  BldgType  HouseStyle  PavedDrive
0         3             5         0           5           2
1         3            24         0           2           2
2         3             5         0           5           2
3         3             6         0           5           2
4         3            15         0           5           2


<hr>
There is no natural ordering between the entries yet. As an example: Using LabelEncoder, the CollgCr Neighborhood was encoded as 5, while the Veenker Neighborhood was encoded as 24, and Crawfor as 6. Is Veenker "greater" than Crawfor and CollgCr? No - and allowing the model to assume this natural ordering may result in poor performance.
<br><br>
As a result, there is another step needed: We have to apply a one-hot encoding to create binary, or "dummy" variables. You can do this using scikit-learn's OneHotEncoder.
<hr>

In [5]:
# Create OneHotEncoder and apply it to categorical columns
ohe = OneHotEncoder(categorical_features=categorical_mask, sparse=False)
df_encoded = ohe.fit_transform(df)

print(df_encoded[:5, :])
print(df.shape)
print(df_encoded.shape)

[[0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 1.000e+00 6.000e+01 6.500e+01 8.450e+03
  7.000e+00 5.000e+00 2.003e+03 0.000e+00 1.710e+03 1.000e+00 0.000e+00
  2.000e+00 1.000e+00 3.000e+00 0.000e+00 5.480e+02 2.085e+05]
 [0.000e+00 0.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
  0.000e+00 1.000e+00 1.000e+00 0.000e+00 0.000e+00 0.000e+00 0.000e+00
 



<hr>
After one hot encoding, which creates binary variables out of the categorical variables, there are now 62 columns. Furthermore, notice that df_encoded is no longer a dataframe.

### Encoding categorical columns I: DictVectorizer

The two step process you just went through - LabelEncoder followed by OneHotEncoder - can be simplified by using a DictVectorizer.

Using a DictVectorizer on a DataFrame that has been converted to a dictionary allows us to get label encoding as well as one-hot encoding in one go.
<hr>

In [6]:
# Convert df into a dictionary, create the DictVectorizer object and apply it on df
df_dict = df.to_dict('records')
dv = DictVectorizer(sparse=False)
df_encoded = dv.fit_transform(df_dict)

print(df_encoded[:5,:])
print(dv.vocabulary_)

[[3.000e+00 0.000e+00 1.000e+00 0.000e+00 0.000e+00 2.000e+00 5.480e+02
  1.710e+03 1.000e+00 5.000e+00 8.450e+03 6.500e+01 6.000e+01 3.000e+00
  5.000e+00 5.000e+00 7.000e+00 2.000e+00 0.000e+00 2.085e+05 2.003e+03]
 [3.000e+00 0.000e+00 0.000e+00 1.000e+00 1.000e+00 2.000e+00 4.600e+02
  1.262e+03 0.000e+00 2.000e+00 9.600e+03 8.000e+01 2.000e+01 3.000e+00
  2.400e+01 8.000e+00 6.000e+00 2.000e+00 0.000e+00 1.815e+05 1.976e+03]
 [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 6.080e+02
  1.786e+03 1.000e+00 5.000e+00 1.125e+04 6.800e+01 6.000e+01 3.000e+00
  5.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 2.235e+05 2.001e+03]
 [3.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 1.000e+00 6.420e+02
  1.717e+03 0.000e+00 5.000e+00 9.550e+03 6.000e+01 7.000e+01 3.000e+00
  6.000e+00 5.000e+00 7.000e+00 2.000e+00 1.000e+00 1.400e+05 1.915e+03]
 [4.000e+00 0.000e+00 1.000e+00 0.000e+00 1.000e+00 2.000e+00 8.360e+02
  2.198e+03 1.000e+00 5.000e+00 1.426e+04 8.400e+01 6.000e+0

<hr>
Besides simplifying the process into one step, DictVectorizer has useful attributes such as vocabulary_ which maps the names of the features to their indices.

### Preprocessing within a pipeline

We will use the much cleaner and more succinct DictVectorizer approach and put it alongside an XGBoostRegressor inside of a scikit-learn pipeline: first using fit/predict methods, and then cross validation.
<hr>

In [7]:
# Fill LotFrontage missing values with 0
df.LotFrontage = df.LotFrontage.fillna(0)

# Setup the pipeline steps, create and fit the pipeline
steps = [("ohe_onestep", DictVectorizer(sparse=False)),
         ("xgb_model", xgb.XGBRegressor())]

xgb_pipeline = Pipeline(steps)

xgb_pipeline.fit(df.iloc[:,:-1].to_dict('records'), df.SalePrice)
preds = xgb_pipeline.predict(df.iloc[:,:-1].to_dict('records'))
rmse = np.sqrt(mean_squared_error(df.SalePrice, preds))
print("\nRMSE: %f" % (rmse))


RMSE: 19008.073730


In [8]:
cross_val_scores = cross_val_score(xgb_pipeline, df.iloc[:,:-1].to_dict('records'), df.SalePrice
                                   , cv=10, scoring='neg_mean_squared_error')

print("\n10-fold RMSE: ", np.mean(np.sqrt(np.abs(cross_val_scores))))


10-fold RMSE:  28440.796697477967
