# Ames Housing - Auto ML
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [17]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

import h2o
from h2o.automl import H2OAutoML

In [18]:
plt.style.use('fivethirtyeight')

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## AutoML with H20

H2O is usually executed on a server. Here, we emulate the server on the local machine. The server is started with the `h2o.init()` command. The server is stopped with the `h2o.shutdown()` command.

In [23]:
h2o.init()

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_381"; Java(TM) SE Runtime Environment (build 1.8.0_381-b09); Java HotSpot(TM) 64-Bit Server VM (build 25.381-b09, mixed mode)
  Starting server from /Users/oliver/miniconda3/envs/prodok/lib/python3.12/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/4d/0rd3mwcn4y7gv39shh0dy_0w0000gn/T/tmp74eywmn0
  JVM stdout: /var/folders/4d/0rd3mwcn4y7gv39shh0dy_0w0000gn/T/tmp74eywmn0/h2o_oliver_started_from_python.out
  JVM stderr: /var/folders/4d/0rd3mwcn4y7gv39shh0dy_0w0000gn/T/tmp74eywmn0/h2o_oliver_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,03 secs
H2O_cluster_timezone:,Europe/Berlin
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.44.0.3
H2O_cluster_version_age:,1 month and 22 days
H2O_cluster_name:,H2O_from_python_oliver_taccox
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.540 Gb
H2O_cluster_total_cores:,8
H2O_cluster_allowed_cores:,8


### Load and prepare data

Next, we have to "upload" the data to the server. The data is loaded with the `h2o.import_file()` command.

In [73]:
df = h2o.import_file('data/train.csv')

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [74]:
df.describe()

Unnamed: 0,house_id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,FirstFlrSF,SecondFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,ThreeSsnPorch,ScreenPorch,PoolArea,Fence,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
type,int,int,enum,int,int,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,enum,int,int,int,int,enum,enum,enum,enum,enum,int,enum,enum,enum,enum,enum,enum,enum,int,enum,int,int,int,enum,enum,enum,enum,int,int,int,int,int,int,int,int,int,int,enum,int,enum,int,enum,enum,enum,int,int,enum,enum,enum,int,int,int,int,int,int,enum,int,int,int,enum,enum,int
mins,1.0,20.0,,0.0,1300.0,,,,,,,,,,,,,1.0,1.0,1872.0,1950.0,,,,,,0.0,,,,,,,,0.0,,0.0,0.0,0.0,,,,,334.0,0.0,0.0,334.0,0.0,0.0,0.0,0.0,0.0,0.0,,2.0,,0.0,,,,0.0,0.0,,,,0.0,0.0,0.0,0.0,0.0,0.0,,0.0,1.0,2006.0,,,12789.0
mean,1172.5,57.3080204778157,,57.60409556313993,10127.857508532423,,,,,,,,,,,,,6.064419795221843,5.581911262798636,1970.5068259385666,1983.9236348122868,,,,,,97.74104095563139,,,,,,,,442.3425767918089,,50.0759385665529,554.1569965870307,1046.5755119453925,,,,,1154.8144197952217,333.68813993174064,4.3570819112627985,1492.8596416382252,0.43131399317406144,0.05930034129692833,1.5571672354948805,0.37073378839590443,2.845136518771331,1.043088737201365,,6.409129692832765,,0.5989761092150171,,,,1.746160409556314,468.8788395904437,,,,94.11305460750853,47.34172354948805,24.73165529010239,2.42278156996587,16.01023890784983,2.582764505119454,,58.055034129692835,6.184726962457338,2007.794795221843,,,178582.20776450512
maxs,2344.0,190.0,,313.0,215245.0,,,,,,,,,,,,,10.0,9.0,2010.0,2010.0,,,,,,1290.0,,,,,,,,5644.0,,1526.0,2336.0,6110.0,,,,,5095.0,2065.0,1064.0,5642.0,3.0,2.0,4.0,2.0,8.0,3.0,,15.0,,4.0,,,,4.0,1488.0,,,,870.0,742.0,1012.0,508.0,576.0,800.0,,17000.0,12.0,2010.0,,,755000.0
sigma,676.7988376664566,42.80255520479277,,33.54268403124423,8050.9081315860585,,,,,,,,,,,,,1.3885195724365862,1.105658637771616,30.341433709809888,20.786286586191313,,,,,,171.76682948029887,,,,,,,,452.21909238372723,,170.36377498493337,433.8468953028434,437.009368882394,,,,,385.11426946187606,427.1411905726988,44.32399300764559,504.6196755936108,0.525468127735148,0.24159561045811204,0.5521616197346733,0.49960818531350026,0.8203516095298484,0.21133887922134537,,1.563650890639959,,0.6529873381551253,,,,0.7471572083696859,212.6083240164343,,,,124.85186961842751,68.0374645088839,67.03094313263323,24.524361606192276,55.820882046949514,38.324144514778006,,623.3751214138194,2.708407833789809,1.3151172158333897,,,77125.07271273079
zeros,0,0,,393,0,,,,,,,,,,,,,0,0,0,0,,,,,,1422,,,,,,,,738,,2059,196,61,,,,,0,1346,2313,0,1367,2208,10,1494,5,2,,0,,1144,,,,121,121,,,,1215,1048,1956,2316,2139,2332,,2253,0,0,,,0
missing,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
0,1.0,20.0,RL,80.0,10400.0,Pave,none,Reg,Lvl,AllPub,Inside,Gtl,NWAmes,Norm,Norm,1Fam,1Story,7.0,5.0,1976.0,1976.0,Gable,CompShg,HdBoard,HdBoard,BrkFace,189.0,TA,TA,CBlock,Gd,TA,No,Unf,0.0,Unf,0.0,1090.0,1090.0,GasA,TA,Y,SBrkr,1370.0,0.0,0.0,1370.0,0.0,0.0,2.0,0.0,3.0,1.0,TA,6.0,Typ,1.0,TA,Attchd,RFn,2.0,479.0,TA,TA,Y,0.0,0.0,0.0,0.0,0.0,0.0,MnPrv,0.0,6.0,2009.0,WD,Family,152000.0
1,2.0,60.0,RL,0.0,28698.0,Pave,none,IR2,Low,AllPub,CulDSac,Sev,ClearCr,Norm,Norm,1Fam,2Story,5.0,5.0,1967.0,1967.0,Flat,Tar&Grv,Plywood,Plywood,,0.0,TA,TA,PConc,TA,Gd,Gd,LwQ,249.0,ALQ,764.0,0.0,1013.0,GasA,TA,Y,SBrkr,1160.0,966.0,0.0,2126.0,0.0,1.0,2.0,1.0,3.0,1.0,TA,7.0,Min2,0.0,none,Attchd,Fin,2.0,538.0,TA,TA,Y,486.0,0.0,0.0,0.0,225.0,0.0,none,0.0,6.0,2009.0,WD,Abnorml,185000.0
2,3.0,90.0,RL,70.0,9842.0,Pave,none,Reg,Lvl,AllPub,FR2,Gtl,NAmes,Norm,Norm,Duplex,1Story,4.0,5.0,1962.0,1962.0,Gable,CompShg,HdBoard,HdBoard,,0.0,TA,TA,Slab,none,none,none,none,0.0,none,0.0,0.0,0.0,GasA,TA,Y,SBrkr,1224.0,0.0,0.0,1224.0,0.0,0.0,2.0,0.0,2.0,2.0,TA,6.0,Typ,0.0,none,CarPort,Unf,2.0,462.0,TA,TA,Y,0.0,0.0,0.0,0.0,0.0,0.0,none,0.0,3.0,2007.0,WD,Normal,101800.0


We need to do some minimal data prepereatio, that is, identify the response variable and split the data into train and test sets.

In [27]:
y = "SalePrice"

In [28]:
splits = df.split_frame(ratios = [0.8], seed = 42)
train = splits[0]
test = splits[1]

### Train AutoML model

Now we can start the AutoML process. It's really nothing more than specifying the maximum runtime and handing over the data. The AutoML process will then try a variety of learning algorithms and hyperparamter combinations to find the best model.

In [29]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 42, project_name = "ames")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%


key,value
Stacking strategy,cross_validation
Number of base models (used / total),4/6
# GBM base models (used / total),1/1
# XGBoost base models (used / total),1/1
# DeepLearning base models (used / total),1/1
# DRF base models (used / total),1/2
# GLM base models (used / total),0/1
Metalearner algorithm,GLM
Metalearner fold assignment scheme,Random
Metalearner nfolds,5

Unnamed: 0,mean,sd,cv_1_valid,cv_2_valid,cv_3_valid,cv_4_valid,cv_5_valid
mae,14361.798,1278.871,14635.716,13841.79,12471.604,15016.472,15843.409
mean_residual_deviance,611797120.0,215029328.0,675830780.0,517307552.0,312344064.0,658790400.0,894712960.0
mse,611797120.0,215029328.0,675830780.0,517307552.0,312344064.0,658790400.0,894712960.0
null_deviance,2237628150000.0,346971537000.0,1995348120000.0,2463938900000.0,1837940080000.0,2695961250000.0,2194953340000.0
r2,0.8971342,0.0363637,0.8706957,0.9227553,0.9346422,0.9092348,0.8483432
residual_deviance,228905599000.0,79462572000.0,257491534000.0,189851861000.0,119940112000.0,244411236000.0,332833227000.0
rmse,24398.613,4542.14,25996.746,22744.396,17673.258,25666.912,29911.752
rmsle,0.1260067,0.0175305,0.1273246,0.1054842,0.1121242,0.1483059,0.1367947


### Display leaderboard

The above metrics stem from H2O's internal cross-validation. The leaderboard below shows the performance of the models on the test set.

In [30]:
aml.leaderboard.head()

model_id,rmse,mse,mae,rmsle,mean_residual_deviance
StackedEnsemble_BestOfFamily_3_AutoML_1_20240212_133707,21662.3,469255000.0,13966.2,0.138664,469255000.0
StackedEnsemble_AllModels_2_AutoML_1_20240212_133707,22212.2,493381000.0,14123.0,0.141379,493381000.0
StackedEnsemble_AllModels_1_AutoML_1_20240212_133707,22874.5,523242000.0,14475.3,0.142651,523242000.0
StackedEnsemble_BestOfFamily_2_AutoML_1_20240212_133707,22899.7,524397000.0,14634.4,0.140837,524397000.0
GBM_2_AutoML_1_20240212_133707,22995.3,528783000.0,14671.5,0.140301,528783000.0
StackedEnsemble_BestOfFamily_1_AutoML_1_20240212_133707,24124.5,581993000.0,15493.6,0.151572,581993000.0
GBM_3_AutoML_1_20240212_133707,24359.1,593364000.0,15544.3,0.151266,593364000.0
GBM_1_AutoML_1_20240212_133707,24713.6,610760000.0,16212.9,0.159239,610760000.0
GBM_4_AutoML_1_20240212_133707,24843.6,617205000.0,15611.3,0.146629,617205000.0
XGBoost_2_AutoML_1_20240212_133707,25070.8,628546000.0,17213.4,0.159269,628546000.0


### Predict on Kaggle test set

The only thing we have to do now is to make predictions on the Kaggle test set and upload them.

In [62]:
test_kaggle = h2o.import_file('data/test.csv')

Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%


In [63]:
pred = aml.predict(test_kaggle)

stackedensemble prediction progress: |███████████████████████████████████████████| (done) 100%


Put everything together in one dataframe.

In [70]:
pred = h2o.as_list(pred)
id = h2o.as_list(test_kaggle["house_id"])
my_submission = pd.concat([id, pred], axis = 1)
my_submission.columns = ['HouseId', 'SalePrice']

In [71]:
my_submission.head()

Unnamed: 0,HouseId,SalePrice
0,2345,157281.66982
1,2346,114220.247043
2,2347,185537.502271
3,2348,128552.039232
4,2349,116962.245662


Save the dataframe to a csv file and manualy upload it to Kaggle.

In [75]:
my_submission.to_csv('submission.csv', index=False)

Stop H2O server.

In [76]:
h2o.shutdown()

H2O session _sid_9305 closed.
