# Ames Housing - Auto ML
- Author: Oliver Mueller
- Last update: 26.01.2024

## Initialize notebook
Load required packages. Set up workspace, e.g., set theme for plotting and initialize the random number generator.

In [None]:
# Install packages that are not already installed on Colab
#!pip install h2o

In [None]:
import warnings
warnings.simplefilter('ignore')

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split

import h2o
from h2o.automl import H2OAutoML

In [None]:
plt.style.use('fivethirtyeight')

## Problem description

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 76 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this dataset challenges you to predict the final price of each home. More: <https://www.kaggle.com/c/house-prices-advanced-regression-techniques>


## AutoML with H20

H2O is usually executed on a server. Here, we emulate the server on the local machine. The server is started with the `h2o.init()` command. The server is stopped with the `h2o.shutdown()` command.

In [None]:
h2o.init()

### Load and prepare data

Next, we have to "upload" the data to the server. The data is loaded with the `h2o.import_file()` command.

In [None]:
df = h2o.import_file('https://raw.githubusercontent.com/olivermueller/vhbprodok_datascience/main/ames_housing/data/train.csv')

In [None]:
df.describe()

We need to do some minimal data prepereatio, that is, identify the response variable and split the data into train and test sets.

In [None]:
y = "SalePrice"

In [None]:
splits = df.split_frame(ratios = [0.8], seed = 42)
train = splits[0]
test = splits[1]

### Train AutoML model

Now we can start the AutoML process. It's really nothing more than specifying the maximum runtime and handing over the data. The AutoML process will then try a variety of learning algorithms and hyperparamter combinations to find the best model.

In [None]:
aml = H2OAutoML(max_runtime_secs = 60, seed = 42, project_name = "ames")
aml.train(y = y, training_frame = train, leaderboard_frame = test)

### Display leaderboard

The above metrics stem from H2O's internal cross-validation. The leaderboard below shows the performance of the models on the test set.

In [None]:
aml.leaderboard.head()

### Predict on Kaggle test set

The only thing we have to do now is to make predictions on the Kaggle test set and upload them.

In [None]:
test_kaggle = h2o.import_file('data/test.csv')

In [None]:
pred = aml.predict(test_kaggle)

Put everything together in one dataframe.

In [None]:
pred = h2o.as_list(pred)
id = h2o.as_list(test_kaggle["house_id"])
my_submission = pd.concat([id, pred], axis = 1)
my_submission.columns = ['HouseId', 'SalePrice']

In [None]:
my_submission.head()

Save the dataframe to a csv file and manualy upload it to Kaggle.

In [None]:
my_submission.to_csv('submission.csv', index=False)

Stop H2O server.

In [None]:
h2o.shutdown()