# Michelle's Initial Testing of h2o_wave.ml Functionality

### Setup
```
git clone https://github.com/h2oai/wave.git
cd wave
git fetch
git checkout feat/common-api

cd py
make setup

./venv/bin/pip install h2o
./venv/bin/pip install datatable

./venv/bin/pip install ipykernel
./venv/bin/python -m  ipykernel install

 ./venv/bin/jupyter notebook    

```
I am using the h2o_wave package from ^^ but my waved is 0.9.1 (IDK if this is right)

In [1]:
from h2o_wave import site, data, ui
from h2o_wave.ml import build_model
import datatable as dt
import pandas as pd

I had a cluster running that was version 3.32.0.1 which is not in pypi yet so it caused problems, fixed by shutting down my cluster

In [None]:
# import h2o
# h2o.init(strict_version_check = False)
# h2o.shutdown()

## Build Model

In [2]:
target = 'Fuel_Price'
train_path = '/Users/mtanco/Downloads/walmart_train.csv'
test_path = '/Users/mtanco/Downloads/walmart_test.csv'

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

train_dt = dt.fread(train_path)
test_dt = dt.fread(test_path)

### Thoughts
* Based on the original notebook I read the data into a datatable frame, but I only needed a path to the file
* Ability to do either would be nice
* There doesn't seem to be much to do with the model object, I would like to see the confusion matrix or standard regression validation metrics
* I would LOVE a standard UI templates for visualizing model preformance. For example, a Confusion Matrix that has a slider for changing the threshold
* `Model.Type` only says it's H2O-3, I want to know what algo it is (gbm, rf, etc.)
* model.predict(test_dt.to_tuples()) failed for me
* model.predict(test_path) doesn't work which is inconsistent

In [3]:
model = build_model(train_path, target=target) 

Checking whether there is an H2O instance running at http://localhost:54321 . connected.


0,1
H2O_cluster_uptime:,2 days 18 hours 56 mins
H2O_cluster_timezone:,America/Los_Angeles
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.30.1.3
H2O_cluster_version_age:,1 month and 14 days
H2O_cluster_name:,H2O_from_python_mtanco_6p668a
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.693 Gb
H2O_cluster_total_cores:,12
H2O_cluster_allowed_cores:,12


Parse progress: |█████████████████████████████████████████████████████████| 100%
AutoML progress: |████████████████████████████████████████████████████████| 100%


In [8]:
model.type

<WaveModelType.H2O3: 1>

In [5]:
predict = model.predict(test_df)

Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: |████████████████████████████████████| 100%


In [19]:
# datatype issues
predict = model.predict(test_dt.to_tuples()) 

Parse progress: |█████████████████████████████████████████████████████████| 100%
stackedensemble prediction progress: | (failed)


OSError: Job with key $03017f00000132d4ffffffff$_93bf92eec9079a23639ed057aaa44ec3 failed with an exception: java.lang.IllegalArgumentException: Test/Validation dataset has categorical column 'Unemployment' which is real-valued in the training data
stacktrace: 
java.lang.IllegalArgumentException: Test/Validation dataset has categorical column 'Unemployment' which is real-valued in the training data
	at hex.Model.adaptTestForTrain(Model.java:1453)
	at hex.Model.adaptTestForTrain(Model.java:1300)
	at hex.Model.score(Model.java:1588)
	at water.api.ModelMetricsHandler$1.compute2(ModelMetricsHandler.java:396)
	at water.H2O$H2OCountedCompleter.compute(H2O.java:1563)
	at jsr166y.CountedCompleter.exec(CountedCompleter.java:468)
	at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
	at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
	at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
	at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)


## Build Site

In [18]:
page = site['/predictions']

n_train = 10
n_test = 10
n_total = n_train + n_test
v = page.add('example', ui.plot_card(
    box='1 1 4 5',
    title='Line',
    data=data('date price', n_total),
    plot=ui.plot([ui.mark(type='line', x_scale='time', x='=date', y='=price', y_min=0)])
))

# We are taking the last `n_train` values from the train set.
values_train = [(train_dt[-i, 'Date'], train_dt[-i, 'Fuel_Price']) for i in reversed(range(1, n_train + 1))]
values_test = [(test_dt[i, 'Date'], predict[i][0]) for i in range(n_test)]

v.data = values_train + values_test

page.save()

In [8]:
print('http://localhost:55555/predictions')

http://localhost:55555/predictions
