# How to use h2o python API on hadoop cluster

## 1. Running h2o on hadoop

First of all, you need to download and run the last release of h2o-dev. Then, you have to unzip this release and change to directory. Finally, launch h2o nodes and form a cluster on hadoop cluster :

    -nodes : number of nodes
    -mapperXmx : amount of RAM per node
    -output : output hdfs directory
    -nthreads : number of CPU (-1 for all)

## 2. Import h2o module and connect to hadoop cluster

In [22]:
import h2o
import time

To connect to h2o cluster, you need to init the connection (ip address is one of launched instances) :

In [23]:
serverH2O = h2o.init(ip="10.235.249.34", port=54321)


--------------------------  ----------------------------------------------
H2O cluster uptime:         2 hours 59 minutes 56 seconds 224 milliseconds
H2O cluster version:        0.3.0.1096
H2O cluster name:           H2O_88463
H2O cluster total nodes:    9
H2O cluster total memory:   89.47 GB
H2O cluster total cores:    36
H2O cluster allowed cores:  36
H2O cluster healthy:        True
--------------------------  ----------------------------------------------



## 3. Import data from HDFS

Here, our data are CSV files stored on HDFS

In [24]:
testDataPath = ["hdfs://nameservice1/user/bcourbe/data/mnist_test.csv"]
trainDataPath = ["hdfs://nameservice1/user/bcourbe/data/mnist_train.csv"]

Import data from HDFS on h2o cluster :

In [25]:
testMnist = h2o.import_frame(path=testDataPath)


Parse Progress: [##################################################] 100%
Veckeys [{u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ff01440000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_test.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ff02440000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_test.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ff03440000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_test.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ff04440000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_test.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>', u'n

In [26]:
trainMnist = h2o.import_frame(path=trainDataPath)


Parse Progress: [##################################################] 100%
Veckeys [{u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ffb3330000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_train.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ffb4330000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_train.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ffb5330000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_train.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>', u'name': u'$04ffb6330000ffffffff$hdfs://nameservice1/user/bcourbe/data/mnist_train.csv', u'__meta': {u'schema_name': u'VecKeyV1', u'schema_version': 1, u'schema_type': u'Key<Vec>'}}, {u'URL': None, u'type': u'Key<Vec>',

Define where are your labels

In [27]:
trainMnist[784]._name = "label"
testMnist[784]._name = "label"

Take a look at our labels

In [33]:
trainMnist["label"].summary()

{u'__meta': {u'schema_name': u'ColV2',
  u'schema_type': u'Vec',
  u'schema_version': 2},
 u'data': [2.0,
  3.0,
  0.0,
  0.0,
  2.0,
  7.0,
  5.0,
  2.0,
  6.0,
  8.0,
  7.0,
  4.0,
  1.0,
  7.0,
  8.0,
  8.0,
  4.0,
  7.0,
  7.0,
  4.0,
  7.0,
  7.0,
  3.0,
  0.0,
  4.0,
  6.0,
  4.0,
  7.0,
  1.0,
  0.0,
  9.0,
  1.0,
  0.0,
  2.0,
  0.0,
  5.0,
  1.0,
  9.0,
  1.0,
  2.0,
  3.0,
  3.0,
  7.0,
  2.0,
  7.0,
  1.0,
  9.0,
  1.0,
  4.0,
  2.0,
  8.0,
  6.0,
  6.0,
  9.0,
  2.0,
  6.0,
  8.0,
  4.0,
  8.0,
  7.0,
  9.0,
  3.0,
  3.0,
  4.0,
  6.0,
  2.0,
  4.0,
  4.0,
  8.0,
  1.0,
  5.0,
  3.0,
  3.0,
  3.0,
  3.0,
  3.0,
  2.0,
  9.0,
  4.0,
  7.0,
  1.0,
  4.0,
  5.0,
  3.0,
  7.0,
  3.0,
  2.0,
  5.0,
  1.0,
  5.0,
  2.0,
  9.0,
  8.0,
  3.0,
  3.0,
  0.0,
  1.0,
  3.0,
  9.0,
  8.0],
 u'domain': None,
 u'histogram_base': 0.0,
 u'histogram_bins': None,
 u'histogram_stride': 0.0,
 u'label': u'C1',
 u'maxs': [9.0, 9.0, 9.0, 9.0, 9.0],
 u'mean': 4.453933333333333,
 u'mins': [0.0, 0.0,

## 4. Train models

First, lets start with deeplearning training :

In [28]:
tic = time.strftime("%H:%M:%S", time.localtime())
print("Train start : "+tic)

mnistModel = h2o.deeplearning(x=trainMnist.drop("label"), 
                              y=trainMnist["label"].asfactor(),
                              validation_x=testMnist.drop("label"),
                              validation_y=testMnist["label"].asfactor(),
                              hidden=[200,200],
                              activation="RectifierWithDropout",
                              l1=1e-5,
                              do_classification=True
                             )

toc = time.strftime("%H:%M:%S", time.localtime())
print("Train finish : "+toc)

Train start : 14:11:17

deeplearning Model Build Progress: [##################################################] 100%
Train finish : 14:16:21


Now, take a look to your model :

In [29]:
mnistModel.show()


Model Details:


Scoring History:

    Timestamp            Training Duration    Training Speed     Training Epochs    Training Samples    Training MSE    Training R^2    Training Classification Error    Validation MSE    Validation R^2    Validation Classification Error
--  -------------------  -------------------  -----------------  -----------------  ------------------  --------------  --------------  -------------------------------  ----------------  ----------------  ---------------------------------
    2015-03-12 14:14:04  0.000 sec                                                                      inf             nan             1                                inf               nan               1
    2015-03-12 14:14:42  38.313 sec           1569.546 rows/sec  1.00223338604      60134.0             0.0889725       0.989339        0.0970579                        0.0875272         0.989563          0.0984
    2015-03-12 14:15:48  1 min 44.012 sec     6974.493 rows/sec  12.0

In [30]:
mnistModelPerf = mnistModel.model_performance(testMnist)

In [31]:
predictMnist = mnistModel.predict(testMnist)

Since we don't have any metric with multinomial model, we can look if our model fit well by doing :

In [56]:
predAc = h2o.H2OFrame(vecs=[predictMnist[0],testMnist[784].asfactor()])
predAc.show()

Displaying 10 row(s):
  Row ID    predict    label
--------  ---------  -------
       1          8        8
       2          3        3
       3          8        8
       4          0        0
       5          1        1
       6          5        5
       7          0        0
       8          1        1
       9          5        5
      10          2        2

