# Getting Started with GOAI

## Loading data into a GPU DataFrame (GDF)

### Loading data into a Pandas DataFrame

It's easy to load almost any sort of data (json, csv, etc) into a Pandas DataFrame. Ex (csv import from disk):

In [1]:
import pandas

# read data from csv file into pandas dataframe
df = pandas.read_csv('data/ipums/ipums_easy.csv')

[Read more on using a Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### Converting a Pandas DataFrame to a GDF

In [2]:
import pygdf

# convert the panda dataframe into a gpu dataframe
gdf = pygdf.DataFrame.from_pandas(df)

## Working with the GDF
See the [pygdf documentation](http://pygdf.readthedocs.io/en/latest/index.html) for more.

### Take a look at the columns and their data types

In [3]:
# print the columns and their datatypes in this gdf
gdf.dtypes

RECTYPE          int64
YEAR             int64
DATANUM          int64
SERIAL           int64
NUMPREC          int64
SUBSAMP          int64
HHWT             int64
HHTYPE           int64
REPWT          float64
CLUSTER          int64
ADJUST         float64
CPI99          float64
REGION           int64
STATEICP         int64
STATEFIP         int64
COUNTY         float64
COUNTYFIPS     float64
METRO          float64
METAREA        float64
METAREAD       float64
MET2013        float64
MET2013ERR     float64
CITY           float64
CITYERR        float64
CITYPOP        float64
PUMA           float64
PUMARES2MIG    float64
STRATA           int64
PUMASUPR       float64
CONSPUMA       float64
                ...   
REPWTP51       float64
REPWTP52       float64
REPWTP53       float64
REPWTP54       float64
REPWTP55       float64
REPWTP56       float64
REPWTP57       float64
REPWTP58       float64
REPWTP59       float64
REPWTP60       float64
REPWTP61       float64
REPWTP62       float64
REPWTP63   

### Slice the GDF

Woah! This GDF has a lot of columns, let's make it more manageable...

In [4]:
# only select certain columns (and overwrite the gdf)
gdf = gdf.loc[0:, [
    'INCEARN', 'PERWT', 'ADJUST', 'STATEICP', 'ROOMS', 'BEDROOMS',
     'PHONE', 'VEHICLES', 'RACE', 'SEX', 'AGE', 'VETSTAT'
]]

# show the first 5 records of each column
gdf.head(5)

  INCEARN PERWT   ADJUST STATEICP ROOMS BEDROOMS PHONE VEHICLES RACE  SEX  AGE VETSTAT
0    4000   618 1.018516       21     7        4     2        3    1    2   66       1
1   36700   684 1.018516       21     7        4     2        3    1    1   40       1
2   54000   618 1.018516       49     5        4     2        3    1    1   51       2
3     900   609 1.018516       49     5        4     2        3    1    2   48       1
4    2000   621 1.018516       49     5        4     2        3    1    1   19       1

### Modify data types

In [5]:
gdf.dtypes

INCEARN       int64
PERWT         int64
ADJUST      float64
STATEICP      int64
ROOMS         int64
BEDROOMS      int64
PHONE         int64
VEHICLES      int64
RACE          int64
SEX           int64
AGE           int64
VETSTAT       int64
dtype: object

Looks like `INCEARN` and `PERWT` are integers when they should be floats. Let's fix that...

In [6]:
import numpy as np

# force float64 instead of int64
gdf['INCEARN'] = gdf['INCEARN'].astype(np.float64)
gdf['PERWT'] = gdf['PERWT'].astype(np.float64)

# take another look
gdf.dtypes

INCEARN     float64
PERWT       float64
ADJUST      float64
STATEICP      int64
ROOMS         int64
BEDROOMS      int64
PHONE         int64
VEHICLES      int64
RACE          int64
SEX           int64
AGE           int64
VETSTAT       int64
dtype: object

### Manipulate data with a user-defined function (UDF)

`INCEARN` is not a true representation of income earned. Let's adjust it by multiplying it by the `ADJUST` constant.

In [7]:
# define a function to adjust the incearn var
# so it more accurately represents income earned
adjust = gdf['ADJUST'][0]
def adjust_incearn(incearn):
    return adjust * incearn;

# apply it to the 'population' column
gdf['INCEARN'] = gdf['INCEARN'].applymap(adjust_incearn)

# drop the ADJUST column
gdf.drop_column('ADJUST')

# compute the mean
gdf['INCEARN'].mean()

18637.0999154208

### Sort the data

In [8]:
# sort the gdf by the INCEARN column
gdf = gdf.sort_values(by='INCEARN', ascending=True)
# reset the index so we can use loc slicing later
gdf = gdf.reset_index()
gdf.head(5)

        INCEARN PERWT STATEICP ROOMS BEDROOMS PHONE VEHICLES RACE  SEX  AGE VETSTAT
0 -10184.141484 538.0       53     4        3     2        2    1    1   35       1
1 -10184.141484 614.0       71     7        4     2        2    5    2   57       1
2 -10184.141484 511.0       45     9        5     2        2    1    2   48       1
3 -10184.141484 453.0       45     9        5     2        2    1    1   57       1
4 -10184.141484 593.0       53     9        5     2        5    1    2   55       1

Looks like we have some negative income values. Let's filter those out...

### Filter the data

In [9]:
# how many records do we have?
print("{} = Original # of records".format(len(gdf)))

# filter out
gdf = gdf.query('INCEARN >= 0')

# how many records do we have left?
print("{} = New # of records".format(len(gdf)))

# sanity check...
gdf.head(5)

10000 = Original # of records
9985 = New # of records


  INCEARN PERWT STATEICP ROOMS BEDROOMS PHONE VEHICLES RACE  SEX  AGE VETSTAT
15     0.0 559.0       49     5        4     2        3    1    2   17       1
16     0.0 589.0       43     8        4     2        3    1    1   21       1
17     0.0 617.0       43     5        3     2        1    1    2   66       1
18     0.0 574.0       43     6        4     2        1    1    1   80       2
19     0.0 616.0       43     6        4     2        1    1    2   72       1

### One hot encode categorical columns

In [10]:
# define the categorical columns
cat_cols = set(['STATEICP', 'RACE', 'SEX', 'VETSTAT'])
# store the unique values for each category column
uniques = {}

# iterate through each categorical column and one-hot
# encode it using the unique values it has
for k in cat_cols:
    uniques[k] = gdf[k].unique_k(k=1000)
    cats = uniques[k][1:]  # drop first
    gdf = gdf.one_hot_encoding(k, prefix=k, cats=cats)
    del gdf[k]
    
# we should see many more columns since the categorical
# columns will get expanded due to one-hot encoding
gdf.dtypes

INCEARN        float64
PERWT          float64
ROOMS            int64
BEDROOMS         int64
PHONE            int64
VEHICLES         int64
AGE              int64
VETSTAT_1      float64
VETSTAT_2      float64
STATEICP_2     float64
STATEICP_3     float64
STATEICP_4     float64
STATEICP_5     float64
STATEICP_6     float64
STATEICP_11    float64
STATEICP_12    float64
STATEICP_13    float64
STATEICP_14    float64
STATEICP_21    float64
STATEICP_22    float64
STATEICP_23    float64
STATEICP_24    float64
STATEICP_25    float64
STATEICP_31    float64
STATEICP_32    float64
STATEICP_33    float64
STATEICP_34    float64
STATEICP_35    float64
STATEICP_36    float64
STATEICP_37    float64
                ...   
STATEICP_48    float64
STATEICP_49    float64
STATEICP_51    float64
STATEICP_52    float64
STATEICP_53    float64
STATEICP_54    float64
STATEICP_56    float64
STATEICP_61    float64
STATEICP_62    float64
STATEICP_63    float64
STATEICP_64    float64
STATEICP_65    float64
STATEICP_66

### Split the data into training, validation, and test sets

In [11]:
# enforce float64 data type on ALL columns
for k in gdf.columns:
    gdf[k] = gdf[k].astype(np.float64)

# set the fractions for training and validation
fractions = {
    "train": 0.8,
    "valid": 0.2
}

# validation splitpoint
splitpoint = int(len(gdf) * fractions["train"])
print('splitpoint: {} of {} is {}'.format(fractions["train"], len(gdf), splitpoint))

# break the gdf up into training, validation, and test sets
gdfs = {
    "train": gdf.loc[:splitpoint],
    "valid": gdf.loc[splitpoint:]
}
print('gdfs["train"] has {} rows'.format(len(gdfs["train"])))
print('gdfs["valid"] has {} rows'.format(len(gdfs["valid"])))

splitpoint: 0.8 of 9985 is 7988
gdfs["train"] has 7974 rows
gdfs["valid"] has 2012 rows


### Turn the GDFs into matrices

In [12]:
# produce gpu matrices (to input to ml libraries, etc)
# this step should not be necessary in the near future
# (should be able to use gdf as input)
matrices = {
    "train": {
        "x": gdfs["train"].as_gpu_matrix(columns=gdf.columns[1:]),
        "y": gdfs["train"].as_gpu_matrix(columns=[gdf.columns[0]])
    },
    "valid": {
        "x": gdfs["valid"].as_gpu_matrix(columns=gdf.columns[1:]),
        "y": gdfs["valid"].as_gpu_matrix(columns=[gdf.columns[0]])
    }
}

# check the matrix shapes (sanity check)
print('matrices["train"]["x"] shape:', matrices["train"]["x"].shape)
print('matrices["train"]["y"] shape:', matrices["train"]["y"].shape)
print('matrices["valid"]["x"] shape:', matrices["valid"]["x"].shape)
print('matrices["valid"]["y"] shape:', matrices["valid"]["y"].shape)

matrices["train"]["x"] shape: (7974, 67)
matrices["train"]["y"] shape: (7974, 1)
matrices["valid"]["x"] shape: (2012, 67)
matrices["valid"]["y"] shape: (2012, 1)


### Obtain pointers to the matrices

In [13]:
# get pointers (so we can keep data on gpu)
# this step should not be necessary in the near future
# (should be able to use gdf as input)

from ctypes import *

def get_pointer(matrix):
    return c_void_p(matrix.device_ctypes_pointer.value)

## Train a model with H2O4GPU
See the [H2O4GPU README](https://github.com/h2oai/h2o4gpu) for more.

### Set up an ElasticNet model

See [elastic_net.py](https://github.com/h2oai/h2o4gpu/blob/master/src/interface_py/h2o4gpu/solvers/elastic_net.py) for more documentation.

In [14]:
# uncomment the following to see documentation
# (especially for available params)

# h2o4gpu.ElasticNetH2O?

In [15]:
import h2o4gpu

# set up the solver
model = h2o4gpu.ElasticNetH2O(
    double_precision = 1,                         # double precision to use (float64)
    order = 'c',                                  # order of data (c = column, r = row)
    n_gpus = 1,                                   # number of gpus to use
    n_threads = 1,                                # 1 thread per gpu is optimal
    family = "elasticnet",                        # use "logistic" for classification, "elasticnet" for regression
    n_folds = 5,                                  # number of cross-validation folds (default is 1)
    n_alphas = 8                                  # number of alphas to be used in search (default is 5)
)

model

<h2o4gpu.solvers.elastic_net.ElasticNetH2O at 0x7fdfb003e860>

### Fit and predict the ElasticNet model

The following will fit and predict the model. Alternative methods are available to split up fitting and predicting.

`model.fit_ptr` - fitting with pointer inputs

`model.fit` - fitting with numpy array inputs (does not use gpu mem currently)

`model.predict_ptr` - predicting with pointer inputs

`model.predict` - predicting with numpy array inputs (does not use gpu mem currently)

In [16]:
# uncomment the following to see documentation
# (especially for available params)

# model.fit_predict_ptr?

In [17]:
# fit the model
train_shape = matrices["train"]["x"].shape
valid_shape = matrices["valid"]["x"].shape
nada_pointer = c_void_p(0)
predictions = model.fit_predict_ptr(
    double_precision = 1,                         # double precision to use (float64)
    order = 'c',                                  # order of data (c = column, r = row)
    m_train = train_shape[0],                     # number of rows in training set
    n = train_shape[1],                           # number of columns in training set
    m_valid = valid_shape[0],                     # number of rows in validation set
    a = get_pointer(matrices["train"]["x"]),      # pointer to training features matrix
    b = get_pointer(matrices["train"]["y"]),      # pointer to training response matrix
    c = get_pointer(matrices["valid"]["x"]),      # pointer to validation features matrix
    d = get_pointer(matrices["valid"]["y"]),      # pointer to validation response matrix
    e = nada_pointer                              # pointer to weight column (not using in this case)
)

# show the summary
model.summary()

RMSE per alpha value (-1.00 = missing)

|   Alphas |   Train |       CV |    Valid |
|---------:|--------:|---------:|---------:|
|     0.00 | 5181.31 | 24937.40 | 84390.10 |
|     0.14 | 5181.31 | 24937.40 | 84390.10 |
|     0.29 | 5181.31 | 24937.40 | 84390.10 |
|     0.43 | 5181.31 | 24937.40 | 84390.10 |
|     0.57 | 5181.31 | 24937.40 | 84390.10 |
|     0.71 | 5181.32 | 24937.40 | 84390.11 |
|     0.86 | 5181.32 | 24937.40 | 84390.11 |
|     1.00 | 5181.32 | 24937.41 | 84390.11 |


## Other things you can do with H2O4GPU

### Clustering with K-means

See [kmeans.py](https://github.com/h2oai/h2o4gpu/blob/master/src/interface_py/h2o4gpu/solvers/kmeans.py) for more.

In [18]:
# note: does not use gpu data currently, but does train on gpu

# get the data as a numpy ndarray
cpu_matrix = gdf.as_matrix()

# set up the k-means model
kmeans = h2o4gpu.KMeans(
    n_clusters = 5,
    n_gpus = 1,
    max_iter = 1000
)

# fit and predict (can be broken up into separate `fit` and `predict` methods)
predicted_clusters = kmeans.fit_predict(cpu_matrix)

# copy the cluster results onto a new gdf as a new column called `CLUSTER`
gdf_with_clusters = gdf.loc[:, ["INCEARN", "ROOMS", "BEDROOMS", "PHONE", "VEHICLES", "AGE"]].copy()
gdf_with_clusters.add_column("CLUSTER", predicted_clusters)

# show the last 85 rows of results
gdf_with_clusters.loc[9900:9985].head(85)

         INCEARN ROOMS BEDROOMS PHONE VEHICLES  AGE CLUSTER
9900      152777.4   5.0      4.0   2.0      4.0 47.0       1
9901      152777.4   8.0      5.0   2.0      2.0 58.0       1
9902      152777.4   8.0      4.0   2.0      2.0 59.0       1
9903      152777.4   9.0      5.0   2.0      2.0 38.0       1
9904      152777.4   6.0      4.0   2.0      2.0 47.0       1
9905      152777.4   9.0      5.0   2.0      2.0 72.0       1
9906      152777.4   9.0      5.0   2.0      3.0 57.0       1
9907      152777.4   9.0      5.0   2.0      2.0 59.0       1
9908      152777.4   3.0      2.0   2.0      9.0 47.0       1
9909      152777.4   7.0      4.0   2.0      2.0 42.0       1
9910      152777.4   7.0      4.0   2.0      2.0 47.0       1
9911      152777.4   9.0      6.0   2.0      2.0 61.0       1
9912      152777.4   5.0      4.0   2.0      2.0 33.0       1
9913      152777.4   7.0      4.0   2.0      2.0 52.0       1
9914    154814.432   2.0      2.0   2.0      3.0 25.0       1
9915     1

### PCA

See [pcay.py](https://github.com/h2oai/h2o4gpu/blob/master/src/interface_py/h2o4gpu/solvers/pca.py) for more.

In [19]:
from h2o4gpu.solvers.pca import PCAH2O

# reduce down to the top 3 dimensions
pca = PCAH2O(
    n_components = 3
)

# use `X` from the previous use case
# recall that it is a numpy ndarray of the input data
pca.fit(cpu_matrix)

# components
# print("components", pca.components_)

# explained variance
print("explained variance", pca.explained_variance_)

# explained variance ratio
print("explained variance ratio", pca.explained_variance_ratio_)

explained variance [  1.20293687e+09   2.44333391e+05   4.94523652e+02]
explained variance ratio [ 0.06315957  0.01398424  0.01398268]


### Regression via XGBoost

See [xgboost.py](https://github.com/h2oai/h2o4gpu/blob/master/src/interface_py/h2o4gpu/solvers/xgboost.py) for more, including classification classes.

In [20]:
from h2o4gpu import GradientBoostingRegressor

# ensure that we use the h2o4gpu backend (not sklearn)
xgb = GradientBoostingRegressor(backend = "h2o4gpu")

# convert input data from gdf to cpu matrices (numpy ndarrays)
cpu_matrices = {
    "train": {
        "x": gdfs["train"].as_matrix(columns=gdf.columns[1:]),
        "y": gdfs["train"].as_matrix(columns=[gdf.columns[0]]).flatten()
    },
    "test": {
        "x": gdfs["valid"].as_matrix(columns=gdf.columns[1:]),
        "y": gdfs["valid"].as_matrix(columns=[gdf.columns[0]]).flatten()
    }
}

# set the base parameters
num_rounds = 10
xgb_params = {
    "learning_rate": 0.1,
    "n_estimators": 100,
    "subsample": 1.0,
    "n_gpus": 1
}
xgb.set_params(**xgb_params)

# fit the model
xgb.fit(X = cpu_matrices["train"]["x"], y = cpu_matrices["train"]["y"])

# predict based upon test values
xgb_predictions = xgb.model.predict(cpu_matrices["test"]["x"])

# show the first 20 results
print(xgb_predictions[0:20])

Running h2o4gpu GradientBoostingRegressor
[  6797.90087891   3378.71459961   9183.80957031  16129.484375
  14172.34765625  15765.94824219  15628.46289062  14027.58789062
  18759.87304688  18161.1796875   13734.5078125   18248.9921875
  14172.34765625  17803.74023438  16812.0625      14660.29882812
  19444.0390625   18712.04492188   6858.76806641  17068.14453125]


### Additional H2O4GPU Resources

* [Solvers (code)](https://github.com/h2oai/h2o4gpu/tree/master/src/interface_py/h2o4gpu/solvers)
* [Python examples](https://github.com/h2oai/h2o4gpu/tree/master/examples/py)
* [Tests (useful for examples)](https://github.com/h2oai/h2o4gpu/tree/master/tests_open)