# Rapids CUML 
## https://docs.rapids.ai/api/cuml/stable/

cuML is a suite of fast, GPU-accelerated machine learning algorithms designed for data science and analytical tasks. Our API mirrors Sklearn’s, and we provide practitioners with the easy fit-predict-transform paradigm without ever having to program on a GPU.

## Random Forest 
The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.

In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's `pickling` mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.

Note that the underlying algorithm in cuML for tree node splits differs from that used in scikit-learn.

For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable)

For additional information cuML's random forest model: https://docs.rapids.ai/api/cuml/stable/api.html#random-forest

In [1]:
import cudf
import numpy as np
import pandas as pd
import pickle

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

## Define Parameters

In [2]:
# The speedup obtained by using cuML'sRandom Forest implementation
# becomes much higher when using larger datasets. Uncomment and use the n_samples
# value provided below to see the difference in the time required to run
# Scikit-learn's vs cuML's implementation with a large dataset.

# n_samples = 2*17
n_samples = 2**14
n_features = 399
n_info = 300
data_type = np.float32

In [3]:
n_samples

16384

## Generate Data

### Host

In [4]:
%%time
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
# cuML Random Forest Classifier requires the labels to be integers
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

CPU times: user 7.36 s, sys: 641 ms, total: 8 s
Wall time: 1.07 s


In [5]:
X.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,389,390,391,392,393,394,395,396,397,398
0,3.002012,3.137154,-0.175892,1.642969,-13.864321,3.10494,0.291729,4.270051,5.143433,25.099476,...,11.201533,2.463232,13.170018,0.818072,-10.126958,3.329729,-7.845697,7.462924,1.405313,-0.481213
1,-8.690502,-7.17755,1.060494,-0.881712,6.042475,-6.249778,-0.012579,0.080355,-11.832709,13.821424,...,6.957514,-11.299376,-10.264004,1.07897,4.359313,9.783484,0.019618,8.896317,-0.406758,18.086494
2,-0.263046,-3.670474,-0.554353,-0.165402,3.03026,-10.715601,-0.285615,19.295618,-19.237207,8.104156,...,15.826721,12.024781,-5.576826,13.670948,-18.080793,-21.687071,3.039463,-0.893758,0.450848,-1.3738
3,4.909552,3.377095,0.166382,0.161322,-2.612847,13.082252,-1.996896,6.375703,-3.963556,-8.378668,...,-18.815842,-0.240144,-0.335867,-7.375377,-7.362492,-1.384158,-5.03215,-13.949319,0.245071,-3.324761
4,-9.864326,6.547871,0.264755,-0.62433,-9.511337,17.867865,-0.122875,-15.690889,-2.216547,0.616358,...,7.892476,-2.145288,-14.656464,-7.131478,2.950855,-11.903608,8.251114,-7.561745,-1.011766,-10.917574


### GPU

In [6]:
%%time
X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)

y_cudf_train = cudf.Series(y_train.values)

CPU times: user 10.5 s, sys: 3.46 s, total: 13.9 s
Wall time: 14.4 s


## Scikit-learn Model

### Fit

In [7]:
%%time
sk_model = skrfc(n_estimators=40,
                 max_depth=16,
                 max_features=1.0,
                 random_state=10)

sk_model.fit(X_train, y_train)

CPU times: user 4min 41s, sys: 33.9 ms, total: 4min 41s
Wall time: 4min 41s


### Evaluate

In [18]:
%%time
sk_predict = sk_model.predict(X_test)
sk_acc = accuracy_score(y_test, sk_predict)

CPU times: user 105 ms, sys: 0 ns, total: 105 ms
Wall time: 103 ms


## cuML Model

### Fit

In [9]:
%%time
cuml_model = curfc(n_estimators=40,
                   max_depth=16,
                   max_features=1.0,
                   random_state=10)

cuml_model.fit(X_cudf_train, y_cudf_train)

  return func(**kwargs)


CPU times: user 58.7 s, sys: 48.5 s, total: 1min 47s
Wall time: 34.5 s


### Evaluate

In [17]:
%%time
fil_preds_orig = cuml_model.predict(X_cudf_test)

fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)

CPU times: user 688 ms, sys: 36.2 ms, total: 724 ms
Wall time: 351 ms


In [13]:
#Fit Normal / fit Cuda

(4 * 60) /34.5

6.956521739130435

In [19]:
# Evaluate
(103 * 1000) /351

293.4472934472935