# Pandas-sklearn vs cuDF-cuML comparison: Random Forest training and inference 

[Disclaimer: This tutorial was adapted from the official [cuML Random Forest tutorial](https://github.com/rapidsai/cuml/blob/branch-22.06/notebooks/random_forest_demo.ipynb).]

The Random Forest algorithm is a classification method which builds several decision trees, and aggregates each of their outputs to make a prediction.

In this notebook we will train a scikit-learn and a cuML Random Forest Classification model. Then we save the cuML model for future use with Python's `pickling` mechanism and demonstrate how to re-load it for prediction. We also compare the results of the scikit-learn, non-pickled and pickled cuML models.

Note that the underlying algorithm in cuML for tree node splits differs from that used in scikit-learn.

For information on converting your dataset to cuDF format, refer to the [cuDF documentation](https://docs.rapids.ai/api/cudf/stable)

For additional information cuML's random forest model: https://docs.rapids.ai/api/cuml/stable/api.html#random-forest

In [1]:
import cudf
import numpy as np
import pandas as pd
import pickle

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

## Define Parameters

In [4]:
# The speedup obtained by using cuML'sRandom Forest implementation
# becomes much higher when using larger datasets. Uncomment and use the n_samples
# value provided below to see the difference in the time required to run
# Scikit-learn's vs cuML's implementation with a large dataset.

n_samples = 2**12
n_features = 399
n_info = int(n_features/3)
data_type = np.float32

n_samples

4096

## Generate Data

### Host

In [5]:
%%time
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
# cuML Random Forest Classifier requires the labels to be integers
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

CPU times: user 335 ms, sys: 1.95 s, total: 2.28 s
Wall time: 180 ms


### GPU

In [6]:
%%time
X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)

y_cudf_train = cudf.Series(y_train.values)

CPU times: user 2.98 s, sys: 651 ms, total: 3.64 s
Wall time: 3.77 s


## Scikit-learn Model

### Fit

In [7]:
%%time
sk_model = skrfc(n_estimators=40,
                 max_depth=16,
                 max_features=1.0,
                 random_state=10)

sk_model.fit(X_train, y_train)

CPU times: user 24.9 s, sys: 0 ns, total: 24.9 s
Wall time: 24.9 s


RandomForestClassifier(max_depth=16, max_features=1.0, n_estimators=40,
                       random_state=10)

### Evaluate

In [8]:
%%time
sk_predict = sk_model.predict(X_test)
sk_acc = accuracy_score(y_test, sk_predict)

CPU times: user 18.7 ms, sys: 0 ns, total: 18.7 ms
Wall time: 16.9 ms


## cuML Model

### Fit

In [9]:
%%time
cuml_model = curfc(n_estimators=40,
                   max_depth=16,
                   max_features=1.0,
                   random_state=42)

cuml_model.fit(X_cudf_train, y_cudf_train)

  return func(**kwargs)


CPU times: user 3.35 s, sys: 464 ms, total: 3.81 s
Wall time: 1.41 s


RandomForestClassifier()

### Evaluate

In [10]:
%%time
fil_preds_orig = cuml_model.predict(X_cudf_test)
fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)

CPU times: user 2.58 s, sys: 221 ms, total: 2.8 s
Wall time: 227 ms


## Pickle the cuML random forest classification model

In [11]:
filename = 'cuml_random_forest_model.sav'
# save the trained cuml model into a file
pickle.dump(cuml_model, open(filename, 'wb'))
# delete the previous model to ensure that there is no leakage of pointers.
# this is not strictly necessary but just included here for demo purposes.
del cuml_model
# load the previously saved cuml model from a file
pickled_cuml_model = pickle.load(open(filename, 'rb'))


### Predict using the pickled model

In [12]:
%%time
pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)

fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)

CPU times: user 3.19 s, sys: 235 ms, total: 3.43 s
Wall time: 266 ms


## Compare Results

In [13]:
print("CUML accuracy of the RF model before pickling: %s" % fil_acc_orig)
print("CUML accuracy of the RF model after pickling: %s" % fil_acc_after_pickling)

CUML accuracy of the RF model before pickling: 0.7512195110321045
CUML accuracy of the RF model after pickling: 0.7512195110321045


In [14]:
print("SKL accuracy: %s" % sk_acc)
print("CUML accuracy before pickling: %s" % fil_acc_orig)

SKL accuracy: 0.7804877758026123
CUML accuracy before pickling: 0.7512195110321045


## Random Forests Multi-node/Multi-GPU (MNMG) demo 

Today's demo runs on single-GPU. There are ways to train ML models in a GPU-accelerated manner, and within a multi-GPU and even multi-node environment using Dask-cuDF and Dask-ML. For a tutorial on MNMG Random Forest training, please visit:

https://github.com/rapidsai/cuml/blob/branch-22.06/notebooks/random_forest_mnmg_demo.ipynb