Based on https://colab.research.google.com/drive/13sspqiEZwso4NYTbsflpPyNFaVAAxUgr#scrollTo=Dlsyk9m9NN2K
and https://docs.rapids.ai/deployment/stable/platforms/colab/

# Environment Sanity Check #

Click the _Runtime_ dropdown at the top of the page, then _Change Runtime Type_ and confirm the instance type is _GPU_.

You can check the output of `!nvidia-smi` to check which GPU you have.  Please uncomment the cell below if you'd like to do that.  Currently, RAPIDS runs on all available Colab GPU instances.

In [None]:
# !nvidia-smi

#Setup:
This set up script:

1. Checks to make sure that the GPU is RAPIDS compatible
1. Installs the **current stable version** of RAPIDSAI's core libraries using pip, which are:
  1. cuDF
  1. cuML
  1. cuGraph
  1. cuSpatial
  1. cuxFilter
  1. cuCIM
  1. xgboost

**This will complete in about 5-6 minutes**

If you require installing the **nightly** releases of RAPIDSAI, please use the [RAPIDS Conda Colab Template notebook](https://colab.research.google.com/drive/1TAAi_szMfWqRfHVfjGSqnGVLr_ztzUM9) and use the nightly parameter option when running the RAPIDS installation cell.


In [None]:
# This get the RAPIDS-Colab install files and test check your GPU.  Run this and the next cell only.
# Please read the output of this cell.  If your Colab Instance is not RAPIDS compatible, it will warn you and give you remediation steps.
!git clone https://github.com/rapidsai/rapidsai-csp-utils.git
!python rapidsai-csp-utils/colab/pip-install.py


# RAPIDS is now installed on Colab.  
You can copy your code into the cells below or use the below to validate your RAPIDS installation and version.  
# Enjoy!

In [None]:
import cudf
cudf.__version__

'23.12.01'

In [None]:
import cuml
cuml.__version__

'23.12.00'

In [None]:
import cugraph
cugraph.__version__

'23.12.00'

In [None]:
import cuspatial
cuspatial.__version__

'23.12.01'

In [None]:
import cuxfilter
cuxfilter.__version__

'23.12.00'

# Next Steps #

For an overview of how you can access and work with your own datasets in Colab, check out [this guide](https://towardsdatascience.com/3-ways-to-load-csv-files-into-colab-7c14fcbdcb92).

For more RAPIDS examples, check out our RAPIDS notebooks repos:
1. https://github.com/rapidsai/notebooks
2. https://github.com/rapidsai/notebooks-contrib

In [None]:
import cudf
import numpy as np
import pandas as pd
import pickle

from cuml.ensemble import RandomForestClassifier as curfc
from cuml.metrics import accuracy_score

from sklearn.ensemble import RandomForestClassifier as skrfc
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

In [None]:
# The speedup obtained by using cuML'sRandom Forest implementation
# becomes much higher when using larger datasets. Uncomment and use the n_samples
# value provided below to see the difference in the time required to run
# Scikit-learn's vs cuML's implementation with a large dataset.

# n_samples = 2*17
n_samples = 2**12
n_features = 399
n_info = 300
data_type = np.float32

In [None]:
%%time
X,y = make_classification(n_samples=n_samples,
                          n_features=n_features,
                          n_informative=n_info,
                          random_state=123, n_classes=2)

X = pd.DataFrame(X.astype(data_type))
# cuML Random Forest Classifier requires the labels to be integers
y = pd.Series(y.astype(np.int32))

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state=0)

CPU times: user 162 ms, sys: 66.9 ms, total: 229 ms
Wall time: 231 ms


In [None]:
%%time
X_cudf_train = cudf.DataFrame.from_pandas(X_train)
X_cudf_test = cudf.DataFrame.from_pandas(X_test)

y_cudf_train = cudf.Series(y_train.values)

CPU times: user 569 ms, sys: 16.6 ms, total: 586 ms
Wall time: 617 ms


In [None]:
%%time
sk_model = skrfc(n_estimators=40,
                 max_depth=16,
                 max_features=1.0,
                 random_state=10)

sk_model.fit(X_train, y_train)

CPU times: user 38.7 s, sys: 41.8 ms, total: 38.7 s
Wall time: 38.7 s


In [None]:
%%time
sk_predict = sk_model.predict(X_test)
sk_acc = accuracy_score(y_test, sk_predict)

CPU times: user 53.5 ms, sys: 6.12 ms, total: 59.6 ms
Wall time: 152 ms


In [None]:
%%time
cuml_model = curfc(n_estimators=40,
                   max_depth=16,
                   max_features=1.0,
                   random_state=10,
                   n_streams=1)

cuml_model.fit(X_cudf_train, y_cudf_train)

CPU times: user 571 ms, sys: 204 ms, total: 776 ms
Wall time: 952 ms


In [None]:
%%time
fil_preds_orig = cuml_model.predict(X_cudf_test)

fil_acc_orig = accuracy_score(y_test.to_numpy(), fil_preds_orig)

CPU times: user 1min 26s, sys: 1.3 s, total: 1min 27s
Wall time: 1min 28s


In [None]:
from google.colab import drive
drive.mount('/content/drive')

folder_path = '/content/drive/My Drive/Colab Notebooks/Ch4_HR_HRV_Generation/'

Mounted at /content/drive


In [None]:
filename = 'cuml_random_forest_model.sav'
# save the trained cuml model into a file
# pickle.dump(cuml_model, open(folder_path + filename, 'wb'))
# delete the previous model to ensure that there is no leakage of pointers.
# this is not strictly necessary but just included here for demo purposes.
# del cuml_model
# load the previously saved cuml model from a file
pickled_cuml_model = pickle.load(open(folder_path + filename, 'rb'))

In [None]:
%%time
pred_after_pickling = pickled_cuml_model.predict(X_cudf_test)

fil_acc_after_pickling = accuracy_score(y_test.to_numpy(), pred_after_pickling)

CPU times: user 166 ms, sys: 41.1 ms, total: 207 ms
Wall time: 173 ms


In [None]:
print("CUML accuracy of the RF model before pickling: %s" % fil_acc_orig)
print("CUML accuracy of the RF model after pickling: %s" % fil_acc_after_pickling)

CUML accuracy of the RF model before pickling: 0.7512195110321045
CUML accuracy of the RF model after pickling: 0.7512195110321045


In [None]:
print("SKL accuracy: %s" % sk_acc)
print("CUML accuracy before pickling: %s" % fil_acc_orig)

SKL accuracy: 0.6926829218864441
CUML accuracy before pickling: 0.7512195110321045
