# Machine Learning on Pycaret with CuML

## 1.0 RAPIDS

### What is RAPIDS?

RAPIDS is an open source software libraries and APIs that gives you the ability to execute end-to-end data science and analytics pipelines entirely on GPUs. With RAPIDS, now you can utilize GPU from data preprocessing to Model Training and Deployment. To learn more about RAPIDS you can follow this [Link](https://rapids.ai/index.html).

### cuML one of the RAPIDS Features to boost your ML projects

cuML is a suite of fast, GPU-accelerated machine learning algorithms designed for data science and analytical tasks. The API mirrors Sklearn's and provide practitioners with the easy fit-predict-transform paradigm without ever having to program on a GPU.

This will benefits to the users when running ML algorithms with large datasets.

cuML can only be installed in Linux system. For those who are using windows you can install WSL2 (Windows Subsystem Linux) and enabled your WSL in your windows machine. For this particular notebook project we will only install cuML and then utilized the GPU to run ML by using Pycaret Library.


## 1.1 Pycaret Python Module

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive. [Learn more](https://pycaret.readthedocs.io/en/latest/index.html)

## 1.2 Pycaret meet cuML

Pycaret help you to make your Machine Learning project simpler with its low-code features. It also covers from data preprocessing, model training and tuning to model deployment. Imagine, there are plenty of ML algorithm out there to experiment on your data but you can experiment most of ML models available to a single code in Pycaret. This help you to minimize resources and save time and be more productive.

While Pycaret help you to minimize your time of experimenting with model algorithm, cuML help you to boost your ML training by utilizing GPUs on your machine. These two, personally to me is a great combination. As your data get bigger, depend solely on CPU could be a problem unless you have very expensive hardware to begin with.

## 1.3 SETTING-UP ENVIRONMENT

Before installing your system have to meet requirements as in figure below.

![image.png](attachment:15553510-d735-4635-9712-2cc956651724.png) [Link](https://rapids.ai/wsl2.html)

#### 1.3.1 Download and enabled WSL

To install WSL in your windows machine, click the this [Link](https://learn.microsoft.com/en-us/windows/wsl/install) and refer to the guide provided.

#### 1.3.2 Download NVIDIA Driver

Download latest [Nvidia Driver](https://www.nvidia.com/download/index.aspx) for windows (Game Ready Driver) and install it in your computer

#### 1.3.3 Download Anaconda/Miniconda

If you already meet the requirement, you can simply install anaconda in WSL environment by running following command on your terminal

```
$ wget https://repo.anaconda.com/archive/Anaconda3-latest-Linux-x86_64.sh
```

Then run following command;

```
$ bash Anaconda3-latest-Linux-x86_64.sh
```

After that enable conda-init (recommended).

Restart the terminal to initialized Conda

#### 1.3.4 Download and Install CUDA Toolkit

My own personal recommendation is to install CUDA Toolkit 11.2v. You may refer to this [website](https://developer.nvidia.com/cuda-11.2.0-download-archive?target_os=Linux&target_arch=x86_64&target_distro=WSLUbuntu&target_version=20&target_type=deblocal) and run the command provided one by one.

#### 1.3.5 Install cuML and pycaret

First you will need to create conda environment (Please note that we are using WSL make sure you are using conda installed in your WSL machine). Run following command to create conda env;
```
$ conda create -n pycaret-cuML 
```
Then, activate the environment. Run following command to install cuML;
```
$ conda install -c rapidsai -c nvidia -c numba -c conda-forge \
    cuml=22.06 python=3.8 cudatoolkit=11.2
```
Wait for the installation to finish. Then run following command to install pycaret;
```
$ pip install pycaret
```
or
```
$ pip install pycaret[full] 
```
Then, create notebook kernel to connect with conda enviroment
```
$ python -m ipykernel install --user --name <yourenvname> --display-name <yourenvname>
```

Open your notebook and change the Kernel to your current env

## 1.4 Tutorial for Regression Modules using Pycaret with GPU

In [1]:
# check pycaret version
import pycaret

from pycaret.utils import version
version()

'2.3.10'

In [2]:
from pycaret.datasets import get_data
dataset = get_data('diamond')

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.1,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171


In [3]:
# check the shape of the dataset
dataset.shape

(6000, 8)

In [4]:
# split data for the train and test and unseen data for model prediction later

data = dataset.sample(frac=0.9, random_state=123)
data_unseen = dataset.drop(data.index)

data.reset_index(drop=True, inplace=True)
data_unseen.reset_index(drop=True, inplace=True)

print(f"Data for Modeling: {data.shape}")
print(f"Unseen Data For Prediction : {data_unseen.shape}")

Data for Modeling: (5400, 8)
Unseen Data For Prediction : (600, 8)


##### Train Model with GPU

In [5]:
# use the setup function to initialize and transform data pipeline

from pycaret.regression import *

reg = setup(data = data, target = 'Price', session_id=123, 
            normalize = True, transformation = True, transform_target = True, 
            combine_rare_levels = True, rare_level_threshold = 0.05,
            remove_multicollinearity = True, multicollinearity_threshold = 0.95, 
            bin_numeric_features = ['Carat Weight'],
            log_experiment = True, experiment_name = 'diamond1',
           use_gpu=True) 

Unnamed: 0,Description,Value
0,session_id,123
1,Target,Price
2,Original Data,"(5400, 8)"
3,Missing Values,False
4,Numeric Features,1
5,Categorical Features,6
6,Ordinal Features,False
7,High Cardinality Features,False
8,High Cardinality Method,
9,Transformed Train Set,"(3779, 39)"


In [6]:
# Model Training
# you may also need to pay attention to your GPU utilization /
# simply go to Task Manager and observe your GPU

import time

start_time = time.time()

best = compare_models(exclude= ['ransac'], n_select=5)

end_time = time.time()

print(f"Time taken for the training using GPU: {end_time - start_time}")

Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE,TT (Sec)
lightgbm,Light Gradient Boosting Machine,787.5566,3118472.2493,1735.1266,0.9701,0.0797,0.0581,0.193
ridge,Ridge Regression,943.8508,3128698.4695,1761.1021,0.9697,0.0956,0.0712,0.028
huber,Huber Regressor,951.4191,3318263.2111,1810.9574,0.9679,0.0959,0.0709,0.272
br,Bayesian Ridge,953.4396,3347366.194,1818.2586,0.9676,0.0957,0.0713,0.04
rf,Random Forest Regressor,918.0174,3743097.8964,1905.6855,0.9638,0.0953,0.0696,1.356
et,Extra Trees Regressor,997.2082,4768284.492,2152.7785,0.9535,0.1051,0.0759,0.485
gbr,Gradient Boosting Regressor,1123.3588,5204613.2764,2240.561,0.9502,0.109,0.0827,0.237
dt,Decision Tree Regressor,1029.2659,6015516.0996,2383.8485,0.9412,0.1079,0.0774,0.024
par,Passive Aggressive Regressor,2126.3425,13072419.3013,3514.6981,0.8748,0.2005,0.155,0.044
knn,K Neighbors Regressor,3369.9164,44807223.6349,6671.111,0.5671,0.3864,0.2555,0.023


Time taken for the training using GPU: 55.401522159576416


These are all ML algorithm for Regression tested in Pycaret in one single line of code!

In [7]:
print(best)

[PowerTransformedTargetRegressor(boosting_type='gbdt', class_weight=None,
                                colsample_bytree=1.0, importance_type='split',
                                learning_rate=0.1, max_depth=-1,
                                min_child_samples=20, min_child_weight=0.001,
                                min_split_gain=0.0, n_estimators=100, n_jobs=-1,
                                num_leaves=31, objective=None,
                                power_transformer_method='box-cox',
                                power_transformer_standardize=True,
                                random_state=1...
                                                        importance_type='split',
                                                        learning_rate=0.1,
                                                        max_depth=-1,
                                                        min_child_samples=20,
                                                        min_child_weig

In [8]:
# create model function with default 5-fold cross-validation
# trains and evaluates the performance of a given estimator using cross validation.
# select top three most performed models

# first model
rf = create_model('rf', verbose=False)

In [9]:
# second model
lgbm = create_model('lightgbm', verbose=False)

In [10]:
# you can fine tune your model using fine_tune function
# Tune first model

tuned_rf = tune_model(rf)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2189.4764,15505675.035,3937.7246,0.8241,0.2169,0.1705
1,2391.031,18706301.5044,4325.0782,0.7978,0.2405,0.189
2,2887.72,32696987.5415,5718.128,0.7224,0.2511,0.179
3,2507.2203,23518586.9283,4849.5966,0.7648,0.2379,0.1752
4,2578.2038,29534057.4413,5434.5246,0.7173,0.2428,0.1806
5,2801.8834,35692541.3765,5974.3235,0.7057,0.2469,0.178
6,2496.5384,26971159.8812,5193.3765,0.7383,0.2468,0.1856
7,2436.7576,28495974.8685,5338.1621,0.6927,0.2358,0.1793
8,2822.2804,30409469.2153,5514.4781,0.7063,0.2644,0.1938
9,2481.1003,24424494.9619,4942.1144,0.7698,0.2298,0.1725


In [11]:
# Tune second model
tuned_lgbm = tune_model(lgbm)

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,993.9446,4895293.264,2212.531,0.9445,0.097,0.0709
1,928.8535,3541278.6427,1881.8285,0.9617,0.0942,0.0707
2,1347.9678,9593573.5512,3097.3494,0.9185,0.1247,0.0846
3,967.337,3531806.8269,1879.3102,0.9647,0.1017,0.0722
4,1070.1763,7099651.443,2664.5171,0.9321,0.1024,0.0749
5,1295.7907,12400389.8402,3521.4187,0.8978,0.1166,0.0792
6,1102.8269,8038156.6696,2835.1643,0.922,0.1076,0.0787
7,997.9812,9099924.6338,3016.6081,0.9019,0.1077,0.0758
8,1212.32,5607084.3918,2367.9283,0.9459,0.1119,0.0808
9,1199.7918,6593894.8148,2567.858,0.9378,0.1076,0.0794


In [12]:
# Plot the model
plot_model(tuned_rf, plot='parameter')

Unnamed: 0,Parameters
handle,<raft.common.handle.Handle object at 0x7f5d588...
verbose,4
output_type,input
n_estimators,60
max_depth,10
max_features,1.0
n_bins,128
split_criterion,2
min_samples_leaf,1
min_samples_split,2


In [13]:
# Plot the model
plot_model(tuned_lgbm, plot='parameter')

Unnamed: 0,Parameters
boosting_type,gbdt
class_weight,
colsample_bytree,1.0
importance_type,split
learning_rate,0.4
max_depth,-1
min_child_samples,6
min_child_weight,0.001
min_split_gain,0.3
n_estimators,20


In [14]:
# belind models
blender = blend_models(estimator_list = [tuned_rf, tuned_lgbm])

Unnamed: 0_level_0,MAE,MSE,RMSE,R2,RMSLE,MAPE
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,1300.1692,7255406.5953,2693.5862,0.9177,0.1337,0.0992
1,1377.1351,7358642.1133,2712.6817,0.9205,0.1439,0.1094
2,1909.6344,17981479.2538,4240.4574,0.8473,0.1677,0.1137
3,1507.2092,10248853.2611,3201.383,0.8975,0.1476,0.1032
4,1597.6307,15797039.3326,3974.5489,0.8488,0.1516,0.1087
5,1783.1549,20590381.2694,4537.6625,0.8302,0.1597,0.109
6,1545.6572,14628639.101,3824.7404,0.8581,0.1537,0.1105
7,1510.7804,16239588.8031,4029.8373,0.8249,0.1499,0.1086
8,1820.6324,13981033.3238,3739.122,0.865,0.1662,0.1215
9,1574.0616,11056170.8434,3325.0821,0.8958,0.144,0.1068


In [15]:
# finalize model
model_final = finalize_model(blender)

# predict final model with the unseen data
predictions = predict_model(model_final, data = data_unseen)



Unnamed: 0,Model,MAE,MSE,RMSE,R2,RMSLE,MAPE
0,Voting Regressor,1478.401,9223053.4744,3036.9481,0.8967,0.1428,0.1066


In [16]:
# save model for deployment
save_model(model_final, "Final Model")

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[], ml_usecase='regression',
                                       numerical_features=[], target='Price',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_strategy='...
                                                                                                       learning_rate=0.4,
                                                                                                       max_depth=-1,
                                                                          

# Conclusion

here I list out the advantages and disadvantages after experienced on using Pycaret and cuML 

__Advantages__
1. Easy to use. Simple code and easy to understand.
2. Pycaret is easy to install and can be installed via pip
3. Using Pycaret save lot of time on experimenting and coding since you can train multiple of model in just one single of code
4. When working with lot of data, cuML can be benefits to help user utilize GPU for Machine Learning Training

__Disadvantages__
1. Hard to setup the GPU
2. cuML and RAPIDS intgrations only support linux-like machine. Luckily in windows machine users can install WSL and run ubuntu in windows machine

This is so far I can see the pros and cons. If you have some thought from this you may name one and evaluate by yourself whether it's good and convenience to your task and case study or otherwise.

End of Notebook