<div style="background-color:rgba(0, 167, 255, 0.6);border-radius:5px;display:fill">
    <h1><center>Tabular Playground Series - Feb 2022
</div>

<center><a><img src="https://i.ibb.co/PWvpT9F/header.png" alt="header" border="0" width=800 height=400 class="center"></a>

<h1> Fast Random Forest and Intel® Extension for Scikit-learn* - Kaggle Tabular Playground Series - February 2022 </h1>

For classical machine learning algorithms, we often use the most popular Python library, Scikit-learn. With Scikit-learn you can fit models and search for optimal parameters, but it sometimes works for hours. Speeding up this process is something anyone who uses Scikit-learn would be interested in.

I want to show you how to use Scikit-learn library and get the results faster without changing the code. To do this, we will make use of another Python library, [**Intel® Extension for Scikit-learn***](https://github.com/intel/scikit-learn-intelex). It accelerates Scikit-learn and does not require you to change the code written for Scikit-learn.

I will show you how to **speed up** your kernel without changing your code!

### Intel® Extension for Scikit-learn installation:

In [None]:
!pip install scikit-learn-intelex -q --progress-bar off

### Import Libraries

In [None]:
import pandas as pd
import numpy as np
import warnings
import gc
from IPython.display import HTML
warnings.filterwarnings("ignore")

from timeit import default_timer as timer
import matplotlib.pyplot as plt

random_state = 42

### Reading Data

In [None]:
PATH_TRAIN      = '../input/tabular-playground-series-feb-2022/train.csv'
PATH_TEST       = '../input/tabular-playground-series-feb-2022/test.csv'
PATH_SUBMISSION = '../input/tabular-playground-series-feb-2022/sample_submission.csv'

In [None]:
tPF = timer()
train_data = pd.read_csv(PATH_TRAIN)
test_data  = pd.read_csv(PATH_TEST)
submission = pd.read_csv(PATH_SUBMISSION)
tPS = timer()

In [None]:
print("Data reading with default pandas time: {}".format(tPS - tPF))

### Fast Reading Data

<center><a><img src="https://modin.readthedocs.io/en/stable/_static/MODIN_ver2.png" alt="header" border="0" width=300 height=200 class="center"></a>

Modin is a drop-in replacement for pandas. While pandas is single-threaded, Modin lets you instantly speed up your workflows by scaling pandas so it uses all of your cores. Modin works especially well on larger datasets, where pandas becomes painfully slow or runs out of memory.

### Modin installation:

In [None]:
!pip install modin

In [None]:
import modin.pandas as pd

In [None]:
tMF = timer()
train_data = pd.read_csv(PATH_TRAIN)
test_data  = pd.read_csv(PATH_TEST)
submission = pd.read_csv(PATH_SUBMISSION)
tMS = timer()

In [None]:
print("Data reading with Modin time: {}".format(tMS - tMF))

In [None]:
modin_speedup = round((tPS - tPF) / (tMS - tMF), 2)
HTML(f'<h2>Reading data speedup: {modin_speedup}x</h2>'
     f'(from {round((tPS - tPF), 2)} to {round((tMS - tMF), 2)} seconds)')

In [None]:
X, y = train_data.drop(['target'], axis = 1), train_data['target']

## Scikit-learn-intelex

With Intel® Extension for Scikit-learn you can accelerate your Scikit-learn applications and still have full conformance with all Scikit-Learn APIs and algorithms. Intel® Extension for Scikit-learn* is a free software AI accelerator that brings over **10-100X acceleration** across a variety of applications.

More information you can find in [Introduction to scikit-learn-intelex](https://www.kaggle.com/lordozvlad/introduction-to-scikit-learn-intelex/notebook)

### Accelerate Scikit-learn with two lines of code:

In [None]:
from sklearnex import patch_sklearn
patch_sklearn()

Setup logging to track accelerated cases:

In [None]:
import logging

logger = logging.getLogger()
fh = logging.FileHandler('log.txt')
fh.setLevel(10)
logger.addHandler(fh)

## Bayesian methods of hyperparameter optimization

Bayesian optimization works by constructing a posterior distribution of functions (gaussian process) that best describes the function you want to optimize. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not, as seen in the picture below.

<img src="https://github.com/fmfn/BayesianOptimization/blob/master/examples/bo_example.png?raw=true" />
As you iterate over and over, the algorithm balances its needs of exploration and exploitation taking into account what it knows about the target function. At each step a Gaussian Process is fitted to the known samples (points previously explored), and the posterior distribution, combined with a exploration strategy (such as UCB (Upper Confidence Bound), or EI (Expected Improvement)), are used to determine the next point that should be explored (see the gif below).
<img src="https://github.com/fmfn/BayesianOptimization/raw/master/examples/bayesian_optimization.gif" />

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

def bo_params_rf(max_samples, max_features):  
    params = {
        'max_samples' : max_samples,
        'max_features' : max_features,
    }
    
    clf = RandomForestClassifier(**params)
    clf.fit(X_train, y_train)
    
    score = accuracy_score(y_test, clf.predict(X_test))
    
    return score

In [None]:
from bayes_opt import BayesianOptimization
rf_bo = BayesianOptimization(bo_params_rf, {
                                             'max_samples': (0.5, 0.9),
                                             'max_features':(0.5, 0.9)
                                            })

In [None]:
results = rf_bo.maximize(n_iter = 2, init_points = 2, acq = 'ei')

## RandomForest with optimized Scikit-learn

In [None]:
params = rf_bo.max['params']

slfOpt = RandomForestClassifier(**params, n_estimators = 300, random_state = 42)

tFO = timer()
slfOpt.fit(X, y)
tSO = timer()

In [None]:
print("Total fitting Random Forest time with optimized Scikit-learn: {} seconds".format(tSO - tFO))

### List of algorithms which are accelerated by sklearnex

In [None]:
!cat log.txt | grep 'running accelerated version' | sort | uniq

## RandomForest with default Scikit-learn

In [None]:
from sklearnex import unpatch_sklearn
unpatch_sklearn()

In [None]:
from sklearn.ensemble import RandomForestClassifier

params = rf_bo.max['params']

slf = RandomForestClassifier(**params, n_estimators = 300, random_state = 42)

tFD = timer()
slf.fit(X, y)
tSD = timer()

In [None]:
print("Total fitting Random Forest time with default Scikit-learn: {} seconds".format(tSD - tFD))

In [None]:
rf_speedup = round((tSD - tFD) / (tSO - tFO), 2)
HTML(f'<h2>RandomForest speedup: {rf_speedup}x</h2>'
     f'(from {round((tSD - tFD), 2)} to {round((tSO - tFO), 2)} seconds)')

# Prediction

In [None]:
predictions = slfOpt.predict(test_data)
submission['target'] = predictions
submission[:5]

In [None]:
submission.to_csv("submission.csv", index = False)

# Conclusion

**Intel® Extension for Scikit-learn** gives you opportunities to:
* Use your Scikit-learn code for training and inference without modification.
* Get speed up your kernel

*Please upvote if you liked it.*

# Other notebooks with scikit-learn-intelex usage

### [[predict sales] Stacking with scikit-learn-intelex](https://www.kaggle.com/alexeykolobyanin/predict-sales-stacking-with-scikit-learn-intelex)

### [[TPS-Aug] NuSVR with Intel Extension for Sklearn](https://www.kaggle.com/alexeykolobyanin/tps-aug-nusvr-with-intel-extension-for-sklearn)

### [Using scikit-learn-intelex for What's Cooking](https://www.kaggle.com/kppetrov/using-scikit-learn-intelex-for-what-s-cooking?scriptVersionId=58739642)

### [Fast KNN using  scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-knn-using-scikit-learn-intelex-for-mnist?scriptVersionId=58738635)

### [Fast SVC using scikit-learn-intelex for MNIST](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-mnist?scriptVersionId=58739300)

### [Fast SVC using scikit-learn-intelex for NLP](https://www.kaggle.com/kppetrov/fast-svc-using-scikit-learn-intelex-for-nlp?scriptVersionId=58739339)

### [Fast AutoML with Intel Extension for Scikit-learn](https://www.kaggle.com/lordozvlad/fast-automl-with-intel-extension-for-scikit-learn)

### [[Titanic] AutoML with Intel Extension for Sklearn](https://www.kaggle.com/lordozvlad/titanic-automl-with-intel-extension-for-sklearn)