# Warning

CatBoost is hosted in Russia via Yandex servers. Please do access [catboost.ai](catboost.ai) with care.

# CatBoost Test

This notebook will contain a sample use of the XGBoost library to determine feasibility in use for the study 

**[Evaluation and Comparison of Boosted ML Models in Behavior-Based Malware Detection]**

## GPU Support

CatBoost supports training on GPUs. Refer to this [link](https://catboost.ai/en/docs/features/training-on-gpu) for more information.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split #For Splitting Datasets
from sklearn import preprocessing #For LabelEncoding
from sklearn.metrics import classification_report #For Classification Report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay #For Confusion Matrix
import matplotlib.pyplot as plt #For figure plotting.
from sklearn.model_selection import RandomizedSearchCV #For automated hyperparameter tuning; Would be better if it was GridSearchCV

# 1. Installation

*This test will only use Python version of XGBoost. There are two ways it could be installed which are through Python's pip or Conda (via Anaconda). For this test we'll be using Python via Anaconda instead.*

1. Open your Anaconda Terminal
2. Enter `pip install catboost`

# 2. Verifying Library Installation

*As long as it does not show an error upon importing, it means that it works*

In [2]:
#Verifying installation of XGBoost
import catboost as catb

# 3. Sample Dataset

*For this sample, the [crops dataset](https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset) similar in [this notebook](https://github.com/jm55/CSINTSY-MCO-5/blob/main/Machine%20Learning/notebook-v2.2.ipynb) will be ued in this demo.*

In [3]:
crops_df = pd.read_csv('crops_dataset.csv')
crops_df

Unnamed: 0,nitrogen,phosphorus,potassium,temperature,humidity,ph,rainfall,label
0,90,42,43,20.879744,82.002744,6.502985,202.935536,rice
1,85,58,41,21.770462,80.319644,7.038096,226.655537,rice
2,60,55,44,23.004459,82.320763,7.840207,263.964248,rice
3,74,35,40,26.491096,80.158363,6.980401,242.864034,rice
4,78,42,42,20.130175,81.604873,7.628473,262.717340,rice
...,...,...,...,...,...,...,...,...
2195,107,34,32,26.774637,66.413269,6.780064,177.774507,coffee
2196,99,15,27,27.417112,56.636362,6.086922,127.924610,coffee
2197,118,33,30,24.131797,67.225123,6.362608,173.322839,coffee
2198,117,32,34,26.272418,52.127394,6.758793,127.175293,coffee


In [4]:
len(crops_df['label'].unique())

22

## Reminder

Note that the dataset is a multi-class dataset which means that the output is not simply 0 or 1 like the case of the official thesis document. Hence, there will be some differences in the real study.

# 4. Implementing XGBoost Classifier

*This demonstrates the use of the SKLearn like implementation/use of XGBoost with the other parameters as well as other related functions and properties is as shown in this [link](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier)*.

Sample Complete Parameters (not all are included in the example):

`catb.CatBoostClassifier(iterations=None, learning_rate=None, depth=None, l2_leaf_reg=None, model_size_reg=None, rsm=None, loss_function=None, border_count=None, feature_border_type=None, per_float_feature_quantization=None, input_borders=None, output_borders=None, fold_permutation_block=None, od_pval=None, od_wait=None, od_type=None, nan_mode=None, counter_calc_method=None, leaf_estimation_iterations=None, leaf_estimation_method=None, thread_count=None, random_seed=None, use_best_model=None, verbose=None, logging_level=None, metric_period=None, ctr_leaf_count_limit=None, store_all_simple_ctr=None, max_ctr_complexity=None, has_time=None, allow_const_label=None, classes_count=None, class_weights=None, auto_class_weights=None, one_hot_max_size=None, random_strength=None, name=None, ignored_features=None, train_dir=None, custom_loss=None, custom_metric=None, eval_metric=None, bagging_temperature=None, save_snapshot=None, snapshot_file=None, snapshot_interval=None, fold_len_multiplier=None, used_ram_limit=None, gpu_ram_part=None, allow_writing_files=None, final_ctr_computation_mode=None, approx_on_full_history=None, boosting_type=None, simple_ctr=None, combinations_ctr=None, per_feature_ctr=None, task_type=None, device_config=None, devices=None, bootstrap_type=None, subsample=None, sampling_unit=None, dev_score_calc_obj_block_size=None, max_depth=None, n_estimators=None, num_boost_round=None, num_trees=None, colsample_bylevel=None, random_state=None, reg_lambda=None, objective=None, eta=None, max_bin=None, scale_pos_weight=None, gpu_cat_features_storage=None, data_partition=None metadata=None, early_stopping_rounds=None, cat_features=None, grow_policy=None, min_data_in_leaf=None, min_child_samples=None, max_leaves=None, num_leaves=None, score_function=None, leaf_estimation_backtracking=None, ctr_history_unit=None, monotone_constraints=None, feature_weights=None, penalties_coefficient=None, first_feature_use_penalties=None, model_shrink_rate=None, model_shrink_mode=None, langevin=None, diffusion_temperature=None, posterior_sampling=None, boost_from_average=None, text_features=None, tokenizers=None, dictionaries=None, feature_calcers=None, text_processing=None, fixed_binary_splits=None)`

## 4.1. Loading Classifier

In [5]:
#Loading XGBClassifier as an object
catbClassifier = catb.CatBoostClassifier()

## 4.2. Splitting Datasets to Train and Test Datasets

In [6]:
#Splitting datasets to train and test datasets
features = crops_df.columns.to_list()
features = features[0:len(features)-2]

le = preprocessing.LabelEncoder()
labels = le.fit_transform(crops_df['label']) #Converting t

X,y = crops_df[features],labels
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

## 4.3. Loading Hyperparameter Tuning

Parameters: https://catboost.ai/en/docs/concepts/python-reference_catboostclassifier

Note that the hyperparameter values shown in the tuning choices here are not complete as some values were ommited due to errors or to improve tuning speed relative to the available hardware. 

Booster `'dart'` may be better performing but at the extreme cost of time as tests suggests that it runs at ~30mins each iteration during RandomizedSearchCV, hence it was removed on this demo.

In [7]:
param = {'sampling_frequency':['PerTree','PerTreeLevel'], 'learning_rate': [0.1], 'depth': [None], 
         'l2_leaf_reg': [0,0.1,1], 'thread_count':[-1], 'classes_count':[22], 
         'grow_policy':['SymmetricTree','Depthwise','Lossguide'],
         'auto_class_weights':['Balanced','SqrtBalanced'],}

tuner = RandomizedSearchCV(catbClassifier, param, verbose=2, n_jobs=2, cv=3, refit=True, error_score=0, random_state=1)
tuner.fit(X_train,y_train)
print("Best Score:", tuner.best_score_)
print("Best Params:", tuner.best_params_)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


9 fits failed out of a total of 30.
The score on these train-test partitions for these parameters will be set to 0.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
6 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\ejose\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\ejose\anaconda3\Lib\site-packages\catboost\core.py", line 5100, in fit
    self._fit(X, y, cat_features, text_features, embedding_features, None, sample_weight, None, None, None, None, baseline, use_best_model,
  File "C:\Users\ejose\anaconda3\Lib\site-packages\catboost\core.py", line 2319, in _fit
    self._train(
  File "C:\Users\ejose\anaconda3\Lib\site-packages\catboost\core.py", line 1723, in _t

KeyError: 'depth'

## 4.4. Loading Tuned Parameters to Model

In [None]:
#Reloading model with better parameters
catbClassifier = catb.CatBoostClassifier(**tuner.best_params_)

#Fitting/Training model
catbClassifier.fit(X_train, y_train)

## 4.5. Results

In [None]:
#Testing Predictions
y_pred = catbClassifier.predict(X_test)

#Create confusion matrix
catb_cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=catb_cm)
disp.plot()
plt.show()

#Converting LabelEncoded to String Labels
y_pred_str = le.inverse_transform(y_pred)
y_test_str = le.inverse_transform(y_test)

#Create classification report
catb_cr = classification_report(y_test_str, y_pred_str, digits=4)
print(catb_cr)

# 5. Saving and Loading Model

In [None]:
from joblib import dump, load

save_model("saved.json", format="json")

loaded_model = load_model("saved.json", format='json')

y_pred = loaded_model.predict(X_test)

#Converting LabelEncoded to String Labels
y_pred_str = le.inverse_transform(y_pred)
y_test_str = le.inverse_transform(y_test)

#Create classification report
lgbm_cr = classification_report(y_test_str, y_pred_str, digits=4)
print(lgbm_cr)