# $$CatBoost\ Feature\ Importance\ Tutorial$$

*Original source of the notebook: https://github.com/catboost/tutorials/blob/master/model_analysis/feature_importance_tutorial.ipynb*
**Credits to the creators of the original notebook.**

### Weights and Biases

In [None]:
# Install W&B client
%%bash
git clone https://github.com/wandb/client/
cd client
python setup.py install

# when the functionality is released on PyPI you would do the below:
#!pip install wandb

In [None]:
# Go to https://www.wandb.com/, create an account or login if you already have an account
# Create your project and ensure you have obtained a W&B Token (after logging in and creating your project)

!wandb login ${WANDB_TOKEN}

In [1]:
import wandb
from wandb.catboost import plot_feature_importances

In [2]:
# Here you would enter the name of your project created in the above step

wandb.init(project='catboost_plot_feature_importances')

[34m[1mwandb[0m: Wandb version 0.8.36 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


W&B Run: https://app.wandb.ai/neomatrix369/catboost_plot_feature_importances/runs/3u8me0as

### Catboost Feature Importance

#### Sometimes it is very important to understand which feature made the greatest contribution to the final result. To do this, the CatBoost model has a get_feature_importance method.

In [3]:
import numpy as np
from catboost import CatBoostRegressor, Pool
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

#### First, let's prepare the dataset:

In [4]:
%%time
full_dataset_dict = load_iris(False)

CPU times: user 2.59 ms, sys: 0 ns, total: 2.59 ms
Wall time: 1.84 ms


In [5]:
full_dataset_dict.keys()

dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

In [6]:
X, y = np.array(full_dataset_dict['data'])[:1000], np.array(full_dataset_dict['data'])[:1000]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train = X_train.reshape(-1,1)
X_test = X_test.reshape(-1,1)
y_train = y_train.reshape(-1,1)
y_test = y_test.reshape(-1,1)

train_pool = Pool(X_train, label=y_train)
test_pool = Pool(X_test, label=y_test)

#### Let's train CatBoost:

In [10]:
%%time
params = {'iterations': 20, 'learning_rate': 0.07, 'depth': 16, 'eval_metric': 'RMSE',
          'od_wait': 100, 'allow_writing_files': True, 'verbose': True} #  'n_estimators': 1000', #'grow_policy': 'Lossguide'
model = CatBoostRegressor(random_seed=1234, **params)
best_model = model.fit(train_pool, eval_set=test_pool, use_best_model=True, verbose=True)

0:	learn: 1.8330124	test: 1.8892445	best: 1.8892445 (0)	total: 48.3ms	remaining: 917ms
1:	learn: 1.7183943	test: 1.7743163	best: 1.7743163 (1)	total: 49ms	remaining: 441ms
2:	learn: 1.6143398	test: 1.6673153	best: 1.6673153 (2)	total: 195ms	remaining: 1.11s
3:	learn: 1.5106876	test: 1.5604859	best: 1.5604859 (3)	total: 197ms	remaining: 786ms
4:	learn: 1.4161375	test: 1.4631750	best: 1.4631750 (4)	total: 203ms	remaining: 610ms
5:	learn: 1.3273977	test: 1.3724472	best: 1.3724472 (5)	total: 205ms	remaining: 479ms
6:	learn: 1.2439417	test: 1.2854552	best: 1.2854552 (6)	total: 206ms	remaining: 383ms
7:	learn: 1.1689810	test: 1.2083937	best: 1.2083937 (7)	total: 207ms	remaining: 311ms
8:	learn: 1.0967593	test: 1.1355992	best: 1.1355992 (8)	total: 383ms	remaining: 468ms
9:	learn: 1.0306900	test: 1.0702588	best: 1.0702588 (9)	total: 383ms	remaining: 383ms
10:	learn: 0.9674108	test: 1.0048687	best: 1.0048687 (10)	total: 488ms	remaining: 399ms
11:	learn: 0.9107697	test: 0.9458671	best: 0.9458671

#### Catboost provides several types of feature importances. One of them is PredictionDiff: A vector with contributions of each feature to the RawFormulaVal difference for each pair of objects.

#### Let's find two objects with incorrect labels on test data:

In [11]:
%%time
prediction = model.predict(X_test)

CPU times: user 4.62 ms, sys: 3.71 ms, total: 8.33 ms
Wall time: 2.71 ms


#### As you can see, feature 25  is most important for getting the right prediction.

### W&B Feature Importance visualisation

In [12]:
plot_feature_importances(model, feature_names=full_dataset_dict['feature_names'])

[34m[1mwandb[0m: Wandb version 0.8.36 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade


Go to your dashboard to see the Catboost Feature Importance graph plotted.


<wandb.viz.Visualize at 0x7f305cf3c400>

## You should see a nice bar chart with the title Feature importances after training with the CatBoost model, in your W&B Dashboard on https://app.wandb.ai/[your username]/[your project name]/runs/[your run id]
