# CatBoost json tutorial

This notebook focuses on binary classification, but it can be used as a basis to extend these concepts to other uses of CatBoost.

Other resources:
1. [CatBoost JSON model tutorial](https://github.com/catboost/tutorials/blob/master/model_analysis/model_export_as_json_tutorial.ipynb) - on `catboost` repo
2. [Explanation of Json model format of CatBoost](https://parasmalik.blogspot.com/2020/07/explanation-of-json-model-format-of.html) - blogpost by Paras Malik
3. [CatBoost documentation](https://catboost.ai/docs/en/)

In [1]:
import json
from copy import deepcopy

import numpy as np
import pandas as pd
from catboost import CatBoostClassifier, Pool
from clickhouse_cityhash.cityhash import CityHash64
from scipy.special import expit
from tqdm import tqdm
from ucimlrepo import fetch_ucirepo
from pprint import pprint

## Load and process dataset

As an example we will use [Adult "Census Income" dataset](https://archive.ics.uci.edu/dataset/2/adult) as it has many categorical features

In [2]:
adult = fetch_ucirepo(id=2)
X = adult.data.features
y = adult.data.targets

y_unique = y["income"].unique()
mapping = {value: 1 if ">" in value else 0 for value in y_unique} # label needs cleaning, 1 == income > 50k, 0 == income < 50k
y = y["income"].map(mapping)

CatBoost does not handle missing values in categorical features. We need to fill them

In [3]:
cat_features = X.select_dtypes(include="object").columns.to_list()
X.loc[:, cat_features] = X[cat_features].fillna("missing")

In order to see how CatBoost handles missing values in numerical features we need to artifically insert them since numerical features in "Census Income" dataset are filled completely.

In [4]:
X.loc[:,"capital-gain"] = X["capital-gain"].astype(np.float64)
to_nans = X["capital-gain"].sample(frac=0.02).index
X.loc[to_nans, "capital-gain"] = None
X.loc[:,"capital-loss"] = X["capital-loss"].astype(np.float64)
to_nans = X["capital-loss"].sample(frac=0.02).index
X.loc[to_nans, "capital-loss"] =None

In [5]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             48842 non-null  int64  
 1   workclass       48842 non-null  object 
 2   fnlwgt          48842 non-null  int64  
 3   education       48842 non-null  object 
 4   education-num   48842 non-null  int64  
 5   marital-status  48842 non-null  object 
 6   occupation      48842 non-null  object 
 7   relationship    48842 non-null  object 
 8   race            48842 non-null  object 
 9   sex             48842 non-null  object 
 10  capital-gain    47865 non-null  float64
 11  capital-loss    47865 non-null  float64
 12  hours-per-week  48842 non-null  int64  
 13  native-country  48842 non-null  object 
dtypes: float64(2), int64(4), object(8)
memory usage: 5.2+ MB


## Train model

We need to specify which features are categorical

In [6]:
model = CatBoostClassifier(cat_features=cat_features, random_state=42, allow_writing_files=False, verbose=False)
model.fit(X, y)

<catboost.core.CatBoostClassifier at 0x7b74e4fb5be0>

## Export model to json

We can export model to different formats. Json is human readable. If we also pass training dataset in `pool` parameter, hash values of the categories will be saved (more on that later).

In [7]:
X_pool = Pool(X, cat_features=cat_features)
model.save_model("catboost_model.json", format="json", pool=X_pool)

## Explore CatBoost json

Let's load back json file to a dictionary.

In [8]:
with open("catboost_model.json", mode="r") as f:
    model_json = json.load(f)

`model_json` contains 4 keys:

In [9]:
print(*model_json.keys(), sep="\n")

ctr_data
features_info
model_info
oblivious_trees
scale_and_bias


Brief description:
- `ctr_data` - data used to calculate CTRs (more on that later)
- `feature_info` - feature names, types, borders used for splits (more on that later)
- `model_info` - hyperparameters and other informations
- `oblivious_trees` - splits, leaf values and leaf weights of all trees
- `scale_and_bias` - final modification of prediction

### Numerical features

Information about numerical features are kept in `model_json["feature_info"]["float_features"]` which is a list of one dictionary per numerical feature. Let's see an example:

In [10]:
pprint(model_json["features_info"]["float_features"][0])

{'borders': [18.5,
             23.5,
             24.5,
             25.5,
             26.5,
             27.5,
             28.5,
             29.5,
             30.5,
             31.5,
             32.5,
             33.5,
             34.5,
             35.5,
             36.5,
             37.5,
             38.5,
             39.5,
             40.5,
             41.5,
             42.5,
             43.5,
             44.5,
             45.5,
             46.5,
             47.5,
             48.5,
             49.5,
             50.5,
             51.5,
             52.5,
             53.5,
             54.5,
             55.5,
             56.5,
             57.5,
             58.5,
             59.5,
             60.5,
             61.5,
             63.5,
             64.5,
             65.5,
             66.5,
             68.5,
             69.5,
             70.5,
             71.5,
             72.5,
             73.5,
             76.5,
             77.5,
            

- `borders` - values of all borders in splits which used this feature (if there are `n_b` borders it means that model divided this feature into `n_b+1` bins)
- `feature_id`, `feature_name` - feature name
- `feature_index` - feature index BUT among float features
- `flat_feature_index` - feature index BUT among all features
- `has_nans` - if missing values were seen during training
- `nan_value_treatment` - how to handle missing values

#### Missing values
CatBoost has different strategies for handling missing values in numerical features based on parameter `nan_mode`, which is kept in
```python
model_json["model_info"]["params"]["data_processing_options"]["float_features_binarization"]["nan_mode"]
```
By default `nan_mode="Min"` - it encodes them as smaller than any value. From practical point of view we can recreate this behaviour creating a function replaces missing values with e.g. `min(borders)-1`

In [11]:
def preprocessing_nans(X: pd.DataFrame, model_json: dict) -> pd.DataFrame:
    X = X.copy(deep=True)
    nan_mode = model_json["model_info"]["params"]["data_processing_options"][
        "float_features_binarization"
    ]["nan_mode"]
    float_features = model_json["features_info"].get("float_features", [])
    for feature_dict in float_features:
        feature = feature_dict["feature_id"]
        if nan_mode == "Forbidden":
            if X.loc[:, feature].isna().sum() != 0:
                raise ValueError("NAs are not allowed!")
        elif nan_mode == "Max":
            replace_value = max(feature_dict["borders"]) + 1
            X.loc[:, feature] = X[feature].fillna(replace_value)
        else:
            replace_value = min(feature_dict["borders"]) - 1
        X.loc[:, feature] = X[feature].fillna(replace_value)
    return X

In [12]:
X_cb = preprocessing_nans(X, model_json=model_json)
X_cb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             48842 non-null  int64  
 1   workclass       48842 non-null  object 
 2   fnlwgt          48842 non-null  int64  
 3   education       48842 non-null  object 
 4   education-num   48842 non-null  int64  
 5   marital-status  48842 non-null  object 
 6   occupation      48842 non-null  object 
 7   relationship    48842 non-null  object 
 8   race            48842 non-null  object 
 9   sex             48842 non-null  object 
 10  capital-gain    48842 non-null  float64
 11  capital-loss    48842 non-null  float64
 12  hours-per-week  48842 non-null  int64  
 13  native-country  48842 non-null  object 
dtypes: float64(2), int64(4), object(8)
memory usage: 5.2+ MB


### Categorical features

CatBoost uses CTR (click-through rate) or one-hot encoding for categorical features (depending on the parameter `one_hot_max_size`). It can combine categorical feature with other categorical features or numerical splits to create *feature combination*. Simlar data to numerical features are in
```python
model_json["features_info"]["categorical_features"]
```
However, there is no information about `borders`. Instead, if given feature was one-hot encoded, special hash (more on that later) is stored in key `values`.
Here is an example:

In [13]:
pprint(model_json["features_info"]["categorical_features"][6])

{'feature_id': 'sex',
 'feature_index': 6,
 'feature_name': 'sex',
 'flat_feature_index': 9,
 'values': [-2114564283]}


#### Missing values
CatBoost does not allow missing values in categorical features. As you've seen there is no `nan_value_treatment` key.

#### CTR
CTRs can be thought of as new numerical features. Information about them are stored in:
```python
model_json["features_info"]["ctrs"]
```
and in
```python
model_json["ctr_data"]
```

For simplification we will describe CTRs for binary classification.
 
There are several different types of CTRs considered by CatBoost. Two main ones are:
1. Borders
$$
ctr = \frac{numberOfSuccessInCategory + prior_{numerator}}{numberOfSuccessInCategory + numberOfFailuresInCategory + prior_{denominator}}
$$
- It is basically [TargetEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.TargetEncoder.html) with prior
- It encodes features as mean of the target conditioned on category
- It encodes *quality* of the category - how good it is
2. Counter
$$
ctr = \frac{numberOfInstancesInCategory + prior_{numerator}}{numberOfInstancesInLargestCategory + prior_{denominator}}
$$
- It encodes frequency of the feature - how often it appeared in the training data

Some information how CatBoost uses CTRs
- During training each tree sees different permutation of instances and calculates CTRs *online* - meaning the value for each instance depends on values of instances before it in given permutation.
- During inference the values of CTRs are fixed.
- For regression target values are quantized and turned in to multiclass classification - for each class CTRs are calculated separately

##### Feature combination
CatBoost also considers *feature combinations* - it creates new categorical feature by combining few categorical features e.g. `feature_1 = "red"` and `feature_2="square"` $\Rightarrow$ `feature_3="red|square"` (well, not exactly - we will see how exactly it is done later).

Combination can also be done with numerical splits which can be interpreted binary categorical variable e.g. `feature_4 > 3.5` $\Rightarrow$ `0` or `1`, which later is combined with categorical feature e.g. `feature_5="red|1"` (again - we will see how it is done exactly later).

##### Expand model json
CatBoost stores all necessary information in the `json` export. However, some of them are in *raw* form. So if we want to analyze model we can expand it and make our life easier.

First, we can add names of the *effective* CTR features. Information about elements are stored in
```python
model_json["features_info"]["ctrs"][ctr_idx]["elements"]
```
while CTR type (`Borders` or `Counter`) and $prior_{numerator}$ are in `ctr_type` and `prior_numerator` key, respectively.
Here is an example:

In [14]:
pprint(model_json["features_info"]["ctrs"][200])

{'borders': [4.999999046325684],
 'ctr_type': 'Borders',
 'elements': [{'cat_feature_index': 1,
               'combination_element': 'cat_feature_value'},
              {'border': 40.5,
               'combination_element': 'float_feature',
               'float_feature_index': 5},
              {'border': 45.5,
               'combination_element': 'float_feature',
               'float_feature_index': 0}],
 'identifier': '{"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":40.5,"combination_element":"float_feature","float_feature_index":5},{"border":45.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"}',
 'prior_denomerator': 1,
 'prior_numerator': 0,
 'scale': 15,
 'shift': 0,
 'target_border_idx': 0}


Here is function that adds `feature_id` to CTRs. This `feature_id` is constructed in such a way that encodes all necessary information to identify given CTR e.g. `workclass|race|age>24:Borders_0.5` means CTR which has combination elements: `workclass`, `race`, `age>24` and is of type `Border` with prior numerator equal `0.5`.

In [15]:
def add_ctr_feature_id(model_json: dict)-> dict:
    model_json = deepcopy(model_json)
    cat_features = model_json["features_info"].get("categorical_features", [])
    float_features = model_json["features_info"].get("float_features",[])
    ctrs = model_json["features_info"].get("ctrs",[])

    for ctr in ctrs:
        elements = []
        for elem in ctr["elements"]:
            if elem["combination_element"] == "cat_feature_value":
                elements.append(cat_features[elem["cat_feature_index"]]["feature_id"])
            elif elem["combination_element"] == "float_feature":
                elements.append(f"{float_features[elem['float_feature_index']]['feature_id']}>{elem["border"]}")
            elif elem["combination_element"] == "cat_feature_exact_value":
                elements.append(f"{cat_features[elem["cat_feature_index"]]["feature_id"]}=={elem["value"]}")
        feature_id = "|".join(elements)
        feature_id += f":{ctr["ctr_type"]}_{ctr["prior_numerator"]}"
        ctr["feature_id"] = feature_id
    return model_json

In [16]:
model_json_extended = add_ctr_feature_id(model_json=model_json)
ctr = model_json_extended["features_info"]["ctrs"][200]
print(ctr["feature_id"])

education|hours-per-week>40.5|age>45.5:Borders_0


Values needed to calculate CTRs are in:
```python
model_json["ctr_data"]
```
which is identifiable by lengthy `identifier` key of ctr `dict`. Example:

In [17]:
identifier = ctr["identifier"]
print("Lengthy identifier:", identifier,"\n")
print("Corresponding 'ctr_data':\n")
pprint(model_json["ctr_data"][identifier])

Lengthy identifier: {"identifier":[{"cat_feature_index":1,"combination_element":"cat_feature_value"},{"border":40.5,"combination_element":"float_feature","float_feature_index":5},{"border":45.5,"combination_element":"float_feature","float_feature_index":0}],"type":"Borders"} 

Corresponding 'ctr_data':

{'counter_denominator': 0,
 'hash_map': ['12149678890856442368',
              5682,
              633,
              '18446744073709551615',
              27,
              146,
              '16943509975464877063',
              42,
              112,
              '4308412934939786119',
              507,
              329,
              '18416576814848222218',
              48,
              14,
              '13748161776802938637',
              90,
              26,
              '16466277097607122449',
              33,
              1,
              '6920013252264861585',
              58,
              58,
              '2246878560390687890',
              1170,
              1

- `counter_denominator` - contains $numberOfInstancesInLargestCategory$ for `Counter` CTRs (is equal to `0` for `Borders` CTRs)
- `hash_stride` - contains number of entries in `hash_map` list related to single category
- `hash_map` contains
  - `[category_1_hash, numberOfFailuresInCategory_1, numberOfSuccessesInCategory_1, category_2_hash, numberOfFailuresInCategory_2, ...]` for `Borders` CTRs
  - `[category_1_hash, numberOfInstancesInCategory_1, category_2_hash, numberOfInstancesInCategory_2, ...]` for `Counter` CTRs

To calculate CTR values we need also prior numerator and prior denominator which are in `model_json["features_info"]["ctrs"]`. These raw CTR values are then scaled and shifted (values of scale and shift are also found in CTR dictionary)
$$
finalCTR = scale\times ctr + shift
$$
Putting everything together we can create dictionary for each CTR which assigns CTR value to category hash. Here is a function that does precisely that:

In [18]:
def add_ctr_values(model_json: dict) -> dict:
    model_json = deepcopy(model_json)
    ctrs = model_json["features_info"].get("ctrs", [])
    ctr_data = model_json.get("ctr_data", [])

    for ctr in ctrs:
        identifier = ctr["identifier"]
        data = ctr_data[identifier]
        stride = data["hash_stride"]
        hash_map = data["hash_map"]
        prior_denom = ctr["prior_denomerator"] # here is funny typo which haunts CatBoost jsons for many years
        prior_num = ctr["prior_numerator"]

        ctr_values = {}
        if ctr["ctr_type"] == "Borders":
            for idx in range(0, len(hash_map), stride):
                ctr_values[int(hash_map[idx])] = (
                    ctr["scale"]
                    * (hash_map[idx + 2] + prior_num)
                    / (hash_map[idx + 1] + hash_map[idx + 2] + prior_denom)
                    + ctr["shift"]
                )
        elif ctr["ctr_type"] == "Counter":
            denom = data["counter_denominator"]
            for idx in range(0, len(hash_map), stride):
                ctr_values[int(hash_map[idx])] = (
                    ctr["scale"]
                    * (hash_map[idx + 1] + prior_num)
                    / (denom + prior_denom)
                    + ctr["shift"]
                )
        ctr["ctr_values"] = ctr_values
    return model_json

In [19]:
model_json_extended = add_ctr_values(model_json_extended)
model_json_extended["features_info"]["ctrs"][0]["ctr_values"]

{18446744073709551615: 4.0110998990918265,
 15379737126276794113: 5.872295882763433,
 1746527532669166693: 0.0,
 7024059537692152076: 8.295990566037736,
 15472181234288693070: 4.181982914833031,
 14256903225472974739: 1.5596080566140447,
 18048946643763804916: 4.0110998990918265,
 2051959227349154549: 4.432578897035384,
 6285290715428032055: 1.1514522821576763,
 13863729436386866842: 1.3636363636363635,
 8864790892067322495: 3.2679092812693544}

#### Hashing

If during json export we also pass training dataset to `pool` parameter, hash values are found in `model_json["features_info"]["cat_features_hash"]`. However, this key is not populated if we don't pass trainig dataset. Fortunately, we can recalculate these hashes. CatBoost uses particular implementation of [CityHash64](https://clickhouse.com/docs/native-protocol/hash). Even though this hash has 64 bits, CatBoost uses only 32 least significant bits represented as (signed) `int32`. Here is python version of this hashing function:

In [20]:
def hash_str(value: str) -> int:
    """Calculates hash of string (categorical) variables as used in CatBoost.

    Uses CityHash64 (`clickhouse_cityhash` implementation).

    32 least significant bits are used and represented as (signed) int32.

    Parameters
    ----------
    value : str
        String to be hashed

    Returns
    -------
    int
        Hash value
    """
    ch = CityHash64(value)
    return int((np.uint64(ch) & np.uint64(0xFFFFFFFF)).astype(np.int32))

In [21]:
category_hash_examples = model_json_extended["features_info"]["cat_features_hash"][0]
category_hash_examples

{'hash': 1601642347, 'value': 'Nicaragua'}

In [22]:
print(f'{category_hash_examples["value"]} has hash={hash_str(category_hash_examples["value"])}')
print(f'Is it the same as in model_json? {hash_str(category_hash_examples["value"])==category_hash_examples["hash"]}')

Nicaragua has hash=1601642347
Is it the same as in model_json? True


These hashes are used as a numerical representation of a category.

Next step is to calculate final hash which is a result of "mixing" of all combination elements. If CTR uses only one categorical feature the mixing will be done on a sequence
```
[0, category_hash]
```
If there are there are several categorical features as combination elements, the sequence will be
```
[0, feature_1_category_hash, feature_2_category_hash, ...]
```
Finally, if one-hot encoded feature or numerical split are among combination elements, binary representation of boolean result of such split is the value used in the sequence. Here example with `1` for first numerical/one-hot split and `0` for the second numerical/one-hot split:
```
[0, category_hash, 1, 0]
```
Below are functions that perform this sequential mixing

In [23]:
def calc_hash(
    value: int,
    starting_hash: int = 0,
    max_int: int = 0xFFFFFFFFFFFFFFFF,
    magic_mult: int = 0x4906BA494954CB65,
) -> int:
    """Mixes `value` into existing hash (`starting_hash`) to produce new hash.

    Default values of `max_int` and `magic_mult` are the same as in CatBoost.

    Parameters
    ----------
    value : int
        Value to be mixed into the `starting_hash`.
    starting_hash : int, optional
        Hash that the `value` should be mixed into, by default 0
    max_int : hexadecimal, optional
        Special value for mixing, by default 0xFFFFFFFFFFFFFFFF
    magic_mult : hexadecimal, optional
        Special value for mixing, by default 0x4906BA494954CB65

    Returns
    -------
    int
        Mixed hash.
    """
    value = int(value)
    starting_hash = int(starting_hash)
    return (magic_mult * ((starting_hash + magic_mult * value) & max_int)) & max_int


def calc_hash_combination(combination_elements: list)->int:
    """Given combination elements of CatBoost CTR, calculate hash.

    - For categorical features element is their hash value
    - For float features element is binary encoding of split condition
    - For one-hot feature element is binary encoding of split equality
    
    Parameters
    ----------
    combination_elements : list
        List of combination elements

    Returns
    -------
    int
        Final hash value
    """    
    hash_value = 0
    for element in combination_elements:
        hash_value = calc_hash(element, starting_hash=hash_value)
    return hash_value

Let's try to calculate one of the hashes

In [24]:
ctr = model_json_extended["features_info"]["ctrs"][9] # since CatBoost is not deterministic even with set random_state, this particular CTR can contain different features if you run it again
print("CTR feature:", ctr["feature_id"])

cat_value = X["workclass"].unique()[0] 
cat_hash = hash_str(cat_value)

print("Category value of first element of combination:", cat_value, ". It's hash is:", cat_hash)

final_hash = calc_hash_combination([cat_hash, 1]) # first value is implicitly 0 so we can skip it
print("Final hash after mixing with 1 (example of binary encoding of numerical split) is:", final_hash)
print(f"Is it one of the hashes in model_json for feature '{ctr['feature_id']}'?", final_hash in ctr["ctr_values"])

CTR feature: workclass|age>45.5:Borders_0.5
Category value of first element of combination: State-gov . It's hash is: -447941100
Final hash after mixing with 1 (example of binary encoding of numerical split) is: 8193958724117795869
Is it one of the hashes in model_json for feature 'workclass|age>45.5:Borders_0.5'? True


Now we can wrap everything into functions which will transform dataset by hashing categorical features and extracting CTR features.
1. `map_categories_to_hashes` simply applies hashing function to category values
2. `create_ctr_features`
    - for each CTR feature collects combination elements (category hashes, binary encodings of numerical splits or one-hot splits)
    - calculates sequantial hash mixing of these elements
    - maps final hash values to respective CTR values

In [25]:
def map_categories_to_hashes(X: pd.DataFrame) -> pd.DataFrame:
    X = X.copy(deep=True)
    cat_features = X.select_dtypes(include=["object", "category"]).columns
    X[cat_features] = X[cat_features].map(hash_str)
    return X

def create_ctr_features(X_cb: pd.DataFrame, model_json: dict) -> pd.DataFrame:
    X_cb = X_cb.copy(deep=True)
    ctrs = model_json["features_info"].get("ctrs", [])
    cat_features = model_json["features_info"].get("categorical_features",[])
    float_features = model_json["features_info"].get("float_features",[])

    ctr_features_list = []
    for ctr in tqdm(ctrs):
        feature_id = ctr["feature_id"]
        combination_elements_list =[]
        for elem in ctr["elements"]:
            if elem["combination_element"] == "cat_feature_value":
                feature_element_id = cat_features[elem["cat_feature_index"]][
                    "feature_id"
                ]
                value = X_cb[feature_element_id] # here category hash value
            elif elem["combination_element"] == "float_feature":
                feature_element_id = float_features[elem["float_feature_index"]][
                    "feature_id"
                ]
                value = (X_cb[feature_element_id] > elem["border"]).astype(int) # binary encoding of numerical split
            elif elem["combination_element"] == "cat_feature_exact_value":
                feature_element_id = cat_features[elem["cat_feature_index"]][
                    "feature_id"
                ]
                value = (X_cb[feature_element_id] == elem["value"]).astype(int) # binary encoding of one-hot split

            combination_elements_list.append(value) # collect all combination elements

        combination_elements_df = pd.concat(combination_elements_list, axis=1)            
        series_temp = combination_elements_df.apply(calc_hash_combination, axis=1) 
        series_temp = series_temp.map(ctr["ctr_values"]) # map hashes to ctr values
        series_temp.name = feature_id
        ctr_features_list.append(series_temp)

    X_cb = pd.concat([X_cb, *ctr_features_list], axis=1)
    return X_cb

In [26]:
X_cb = map_categories_to_hashes(X_cb)
X_cb = create_ctr_features(X_cb, model_json=model_json_extended)

100%|██████████| 610/610 [01:07<00:00,  8.99it/s]


Dataset is expanded by CTR features and consists of only numerical features - conceptually CatBoost does the same thing internally (but faster and more efficient).

In "Census Income" there quite a few categorical features. As a result CatBoost created many CTRs.

In [27]:
X_cb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Columns: 624 entries, age to native-country|hours-per-week>73.5:Borders_1
dtypes: float64(612), int64(12)
memory usage: 232.5 MB


### Trees

CatBoost uses oblivious trees - at each level the same split is performed in all nodes at this level. This means that decision path can be encoded as binary vector. This binary vector can be turned into integer index in the list of all leaf values. Leaf values are possible prediction of that tree. So if for a given tree path an instance took is e.g. `[Greater, Not greater, Greater]` (tree of `depth=3`) it is encoded as `101`. Therefore corresponding leaf value is at index `0b101=5`. 

Let's look at example of CatBoost tree:

In [28]:
pprint(model_json["oblivious_trees"][42])

{'leaf_values': [-0.053246387294678704,
                 0.04344816195199846,
                 -0.03889367593239945,
                 0.011376445792702607,
                 0,
                 0,
                 0,
                 0,
                 -0.049786560982482714,
                 -0.07900743970240365,
                 -0.014231091855371053,
                 -0.03105123577189302,
                 0,
                 0,
                 0,
                 0,
                 -0.02992219933327188,
                 -0.008388896899044455,
                 -0.00948424520345465,
                 -0.015603536432856592,
                 0,
                 0,
                 0,
                 0,
                 -0.028782229264519957,
                 -0.04281970650040307,
                 0.038200608259269714,
                 -0.013399462043038334,
                 0,
                 0,
                 0,
                 0,
                 -0.02922295387235034,
           

Each tree contains:
- `leaf_values` - list of all leaf values; number of leaves is $2^{depth}$ (if there are more classes, this is multiplied by $n_{classes}$)
- `leaf_weights` - number of instances that landed in this leaf (if instances were weighted this is taken into account here)
- `splits` list of all splits in this tree

#### Splits
Split have their own structure:
- `split_type` - one of `FloatFeature` (numerical split), `OnlineCtr` (CTR feature split), `OneHotFeature` (categorical one-hot split)
- `split_index` - index in the list of all borders in the whole model which are concatenated in the following order (they are not explicitly saved in `model_json`)
    - `float_features` - all borders for each feature
    - `categorical_features` - only values of one-hot encoded features used as split conditions
    - `ctrs` - all borders for each CTR
- `border` (for numerical splits or CTRs) or `value` (for one-hot splits) - value used in this split condition
- `float_feature_index` or `cat_feature_index` - index among given type of features
- `ctr_target_border_idx` - always `0` for binary classification

For numerical splits, we can quickly determine which feature must be used for this split. However, for CTRs we need to look for correct one using `split_index`. To make our life easier we can add list of all splits, which will contain feature names (for extracted CTR features as well). Here is a function that does that:

In [29]:
def add_all_splits(model_json: dict)-> dict:
    model_json = deepcopy(model_json)
    cat_features = model_json["features_info"].get("categorical_features", [])
    float_features = model_json["features_info"].get("float_features",[])
    ctrs = model_json["features_info"].get("ctrs",[])

    model_json["all_splits"] = []
    for idx, feature_dict in enumerate(float_features):
        for border in feature_dict["borders"]:
            split_dict = {
                "feature_id": feature_dict["feature_id"],
                "border": border,
                "split_type": "FloatFeature",
                "feature_idx_in_type": idx,
            }
            model_json["all_splits"].append(split_dict)

    for idx, feature_dict in enumerate(cat_features):
        if "values" in feature_dict.keys():
            for value in feature_dict["values"]:
                split_dict = {
                    "feature_id": feature_dict["feature_id"],
                    "border": value,
                    "split_type": "OneHotFeature",
                    "feature_idx_in_type": idx,
                }
                model_json["all_splits"].append(split_dict)

    for idx, feature_dict in enumerate(ctrs):
        for border in feature_dict["borders"]:
            split_dict = {
                "feature_id": feature_dict["feature_id"],
                "border": border,
                "split_type": "OnlineCtr",
                "feature_idx_in_type": idx,
            }
            model_json["all_splits"].append(split_dict)
    return model_json

In [30]:
model_json_extended = add_all_splits(model_json_extended)
print(f"There are {len(model_json_extended["all_splits"])} unique splits in the model.")

There are 1818 unique splits in the model.


### Prediction

Now that all splits are turned into numerical, and we know how to turn them into binary decision vector, we can recalculate prediction.

Tree ensemble models simply sum up leaf values of all the trees, possibly scale them and add bias. For classifier model's raw prediction output is in log-odds space (because log-odds space is additive). To get more intuitive probability output raw prediction needs to be passed through `expit` function.

Here is a function that recalculates prediction based on expanded `model_json` structure and extracted CTR features.

In [31]:
def predict(X_cb: pd.DataFrame, model_json: dict) -> pd.Series:
    prediction = np.zeros_like(X_cb.index, dtype=np.float64)
    scale, bias = model_json["scale_and_bias"]
    trees = model_json["oblivious_trees"]
    all_splits = model_json["all_splits"]

    for tree in trees:
        splits = tree.get("splits",[])
        splits = splits if splits is not None else [] 
        leaf_idx = np.zeros_like(prediction, dtype=np.int32)
        for idx, split in enumerate(splits):
            split = all_splits[split["split_index"]]
            if split["split_type"] == "OneHotFeature":
                bin_value = X_cb[split["feature_id"]] == split["border"]
            else:
                bin_value = X_cb[split["feature_id"]] > split["border"]
            
            leaf_idx += (2**idx)*bin_value
        leaf_values = np.array(tree["leaf_values"])
        prediction += leaf_values[leaf_idx]

    return scale*prediction + bias

In [32]:
pred_recalc = expit(predict(X_cb, model_json_extended))
pred_original = model.predict_proba(X)[:,1]
np.max(np.abs(pred_original - pred_recalc))

np.float64(0.0)

The prediction matches exactly!

### Some fun with the model

We can use only some of the trees by simply removing them from json.

In [33]:
model_small_json = deepcopy(model_json)
model_small_json["oblivious_trees"][:250] = []
model_small_json["oblivious_trees"][-250:] = []

# do the same to extended version
model_small_json_extended = deepcopy(model_json_extended)
model_small_json_extended["oblivious_trees"][:250] = []
model_small_json_extended["oblivious_trees"][-250:] = []
print(f"Number of trees in original model is {len(model_json['oblivious_trees'])}")
print(f"Number of trees in small model is {len(model_small_json['oblivious_trees'])}")

Number of trees in original model is 1000
Number of trees in small model is 500


In [34]:
with open("catboost_model_small.json", mode="w+") as f:
    json.dump(model_small_json, f, indent=2)
model_small = CatBoostClassifier()
model_small.load_model("catboost_model_small.json", format="json")

<catboost.core.CatBoostClassifier at 0x7b744587ce10>

In [35]:
pred_recalc = expit(predict(X_cb, model_small_json_extended))
pred_original = model_small.predict_proba(X)[:,1]
np.max(np.abs(pred_original - pred_recalc))

np.float64(2.220446049250313e-16)

Here we have small numerical error but it is exactly the same as if we were to use CatBoost build-in solution to choose trees for prediction.

In [36]:
np.max(np.abs(model.predict_proba(X, ntree_start=250, ntree_end=750)[:,1] - pred_original))

np.float64(2.220446049250313e-16)

We can also flip the model so it predicts who earns *less than* instead of *more than* 50k. Do do that, we need to multiply each leaf value by `-1`.

In [37]:
model_flipped_json_extended = deepcopy(model_json_extended)
for tree in model_flipped_json_extended["oblivious_trees"]:
    for i in range(len(tree["leaf_values"])):
        tree["leaf_values"][i] *= -1

pred_recalc = expit(predict(X_cb, model_flipped_json_extended))
pred_original = model.predict_proba(X)[:,1]
np.max(np.abs(1-(pred_original + pred_recalc))) # original predictions + flipped prediction must add-up to 1

np.float64(2.220446049250313e-16)

Again, only small numerical errors.