# EBM Internals - Binary classification

This is part 2 of a 3 part series describing EBM internals and how to make predictions. For part 1, click [here](./ebm-internals-regression.ipynb). For part 3, click [here](./ebm-internals-multiclass.ipynb).

In this part 2 we'll cover binary classification, interactions, missing values, ordinals, and bin discretization resolutions for interactions. Before reading this part you should be familiar with the internals of EBM regression from [part 1](./ebm-internals-regression.ipynb)

In [None]:
# boilerplate
from interpret import show
from interpret.glassbox import ExplainableBoostingClassifier
import numpy as np

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

In [None]:
# make a dataset composed of a nominal categorical feature, and a continuous feature 
X = np.array([["low", 1.75], ["medium", 0.75], ["high", 2.75], [None, None]])
y = np.array(["apples", "apples", "oranges", "oranges"])

# Fit an EBM with interactions
# Define an ordial feature with a specified ordering
# Eliminate the validation set to handle the small dataset
ebm = ExplainableBoostingClassifier(
    feature_types=[["low", "medium", "high"], 'continuous'], 
    max_interaction_bins=4,
    validation_size=0, early_stopping_rounds=1000, min_samples_leaf=1)
ebm.fit(X, y)
show(ebm.explain_global())

<br/>
<br/>
<br/>
<br/>
<br/>


In [None]:
print(ebm.classes_)

Per scikit-learn convention, we store the list of classes in the ebm.classes_ attribute as a sorted array. In this example our classes are strings, but we also accept integers as we'll see in part 3.

In [None]:
print(ebm.feature_types)

In this example we passed feature_types into the \_\_init\_\_ function. Per scikit-learn convention, this is recorded unmodified in the ebm object.

In [None]:
print(ebm.feature_types_in_)

We translated the feature_types passed to \_\_init\_\_ into actualized feature types. Following scikit-learn's [SLEP007 convention](https://scikit-learn-enhancement-proposals.readthedocs.io/en/latest/slep007/proposal.html), we recorded this in ebm.feature_types_in_

In [None]:
print(ebm.feature_names_in_)

Since we passed in a numpy array without specifying column names, the EBM created some default names. If we had passed feature_names to the __init__ function, or if we had used a Pandas dataframe, then feature_names_in_ would have contained those names.

In [None]:
print(ebm.term_features_)

Our model contains 3 additive terms. The first two terms are the main effect features, and the 3rd term is the pairwise interaction between the individual features. EBMs are not limited to only main effects and pairs. We also support 3-way interactions, 4-way interactions, and higher dimensions as well. If there were any higher order interactions in the model, they would be listed in ebm.term_features_

In [None]:
print(ebm.term_names_)

ebm.term_names_ is a convenience attribute that joins ebm.term_features_ and ebm.feature_names_in_ to create names for each of the additive terms.

ebm.term_names_ is always the result of:
term_names = [" & ".join(ebm.feature_names_in_[i] for i in grp) for grp in ebm.term_features_]

In [None]:
print(ebm.bins_)

ebm.bins_ is a per-feature attribute. As described in [part 1](./ebm-internals-regression.ipynb), ebm.bins_ defines both categorical ('nominal' or 'ordinal') and 'continuous' features.

For categorical features we use a dictionary that maps the category strings to bin indexes.

Continuous features are defined with a list of cut points that partition the continuous range. In this case, our dataset has 3 unique values for the continuous feature: 0.75, 1.75, and 2.75. As described in [part 1](./ebm-internals-regression.ipynb) the main effects have 2 bin cuts that sparate these into 3 regions. In the call to \_\_init\_\_ for the ExplainableBoostingClassifier(...) we specified max_interaction_bins=4, which limits us to just 4 bins for pairwise and higher interactions. As described in [part 1](./ebm-internals-regression.ipynb), we have reserved 2 of those bins for 'missing' and 'unknown', which leaves us 2 remaining bins for interaction terms.  We have 3 unique values though, so the EBM needs to decide which to group together. Then it must choose a cut point that separates them into separate paritions.  In this case, the EBM could have chosen any cut point between 0.75 and 2.75. It chose 2.25, which puts the 0.75 and 1.75 in the lower bin and 2.75 in the upper bin.

In this example, we included missing values. Missing values are always put into the 0th bin.

In [None]:
print(ebm.term_scores_[0])

ebm.term_scores_ contains the lookup table that we use to lookup the scores after we have binned the feature values.  ebm.term_scores_[0] is the lookup table for the first term, which in this example is the first feature.

Since the first feature is an ordinal categorial, we use the dictionary {'low': 1, 'medium': 2, 'high': 3} to lookup which bin to use for each categorical string. If the feature value is None, then we use the score at index 0. If the feature value is "low", we use the score at index 1. If the feature value is "medium", we use the score at index 2. If the feature value is "high", we use the score at index 3. If it was any other string, we'd use the value at index 4.

In [None]:
print(ebm.term_scores_[1])

ebm.term_scores_[1] is for the continuous feature in our dataset. As with categoricals, the 0th index is reserved to hold the score for the 'missing' value, and the last index is reserved to hold the score for 'unknown' feature values. In the context of a continuous feature, the unknown bin is anything that cannot be expressed as a float, so if we had been given the string "BAD_VALUE" instead of a float, then we'd use the 4th bin. This bin can optionally be set to np.nan if non-continuous values are illegal.

Our ebm.bins_[1] attribute contains a list having 2 arrays of cut points. If we're binning a main effect then we use the cut points at index 0, and if we're binning an interaction we use the cut points at index 1.

In [None]:
print(ebm.term_scores_[2])

Since ebm.term_scores_ is a per-term attribute, term_scores_[2] will contain the lookup table for the pair (0, 1) which was found in the ebm.term_features_ attribute. Our lookup table needs to be two dimensional for this pair with the index being:

pair_scores = ebm.term_scores_[2]
sample_score = pair_scores[feature_0_index, feature_1_index]

Let's put this together now and make predictions for the original dataset.

In [None]:
# this example is ONLY meant to work for the example above. If you need code that 
# works with all inputs, use the code in the multiclass example, which handles everything

sample_scores = []
# we have 4 samples in X, so loop 4 times
for sample in X:
    # start from the intercept for each sample
    score = float(ebm.intercept_)
    
    # We have 3 terms: two main effects, and 1 pair interaction
    for term_idx, features in enumerate(ebm.term_features_):
        # We'll be indexing into a tensor, so our index needs to be multi-dimensional
        tensor_index = []

        # for each feature that is a component of the term
        # main effects will have only 1, pairs will have 2
        for feature_idx in features:
            feature_val = sample[feature_idx]

            if feature_val is None or feature_val is np.nan:
                # missing values are always in the 0th bin
                bin_idx = 0
            else:
                # we bin differently for main effects and pairs, so first 
                # get the list containing the bins for different resolutions

                if len(features) == 1 or len(ebm.bins_[feature_idx]) == 1:
                    # this is a main effect or only one bin level
                    # is available, so use the highest resolution bins
                    bins = ebm.bins_[feature_idx][0]
                else:
                    # pair with lower resolution bins available
                    bins = ebm.bins_[feature_idx][1]

                if isinstance(bins, dict):
                    # categorical feature
                    bin_idx = bins[feature_val]
                else:
                    # continuous feature
                    # add 1 because the 0th bin is reserved for 'missing'
                    bin_idx = np.digitize(feature_val, bins) + 1

            tensor_index.append(bin_idx)
        score_tensor = ebm.term_scores_[term_idx]
        score += score_tensor[tuple(tensor_index)]
    sample_scores.append(score)

logits = np.array(sample_scores)

# use the sigmoid function to convert logits to probabilities
probabilities = 1 / (1 + np.exp(-logits))

print("probability of " + ebm.classes_[1])
print(ebm.predict_proba(X)[:, 1])
print(probabilities)

For regression, our default link function was the identity link function, so the scores were the predictions.

For classification, the scores are logits and we need to apply the inverse link function to calculate probabilities, which for binary classification is the sigmoid function.