# EBM Internals - Regression

This is part 1 of a 3 part series describing EBM internals and how to make predictions. For part 2, click [here](./ebm-internals-classification.ipynb). For part 3, click [here](./ebm-internals-multiclass.ipynb).

In this part 1 we'll cover the simplest useful EBM: a regression model that does not have interactions, missing values, or other complications.

At their core, EBMs are additive models where the score contributions from the individual features and interactions are added together to make a predictions. Each individual feature's score is determined using a lookup table. Before doing the lookup we first need to discretize continuous features and assign bin indexes to categorical features.

Regression is the simplest EBM model type because the final sum is the actual prediction without needing a link function.

In [None]:
# boilerplate
from interpret import show
from interpret.glassbox import ExplainableBoostingRegressor
import numpy as np

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

In [None]:
# make a dataset composed of a nominal categorical feature, and a continuous feature 
X = np.array([["Sudan", 0.75], ["Germany", 1.75], ["Sudan", 2.75]])
y = np.array([7.5, 9.5, 5.5])

# fit an EBM with no interactions and eliminate the validation set to handle the small dataset
ebm = ExplainableBoostingRegressor(interactions=0, validation_size=0, early_stopping_rounds=1000, min_samples_leaf=1)
ebm.fit(X, y)
show(ebm.explain_global())

<br/>
<br/>
<br/>
<br/>
<br/>


Let's have a look at some of the most important attributes of the ExplainableBoostingRegressor.

In [None]:
print(ebm.feature_types_in_)

Since we did not pass in a feature_types attribute to the \_\_init\_\_ function, some reasonable feature type guesses were assigned. Those guesses were recorded in ebm.feature_types_in_. We support the following feature types: 'continuous', 'nominal', and 'ordinal'. For evaluation purposes, 'nominal' and 'ordinal' can be treated identically as they are both categoricals and represented the same way in the model. 'nominal' and 'ordinal' are treated differently during training, but we'll focus just on prediction here.

In [None]:
print(ebm.bins_)

ebm.bins_ defines how to bin both categorical ('nominal' and 'ordinal') and 'continuous' features.

For categorical features we use a dictionary that maps the category strings to bin indexes.

Continuous features are defined with a list of cut points that partition the continuous range. In this case, our dataset has 3 unique values for the continuous feature: 0.75, 1.75, and 2.75.  To separate these 3 values into 3 bins, we require 2 cut points.  The EBM has chosen the cut points 1.25 and 2.25, but it could have chosen any value between 0.75 to 1.75 and between 1.75 to 2.75.

When making predictions for continuous features, the values we receive can be anything in the range -infinity to +infinity. With these bin cuts we have the following regions: bin #1 is [-np.inf, 1.25), bin #2 is [1.25, 2.25), and bin #3 is [2.25, +np.inf].

If there are any feature values that are equal to the bin cut value, they are placed into the higher bin. To convert a continuous feature into bins, we can use the numpy.digitize function with a slight adjustment that we'll see below.

For this first example, we are going to focus solely on main effect models that do not have interactions. We allow higher order interactions to be binned with less resolution than main effects. We'll cover this in more detail later.

We also define 2 special bins: The missing bin, and the unknown bin. The missing bin is of course where we would lookup the score for missing values. The unknown bin is used whenever we see a categorical value that was not in the training set. For example, if our testing dataset had the categorical "Vietnam", or "Sudan", then we would lookup the unknown bin value in this example since those did not appear in the training set.

The missing bin is always located in the 0th index, and the unknown bin is always in the last index. 

In [None]:
print(ebm.term_scores_[0])

term_scores_ contains the lookup tables for each additive term. ebm.term_scores_[0] is the lookup table for the first feature.

Since the first feature is a nominal categorial, we use the dictionary {'Germany': 1, 'Sudan': 2} to lookup which bin to use for  categorical strings. If we recieved a feature value of 'None', then we'd use the score value at index 0. If the feature value was "Germany", we'd use the value at index 1.  If the feature value was "Sudan", we'd use the value at index 2. If it was any other string, we'd use the value at index 3.

In [None]:
print(ebm.term_scores_[1])

term_scores_[1] is the lookup table for the continuous feature in our dataset. As with categoricals, the 0th index is for missing values, and the last index (index 4 here) is for unknown values. In the context of a continuous feature, the unknown bin is anything that cannot be expressed as a float, so if we had been given the string "BAD_VALUE" instead of a number, then we'd use the 4th bin. This bin can optionally be set to np.nan if you would prefer to raise an error for non-continuous values.

In [None]:
print(ebm.intercept_)

For an EBM, the intercept_ should be very close to the base rate. When making predictions we add the intercept value to the scores from each feature to obtain the prediction. 

Let's put what we've covered above together to make some predictions. This example is designed to be the simplest useful example possible, so it does not handle things like interactions, missing values, unknown values, or classification. For a more complete example, see the multiclass example which handles all of these things.

In [None]:
sample_scores = []
# we have 3 samples in X, so loop 3 times
for sample in X:
    # start from the intercept for each sample
    score = ebm.intercept_

    # we have 2 feaures, so add their score contributions
    for feature_idx in range(len(sample)):
        feature_val = sample[feature_idx]
        bins = ebm.bins_[feature_idx][0]
        if isinstance(bins, dict):
            # categorical feature
            bin_idx = bins[feature_val]
        else:
            # continuous feature. bins is an array of cut points
            # add 1 because the 0th bin is reserved for 'missing'
            bin_idx = np.digitize(feature_val, bins) + 1

        score += ebm.term_scores_[feature_idx][bin_idx]
    sample_scores.append(score)

print(ebm.predict(X))
print(np.array(sample_scores))