How to plot EBM explain_global().data() customly: EBM Shape Plot is cut at a certain x-value. ebm_global.data(i)["names"] only features a subset of potential x values. #325

NicoHambauer · 2022-03-12T17:22:17Z

Hi dear interpretML team!
I love your work!
However, i am facing an issue while I wanted to correct a plot for a paper in publishing process (conditional accept).

Here is an image of the plot I have (the feature capital.gain is under review):

I saw there is a dependence to the parameter max_bins which defaults to 256. If I turn that parameter up, I receive a bigger range of plot values on the x-axis, however still not the whole range for this feature is included in the shape plot.

amount of x values is 79 but only ranges to 70654.5 while the max value of this feature is 99 999 using max_bins=256 and also Gam Splines takes 99 999 as max value for this feature:

The code I use for plotting is as follows:

Defining a function that takes my data and plots my plots customly:

def EBM(X, y, dataset_name, model_name='EBM'):
    if task == "classification":
        ebm= ExplainableBoostingClassifier(interactions=10, max_bins=200000)
    else:
        ebm = ExplainableBoostingRegressor(interactions=10, max_bins=200000)
    ebm.fit(X, y)
    ebm_global = ebm.explain_global()

    for i in range(len(ebm_global.data()['names'])):
        data_names = ebm_global.data()
        feature_name = data_names['names'][i]
        shape_data = ebm_global.data(i)

        if shape_data['type'] == 'interaction':
            x_name, y_name = feature_name.split('x')
            x_name = x_name.replace(' ', '')
            y_name = y_name.replace(' ', '')
            make_plot_interaction(shape_data['left_names'], shape_data['right_names'],
                                  np.transpose(shape_data['scores']),
                                  x_name, y_name, model_name, dataset_name)
            continue
        if len(shape_data['names']) == 2:
            pass
            # make_one_hot_plot(shape_data['scores'][0], shape_data['scores'][1], feature_name, model_name, dataset_name)
        else:
            make_plot(shape_data['names'][:-1], shape_data['scores'], shape_data['scores'],
                      shape_data['scores'], feature_name, model_name, dataset_name)

    feat_for_vis = dict()
    for i, n in enumerate(ebm_global.data()['names']):
        feat_for_vis[n] = {'importance': ebm_global.data()['scores'][i]}
    feature_importance_visualize(feat_for_vis, save_png=True, folder='.', name='ebm_feat_imp')

Data used: including a custom loader for the reknown adult dataset:

random_state = 1
task = 'classification'  # regression or classification

dataset, y_word_dict = load_adult_data()
dataset_name = 'adult'

X = pd.DataFrame(dataset['full']['X'])
y = np.array(dataset['full']['y'])
X, y = shuffle(X, y, random_state=random_state)

is_cat = np.array([dt.kind == 'O' for dt in X.dtypes])

num_cols = X.columns.values[~is_cat]
# one hot encoder pipeline replaced by pd.getdummies to make sure column names are concatenated
# cat_ohe_step = ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'))

# Handle unknown data as ignore
X = pd.get_dummies(X)
X = X.reindex(columns=X.columns, fill_value=0)
dummy_column_names = X.columns

# cat_pipe = Pipeline([cat_ohe_step])
num_pipe = Pipeline([('identity', FunctionTransformer())])  # , ('scaler', RobustScaler())])
transformers = [
    # ('cat', cat_pipe, cat_cols) replaced by pd.getdummies
    ('num', num_pipe, num_cols)
]
ct = ColumnTransformer(transformers=transformers, remainder='passthrough')
ct.fit(X)
X = ct.transform(X)

X = pd.DataFrame(X, columns=dummy_column_names)

scaler_dict = {}
for c in num_cols:

    
    scaler = RobustScaler()
    X[c] = scaler.fit_transform(X[c].values.reshape(-1, 1))
    scaler_dict[c] = scaler

function to make the actual custom plot used above:

def make_plot(x, mean, upper_bounds, lower_bounds, feature_name, model_name, dataset_name, num_epochs='', debug=False):
    x = np.array(x)
    if debug:
        print("Num cols:", num_cols)
    if feature_name in num_cols:
        if debug:
            print("Feature to scale back:", feature_name)
        if feature_name == "capital.gain":
           pass #halt here the debugger
        x = scaler_dict[feature_name].inverse_transform(x.reshape(-1, 1)).squeeze()
    else:
        if debug:
            print("Feature not to scale back:", feature_name)

    plt.plot(x, mean, color='black')
    plt.fill_between(x, lower_bounds, mean, color='gray')
    plt.fill_between(x, mean, upper_bounds, color='gray')
    plt.xlabel(f'Feature value')
    plt.ylabel('Feature effect on model output')
    plt.title(f'Feature:{feature_name}')
    plt.savefig(f'plots/{model_name}_{dataset_name}_shape_{feature_name}_{num_epochs}epochs.pdf')
    plt.show()

interpret-ml · 2022-03-12T22:42:18Z

Hi @NicoHambauer --

Just looking at your graphs, it appears that your plotting code is drawing straight lines between points located at the center of the bins. That isn't an exact representation of how the model works which is leading to your discrepancy at the boundaries. Let's say you had 3 samples and a feature with the following values (1,2,3). The EBM binning code would separate these values into 3 bins by putting cut points at 1.5 and 2.5. When evaluating an EBM and during graphing there should be a constant score in the ranges between -inf and 1.5, between 1.5 and 2.5 and between 2.5 to +inf. We only display our graphs though between the min and max feature value since it isn't helpful to graph between -inf and +inf.

I believe your graphing code is instead putting a point at 1.25 (the average between 1 and 1.5), a point at 2 (the average between 1.5 and 2.5) and a point at 2.75 (the average between 3 and 2.5). If that's true, the reason for your graph not going up to 99999 is because your upper point is located between 99999 and the highest cut value. In my example above that would be equivalent to having the graph go up to the 2.75 value instead of 3.

-InterpretML team

NicoHambauer · 2022-03-13T22:00:53Z

Dear interpret-ml team,
thanks for the valuable comment and useful example!
I did plot the same feature using the internal show() function also:

As I see therefore the problem seems to be in my plotting. Since we are publishing this document in an information systems conference we wanted to have a unified plot layout across other models.

Is there any hints you could give me to improve my implementation, e.g. pointing towards where i can find the actual implementation of that plot in using show() that pyplot is using?

I understood now, that the difference between my custom plot is, that instead of connecting the points (x,y) stored in the respective ebm_global.data() object, I need to ensure the same y value for the spans of the points included in the x array (ebm_global.data(i)["names"][:-1]). So this will probably fix my custom plot now, however maybe I can refer to that implementation and use it for my custom plotting

Best regards and thanks in advance!!
Nico

interpret-ml · 2022-03-14T00:22:53Z

@NicoHambauer --

Here's the function for plotting continuous features:

interpret/python/interpret-core/interpret/visual/plot.py

Line 91 in 2327384

def plot_continuous_bar(

-InterpretML team

NicoHambauer · 2022-03-14T23:41:33Z

Dear InterpretML team!
Much appreciation for your responses and quick help!!
I was able to resolve my issue and the plot is now correct as I compared it to the show(..) dashboard!

For anyone else reading this issue here is my code used for custom plotting:

def make_plot_ebm(data_dict, feature_name, model_name, dataset_name, num_epochs='', debug=False):
    x_vals = data_dict["names"].copy()
    y_vals = data_dict["scores"].copy()

    # This is important since you do not plot plt.stairs with len(edges) == len(vals) + 1, which will have a drop to zero at the end
    y_vals = np.r_[y_vals, y_vals[np.newaxis, -1]] 

    # This is the code interpretml also uses: https://github.com/interpretml/interpret/blob/2327384678bd365b2c22e014f8591e6ea656263a/python/interpret-core/interpret/visual/plot.py#L115

    # main_line = go.Scatter(
    #     x=x_vals,
    #     y=y_vals,
    #     name="Main",
    #     mode="lines",
    #     line=dict(color="rgb(31, 119, 180)", shape="hv"),
    #     fillcolor="rgba(68, 68, 68, 0.15)",
    #     fill="none",
    # )
    #
    # main_fig = go.Figure(data=[main_line])
    # main_fig.show()
    # main_fig.write_image(f'plots/{model_name}_{dataset_name}_shape_{feature_name}_{num_epochs}epochs.pdf')
    
    
    # This is my custom code used for plotting
    x = np.array(x_vals)
    if debug:
        print("Num cols:", num_cols)
    if feature_name in num_cols:
        if debug:
            print("Feature to scale back:", feature_name)
        x = scaler_dict[feature_name].inverse_transform(x.reshape(-1, 1)).squeeze()
    else:
        if debug:
            print("Feature not to scale back:", feature_name)

    plt.step(x, y_vals, where="post", color='black')
    # plt.fill_between(x, lower_bounds, mean, color='gray')
    # plt.fill_between(x, mean, upper_bounds, color='gray')
    plt.xlabel(f'Feature value')
    plt.ylabel('Feature effect on model output')
    plt.title(f'Feature:{feature_name}')
    plt.savefig(f'plots/{model_name}_{dataset_name}_shape_{feature_name}_{num_epochs}epochs.pdf')
    plt.show()

Again huge thanks!
Best regards,
Nico Hambauer

NicoHambauer changed the title ~~Shape Plot is cut at a certain x-value. ebm_global.data(i)["names"] only features a subset of potential x values.~~ EBM Shape Plot is cut at a certain x-value. ebm_global.data(i)["names"] only features a subset of potential x values. Mar 12, 2022

NicoHambauer closed this as completed Mar 14, 2022

interpret-ml mentioned this issue Apr 6, 2022

How to get the data used to generate the plots for a global EBM explanation? #331

Closed

paulbkoch mentioned this issue Feb 2, 2024

Method for extracting where the steps are in the variable plots #506

Open

This was referenced Apr 19, 2024

Lookup Table for single feature and feature interaction terms #530

Open

Compare to EBMs? impactchart/impactchart#31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to plot EBM explain_global().data() customly: EBM Shape Plot is cut at a certain x-value. ebm_global.data(i)["names"] only features a subset of potential x values. #325

How to plot EBM explain_global().data() customly: EBM Shape Plot is cut at a certain x-value. ebm_global.data(i)["names"] only features a subset of potential x values. #325

NicoHambauer commented Mar 12, 2022 •

edited

interpret-ml commented Mar 12, 2022

NicoHambauer commented Mar 13, 2022 •

edited

interpret-ml commented Mar 14, 2022

NicoHambauer commented Mar 14, 2022

How to plot EBM explain_global().data() customly: EBM Shape Plot is cut at a certain x-value. ebm_global.data(i)["names"] only features a subset of potential x values. #325

How to plot EBM explain_global().data() customly: EBM Shape Plot is cut at a certain x-value. ebm_global.data(i)["names"] only features a subset of potential x values. #325

Comments

NicoHambauer commented Mar 12, 2022 • edited

interpret-ml commented Mar 12, 2022

NicoHambauer commented Mar 13, 2022 • edited

interpret-ml commented Mar 14, 2022

NicoHambauer commented Mar 14, 2022

NicoHambauer commented Mar 12, 2022 •

edited

NicoHambauer commented Mar 13, 2022 •

edited