GAMChanger fails to load data in some cases #4

aaronsnoswell · 2022-05-24T02:59:47Z

I've found a bug where GAMChanger sometimes doesn't populate the 'metrics' / 'feature' / 'history' panel. It seems that when this happens, the GAMChanger interface has failed to load the validation samples, because the status bar says "0/0 validation samples selected".

This seems to occur sometimes based on the data that is provided, and might have something to do with missing data points, but I'm struggling to figure out exactly what the cause is.

~~Below is the smallest reproducing example I can come up with.~~

See following comment for a better MWE.

import pandas as pd
import gamchanger as gc
from interpret.glassbox import ExplainableBoostingRegressor

# Works
X = pd.read_csv('demo-X-succeed.csv')
y = pd.read_csv('demo-y-succeed.csv')['OrderedFractionOfEstate']

# Doesn't work
#X = pd.read_csv('demo-X-fail.csv')
#y = pd.read_csv('demo-y-fail.csv')['OrderedFractionOfEstate']

ebm = ExplainableBoostingRegressor(interactions=False)
ebm.fit(X, y)

gc.visualize(ebm, X, y)

I've attached the CSV files, which differ in that the 'succeed' files have a single extra data point. That is, when loading 'demo-[X|y]-fail.csv' the GamChanger interface loads, but the side panel doesn't populate (unexpected behaviour). When loading 'demo-[X|y]-succeed.csv', the GamChanger interface loads and the side panel populates the metrics as expected.

demo-X-fail.csv
demo-X-succeed.csv
demo-y-fail.csv
demo-y-succeed.csv

aaronsnoswell · 2022-05-24T03:57:54Z

I have produced a more compact MWE;

import numpy as np
import pandas as pd
import gamchanger as gc

from interpret.glassbox import ExplainableBoostingRegressor

size = 5
x1 = np.linspace(0, 10, size)
y = -1.0 * x1.copy() + 3.0

# Introduce missing data
x1[1] = np.nan
x1[2] = np.nan

# With only two missing datapoints, the GAMChanger interface loads fine
# If we introduce a third missing feature valueby un-commenting the below
# line, the validation data fails to load
#x1[3] = np.nan

df = pd.DataFrame(
    data={
        'x1': x1,
        'y' : y
    }
)

X = df[['x1']]
y = df['y']

print(df)

# Train model
ebm = ExplainableBoostingRegressor(interactions=False)
ebm.fit(X, y)

gc.visualize(ebm, X, y)

aaronsnoswell · 2022-05-24T04:35:21Z

...update... based on the above MWE, I have been able to narrow down the error to this javascript uncaught error in the Firefox JS console;

...which I believe is coming from the variable messenger_js_base64 at gamchanger.py:528.

For the failing case, this javascript (before base64 encoding) looks like this;

(function() {
    let data = {
        "model": {
            "intercept": -0.849715269828704,
            "isClassifier": false,
            "features": [
                {
                    "name": "x1",
                    "type": "continuous",
                    "importance": 0.14849663043758726,
                    "additive": [-0.1856, -0.1856],
                    "error": [0.7972, 0.7972],
                    "id": [0],
                    "count": [1, 1],
                    "binEdge": [0.0, 5.0, 10.0],
                    "histEdge": [0.0, 10.0],
                    "histCount": [2]
                }
            ],
            "labelEncoder": {},
            "scoreRange": [-0.9828, 0.6905]
        },
        "sample": {
            "featureNames": ["x1"],
            "featureTypes": ["continuous"],
            "samples": [[0.0], [NaN], [NaN], [NaN], [10.0]],
            "labels": [3.0, 0.5, -2.0, -4.5, -7.0]
        }
    };
    let event = new Event('gamchangerData');
    event.data = data;
    console.log('before');
    console.log(data);
    document.dispatchEvent(event);
}())

aaronsnoswell · 2022-05-24T04:49:42Z

Following the rabbit trail from the gamchangerData event down, I can see that this event is intercepted at GAM.svelte:663, which calls initDataLoaded, which calls initGAMView theninitSidebar.

Of these two functions, initSidebar is the only one that uses a Promise (which is mentioned in the JS error), so perhaps initSidebar at

gam-changer/src/GAM.svelte

Line 447 in ec85c7a

await new Promise(r => setTimeout(r, 500));

is the culprit here?

At this point, my knowledge of Typescript and WASM is stopping me from investigating this bug further. I suspect the issue is coming from initGAMView or initSidebar, but without the ability to debug and iterate with a non-minified and base64 encoded version of GAMChanger I can't look into this more.

I would very much appreciate help from the devs to track down this bug! Presently, this is preventing me from using GAMChanger with my application (predicting court case outcomes).

xiaohk · 2022-05-25T05:53:33Z

Wow @aaronsnoswell thank you so much for your detailed report and effort in debugging this issue!

I tried to reproduce this error using your example, but I got a ValueError when fitting an EBM model with missing values. I believe EBM does not support missing value yet? My interpret version is 0.2.7

     x1    y
0   0.0  3.0
1   NaN  0.5
2   NaN -2.0
3   7.5 -4.5
4  10.0 -7.0
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-13a640a7d74a> in <module>
     32 # Train model
     33 ebm = ExplainableBoostingRegressor(interactions=False)
---> 34 ebm.fit(X, y)

~/miniconda3/envs/gam/lib/python3.7/site-packages/interpret/glassbox/ebm/ebm.py in fit(self, X, y, sample_weight)
    822         #     AND add some tests for the X.dim == 1 scenario
    823 
--> 824         # TODO PK write an efficient striping converter for X that replaces unify_data for EBMs
    825         # algorithm: grap N columns and convert them to rows then process those by sending them to C
    826 

~/miniconda3/envs/gam/lib/python3.7/site-packages/interpret/utils/all.py in unify_data(data, labels, feature_names, feature_types, missing_data_allowed)
    340         msg = "Missing values are currently not supported."
    341         log.error(msg)
--> 342         raise ValueError(msg)
    343 
    344     return new_data, new_labels, new_feature_names, new_feature_types

ValueError: Missing values are currently not supported.

aaronsnoswell · 2022-05-26T01:18:22Z

Hi @xiaohk thanks for getting back to me!

The latest versions of Interpret have experimental support for missing values - I forgot to mention that I am using this experimental code. To enable it, you need to change a few places in the interpret source code. See this comment on interpretml/interpret#18 for the details.

So for instance, I checked pip show interpret to get the install location, then opened ebm.py, and changed all instances of missing_data_allowed=False to missing_data_allowed=True in ebm.py.

After doing that, the example should work.

Thanks again!

Signed-off-by: Jay Wang <jay@zijie.wang>

xiaohk · 2022-05-30T00:00:34Z

I see, thanks!

EBM's experimental support for missing values introduces a separate additive_scores[0] that is only used when computing the prediction score for missing values. To fully support this, GAM Changer has to visualize the missing value "bin" in the GAM Canvas — I will leave it for future work. For now, GAM Changer will remove all rows in X that contains any missing values in get_sample_data().

fb7ba18#diff-de8698f459a11697fd2d6614444871f69e802ae1af354cd35aba32e62e6698bbR267-R277

I will close this issue for now. Let me know if it doesn't work for you @aaronsnoswell. Thanks for reaching out to me!

aaronsnoswell · 2022-05-30T01:31:23Z

Thanks for looking into this, @xiaohk!

fb7ba18 seems like a good patch for now. Dropping all rows with missing values is pretty rough for users with real-world data though in the longer run :D

I'd be happy to take a stab at adding proper support for missing values if you can provide a little guidance for me. E.g. could you draw a sketch / doodle of what the GAMChanger interface should look like to show the missing value bin (where this would go in the interface?). Also, is there any documentation about setting up a development environment for GAMChanger?

xiaohk · 2022-06-01T05:29:44Z

Thank you so much for your interest! I believe supporting missing value will be super helpful. Adding this feature might sound straightforward, but I am sure it would require A LOT of work. 😅

Some high-level steps:

Support missing value in the EBM inference in WebAssembly
1. Need to handle continuous, categorical features, and interaction terms (implementations are different)
Visualize the missing value
1. Continuous feature: a separate dot on the line chart
2. Categorical feature: a separate bin in the bar chart
3. Cont X Cont interaction: a new row and a new column in the matrix
4. Cont X Cat interaction: many new separate bars (when NA happens in cont) or a new bar (when NA happens in cat)
5. Cat X Cat interaction: a new row/column of dots
Interaction with the missing value
1. Integrate editing tools to support missing values: e.g., align/interpolate/monotonicity do not really make sense.
2. We need to specially handle/prevent users from selecting regular bins and NA bin all at once
Integration with other views
1. Feature panel
2. History panel: new event logging when editing missing values
3. Footer: new event name when editing missing values
Loading .gamchanger files
1. Need to load missing value related parameters

To set up a development environment for GAM Changer:

git clone git@github.com:interpretml/gam-changer.git

# Install the dependencies:
npm install

# Start a development server
npm run dev

You might have noticed that the EBM inference and isotonic regression WebAssembly code are shipped as binaries in this repo. Their source code is at xiaohk/ebm.js and xiaohk/isotonic.js, respectively.

If you are interested, I am happy to provide any sketches, feedback, and guides that can help you! It would be a hard and rewarding contribution to GAM Changer!

aaronsnoswell · 2022-06-03T04:15:40Z

Wow :) That does sound like a lot of work.

A first point - interpretml doesn't currently support missing values for continuous features, and I don't believe they plan to - that would potentially reduce some of the interactions you mention above and simplify the workload.

Perhaps a good starting point is to figure out how stable the interpretml missing value support is. Assuming I can rely on it not changing too much, I could potentially bite off part of this work list in a new branch to get the ball rolling. I will inquire over there and report back.

P.S. Thanks for the dev environment instructions.

xiaohk added a commit that referenced this issue May 29, 2022

Handle missing values #4 thanks @aaronsnoswell

fb7ba18

Signed-off-by: Jay Wang <jay@zijie.wang>

xiaohk closed this as completed May 30, 2022

xiaohk mentioned this issue Jun 14, 2022

Version 0.1.4 breaks gam-changer-adult.ipynb notebook #5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GAMChanger fails to load data in some cases #4

GAMChanger fails to load data in some cases #4

aaronsnoswell commented May 24, 2022 •

edited

aaronsnoswell commented May 24, 2022

aaronsnoswell commented May 24, 2022 •

edited

aaronsnoswell commented May 24, 2022 •

edited

xiaohk commented May 25, 2022

aaronsnoswell commented May 26, 2022

xiaohk commented May 30, 2022

aaronsnoswell commented May 30, 2022

xiaohk commented Jun 1, 2022 •

edited

aaronsnoswell commented Jun 3, 2022 •

edited

GAMChanger fails to load data in some cases #4

GAMChanger fails to load data in some cases #4

Comments

aaronsnoswell commented May 24, 2022 • edited

aaronsnoswell commented May 24, 2022

aaronsnoswell commented May 24, 2022 • edited

aaronsnoswell commented May 24, 2022 • edited

xiaohk commented May 25, 2022

aaronsnoswell commented May 26, 2022

xiaohk commented May 30, 2022

aaronsnoswell commented May 30, 2022

xiaohk commented Jun 1, 2022 • edited

aaronsnoswell commented Jun 3, 2022 • edited

aaronsnoswell commented May 24, 2022 •

edited

aaronsnoswell commented May 24, 2022 •

edited

aaronsnoswell commented May 24, 2022 •

edited

xiaohk commented Jun 1, 2022 •

edited

aaronsnoswell commented Jun 3, 2022 •

edited