Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GAMChanger fails to load data in some cases #4

Closed
aaronsnoswell opened this issue May 24, 2022 · 9 comments
Closed

GAMChanger fails to load data in some cases #4

aaronsnoswell opened this issue May 24, 2022 · 9 comments

Comments

@aaronsnoswell
Copy link

aaronsnoswell commented May 24, 2022

I've found a bug where GAMChanger sometimes doesn't populate the 'metrics' / 'feature' / 'history' panel. It seems that when this happens, the GAMChanger interface has failed to load the validation samples, because the status bar says "0/0 validation samples selected".

This seems to occur sometimes based on the data that is provided, and might have something to do with missing data points, but I'm struggling to figure out exactly what the cause is.

Below is the smallest reproducing example I can come up with.

See following comment for a better MWE.

import pandas as pd
import gamchanger as gc
from interpret.glassbox import ExplainableBoostingRegressor

# Works
X = pd.read_csv('demo-X-succeed.csv')
y = pd.read_csv('demo-y-succeed.csv')['OrderedFractionOfEstate']

# Doesn't work
#X = pd.read_csv('demo-X-fail.csv')
#y = pd.read_csv('demo-y-fail.csv')['OrderedFractionOfEstate']

ebm = ExplainableBoostingRegressor(interactions=False)
ebm.fit(X, y)

gc.visualize(ebm, X, y)

I've attached the CSV files, which differ in that the 'succeed' files have a single extra data point. That is, when loading 'demo-[X|y]-fail.csv' the GamChanger interface loads, but the side panel doesn't populate (unexpected behaviour). When loading 'demo-[X|y]-succeed.csv', the GamChanger interface loads and the side panel populates the metrics as expected.

demo-X-fail.csv
demo-X-succeed.csv
demo-y-fail.csv
demo-y-succeed.csv

@aaronsnoswell
Copy link
Author

I have produced a more compact MWE;

import numpy as np
import pandas as pd
import gamchanger as gc

from interpret.glassbox import ExplainableBoostingRegressor

size = 5
x1 = np.linspace(0, 10, size)
y = -1.0 * x1.copy() + 3.0

# Introduce missing data
x1[1] = np.nan
x1[2] = np.nan

# With only two missing datapoints, the GAMChanger interface loads fine
# If we introduce a third missing feature valueby un-commenting the below
# line, the validation data fails to load
#x1[3] = np.nan

df = pd.DataFrame(
    data={
        'x1': x1,
        'y' : y
    }
)

X = df[['x1']]
y = df['y']

print(df)

# Train model
ebm = ExplainableBoostingRegressor(interactions=False)
ebm.fit(X, y)

gc.visualize(ebm, X, y)

@aaronsnoswell
Copy link
Author

aaronsnoswell commented May 24, 2022

...update... based on the above MWE, I have been able to narrow down the error to this javascript uncaught error in the Firefox JS console;

image

...which I believe is coming from the variable messenger_js_base64 at gamchanger.py:528.

For the failing case, this javascript (before base64 encoding) looks like this;

(function() {
    let data = {
        "model": {
            "intercept": -0.849715269828704,
            "isClassifier": false,
            "features": [
                {
                    "name": "x1",
                    "type": "continuous",
                    "importance": 0.14849663043758726,
                    "additive": [-0.1856, -0.1856],
                    "error": [0.7972, 0.7972],
                    "id": [0],
                    "count": [1, 1],
                    "binEdge": [0.0, 5.0, 10.0],
                    "histEdge": [0.0, 10.0],
                    "histCount": [2]
                }
            ],
            "labelEncoder": {},
            "scoreRange": [-0.9828, 0.6905]
        },
        "sample": {
            "featureNames": ["x1"],
            "featureTypes": ["continuous"],
            "samples": [[0.0], [NaN], [NaN], [NaN], [10.0]],
            "labels": [3.0, 0.5, -2.0, -4.5, -7.0]
        }
    };
    let event = new Event('gamchangerData');
    event.data = data;
    console.log('before');
    console.log(data);
    document.dispatchEvent(event);
}())

@aaronsnoswell
Copy link
Author

aaronsnoswell commented May 24, 2022

Following the rabbit trail from the gamchangerData event down, I can see that this event is intercepted at GAM.svelte:663, which calls initDataLoaded, which calls initGAMView theninitSidebar.

Of these two functions, initSidebar is the only one that uses a Promise (which is mentioned in the JS error), so perhaps initSidebar at

await new Promise(r => setTimeout(r, 500));
is the culprit here?

At this point, my knowledge of Typescript and WASM is stopping me from investigating this bug further. I suspect the issue is coming from initGAMView or initSidebar, but without the ability to debug and iterate with a non-minified and base64 encoded version of GAMChanger I can't look into this more.

I would very much appreciate help from the devs to track down this bug! Presently, this is preventing me from using GAMChanger with my application (predicting court case outcomes).

@xiaohk
Copy link
Collaborator

xiaohk commented May 25, 2022

Wow @aaronsnoswell thank you so much for your detailed report and effort in debugging this issue!

I tried to reproduce this error using your example, but I got a ValueError when fitting an EBM model with missing values. I believe EBM does not support missing value yet? My interpret version is 0.2.7

     x1    y
0   0.0  3.0
1   NaN  0.5
2   NaN -2.0
3   7.5 -4.5
4  10.0 -7.0
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-13a640a7d74a> in <module>
     32 # Train model
     33 ebm = ExplainableBoostingRegressor(interactions=False)
---> 34 ebm.fit(X, y)

~/miniconda3/envs/gam/lib/python3.7/site-packages/interpret/glassbox/ebm/ebm.py in fit(self, X, y, sample_weight)
    822         #     AND add some tests for the X.dim == 1 scenario
    823 
--> 824         # TODO PK write an efficient striping converter for X that replaces unify_data for EBMs
    825         # algorithm: grap N columns and convert them to rows then process those by sending them to C
    826 

~/miniconda3/envs/gam/lib/python3.7/site-packages/interpret/utils/all.py in unify_data(data, labels, feature_names, feature_types, missing_data_allowed)
    340         msg = "Missing values are currently not supported."
    341         log.error(msg)
--> 342         raise ValueError(msg)
    343 
    344     return new_data, new_labels, new_feature_names, new_feature_types

ValueError: Missing values are currently not supported.

@aaronsnoswell
Copy link
Author

Hi @xiaohk thanks for getting back to me!

The latest versions of Interpret have experimental support for missing values - I forgot to mention that I am using this experimental code. To enable it, you need to change a few places in the interpret source code. See this comment on interpretml/interpret#18 for the details.

So for instance, I checked pip show interpret to get the install location, then opened ebm.py, and changed all instances of missing_data_allowed=False to missing_data_allowed=True in ebm.py.

After doing that, the example should work.

Thanks again!

xiaohk added a commit that referenced this issue May 29, 2022
Signed-off-by: Jay Wang <jay@zijie.wang>
@xiaohk
Copy link
Collaborator

xiaohk commented May 30, 2022

I see, thanks!

EBM's experimental support for missing values introduces a separate additive_scores[0] that is only used when computing the prediction score for missing values. To fully support this, GAM Changer has to visualize the missing value "bin" in the GAM Canvas — I will leave it for future work. For now, GAM Changer will remove all rows in X that contains any missing values in get_sample_data().

fb7ba18#diff-de8698f459a11697fd2d6614444871f69e802ae1af354cd35aba32e62e6698bbR267-R277

I will close this issue for now. Let me know if it doesn't work for you @aaronsnoswell. Thanks for reaching out to me!

@xiaohk xiaohk closed this as completed May 30, 2022
@aaronsnoswell
Copy link
Author

Thanks for looking into this, @xiaohk!

fb7ba18 seems like a good patch for now. Dropping all rows with missing values is pretty rough for users with real-world data though in the longer run :D

I'd be happy to take a stab at adding proper support for missing values if you can provide a little guidance for me. E.g. could you draw a sketch / doodle of what the GAMChanger interface should look like to show the missing value bin (where this would go in the interface?). Also, is there any documentation about setting up a development environment for GAMChanger?

@xiaohk
Copy link
Collaborator

xiaohk commented Jun 1, 2022

Thank you so much for your interest! I believe supporting missing value will be super helpful. Adding this feature might sound straightforward, but I am sure it would require A LOT of work. 😅

Some high-level steps:

  1. Support missing value in the EBM inference in WebAssembly
    1. Need to handle continuous, categorical features, and interaction terms (implementations are different)
  2. Visualize the missing value
    1. Continuous feature: a separate dot on the line chart
    2. Categorical feature: a separate bin in the bar chart
    3. Cont X Cont interaction: a new row and a new column in the matrix
    4. Cont X Cat interaction: many new separate bars (when NA happens in cont) or a new bar (when NA happens in cat)
    5. Cat X Cat interaction: a new row/column of dots
  3. Interaction with the missing value
    1. Integrate editing tools to support missing values: e.g., align/interpolate/monotonicity do not really make sense.
    2. We need to specially handle/prevent users from selecting regular bins and NA bin all at once
  4. Integration with other views
    1. Feature panel
    2. History panel: new event logging when editing missing values
    3. Footer: new event name when editing missing values
  5. Loading .gamchanger files
    1. Need to load missing value related parameters

To set up a development environment for GAM Changer:

git clone git@github.com:interpretml/gam-changer.git

# Install the dependencies:
npm install

# Start a development server
npm run dev

You might have noticed that the EBM inference and isotonic regression WebAssembly code are shipped as binaries in this repo. Their source code is at xiaohk/ebm.js and xiaohk/isotonic.js, respectively.

If you are interested, I am happy to provide any sketches, feedback, and guides that can help you! It would be a hard and rewarding contribution to GAM Changer!

@aaronsnoswell
Copy link
Author

aaronsnoswell commented Jun 3, 2022

Wow :) That does sound like a lot of work.

A first point - interpretml doesn't currently support missing values for continuous features, and I don't believe they plan to - that would potentially reduce some of the interactions you mention above and simplify the workload.

Perhaps a good starting point is to figure out how stable the interpretml missing value support is. Assuming I can rely on it not changing too much, I could potentially bite off part of this work list in a new branch to get the ball rolling. I will inquire over there and report back.

P.S. Thanks for the dev environment instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants