GLM Model sometimes returns incorrect predictions #7457

exalate-issue-sync · 2023-05-11T16:59:50Z

The GLM ordinal model has an interesting bug, where it returns the incorrect prediction label in the event that no labels have a probability > 0.50

It takes a few tries to reproduce this because you need to build a model and then feed it a row of data that returns no >0.5 label probabilities. I've managed to reproduce it reliably with this code snippet, which builds and predicts on a model in a loop of 100. Eventually, it fails.

{code:java}import h2o
import pandas as pd
from h2o.estimators import H2OGeneralizedLinearEstimator

h2o.init()
cars = h2o.import_file(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv"
)

def test_model():
cars["cylinders"] = cars["cylinders"].asfactor()
cars.rename(columns={"year": "year_make"})
r = cars[0].runif()
train = cars[r > 0.2]
valid = cars[r <= 0.2]
response = "cylinders"
predictors = [
"displacement",
"power",
"weight",
"acceleration",
"year_make",
]

model = H2OGeneralizedLinearEstimator(seed=1234, family="ordinal")
model.train(
    x=predictors, y=response, training_frame=train, validation_frame=valid
)

features = h2o.H2OFrame(pd.DataFrame([[18,101,22,23.142,1]], columns=predictors))

model_raw_preds = model.predict(features).as_data_frame().values.tolist()[0]

model_pred = model_raw_preds[0] # Label
probs = model_raw_preds[1:] # Probabilities
labels = [3, 4, 5, 6, 8]


max_prob = max(probs)
max_prob_index = probs.index(max_prob)
prob_pred = labels[max_prob_index]

label_probs = dict(zip(labels, probs))

print(f'Model pred: {model_pred}, probabilities: {label_probs}')

assert prob_pred==model_pred, f'Predictions are wrong, model gave {model_pred} but max prob was {prob_pred} with probability {max_prob}. All probs: {label_probs}'

for _ in range(100):
test_model(){code}

An example error output:

{noformat}---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
in
49
50 for _ in range(100):
---> 51 test_model()

in test_model()
46 print(f'Model pred: {model_pred}, probabilities: {label_probs}')
47
---> 48 assert prob_pred==model_pred, f'Predictions are wrong, model gave {model_pred} but max prob was {prob_pred} with probability {max_prob}. All probs: {label_probs}'
49
50 for _ in range(100):

AssertionError: Predictions are wrong, model gave 6.0 but max prob was 8 with probability 0.4285287291223623. All probs: {3: 0.0010826152119126613, 4: 0.2415455664644905, 5: 0.011730752683124124, 6: 0.3171123365181104, 8: 0.4285287291223623}
{noformat}

The text was updated successfully, but these errors were encountered:

h2o-ops · 2023-05-14T19:14:28Z

JIRA Issue Details

Jira Issue: PUBDEV-8194
Assignee: Wendy
Reporter: Ben Epstein
State: Resolved
Fix Version: 3.32.1.4
Attachments: N/A
Development PRs: Available

h2o-ops · 2023-05-14T19:14:30Z

Linked PRs from JIRA

#5544

exalate-issue-sync bot added glm H2O-3 labels May 11, 2023

h2o-ops added the fixVersion/3.32.1.4 label May 14, 2023

h2o-ops closed this as completed May 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GLM Model sometimes returns incorrect predictions #7457

GLM Model sometimes returns incorrect predictions #7457

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

h2o-ops commented May 14, 2023

GLM Model sometimes returns incorrect predictions #7457

GLM Model sometimes returns incorrect predictions #7457

Comments

exalate-issue-sync bot commented May 11, 2023

h2o-ops commented May 14, 2023

h2o-ops commented May 14, 2023