Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GLM Model sometimes returns incorrect predictions #7457

Closed
exalate-issue-sync bot opened this issue May 11, 2023 · 2 comments
Closed

GLM Model sometimes returns incorrect predictions #7457

exalate-issue-sync bot opened this issue May 11, 2023 · 2 comments

Comments

@exalate-issue-sync
Copy link

The GLM ordinal model has an interesting bug, where it returns the incorrect prediction label in the event that no labels have a probability > 0.50

It takes a few tries to reproduce this because you need to build a model and then feed it a row of data that returns no >0.5 label probabilities. I've managed to reproduce it reliably with this code snippet, which builds and predicts on a model in a loop of 100. Eventually, it fails.

{code:java}import h2o
import pandas as pd
from h2o.estimators import H2OGeneralizedLinearEstimator

h2o.init()
cars = h2o.import_file(
"https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv"
)

def test_model():
cars["cylinders"] = cars["cylinders"].asfactor()
cars.rename(columns={"year": "year_make"})
r = cars[0].runif()
train = cars[r > 0.2]
valid = cars[r <= 0.2]
response = "cylinders"
predictors = [
"displacement",
"power",
"weight",
"acceleration",
"year_make",
]

model = H2OGeneralizedLinearEstimator(seed=1234, family="ordinal")
model.train(
    x=predictors, y=response, training_frame=train, validation_frame=valid
)

features = h2o.H2OFrame(pd.DataFrame([[18,101,22,23.142,1]], columns=predictors))

model_raw_preds = model.predict(features).as_data_frame().values.tolist()[0]

model_pred = model_raw_preds[0] # Label
probs = model_raw_preds[1:] # Probabilities
labels = [3, 4, 5, 6, 8]


max_prob = max(probs)
max_prob_index = probs.index(max_prob)
prob_pred = labels[max_prob_index]

label_probs = dict(zip(labels, probs))

print(f'Model pred: {model_pred}, probabilities: {label_probs}')

assert prob_pred==model_pred, f'Predictions are wrong, model gave {model_pred} but max prob was {prob_pred} with probability {max_prob}. All probs: {label_probs}'

for _ in range(100):
test_model(){code}

An example error output:

{noformat}---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
in
49
50 for _ in range(100):
---> 51 test_model()

in test_model()
46 print(f'Model pred: {model_pred}, probabilities: {label_probs}')
47
---> 48 assert prob_pred==model_pred, f'Predictions are wrong, model gave {model_pred} but max prob was {prob_pred} with probability {max_prob}. All probs: {label_probs}'
49
50 for _ in range(100):

AssertionError: Predictions are wrong, model gave 6.0 but max prob was 8 with probability 0.4285287291223623. All probs: {3: 0.0010826152119126613, 4: 0.2415455664644905, 5: 0.011730752683124124, 6: 0.3171123365181104, 8: 0.4285287291223623}
{noformat}

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Details

Jira Issue: PUBDEV-8194
Assignee: Wendy
Reporter: Ben Epstein
State: Resolved
Fix Version: 3.32.1.4
Attachments: N/A
Development PRs: Available

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

Linked PRs from JIRA

#5544

@h2o-ops h2o-ops closed this as completed May 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant