Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

h2o.relevel_by_frequency produces invalid results #15761

Closed
hutch3232 opened this issue Sep 13, 2023 · 1 comment
Closed

h2o.relevel_by_frequency produces invalid results #15761

hutch3232 opened this issue Sep 13, 2023 · 1 comment
Assignees
Labels
Milestone

Comments

@hutch3232
Copy link

H2O version, Operating System and Environment
Tested on latest h2o 3.42.0.3 on both Windows and Linux versions of R.

Actual behavior
h2o appears to be resetting the values of the column, not the underlying index, which results in values essentially being recoded incorrectly.

Expected behavior
For a given record, its value should not change when releveling, only the underlying index should change.

Steps to reproduce

hair_dt <- as.data.table(HairEyeColor)
# expand back out the frequency table such that N records for each combination
hair_dt <- hair_dt[rep(seq(.N), N),][, N := NULL]
hair <- as.h2o(hair_dt)
hair <- h2o.asfactor(hair)

hair
#    Hair   Eye  Sex
# 1 Black Brown Male
# 2 Black Brown Male
# 3 Black Brown Male
# 4 Black Brown Male
# 5 Black Brown Male
# 6 Black Brown Male
#
# [592 rows x 3 columns]

h2o.levels(hair$Hair)
# [1] "Black" "Blond" "Brown" "Red" 

h2o.group_by(data = hair, by = "Hair", nrow(1))
#    Hair nrow
# 1 Black  108
# 2 Blond  127
# 3 Brown  286
# 4   Red   71
#
# [4 rows x 2 columns]

hair$Hair_relevel <- h2o.relevel_by_frequency(hair$Hair)
h2o.levels(hair$Hair_relevel)
# [1] "Brown" "Blond" "Black" "Red"

h2o.group_by(data = hair, by = "Hair_relevel", nrow(1))
#   Hair_relevel nrow
# 1        Brown  108
# 2        Blond   71
# 3        Black  286
# 4          Red  127
#
# [4 rows x 2 columns]

hair
#    Hair   Eye Sex Hair_relevel
# 1 Black Brown Male        Brown
# 2 Black Brown Male        Brown
# 3 Black Brown Male        Brown
# 4 Black Brown Male        Brown
# 5 Black Brown Male        Brown
# 6 Black Brown Male        Brown
# 
# [592 rows x 4 columns]

We can see the labels moved, but the underlying indices did not, so nrow for index 1 is 108 for both (Black in the original, and mistaken Brown in the releveled example). Looking at the transformed hair data, we can see Hair_relevel has a different value which is certainly wrong.

Add any other context about the problem here.
A relevant issue that I believe was incorrectly closed: #6853
It's associated Stackoverflow post: https://stackoverflow.com/questions/74294256/h2o-python-relevel-vs-relevel-by-frequency-for-factor-columns

@maurever
Copy link
Contributor

@hutch3232 Thanks a lot for reporting this bug. The fix is being reviewed and will be available in the next fix release.

maurever added a commit that referenced this issue Nov 3, 2023
* Fix relevel by freq bug

* Fix AstRelevelByFreq test to respect changes in topN method.

* Update topN function to reflect changes
@maurever maurever closed this as completed Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants