Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate why GAM fails with dataset of certain size. #16149

Closed
wendycwong opened this issue Apr 9, 2024 · 0 comments
Closed

Investigate why GAM fails with dataset of certain size. #16149

wendycwong opened this issue Apr 9, 2024 · 0 comments
Assignees
Labels
Milestone

Comments

@wendycwong
Copy link
Contributor

Here is the code:

library(data.table)
library(dplyr)
library(h2o) # or load your h2o in a different way
library(ggplot2)

h2o.init(max_mem_size = '32G')

for n=1000 ==========================================================================================

n <- 1000
sum_insured <- seq(1,200000, length.out=n)

d <- data.table(sum_insured=sum_insured, sqrt = sqrt(sum_insured), sine = sin(2pisum_insured/40000))
d[, sine := 0.3sqrtsine ,]
d[, y := pmax(0,sqrt + sine) ,]

d[, x := sum_insured]
d[, x2 := rev(x) ,] # flip axis

visualise target

ggplot(d) + geom_line(aes(x=x2,y=y)) + scale_x_continuous(breaks = seq(0, 200000, by = 20000))

import the dataset

h2o_data <- as.h2o(d)

model <- h2o.gam(y = "y", gam_columns = c("x2"), bs = c(2), spline_orders = c(3),
splines_non_negative=c(F),
training_frame = h2o_data, family = "tweedie", tweedie_variance_power = 1.1,
scale = c(0),
lambda = 0, alpha= 0,
keep_gam_cols = T,
num_knots = c(10))

pred <- h2o.predict(object = model, newdata = h2o_data) %>% as.vector

plot result

d$pred <- pred
ggplot(d) + geom_line(aes(x=x2,y=y)) + geom_line(aes(x=x2,y=pred),colour='red')+ scale_x_continuous(breaks = seq(0, 200000, by = 20000))
ggsave('H:/h2o gam monotonic decreasing.png')

for n=1001 ==========================================================================================

n <- 1001
sum_insured <- seq(1,200000, length.out=n)

d2 <- data.table(sum_insured=sum_insured, sqrt = sqrt(sum_insured), sine = sin(2pisum_insured/40000))
d2[, sine := 0.3sqrtsine ,]
d2[, y := pmax(0,sqrt + sine) ,]

d2[, x := sum_insured]
d2[, x2 := rev(x) ,] # flip axis

import the dataset

h2o_data2 <- as.h2o(d2)

model2 <- h2o.gam(y = "y", gam_columns = c("x2"), bs = c(2), spline_orders = c(3),
splines_non_negative=c(F),
training_frame = h2o_data2, family = "tweedie", tweedie_variance_power = 1.1,
scale = c(0),
lambda = 0, alpha= 0,
keep_gam_cols = T,
num_knots = c(10))

pred2 <- h2o.predict(object = model2, newdata = h2o_data2) %>% as.vector

plot result

d2$pred <- pred2
ggplot(d2) + geom_line(aes(x=x2,y=y)) + geom_line(aes(x=x2,y=pred),colour='red')+ scale_x_continuous(breaks = seq(0, 200000, by = 20000))
ggsave('H:/h2o gam monotonic decreasing.png')

@wendycwong wendycwong added the bug label Apr 9, 2024
@wendycwong wendycwong self-assigned this Apr 9, 2024
@wendycwong wendycwong modified the milestones: 3.46.0.1, 3.46.0.2 Apr 11, 2024
wendycwong added a commit that referenced this issue Apr 22, 2024
* add R test that reproduce the error.

* add more to test.

* GH-16149: fixed key passed to rebalance dataset to avoid collision.

* Adopt adam code review comments.

Co-authored-by: Veronika Maurerová <maurever@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant