Difficulty estimator with categorical dtype support #27

FlyingCurryMonster · 2024-02-01T16:44:24Z

FlyingCurryMonster
Feb 1, 2024

Hi, I'm learning about CPD and wanted to try it for a xgboost model on a kaggle dataset. My current model uses categorical features without one hot encoding. I'm wondering if anyone has a solution on applying the DifficultyEstimator when the training data has categorical features, so that it can be used as sigma for calibrating a WrapRegressor. Should I just only include numeric features when fitting the difficulty estimator?

Thanks!

Answered by henrikbostrom

Feb 13, 2024

Hi,

This is a good question, as the construction of good difficulty estimators is a central challenge. In addition to what you propose, one possible (but not necessarily good) option is to employ one-hot encoding, which would allow you to use the k-NN approach for the difficulty estimator, which relies on the Euclidean distance. This may however not scale very well with the number of unique categorical values. Another option would be to compute the difficulty outside the DifficultyEstimator and implement a tailored distance function, e.g., which allows for combining categorical and numerical features, such as Gower's distance.

Best regards,
Henrik

View full answer

henrikbostrom · 2024-02-13T14:38:50Z

henrikbostrom
Feb 13, 2024
Maintainer

Hi,

This is a good question, as the construction of good difficulty estimators is a central challenge. In addition to what you propose, one possible (but not necessarily good) option is to employ one-hot encoding, which would allow you to use the k-NN approach for the difficulty estimator, which relies on the Euclidean distance. This may however not scale very well with the number of unique categorical values. Another option would be to compute the difficulty outside the DifficultyEstimator and implement a tailored distance function, e.g., which allows for combining categorical and numerical features, such as Gower's distance.

Best regards,
Henrik

1 reply

FlyingCurryMonster Feb 15, 2024
Author

Hi Henrik,

I think I found a quick and dirty fix, that I'm not sure is appropriate. Following guidance from the crepes extra modules:
documentation https://crepes.readthedocs.io/en/latest/crepes.extras.html :

It ocurred the me that the xgb booster is an ensemble of weak trees like in RF. I created a new class variable estimators_ and passed that in as a learner argument for fitting the difficulty estimator. Some example code looks like this:

booster = xgb_model.learner.get_booster()

xgb_model.learner.estimators_ = xgb_model.learner.get_booster()
xgb_model.learner.estimators_
de_var = DifficultyEstimator()
de_var.fit(learner=xgb_model.learner)

I'm not exactly sure how XGBoost and other boosted tree libraries support categorical data types, hence my suspicion what I'm doing is not exactly kosher.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty estimator with categorical dtype support #27

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Difficulty estimator with categorical dtype support #27

FlyingCurryMonster Feb 1, 2024

Replies: 1 comment · 1 reply

henrikbostrom Feb 13, 2024 Maintainer

FlyingCurryMonster Feb 15, 2024 Author

FlyingCurryMonster
Feb 1, 2024

Replies: 1 comment 1 reply

henrikbostrom
Feb 13, 2024
Maintainer

FlyingCurryMonster Feb 15, 2024
Author