Difficulty estimator with categorical dtype support #27
-
Hi, I'm learning about CPD and wanted to try it for a xgboost model on a kaggle dataset. My current model uses categorical features without one hot encoding. I'm wondering if anyone has a solution on applying the DifficultyEstimator when the training data has categorical features, so that it can be used as sigma for calibrating a WrapRegressor. Should I just only include numeric features when fitting the difficulty estimator? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Hi, This is a good question, as the construction of good difficulty estimators is a central challenge. In addition to what you propose, one possible (but not necessarily good) option is to employ one-hot encoding, which would allow you to use the k-NN approach for the difficulty estimator, which relies on the Euclidean distance. This may however not scale very well with the number of unique categorical values. Another option would be to compute the difficulty outside the DifficultyEstimator and implement a tailored distance function, e.g., which allows for combining categorical and numerical features, such as Gower's distance. Best regards, |
Beta Was this translation helpful? Give feedback.
Hi,
This is a good question, as the construction of good difficulty estimators is a central challenge. In addition to what you propose, one possible (but not necessarily good) option is to employ one-hot encoding, which would allow you to use the k-NN approach for the difficulty estimator, which relies on the Euclidean distance. This may however not scale very well with the number of unique categorical values. Another option would be to compute the difficulty outside the DifficultyEstimator and implement a tailored distance function, e.g., which allows for combining categorical and numerical features, such as Gower's distance.
Best regards,
Henrik