Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge #9

utterances-bot · 2021-03-09T19:30:46Z

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge

Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses.

https://juliasilge.com/blog/xgboost-tune-volleyball/

Haoran-Jiang · 2021-03-09T19:30:47Z

Hi, Julia! Thank you so much to your wonderful tidymodelsseries. It is very informative and impressive. Nice job! For this XgBoost tuning blog, I found a wired result for the ROC curve part. Everything except ROC curve works well. I got the same accuracy and AUC like yours. But my ROC curve is flipped along with the diagonal. It is really wired. Since my curve is below the diagonal, the AUC should be less than 1/2 by definition. However, my AUC is the same as yours. Is it possible that something wrong with roc_curve function? The version of yardstick I am using is 0.0.7. Thank you in advance.

juliasilge · 2021-03-09T20:20:44Z

Yes, since I published this blog post, there was a change in yardstick (in version 0.0.7) that changed how to choose which level (win or lose) is the "event". You can change this by using the event_level argument for functions like roc_curve().

Haoran-Jiang · 2021-03-09T22:23:03Z

Got it. Thank you.

Mr-Hadoop-Hotshot · 2021-03-24T11:26:38Z

Hi Julie,

Great tutorial. Thank you for your support.

I am facing two problems;

my code:

final_res %>% 
  collect_predictions() %>%
  roc_curve(y,.pred_1,event_level="second") %>% 
  autoplot()

Error: The number of levels in truth (3) must match the number of columns supplied in ... (1).

How do I deploy the model for real time data. As. in how can I run this model against other dataframe?

Appreciate your time.
Thanks in advance.

juliasilge · 2021-03-24T16:53:05Z

@Mr-Hadoop-Hotshot it sounds like something has gone a bit wrong somewhere in predictions, maybe some NA values are being generated? I would look at the output of collect_predictions() and see what is happening there.

The output of last_fit() contains a workflow that you can use for prediction on new data. I show how to do that in this post and this post.

Mr-Hadoop-Hotshot · 2021-03-26T14:24:02Z

Hi Julia,

Thank you for your reply. Other tutorials also excellent as always.
I found a solution for the second problem.

But, the 1st one remains the same.

a. Actually my original .df target variable had three levels (i.e: 1,2 & 3). I applied, filter( ) command to use only 1 & 3.
b. Before doing initial_split( ) I used the droplevels( ) command and applied last_fit( ) command.
c. Strangely, when I applied conf_mat( ), no errors popped, but instead the "2" level was also present mentioning both "Actual" & "Predicted" values as 0.
d. I suspect this is what stopping me from generating the roc curve. But, when I check levels of the variable and visual inspection it's no where to be found.
e. collect_predictions( ) also, did return a column for .pred_2. Very confused!!!

Any suggestions on this? Note, all NA also been addressed.

Appreciate your time.
Thanks in advance.

juliasilge · 2021-03-26T16:12:12Z

@Mr-Hadoop-Hotshot Ah gotcha, I would go back to the very beginning and make sure that your initial data set only has two levels in your outcome; this sounds like however you are trying to filter and remove a level is not working. If you would like somewhere to ask for help, I recommend RStudio Community; be sure to create a reprex showing the problem.

Mr-Hadoop-Hotshot · 2021-03-27T04:43:00Z

Hi @juliasilge

Yeah sure I tried that,. Just wanted to let you know, your blog is full of quality information.
You have any materials related to sentiment analysis in R.

Thank you once again.

juliasilge · 2021-03-27T18:43:36Z

Check out this chapter of my book on text mining for info on sentiment analysis.

Mr-Hadoop-Hotshot · 2021-03-28T03:10:11Z

Hey, this book was recommended by UT, Austin when I was doing my PG program in data science and business analytics.
Great book!. I used it as my reference source to by research. However, what are your thoughts on using sentimentr package directly on customer feedback kind of scenario rather than using the general procedures on NLP and comparing to the sentiment lexicons as mentioned in the book. I know it requires lot of effort and time of yours to make a video, but It would be great to learn techniques on NLP from your videos. Thanks.

jderazoa · 2021-04-03T22:53:08Z

Hola Julia, muchas gracias por compartir tu trabajo esta muy bueno soy seguidor tuyo me gusta mucho las pausas que tienes al explicar cada detalle de los codigos, excelente eres muy guapa

graco-roza · 2021-04-22T11:16:33Z

Thanks for the tutorial! I wonder why we create vb_test if we never use it. Am I missing something?

juliasilge · 2021-04-22T15:00:03Z

@graco-roza I think I discuss this in the video, but the main idea there is to demonstrate how to prep() and bake() a recipe for debugging and problem-solving purposes. If you use a workflow() you don't technically need those steps, but it can be helpful to know what is going on under the hood and be able to trouble shoot if/when things go wrong.

Mr-Hadoop-Hotshot · 2021-05-20T06:28:48Z

Hi @juliasilge
Hope you are doing well and safe!

Hey I recently started to encounter a problem with executing predict(fnal$.workflw[[1]],my_dataframe[,]) code line after upgrading my Rstudio from 4.0.4 version to 4.1.0.

ERROR MESSAGE : R Session Aborted. R encountered a fatal error.

Tried running that code line in console window directly and R throws the same error back.

Any suggestions on this issue?

Appreciate your time.
Thanks in advance.

juliasilge · 2021-05-20T14:59:14Z

@Mr-Hadoop-Hotshot Hmmm, most things are working well on R 4.1.0 but we have run into a few small issues so far that we've needed to fix. I can't tell from just this what the problem might be. Can you create a reprex and post it with the details of your problem on RStudio Community? I think that will be the best way to find the solution.

canlikala · 2021-05-22T20:01:20Z

Hey Julia, thank you very much for amazing work! I am a fresh Big data student, I want use these codes in my project , however I already split, and balanced my data for other models I did. For the purpose of the project I want to continue with the same split.

Is there any way I can put my prepared data in that split functions?
I also did my Random Forest model with your codes, but now I dont know I can use my validation data for both models. Can you please give me help :)

juliasilge · 2021-05-24T15:21:02Z

@canlikala Yes, you can use existing training/testing splits in tidymodels; you will need to create your own split object manually, as shown here and in the links in that issue. If you have, say, existing training, validation, and testing data sets, you can definitely use them across multiple types of models.

This case study shows how we treat a validation set as a single set of resampling.

martinocrippa · 2021-06-18T13:42:04Z

Hi Julia,
do you know if in parnship i can estimate an ensamble model with XGBOOST for regression but with a linear booster?
thanks in advance
have a nice day
MC

juliasilge · 2021-06-18T18:17:17Z

@martinocrippa We don't currently make use of the linear booster in parsnip but we are tracking interest in that feature here. If you would like to either add a 👍 or add any helpful context for your use case there, that would be great.

martinocrippa · 2021-06-21T09:56:19Z

ok, thank you very much have a nice day Il giorno ven 18 giu 2021 alle ore 20:17 Julia Silge < ***@***.***> ha scritto:

…

@martinocrippa <https://github.com/martinocrippa> We don't currently make use of the linear booster in parsnip but we are tracking interest in that feature here <tidymodels/parsnip#118>. If you would like to either add a 👍 or add any helpful context for your use case there, that would be great. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHK6UAK4Z2KAZMTUDHTMS6DTTOETXANCNFSM4Y4KDWZA> .

kamaulindhardt · 2021-06-23T08:51:17Z

Dear Julia,

I get this following error "Error: The provided grid is missing the following parameter columns that have been marked for tuning by tune(): 'trees'.", when using the grid_latin_hypercube function to tune my XGBoost grid for a regression exercise. I looked everywhere for an answer, not luck. Any idea? I think it has something to do with the "trees" definition..

kamaulindhardt · 2021-06-23T08:55:54Z

Sorry.. I found the reason: I forgot to set my ´trees = 1000´ Nw it works. However I get this error in my XGBoost tuning

"Fold01, Repeat1: preprocessor 1/1, model 30/30: Error: The option counts = TRUE was used but parameter colsample_bynode was given as 0. Please use...

! Fold01, Repeat1: internal: A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 ...

x Fold02, Repeat1: preprocessor 1/1, model 2/30: Error: The option counts = TRUE was used but parameter colsample_bynode was given as 0. Please use ..."

Anyone having experience with this?

kamaulindhardt · 2021-06-23T12:06:18Z

Thanks for this great example. I have a question.

In this example you are using XGBoost in a classification model and you naturally evaluate model performance in the end with a ROC curve.

My question is: What kind of model performance would you use for the case where XGBoost is used in regression?

juliasilge · 2021-06-23T23:01:13Z

@kamaulindhardt You can check out metrics that are appropriate for regression, and see some example ways to evaluate regression models in this chapter.

TotorosForest · 2021-06-24T17:22:28Z

Dear Julia and all,
I had a great help from this tutorial, and comments as well for managing all the errors i have been getting during the analysis.

I have one problem which i could not solve, namely i need to get variable importance values. I need them to be in exact numbers and not only in the plot.

can you please be so kind and guide me in this issue?

Kind regards
Tamara

juliasilge · 2021-06-24T17:28:52Z

@TotorosForest You can use the vip::vi() function for that.

TotorosForest · 2021-06-24T17:36:49Z

Dear Julia!
Thank you so much! I think i have managed to solve the problem based on your comment

mm_final_xgb %>%
fit(data = df_mm_train) %>%
pull_workflow_fit() %>%
vip::vi()

i hope i have not written "hubble bublle" code :)

My goal is to select some variables from 10 variables that are examined (8 variables are ordinal, 2 variables are binary). What would you recommend as a cutoff coefficient in case you would want to select only few of these 10?

Moreover, what is this importance value? Is it information gain value, gini idex? regression coefficients? How would i call them in the report?

Thank you.

juliasilge · 2021-06-24T18:35:30Z

@TotorosForest You can look here at the vip::vi() documentation to see how the importance scoring works for various models. I think a cutoff decision would be very domain and data specific. Good luck!

TotorosForest · 2021-06-26T20:26:48Z

Dear all,
I have one more question about this part of the tutorial:

"It’s time to go back to the testing set! Let’s use last_fit() to fit our model one last time on the training data and evaluate our model one last time on the testing set. Notice that this is the first time we have used the testing data during this whole modeling analysis.

final_res <- last_fit(final_xgb, vb_split)"

My question: as we aim is to test the results in the testing set, should not the data file be "vb_test" instead of "vb_split"?

As i understand vb_split is the result of initial partition of the data 75 % / 25 %. and if we want to test on the test set, should not we choose "vb_test" ?

Thank you for understanding of my confusion.

Kind regards,
Tamara

juliasilge · 2021-06-26T23:54:58Z

@TotorosForest You can check out the documentation for last_fit(); notice that it takes the split as the argument so that it can train one final time on the training data and evaluate on the testing data. You don't want to fit to the testing data.

SamiFarashi · 2022-06-29T22:38:46Z

Hi Julia,
Thanks for the reply. I think I figured out was was the problem (

SamiFarashi · 2022-06-29T22:40:40Z

I had many 0s in the data, it's running now but the tune_grid is taking so lung ~12 hours and still running, I am wondering if this is normal?
Thanks again,
Sami

juliasilge · 2022-06-29T23:45:35Z

@SamiFarashi I would say generally no, but it's hard to say without other information. If you are looking at a very long-running model, I recommend starting out with very few tuning parameters, few resamples or a subset of your data, and then scaling up to achieve the best model in a reasonable timeframe. If you can describe your situation in more detail, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions.

dhillary-ias · 2022-11-11T22:04:53Z

Great post and package! Thanks so much!

MonkeyCousin · 2022-12-15T00:14:48Z

Hi Julia,
excellent tutorial, thanks. I want to multi class classification; how is that possible, please?

juliasilge · 2022-12-15T00:44:45Z

Several of the models in tidymodels support multiclass classification! You can see some of them here, but also some models support this natively, like ranger.

MonkeyCousin · 2022-12-15T01:29:04Z

Thank you. Does that mean that xgboost as included in tidymodels does not support multi class classification? I have seen examples where num_class is set along with other params, e.g. with objective = "multi:softprob".
I am keen to both continue my foray into tidymodels and, for consistency across my project, to use xgboost.

juliasilge · 2022-12-15T03:30:22Z

@MonkeyCousin xgboost does support multiclass, yep. You can see an example here.

wcwr · 2023-01-31T00:10:34Z

Hi Julia,

Thanks for this tutorial! When I run this with an XGBoost regression on my own data, everything works! However, the default model (setting trees=1000 and nothing more) performs slightly better than my tuned model!

Any idea if this is common? I'm wondering because I plan to implement this tuning step in many other areas of my code.

If relevant, I did choose the best parameters based on "rsq" rathern than "RMSE" (which seem to be the choices for a regression-based xgb compared to "auc" in the classification version".

juliasilge · 2023-01-31T01:14:16Z

@wcwr Take a look at this chapter to understand what might be happening by optimizing $R^2$ instead of RMSE. In general, I would be surprised if an untuned model with default parameters performed better than a model with tuned hyperparameters and I would double check that you're comparing models in a consistent way.

wcwr · 2023-02-06T16:27:56Z

Hi Julia,

In the tune_grid step, the resamples parameter was set to vb_folds, and the output of this final tuned model is xgb_res. Does this mean that the final model uses the hyperparameters that produced the best metric (AUC/RMSE/r) over the average of the 10 folds? Or could it be the single best fold? Or median perhaps?

Looked for this info in the tune_grid section of tune.tidymodels.org but I don't think I found it.

Thanks for the wonderful tutorial!

juliasilge · 2023-02-06T17:21:24Z

@wcwr The xgb_res object does not contain any final model. It contains the model performance results that you get across all the model configurations that were tried, estimated using the 10 folds. The next step is to choose the model you want (I did it here with select_best()) and then to train the model using that specific model configuration chosen via tuning on the whole training set with finalize_workflow() and last_fit(). You may want to read this "Getting Started" article on tidymodels.org.

jlecornu3 · 2023-05-26T10:49:58Z

Hi Julia,

Thanks for the blog post and all your videos!

How can you assess accuracy comparisons between train and test sets from the collect_metrics() call on the final fit?

juliasilge · 2023-05-26T17:50:33Z

@jlecornu3 We don't recommend measuring model performance using the training set as a whole for the reasons outlined in this section and there purposefully isn't fluent tooling in tidymodels to do so using a final tuned model. However, if you look at this blog post, the metrics you see with collect_metrics(xgb_res) are metrics computed using resamples of the training set; this is what we do recommend.

jlecornu3 · 2023-05-26T17:54:56Z

So do you feel this collect_metrics(xgb_res) is reflective of the true model performance on a test set...? Or would you advise computing accuracy / rmse / some other metric on both the resamples and the test set not using in resamples? If the latter... does tidymodels offer this?

juliasilge · 2023-05-26T18:00:02Z

@jlecornu3 Ah, maybe I misunderstood what you were asking. In this blog post:

collect_metrics(xgb_res) computes metrics from resamples of the training set
collect_metrics(final_res) computes metrics from the test set

You might want to check out this chapter on "spending your data budget" and how to use the training set vs. test set, as well as how last_fit() works.

jlecornu3 · 2023-05-26T18:01:39Z

Thanks Julia -- super clear!

mohamedelhilaltek · 2023-06-12T18:01:48Z

hi julia i know this is not the appropriate place to ask this question but i am trying to use mlflow in rstudio and i always faced this error and i did not find any solution :
""" Error in process_initialize(self, private, command, args, stdin, stdout, …:
! Native call to processx_exec failed
Caused by error in chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …:
! Command 'C:/Users/TAKKOUK/AppData/Local/MICROS~~1/WINDOW~~1/Scripts/mlflow' not found @win/processx.c:982 (processx_exec) """

juliasilge · 2023-06-12T18:16:56Z

@mohamedelhilaltek I recommend that you create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then posting on Posit Community. I know there aren't a ton of mlflow users but generally it's a great forum for getting help with these kinds of questions. Good luck! 🙌

Hamza-Gouaref · 2023-06-18T00:15:25Z

hi julia i have s regression problem where the target variable is influenced by zero more than 50 percent how can i do this with xgboost is there any step

juliasilge · 2023-06-19T20:16:13Z

@Hamza-Gouaref Hmmmm, if you had counts with a lot of zeroes, I would suggest that you use zero-inflated Poisson, like in this post. Can you formulate it as a Poisson problem? That would be my mine suggestion.

retzerjj · 2023-06-26T21:40:48Z

Hi Julia, thanks for the great info! Very useful! One question, I've been trying, unsuccessfully, to create a couple of partial dependence plots for your example (for both numeric and categorical predictors). I think its because I'm very unfamiliar with the tidyverse approach to predictive modeling and how/were objects are located. Could you direct me to a source that might be helpful (or short code example)? I've been trying to use pdp and the DALEXtra packages. Thanks very much, Joe

juliasilge · 2023-07-02T21:46:09Z

@retzerjj Check out this chapter of our book that shows how to make partial dependence plots with DALEXtra. If you are wanting to figure out how to pull out various components of a tidymodels workflow, check out these methods, which can help you extract out the workflow, the parsnip model, the underlying engine model, and so forth.

mjwera · 2023-08-18T18:48:35Z

Thank you for the great video and help. My question is about the vip package to see the variable importance. When I try to install the package I get the error message, "package 'vip' is not available for this version of R". I'm using 4.2.2. Has vip been placed by another package? Thanks.

juliasilge · 2023-08-18T19:15:18Z

@mjwera Ooooof, looks like it was archived from CRAN. You can read about their plans here and in the meantime you can install from GitHub.

bgreenwell · 2023-08-18T19:54:00Z

@mjwera apologies, looks like vip was orphaned for some failed tests from some of the last changes we made, but we never got the warning! Should be back up and running soon!

mjwera · 2023-08-18T21:29:03Z

Thank you!

…

On Fri, Aug 18, 2023 at 2:15 PM Julia Silge ***@***.***> wrote: @mjwera <https://github.com/mjwera> Ooooof, looks like it was archived from CRAN <https://cran.r-project.org/package=vip>. You can read about their plans here <koalaverse/vip#153> and in the meantime you can install from GitHub <https://github.com/koalaverse/vip#installation>. — Reply to this email directly, view it on GitHub <#9 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AZQHFBO5YFMKNNEPRVJP7YLXV65NFANCNFSM4Y4KDWZA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

HanLum · 2024-07-26T10:28:43Z

Hi Julia,

Thank you for another great video!
Just wondering, is there any way to retrieve the evaluation logs for the testing and training sets from last_fit using tidymodels? I can only see the training evaluation log!
Native XGB in r has evals_result and in python model.evals_result() can be used - is there an equivalent?

Best wishes,
Hannah

juliasilge · 2024-07-26T16:10:00Z

@HanLum I am not aware of how to do that, but you might want to create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then posting on Posit Community. It's a great forum for getting help with these kinds of modeling questions. Good luck! 🙌

HanLum · 2024-07-29T10:05:25Z

@juliasilge Thank you for getting back to me so fast! I will give Posit a try :) Thank you!

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge #9

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge #9

Comments

utterances-bot commented Mar 9, 2021

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge

Haoran-Jiang commented Mar 9, 2021

juliasilge commented Mar 9, 2021

Haoran-Jiang commented Mar 9, 2021

Mr-Hadoop-Hotshot commented Mar 24, 2021 • edited Loading

juliasilge commented Mar 24, 2021

Mr-Hadoop-Hotshot commented Mar 26, 2021

juliasilge commented Mar 26, 2021

Mr-Hadoop-Hotshot commented Mar 27, 2021

juliasilge commented Mar 27, 2021

Mr-Hadoop-Hotshot commented Mar 28, 2021

jderazoa commented Apr 3, 2021

graco-roza commented Apr 22, 2021

juliasilge commented Apr 22, 2021

Mr-Hadoop-Hotshot commented May 20, 2021

juliasilge commented May 20, 2021

canlikala commented May 22, 2021

juliasilge commented May 24, 2021

martinocrippa commented Jun 18, 2021

juliasilge commented Jun 18, 2021

martinocrippa commented Jun 21, 2021 via email

kamaulindhardt commented Jun 23, 2021

kamaulindhardt commented Jun 23, 2021

kamaulindhardt commented Jun 23, 2021

juliasilge commented Jun 23, 2021

TotorosForest commented Jun 24, 2021

juliasilge commented Jun 24, 2021

TotorosForest commented Jun 24, 2021

juliasilge commented Jun 24, 2021

TotorosForest commented Jun 26, 2021

juliasilge commented Jun 26, 2021

SamiFarashi commented Jun 29, 2022

SamiFarashi commented Jun 29, 2022

juliasilge commented Jun 29, 2022

dhillary-ias commented Nov 11, 2022

MonkeyCousin commented Dec 15, 2022

juliasilge commented Dec 15, 2022

MonkeyCousin commented Dec 15, 2022

juliasilge commented Dec 15, 2022

wcwr commented Jan 31, 2023

juliasilge commented Jan 31, 2023

wcwr commented Feb 6, 2023

juliasilge commented Feb 6, 2023

jlecornu3 commented May 26, 2023

juliasilge commented May 26, 2023

jlecornu3 commented May 26, 2023

juliasilge commented May 26, 2023

jlecornu3 commented May 26, 2023

mohamedelhilaltek commented Jun 12, 2023

juliasilge commented Jun 12, 2023

Hamza-Gouaref commented Jun 18, 2023

juliasilge commented Jun 19, 2023

retzerjj commented Jun 26, 2023

juliasilge commented Jul 2, 2023

mjwera commented Aug 18, 2023

juliasilge commented Aug 18, 2023

bgreenwell commented Aug 18, 2023

mjwera commented Aug 18, 2023 via email

HanLum commented Jul 26, 2024

juliasilge commented Jul 26, 2024

HanLum commented Jul 29, 2024

Mr-Hadoop-Hotshot commented Mar 24, 2021 •

edited

Loading