Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge #9

Open
utterances-bot opened this issue Mar 9, 2021 · 95 comments

Comments

@utterances-bot
Copy link

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball | Julia Silge

Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses.

https://juliasilge.com/blog/xgboost-tune-volleyball/

Copy link

Hi, Julia! Thank you so much to your wonderful tidymodelsseries. It is very informative and impressive. Nice job! For this XgBoost tuning blog, I found a wired result for the ROC curve part. Everything except ROC curve works well. I got the same accuracy and AUC like yours. But my ROC curve is flipped along with the diagonal. It is really wired. Since my curve is below the diagonal, the AUC should be less than 1/2 by definition. However, my AUC is the same as yours. Is it possible that something wrong with roc_curve function? The version of yardstick I am using is 0.0.7. Thank you in advance.

Copy link
Owner

Yes, since I published this blog post, there was a change in yardstick (in version 0.0.7) that changed how to choose which level (win or lose) is the "event". You can change this by using the event_level argument for functions like roc_curve().

Copy link

Got it. Thank you.

Copy link

Mr-Hadoop-Hotshot commented Mar 24, 2021

Hi Julie,

Great tutorial. Thank you for your support.

I am facing two problems;

  1. my code:
final_res %>% 
  collect_predictions() %>%
  roc_curve(y,.pred_1,event_level="second") %>% 
  autoplot()

Error: The number of levels in truth (3) must match the number of columns supplied in ... (1).

  1. How do I deploy the model for real time data. As. in how can I run this model against other dataframe?

Appreciate your time.
Thanks in advance.

Copy link
Owner

@Mr-Hadoop-Hotshot it sounds like something has gone a bit wrong somewhere in predictions, maybe some NA values are being generated? I would look at the output of collect_predictions() and see what is happening there.

The output of last_fit() contains a workflow that you can use for prediction on new data. I show how to do that in this post and this post.

@Mr-Hadoop-Hotshot
Copy link

Hi Julia,

Thank you for your reply. Other tutorials also excellent as always.
I found a solution for the second problem.

But, the 1st one remains the same.

a. Actually my original .df target variable had three levels (i.e: 1,2 & 3). I applied, filter( ) command to use only 1 & 3.
b. Before doing initial_split( ) I used the droplevels( ) command and applied last_fit( ) command.
c. Strangely, when I applied conf_mat( ), no errors popped, but instead the "2" level was also present mentioning both "Actual" & "Predicted" values as 0.
d. I suspect this is what stopping me from generating the roc curve. But, when I check levels of the variable and visual inspection it's no where to be found.
e. collect_predictions( ) also, did return a column for .pred_2. Very confused!!!

Any suggestions on this? Note, all NA also been addressed.

Appreciate your time.
Thanks in advance.

Copy link
Owner

@Mr-Hadoop-Hotshot Ah gotcha, I would go back to the very beginning and make sure that your initial data set only has two levels in your outcome; this sounds like however you are trying to filter and remove a level is not working. If you would like somewhere to ask for help, I recommend RStudio Community; be sure to create a reprex showing the problem.

@Mr-Hadoop-Hotshot
Copy link

Hi @juliasilge

Yeah sure I tried that,. Just wanted to let you know, your blog is full of quality information.
You have any materials related to sentiment analysis in R.

Thank you once again.

@juliasilge
Copy link
Owner

Check out this chapter of my book on text mining for info on sentiment analysis.

@Mr-Hadoop-Hotshot
Copy link

Hey, this book was recommended by UT, Austin when I was doing my PG program in data science and business analytics.
Great book!. I used it as my reference source to by research. However, what are your thoughts on using sentimentr package directly on customer feedback kind of scenario rather than using the general procedures on NLP and comparing to the sentiment lexicons as mentioned in the book. I know it requires lot of effort and time of yours to make a video, but It would be great to learn techniques on NLP from your videos. Thanks.

Copy link

jderazoa commented Apr 3, 2021

Hola Julia, muchas gracias por compartir tu trabajo esta muy bueno soy seguidor tuyo me gusta mucho las pausas que tienes al explicar cada detalle de los codigos, excelente eres muy guapa

Copy link

Thanks for the tutorial! I wonder why we create vb_test if we never use it. Am I missing something?

@juliasilge
Copy link
Owner

@graco-roza I think I discuss this in the video, but the main idea there is to demonstrate how to prep() and bake() a recipe for debugging and problem-solving purposes. If you use a workflow() you don't technically need those steps, but it can be helpful to know what is going on under the hood and be able to trouble shoot if/when things go wrong.

@Mr-Hadoop-Hotshot
Copy link

Hi @juliasilge
Hope you are doing well and safe!

Hey I recently started to encounter a problem with executing predict(fnal$.workflw[[1]],my_dataframe[,]) code line after upgrading my Rstudio from 4.0.4 version to 4.1.0.

ERROR MESSAGE : R Session Aborted. R encountered a fatal error.

Tried running that code line in console window directly and R throws the same error back.

Any suggestions on this issue?

Appreciate your time.
Thanks in advance.

@juliasilge
Copy link
Owner

@Mr-Hadoop-Hotshot Hmmm, most things are working well on R 4.1.0 but we have run into a few small issues so far that we've needed to fix. I can't tell from just this what the problem might be. Can you create a reprex and post it with the details of your problem on RStudio Community? I think that will be the best way to find the solution.

Copy link

Hey Julia, thank you very much for amazing work! I am a fresh Big data student, I want use these codes in my project , however I already split, and balanced my data for other models I did. For the purpose of the project I want to continue with the same split.

Is there any way I can put my prepared data in that split functions?
I also did my Random Forest model with your codes, but now I dont know I can use my validation data for both models. Can you please give me help :)

@juliasilge
Copy link
Owner

@canlikala Yes, you can use existing training/testing splits in tidymodels; you will need to create your own split object manually, as shown here and in the links in that issue. If you have, say, existing training, validation, and testing data sets, you can definitely use them across multiple types of models.

This case study shows how we treat a validation set as a single set of resampling.

Copy link

Hi Julia,
do you know if in parnship i can estimate an ensamble model with XGBOOST for regression but with a linear booster?
thanks in advance
have a nice day
MC

@juliasilge
Copy link
Owner

@martinocrippa We don't currently make use of the linear booster in parsnip but we are tracking interest in that feature here. If you would like to either add a 👍 or add any helpful context for your use case there, that would be great.

@martinocrippa
Copy link

martinocrippa commented Jun 21, 2021 via email

Copy link

Dear Julia,

I get this following error "Error: The provided grid is missing the following parameter columns that have been marked for tuning by tune(): 'trees'.", when using the grid_latin_hypercube function to tune my XGBoost grid for a regression exercise. I looked everywhere for an answer, not luck. Any idea? I think it has something to do with the "trees" definition..

Copy link

Sorry.. I found the reason: I forgot to set my ´trees = 1000´ Nw it works. However I get this error in my XGBoost tuning

"Fold01, Repeat1: preprocessor 1/1, model 30/30: Error: The option counts = TRUE was used but parameter colsample_bynode was given as 0. Please use...

! Fold01, Repeat1: internal: A correlation computation is required, but estimate is constant and has 0 standard deviation, resulting in a divide by 0 ...

x Fold02, Repeat1: preprocessor 1/1, model 2/30: Error: The option counts = TRUE was used but parameter colsample_bynode was given as 0. Please use ..."

Anyone having experience with this?

Copy link

Thanks for this great example. I have a question.

In this example you are using XGBoost in a classification model and you naturally evaluate model performance in the end with a ROC curve.

My question is: What kind of model performance would you use for the case where XGBoost is used in regression?

@juliasilge
Copy link
Owner

@kamaulindhardt You can check out metrics that are appropriate for regression, and see some example ways to evaluate regression models in this chapter.

Copy link

Dear Julia and all,
I had a great help from this tutorial, and comments as well for managing all the errors i have been getting during the analysis.

I have one problem which i could not solve, namely i need to get variable importance values. I need them to be in exact numbers and not only in the plot.

can you please be so kind and guide me in this issue?

Kind regards
Tamara

@juliasilge
Copy link
Owner

@TotorosForest You can use the vip::vi() function for that.

Copy link

Dear Julia!
Thank you so much! I think i have managed to solve the problem based on your comment

mm_final_xgb %>%
fit(data = df_mm_train) %>%
pull_workflow_fit() %>%
vip::vi()

i hope i have not written "hubble bublle" code :)

My goal is to select some variables from 10 variables that are examined (8 variables are ordinal, 2 variables are binary). What would you recommend as a cutoff coefficient in case you would want to select only few of these 10?

Moreover, what is this importance value? Is it information gain value, gini idex? regression coefficients? How would i call them in the report?

Thank you.

@juliasilge
Copy link
Owner

@TotorosForest You can look here at the vip::vi() documentation to see how the importance scoring works for various models. I think a cutoff decision would be very domain and data specific. Good luck!

Copy link

Dear all,
I have one more question about this part of the tutorial:

"It’s time to go back to the testing set! Let’s use last_fit() to fit our model one last time on the training data and evaluate our model one last time on the testing set. Notice that this is the first time we have used the testing data during this whole modeling analysis.

final_res <- last_fit(final_xgb, vb_split)"

My question: as we aim is to test the results in the testing set, should not the data file be "vb_test" instead of "vb_split"?

As i understand vb_split is the result of initial partition of the data 75 % / 25 %. and if we want to test on the test set, should not we choose "vb_test" ?

Thank you for understanding of my confusion.

Kind regards,
Tamara

@juliasilge
Copy link
Owner

@TotorosForest You can check out the documentation for last_fit(); notice that it takes the split as the argument so that it can train one final time on the training data and evaluate on the testing data. You don't want to fit to the testing data.

Copy link

Hi Julia,
Thanks for the reply. I think I figured out was was the problem (

Copy link

I had many 0s in the data, it's running now but the tune_grid is taking so lung ~12 hours and still running, I am wondering if this is normal?
Thanks again,
Sami

@juliasilge
Copy link
Owner

@SamiFarashi I would say generally no, but it's hard to say without other information. If you are looking at a very long-running model, I recommend starting out with very few tuning parameters, few resamples or a subset of your data, and then scaling up to achieve the best model in a reasonable timeframe. If you can describe your situation in more detail, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions.

Copy link

Great post and package! Thanks so much!

@MonkeyCousin
Copy link

Hi Julia,
excellent tutorial, thanks. I want to multi class classification; how is that possible, please?

@juliasilge
Copy link
Owner

Several of the models in tidymodels support multiclass classification! You can see some of them here, but also some models support this natively, like ranger.

@MonkeyCousin
Copy link

Thank you. Does that mean that xgboost as included in tidymodels does not support multi class classification? I have seen examples where num_class is set along with other params, e.g. with objective = "multi:softprob".
I am keen to both continue my foray into tidymodels and, for consistency across my project, to use xgboost.

@juliasilge
Copy link
Owner

@MonkeyCousin xgboost does support multiclass, yep. You can see an example here.

Copy link

wcwr commented Jan 31, 2023

Hi Julia,

Thanks for this tutorial! When I run this with an XGBoost regression on my own data, everything works! However, the default model (setting trees=1000 and nothing more) performs slightly better than my tuned model!

Any idea if this is common? I'm wondering because I plan to implement this tuning step in many other areas of my code.

If relevant, I did choose the best parameters based on "rsq" rathern than "RMSE" (which seem to be the choices for a regression-based xgb compared to "auc" in the classification version".

@juliasilge
Copy link
Owner

@wcwr Take a look at this chapter to understand what might be happening by optimizing $R^2$ instead of RMSE. In general, I would be surprised if an untuned model with default parameters performed better than a model with tuned hyperparameters and I would double check that you're comparing models in a consistent way.

Copy link

wcwr commented Feb 6, 2023

Hi Julia,

In the tune_grid step, the resamples parameter was set to vb_folds, and the output of this final tuned model is xgb_res. Does this mean that the final model uses the hyperparameters that produced the best metric (AUC/RMSE/r) over the average of the 10 folds? Or could it be the single best fold? Or median perhaps?

Looked for this info in the tune_grid section of tune.tidymodels.org but I don't think I found it.

Thanks for the wonderful tutorial!

@juliasilge
Copy link
Owner

@wcwr The xgb_res object does not contain any final model. It contains the model performance results that you get across all the model configurations that were tried, estimated using the 10 folds. The next step is to choose the model you want (I did it here with select_best()) and then to train the model using that specific model configuration chosen via tuning on the whole training set with finalize_workflow() and last_fit(). You may want to read this "Getting Started" article on tidymodels.org.

Copy link

Hi Julia,

Thanks for the blog post and all your videos!

How can you assess accuracy comparisons between train and test sets from the collect_metrics() call on the final fit?

@juliasilge
Copy link
Owner

@jlecornu3 We don't recommend measuring model performance using the training set as a whole for the reasons outlined in this section and there purposefully isn't fluent tooling in tidymodels to do so using a final tuned model. However, if you look at this blog post, the metrics you see with collect_metrics(xgb_res) are metrics computed using resamples of the training set; this is what we do recommend.

@jlecornu3
Copy link

So do you feel this collect_metrics(xgb_res) is reflective of the true model performance on a test set...? Or would you advise computing accuracy / rmse / some other metric on both the resamples and the test set not using in resamples? If the latter... does tidymodels offer this?

@juliasilge
Copy link
Owner

@jlecornu3 Ah, maybe I misunderstood what you were asking. In this blog post:

  • collect_metrics(xgb_res) computes metrics from resamples of the training set
  • collect_metrics(final_res) computes metrics from the test set

You might want to check out this chapter on "spending your data budget" and how to use the training set vs. test set, as well as how last_fit() works.

@jlecornu3
Copy link

Thanks Julia -- super clear!

Copy link

hi julia i know this is not the appropriate place to ask this question but i am trying to use mlflow in rstudio and i always faced this error and i did not find any solution :
""" Error in process_initialize(self, private, command, args, stdin, stdout, …:
! Native call to processx_exec failed
Caused by error in chain_call(c_processx_exec, command, c(command, args), pty, pty_options, …:
! Command 'C:/Users/TAKKOUK/AppData/Local/MICROS1/WINDOW1/Scripts/mlflow' not found @win/processx.c:982 (processx_exec) """

@juliasilge
Copy link
Owner

@mohamedelhilaltek I recommend that you create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then posting on Posit Community. I know there aren't a ton of mlflow users but generally it's a great forum for getting help with these kinds of questions. Good luck! 🙌

Copy link

hi julia i have s regression problem where the target variable is influenced by zero more than 50 percent how can i do this with xgboost is there any step

@juliasilge
Copy link
Owner

@Hamza-Gouaref Hmmmm, if you had counts with a lot of zeroes, I would suggest that you use zero-inflated Poisson, like in this post. Can you formulate it as a Poisson problem? That would be my mine suggestion.

Copy link

Hi Julia, thanks for the great info! Very useful! One question, I've been trying, unsuccessfully, to create a couple of partial dependence plots for your example (for both numeric and categorical predictors). I think its because I'm very unfamiliar with the tidyverse approach to predictive modeling and how/were objects are located. Could you direct me to a source that might be helpful (or short code example)? I've been trying to use pdp and the DALEXtra packages. Thanks very much, Joe

@juliasilge
Copy link
Owner

@retzerjj Check out this chapter of our book that shows how to make partial dependence plots with DALEXtra. If you are wanting to figure out how to pull out various components of a tidymodels workflow, check out these methods, which can help you extract out the workflow, the parsnip model, the underlying engine model, and so forth.

Copy link

mjwera commented Aug 18, 2023

Thank you for the great video and help. My question is about the vip package to see the variable importance. When I try to install the package I get the error message, "package 'vip' is not available for this version of R". I'm using 4.2.2. Has vip been placed by another package? Thanks.

@juliasilge
Copy link
Owner

@mjwera Ooooof, looks like it was archived from CRAN. You can read about their plans here and in the meantime you can install from GitHub.

@bgreenwell
Copy link

@mjwera apologies, looks like vip was orphaned for some failed tests from some of the last changes we made, but we never got the warning! Should be back up and running soon!

@mjwera
Copy link

mjwera commented Aug 18, 2023 via email

Copy link

HanLum commented Jul 26, 2024

Hi Julia,

Thank you for another great video!
Just wondering, is there any way to retrieve the evaluation logs for the testing and training sets from last_fit using tidymodels? I can only see the training evaluation log!
Native XGB in r has evals_result and in python model.evals_result() can be used - is there an equivalent?

Best wishes,
Hannah

@juliasilge
Copy link
Owner

@HanLum I am not aware of how to do that, but you might want to create a reprex (a minimal reproducible example) showing what you want to do and any problems you run into with it, then posting on Posit Community. It's a great forum for getting help with these kinds of modeling questions. Good luck! 🙌

Copy link

HanLum commented Jul 29, 2024

@juliasilge Thank you for getting back to me so fast! I will give Posit a try :) Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests