Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge #57

utterances-bot · 2021-12-14T12:21:05Z

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast demonstrates how to implement multiclass or multinomial classification using with this week’s #TidyTuesday dataset on volcanoes.

https://juliasilge.com/blog/multinomial-volcano-eruptions/

CelMcC · 2021-12-14T12:21:05Z

Thank you so much Dr Silge, this is exactly what I've been hunting for!

conlelevn · 2022-05-16T03:32:36Z

Thanks Julia,

In this model I didnt see you use tuning hyperparameter for RF model, is there any specific reason for it? in practical, do you usually see any significant different in model performance before and after this tuning process?

juliasilge · 2022-05-17T01:51:41Z

@conlelevn Random forest models tend to perform pretty well without tuning, as long as you use "enough" trees (like 1000 or so). You can tune a random forest if you want to eke out a little more performance; I demonstrate how to do that here but typically you don't see a ton of dramatic improvement (unlike when you tune an xgboost model).

Wenyu1024 · 2022-07-04T09:09:44Z

Hi Julia,

Many thanks to the wonderful blog! In your example you showed how a recipe works on the training data as a whole (since you don't tune hyperparameter). I am wondering if you can shed some light on how recipe processing can be visible for the resampling object used for parameter tunning?

For example ,given a nested_cv process where each training data from the outer loop is used to generate a resampling object for hyperparameter tunning, how to confirm the upsampling is working properly, in only analysis sets but not assessment sets?

juliasilge · 2022-07-05T23:34:05Z

@Wenyu1024 You can read about how preprocessing works over resamples (in the context of parallel processing) in this section of Tidy Modeling with R; note the difference between parallel_over = "resamples" and parallel_over = "everything". If you are tuning in serial, it will, as expected, preprocess then fit for the resamples sequentially.

If you are using a nested resampling scheme, then you will need to set some of this up yourself, as outlined here.

aousabdo · 2022-12-08T15:03:50Z

Very useful post as always, Dr. Silge. I have learned a lot about tidymodels from your posts. Thank you very much!

smithhelen · 2023-03-29T22:31:22Z

Hello Julia.
I was wondering how to use the vip "permute" method discussed here (koalaverse/vip#131) with multiple classes, like the volcano data? Is it possible with metric = "mauc" and then somehow specifiying the pred_fun to average over the classes; or would I need to use prediction=FALSE and metric = "accuracy"; or something else entirely?
Many thanks :-)

juliasilge · 2023-03-30T16:00:19Z

@smithhelen Hmmm, I'm not sure. Can you create a small reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for folks to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌

smithhelen · 2023-03-30T22:21:35Z

Thank you Julia

I'll make my question a bit clearer :-)

In this volcano example you generate vi scores using the inbuilt importance="permutation" option via set_engine. Even though a probability forest (rather than a classification forest) is grown, these vi scores are measured from the change in classification accuracy (as per the ranger documentation).

In a different example, using bivariate data, you generate vi scores using the method = "permute" option in the vip package and do not specify importance = "permutation" within set_engine. Now, for a probability forest, the vi scores will be calculated using the auc method metric = "auc" (versus metric="accuracy" for a classification forest (i.e. when set_engine(..., probability = FALSE)). For the auc method, a reference class needs to be specified for both the pred_wrapper and vi().

Here is your code for the bivariate data, where you choose the reference class to be "One" (i.e. $.pred_One and reference_class = "One")

pred_fun <- function(object, newdata) {
  predict(object, new_data = newdata, type = "prob")$.pred_One
}
 
ranger_fit %>%
  vi(method = "permute", target = "Class", metric = "auc", nsim = 10,
     pred_wrapper = pred_fun, train = bivariate_train, reference_class = "One")`

An advantage of using the vip approach is that multiple simulations can be run and a boxplot produced.

My questions:

is it possible to use the vip(method = "permute" ...) method to calculate vi scores when there are more than two classes (as for the volcano example) because what would the reference_class be?
if 1. is not possible, then is it sensible to grow a classification forest and use vip with metric = "accuracy" instead?

Thank you!

juliasilge · 2023-04-03T02:15:01Z

Ah OK @smithhelen, I don't know how/if the vip package "permute" method works for multinomial classification (although you can ask over at the vip GH repo, so maybe they can clarify). You probably want to use something like DALEX instead; you can read more about using DALEX with tidymodels here.

smithhelen · 2023-04-03T02:20:45Z

Ahh, awesome, I'll have a read - thank you 🙂

…

________________________________ From: Julia Silge ***@***.***> Sent: Monday, 3 April 2023 2:15 pm To: juliasilge/juliasilge.com ***@***.***> Cc: Helen Smith ***@***.***>; Mention ***@***.***> Subject: Re: [juliasilge/juliasilge.com] Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge (Issue #57) Ah OK @smithhelen<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsmithhelen&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PeqmMivamYS1Mm3duOVdn7eTB%2BId%2FbhQDnYF985uRyk%3D&reserved=0>, I don't know how/if the vip package "permute" method works for multinomial classification (although you can ask over at the vip GH repo, so maybe they can clarify). You probably want to use something like DALEX instead<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmodeloriented.github.io%2FDALEX%2Farticles%2Fmultilabel_classification.html&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NMeZkvtXd3%2BgxOHSE93mUiFtY4jICF%2BgcmmsI1mpdB4%3D&reserved=0>; you can read more about using DALEX with tidymodels here<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.tmwr.org%2Fexplain.html&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=c7nddA2g7Syd%2BHd2UZcz5%2Bq8TUF4dsLE343j%2FxOJT2M%3D&reserved=0>. — Reply to this email directly, view it on GitHub<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjuliasilge%2Fjuliasilge.com%2Fissues%2F57%23issuecomment-1493540817&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BQ8i3kxPsP4e%2BDEbs6qoacV%2FeGwaAFA%2BACgwm2E6FBE%3D&reserved=0>, or unsubscribe<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANZS6KI3HOMZ75QMPIFCHLDW7IXDBANCNFSM5KAXW2JQ&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SRqZ0SHZ6Dr3%2BXo7IVzSGBd1Tf%2F0QczQ8uxA9mfnSmc%3D&reserved=0>. You are receiving this because you were mentioned.Message ID: ***@***.***>

bgreenwell · 2023-05-08T00:49:55Z

@juliasilge and @smithhelen sorry I'm late to the party. Starting work on the next version of vip now. In short, permutation importance works the same way for multiclass problems as it does for the binary and regression cases. In fact, the vip, iml, and ingredients (the DALEX package for variable importance) packages are all flexible enough to support ANY type of model; even ones built in Python. You just need to supply a suitable metric function and a corresponding prediction wrapper. Here's a somewhat minimal example using a multiclass random forest and the Brier score metric via yardstick:

library(ranger)

set.seed(1028)
rfo <- ranger(Species ~ ., data = iris, probability = TRUE)

p <- predict(rfo, data = iris)$predictions

head(p)
#         setosa   versicolor    virginica
# [1,] 1.0000000 0.0000000000 0.0000000000
# [2,] 0.9963333 0.0030000000 0.0006666667
# [3,] 1.0000000 0.0000000000 0.0000000000
# [4,] 1.0000000 0.0000000000 0.0000000000
# [5,] 1.0000000 0.0000000000 0.0000000000
# [6,] 0.9994286 0.0005714286 0.0000000000

# Multiclass Brier score
yardstick::brier_class_vec(iris$Species, estimate = p)

# Prediction wrapper; to use multiclass Brier score, needs to return matrix of 
# predicted probabilities
pfun <- function(object, newdata) {
  predict(object, data = newdata)$predictions
}

# Metric function; just a thin wrapper around yardstick's Brier score function
mfun <- function(actual, predicted) {
  yardstick::brier_class_vec(actual, estimate = predicted)
}

# Compute permutation importance
vi_permute(
  rfo, 
  train = iris, 
  target = "Species", 
  metric = mfun, 
  pred_wrapper = pfun,  # tells vip how to get predictions form this model
  smaller_is_better = TRUE,  # vip has no idea if smaller or larger is better
  nsim = 10
)
# # A tibble: 4 × 3
#   Variable     Importance    StDev
#   <chr>             <dbl>    <dbl>
# 1 Sepal.Length    0.00867 0.00103 
# 2 Sepal.Width     0.00223 0.000552
# 3 Petal.Length    0.149   0.00885 
# 4 Petal.Width     0.171   0.00798 

# Same, but with sorted output
vi(
  rfo, 
  method = "permute",
  train = iris, 
  target = "Species", 
  metric = mfun, 
  pred_wrapper = pfun,  # tells vip how to get predictions form this model
  smaller_is_better = TRUE,  # vip has no idea if smaller or larger is better
  nsim = 10
)
# # A tibble: 4 × 3
#   Variable     Importance    StDev
#   <chr>             <dbl>    <dbl>
# 1 Petal.Width     0.178   0.0116  
# 2 Petal.Length    0.151   0.0120  
# 3 Sepal.Length    0.00921 0.000942
# 4 Sepal.Width     0.00232 0.000662

Note that I am working to incorporate yardstick into the package to make it a bit easier by not having to write your own metric function each time (but that's where the flexibility comes in). Also, I wrote vip with scale in mind, and it's seemingly faster than alternatives, so keep that in mind. A simple benchmark can be found in out R Journal article (Figure 16). It's also parallelizable via the foreach package for larger problems.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge #57

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge #57

utterances-bot commented Dec 14, 2021

CelMcC commented Dec 14, 2021

conlelevn commented May 16, 2022

juliasilge commented May 17, 2022

Wenyu1024 commented Jul 4, 2022

juliasilge commented Jul 5, 2022

aousabdo commented Dec 8, 2022

smithhelen commented Mar 29, 2023

juliasilge commented Mar 30, 2023

smithhelen commented Mar 30, 2023 •

edited

Loading

juliasilge commented Apr 3, 2023

smithhelen commented Apr 3, 2023 via email

bgreenwell commented May 8, 2023 •

edited

Loading

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge #57

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge #57

Comments

utterances-bot commented Dec 14, 2021

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge

CelMcC commented Dec 14, 2021

conlelevn commented May 16, 2022

juliasilge commented May 17, 2022

Wenyu1024 commented Jul 4, 2022

juliasilge commented Jul 5, 2022

aousabdo commented Dec 8, 2022

smithhelen commented Mar 29, 2023

juliasilge commented Mar 30, 2023

smithhelen commented Mar 30, 2023 • edited Loading

juliasilge commented Apr 3, 2023

smithhelen commented Apr 3, 2023 via email

bgreenwell commented May 8, 2023 • edited Loading

smithhelen commented Mar 30, 2023 •

edited

Loading

bgreenwell commented May 8, 2023 •

edited

Loading