#TidyTuesday hotel bookings and recipes | Julia Silge #26

utterances-bot · 2021-05-04T01:35:32Z

#TidyTuesday hotel bookings and recipes | Julia Silge

Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models!

https://juliasilge.com/blog/hotels-recipes/

jstello · 2021-05-04T01:35:32Z

Thank you for sharing these amazing techniques! I loved the skim function in particular. I got stuck on the Ggally part though, I wasn´t able to install it by running # Github
library(devtools)
install_github("ggobi/ggally").

I'm new to RStudio, but I hope to learn more from your amazing videos. Cheers,

juliasilge · 2021-05-04T01:43:52Z

@jstello Try installing it straight from CRAN via install.packages("GGally")

ntihemuka · 2021-05-24T12:19:46Z

hey julia, how do you get your code to look so neat and formatted? is there an r studio functionality that helps format your code as you type?

ntihemuka · 2021-05-24T14:12:01Z

Error: The first argument to [fit_resamples()] should be either a model or workflow.

I dont know how to shake this error? even when i copy your code exactly

juliasilge · 2021-05-24T15:07:10Z

@ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here. The other thing I do is try to follow tidyverse style most of the time, but I'm not perfect on that.

This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org.

ntihemuka · 2021-05-24T15:11:38Z

thanks!

…

On Mon, May 24, 2021 at 4:07 PM Julia Silge ***@***.***> wrote: @ntihemuka <https://github.com/ntihemuka> I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here <https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts>. The other thing I do is try to follow tidyverse style <https://style.tidyverse.org/> most of the time, but I'm not perfect on that. This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org <https://www.tidymodels.org/start/case-study/>. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#26 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AS3DBKKHQUPTL7UUVSKREC3TPJTSZANCNFSM44BZZQDA> .

gunnergalactico · 2021-08-18T00:36:23Z

Hi Dr. Silge,

I tried this example from the website https://www.tidymodels.org/start/case-study/ and noticed an issue with the engine arguments. It appears you can't pass engine specific arguments like "num.threads" or "importance = impurity" with the new workflow syntax. It does work with the old set_engine syntax.

gunnergalactico · 2021-08-18T00:37:14Z

juliasilge · 2021-08-18T17:14:53Z

@gunnergalactico That is correct and as expected; you can only set engine-specific arguments within set_engine().

nguyenlovesrpy · 2021-09-07T23:45:57Z

Hi, I just think that Knn is only for classification in trainining data, and It shouldn't be used to predict for a new dataset (testing data). What do you think about it? Thank you and Best regards

juliasilge · 2021-09-08T01:08:34Z

@nguyenlovesrpy A nearest neighbor model can definitely be used to predict for a new dataset; check out examples here for both regression and classification.

Cidree · 2022-09-18T16:53:08Z

Hello. First of all thank you for all these videos, there are really helpful!

I have a question about the outcome in the confusion matrix. What are we evaluating exactly? Because when I sum the observations in the CF there are 22,900 observations, whereas the test set has 18,792 and the training set has 56,374. Why is this?

Cidree · 2022-09-18T17:30:28Z

Hello again. I think I figured it out. It is because of the Monte Carlo CV which uses in this case as validation 10% of the data 25 times, so we have 250% of observations of the training set.

juliasilge · 2022-09-18T19:26:11Z

Yep, those predictions that are used in the confusion matrix are from the 25-fold resampling, where the predictions are on the held out (or "assessment") observations in each resample. You may be interested in trying out the conf_mat_resampled() function.

ghost · 2022-12-26T15:03:56Z

Hi Julia, how the knn model estimate the correct k neighbors? Does model use a default value?

juliasilge · 2022-12-26T16:04:57Z

@rcientificos You can check out details like that in the documentation for nearest_neighbor().

ghost · 2022-12-26T22:26:26Z

Thank you.!. What is alternative for step_downsample in recipes? or I have to use themis package?

juliasilge · 2022-12-27T08:22:04Z

@rcientificos Yes, that's right. The function from step_downsample() moved from recipes to themis.

RaymondBalise · 2024-01-14T14:45:28Z

Hello Julia,

I noticed that you use the juiced data when you make the resamples in this vlog:

mc_cv(juice(hotel_rec), prop = 0.9, strata = children)

Am I correct that, to avoid leakage caused by step_normalize() in the recipe, it would be best to feed mc_cv() the unprocessed hotel_train data and then use the recipe when you fit the resamples?

It is a small point but I am thinking this is the modern simple example code:

# I changed juiced preped data to be the full untrained data
validation_splits <- mc_cv(hotel_train, prop = 0.9, strata = children)  

knn_spec <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("classification")
  
hotel_rec <- recipe(children ~ ., data = hotel_train) %>%
  step_downsample(children) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric()) %>%
  step_normalize(all_numeric()) 
  
# use full recipe and unprocessed resampled data  
knn_res <- fit_resamples(
  knn_spec,
  hotel_rec,  # use full recipe here vs just children ~ .,
  validation_splits,  # not pre-baked splits
  control = control_resamples(save_pred = TRUE)
)

Do I have this right?

juliasilge · 2024-01-14T21:12:51Z

Yes @RaymondBalise that's right. You can see that the article here using the same hotel data takes an approach more like what you describe than what I have here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#TidyTuesday hotel bookings and recipes | Julia Silge #26

#TidyTuesday hotel bookings and recipes | Julia Silge #26

utterances-bot commented May 4, 2021

jstello commented May 4, 2021

juliasilge commented May 4, 2021

ntihemuka commented May 24, 2021

ntihemuka commented May 24, 2021

juliasilge commented May 24, 2021

ntihemuka commented May 24, 2021 via email

gunnergalactico commented Aug 18, 2021

gunnergalactico commented Aug 18, 2021

juliasilge commented Aug 18, 2021

nguyenlovesrpy commented Sep 7, 2021

juliasilge commented Sep 8, 2021

Cidree commented Sep 18, 2022

Cidree commented Sep 18, 2022

juliasilge commented Sep 18, 2022

ghost commented Dec 26, 2022

juliasilge commented Dec 26, 2022

ghost commented Dec 26, 2022

juliasilge commented Dec 27, 2022

RaymondBalise commented Jan 14, 2024

juliasilge commented Jan 14, 2024

#TidyTuesday hotel bookings and recipes | Julia Silge #26

#TidyTuesday hotel bookings and recipes | Julia Silge #26

Comments

utterances-bot commented May 4, 2021

#TidyTuesday hotel bookings and recipes | Julia Silge

jstello commented May 4, 2021

juliasilge commented May 4, 2021

ntihemuka commented May 24, 2021

ntihemuka commented May 24, 2021

juliasilge commented May 24, 2021

ntihemuka commented May 24, 2021 via email

gunnergalactico commented Aug 18, 2021

gunnergalactico commented Aug 18, 2021

juliasilge commented Aug 18, 2021

nguyenlovesrpy commented Sep 7, 2021

juliasilge commented Sep 8, 2021

Cidree commented Sep 18, 2022

Cidree commented Sep 18, 2022

juliasilge commented Sep 18, 2022

ghost commented Dec 26, 2022

juliasilge commented Dec 26, 2022

ghost commented Dec 26, 2022

juliasilge commented Dec 27, 2022

RaymondBalise commented Jan 14, 2024

juliasilge commented Jan 14, 2024