Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#TidyTuesday hotel bookings and recipes | Julia Silge #26

Open
utterances-bot opened this issue May 4, 2021 · 20 comments
Open

#TidyTuesday hotel bookings and recipes | Julia Silge #26

utterances-bot opened this issue May 4, 2021 · 20 comments

Comments

@utterances-bot
Copy link

#TidyTuesday hotel bookings and recipes | Julia Silge

Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models!

https://juliasilge.com/blog/hotels-recipes/

Copy link

jstello commented May 4, 2021

Thank you for sharing these amazing techniques! I loved the skim function in particular. I got stuck on the Ggally part though, I wasn´t able to install it by running # Github
library(devtools)
install_github("ggobi/ggally").

I'm new to RStudio, but I hope to learn more from your amazing videos. Cheers,

@juliasilge
Copy link
Owner

@jstello Try installing it straight from CRAN via install.packages("GGally")

Copy link

hey julia, how do you get your code to look so neat and formatted? is there an r studio functionality that helps format your code as you type?

Copy link

Error: The first argument to [fit_resamples()] should be either a model or workflow.

I dont know how to shake this error? even when i copy your code exactly

@juliasilge
Copy link
Owner

@ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here. The other thing I do is try to follow tidyverse style most of the time, but I'm not perfect on that.

This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org.

@ntihemuka
Copy link

ntihemuka commented May 24, 2021 via email

Copy link

Hi Dr. Silge,

I tried this example from the website https://www.tidymodels.org/start/case-study/ and noticed an issue with the engine arguments. It appears you can't pass engine specific arguments like "num.threads" or "importance = impurity" with the new workflow syntax. It does work with the old set_engine syntax.

@gunnergalactico
Copy link

hotel_stays

@juliasilge
Copy link
Owner

@gunnergalactico That is correct and as expected; you can only set engine-specific arguments within set_engine().

Copy link

Hi, I just think that Knn is only for classification in trainining data, and It shouldn't be used to predict for a new dataset (testing data). What do you think about it? Thank you and Best regards

@juliasilge
Copy link
Owner

@nguyenlovesrpy A nearest neighbor model can definitely be used to predict for a new dataset; check out examples here for both regression and classification.

Copy link

Cidree commented Sep 18, 2022

Hello. First of all thank you for all these videos, there are really helpful!

I have a question about the outcome in the confusion matrix. What are we evaluating exactly? Because when I sum the observations in the CF there are 22,900 observations, whereas the test set has 18,792 and the training set has 56,374. Why is this?

Copy link

Cidree commented Sep 18, 2022

Hello again. I think I figured it out. It is because of the Monte Carlo CV which uses in this case as validation 10% of the data 25 times, so we have 250% of observations of the training set.

@juliasilge
Copy link
Owner

Yep, those predictions that are used in the confusion matrix are from the 25-fold resampling, where the predictions are on the held out (or "assessment") observations in each resample. You may be interested in trying out the conf_mat_resampled() function.

Copy link

ghost commented Dec 26, 2022

Hi Julia, how the knn model estimate the correct k neighbors? Does model use a default value?

@juliasilge
Copy link
Owner

@rcientificos You can check out details like that in the documentation for nearest_neighbor().

@ghost
Copy link

ghost commented Dec 26, 2022

Thank you.!. What is alternative for step_downsample in recipes? or I have to use themis package?

@juliasilge
Copy link
Owner

@rcientificos Yes, that's right. The function from step_downsample() moved from recipes to themis.

Copy link
Contributor

Hello Julia,

I noticed that you use the juiced data when you make the resamples in this vlog:

mc_cv(juice(hotel_rec), prop = 0.9, strata = children)

Am I correct that, to avoid leakage caused by step_normalize() in the recipe, it would be best to feed mc_cv() the unprocessed hotel_train data and then use the recipe when you fit the resamples?

It is a small point but I am thinking this is the modern simple example code:

# I changed juiced preped data to be the full untrained data
validation_splits <- mc_cv(hotel_train, prop = 0.9, strata = children)  

knn_spec <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("classification")
  
hotel_rec <- recipe(children ~ ., data = hotel_train) %>%
  step_downsample(children) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric()) %>%
  step_normalize(all_numeric()) 
  
# use full recipe and unprocessed resampled data  
knn_res <- fit_resamples(
  knn_spec,
  hotel_rec,  # use full recipe here vs just children ~ .,
  validation_splits,  # not pre-baked splits
  control = control_resamples(save_pred = TRUE)
)  

Do I have this right?

@juliasilge
Copy link
Owner

Yes @RaymondBalise that's right. You can see that the article here using the same hotel data takes an approach more like what you describe than what I have here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants