## Activity 4 - Regression Trees

**Due** on *Monday, October 30th* by 11:59 pm

You will be asked to complete a short survey on ICON that asks questions about the output generated below. Furthermore, there are additional questions to consider sprinkled throughout the notebook below, these do not need to be explicitly answered, but can provide a bit of a guide to thinking and interpreting the following statistical output. 

## Setup

This first code cell needs to be executed ("Run") everytime this notebook is opened. For example, if you stop working on this activity and come back to the activity, this first code cell will need to be executed again to load the data, even though output may still show up from the prior time you worked on the activity. 

The data are specific details about Starbucks drinks. Here are some more specific details about the data: 

|variable        |class     |description |
|:---------------|:---------|:-----------|
|product_Name    |character | Product Name |
|size            |character | Size of drink (short, tall, grande, venti) |
|milk            |double    | Milk Type type of milk used
|                |          |  - `0` = none
|                |          |  - `1` = nonfat
|                |          |  - `2` = 2%
|                |          |  - `3` = soy
|                |          |  - `4` = coconut
|                |          |  - `5` = whole |
|whip            |double    | Whip added or not (binary, 0 = no; 1 = yes) |
|serv_size_m_l    |double    | Serving size in ml |
|calories        |double    | KCal|
|total_fat_g     |double    | Total fat grams |
|saturated_fat_g |double    | Saturated fat grams |
|trans_fat_g     |character | Trans fat grams |
|cholesterol_mg  |double    | Cholesterol mg |
|sodium_mg       |double    | Sodium milligrams |
|total_carbs_g   |double    | Total Carbs grams |
|fiber_g         |character | Fiber grams |
|sugar_g         |double    | Sugar grams  |
|caffeine_mg     |double    | Caffeine in milligrams |

### Guiding question for the activity
1. How accurately can a regression tree model predict the number of Calories in the drink using other features of the drink?

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)
library(rpart)
library(rpart.plot)
library(rsample)

theme_set(theme_bw(base_size = 18))

starbucks <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv') %>%
   filter(size %in% c('grande', 'tall', 'venti'))

head(starbucks)

## Question 1

Explore the number of calories in the Starbucks drink by the size of the drink.

Fill in the primary attribute of interest (i.e, `calories`) in place of "^^" and the size of the drink (ie., `size`) in place of "@@". Also, fill in appropriate attribute names and plot title in place of "%%".

In [None]:
gf_violin(^^ ~ @@, data = starbucks, draw_quantiles = c(0.1, .5, 0.9), fill = 'gray85') %>%
  gf_refine(coord_flip()) %>%
  gf_labs(x = '%%',
          y = '%%',
          title = '%%')

### Questions to think about

1. What is the shape, center, and variation for each of the three group sizes? 
2. Does it appear that the size of the drink is important in differentiating the number of calories for a drink?
3. What other attributes may also be helpful to accurately predict the number of calories in a Starbucks drink?

## Question 2

Interpret the correlations found between calories, caffeine, and sugar of the starbucks drinks. 

Note, the code below returns a correlation matrix that will have 1's on the diagonal of the matrix (these can be ignored). The correlations are represented on the off diagonal elements by comparing the row to column type. For example, the correlation reported for the row, calories and the column, caffeine_mg, would represent the correlation between calories and caffeine. 

In [None]:
starbucks %>%
  select(calories, caffeine_mg, sugar_g) %>%
  cor()

### Questions to consider

1. Interpret the 3 unique correlations between calories, caffeine, and sugar. 
2. Which attribute would help predict the number of calories in a drink given the 3 correlation values? 

## Question 3

Fit a regression tree to predict the number of calories in a starbucks drink from the other data attributes. **Note:** you can use any attribute as a predictor except the name and the calories. 

Place the outcome (ie., `calories`) in place of the "^^" and place any other attributes in place of the "@@". Separate more than one attribute with the "+" symbol. 

In [None]:
set.seed(2022)

starbucks_split <- initial_split(starbucks, prop = .75)
starbucks_train <- training(starbucks_split)
starbucks_test <- testing(starbucks_split)

calories_tree <- rpart(calories ~ @@, data = starbucks_train, method = 'anova')

rpart.plot(calories_tree, roundint = FALSE, type = 3, branch = .3)

### Questions to consider

1. What attribute is the strongest predictor of the number of calories in a drink?
2. Were all the attributes useful in predicting the number of calories?

## Question 4

Interpret the accuracy of the model in predicting the number of calories. Note, the following code depends on question #3 being ran successfully. 

The error below represents the absolute value of the error, that is, the absolute value of the difference between the real calories vs the model predicted calories. 

In [None]:
starbucks_test <- starbucks_test %>%
    mutate(calorie_pred = predict(calories_tree, newdata = .),
            error = calories - calorie_pred,
            calorie_group = cut_number(calories, n = 5))

df_stats(~ abs(error), data = starbucks_test, mean, median, sd, IQR, min, max)

### Questions to consider

1. What metric is the error here? More specifically, how is the mean or median value interpreted here? 
2. Evaluate how well the model is performing. Is the model doing well at predicting calories or is it not doing a good job? 
3. What additional information is helpful in answering #2 to provide relevant context to evaluate overall model performance? 

## Conditional Error

Evaluate the conditional error within the following table and figure. 

*Note:* The calories was split into 5 groups of about equal size. The range of the calorie groups are shown in the column, calorie_group.

The figure shows the range of error, the left-most part of the range represents the minimum error and the right-most part of the range is the maximum error. The circle represents the mean absolute error. 

In [None]:
conditional_error <- starbucks_test %>%
   df_stats(abs(error) ~ calorie_group, mean, median, sd, IQR, min, max)

conditional_error

In [None]:
gf_pointrangeh(calorie_group ~ mean + min + max, data = conditional_error) %>%
  gf_labs(y = 'Calories',
          x = 'Absolute Error')

### Questions to consider

1. How does the conditional error change across the 5 different groups? 
2. Does there appear to be differences in the absolute error across different calorie groups?