 ## Decision Trees - Regression Trees Supplemental


 ## Setup

 In this example, we will explore data on major league baseball players that comes with an R package, `ISLR`. These data contain information about hitters in major league baseball for the 1986 season and also contain information about their starting salary for the 1987 season. Missing data related to salary information was dropped from the data.

 ### Loading R packages
 Then the packages can be loaded and some processing is done on the `Hitters` data to drop any missing data elements from the salary data attribute. Finally, the first few rows of the data are shown with the `head()` function.

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)
library(ISLR)
library(rpart)
library(rsample)
library(rpart.plot)

theme_set(theme_bw(base_size = 16))

Hitters <- Hitters %>%
  drop_na(Salary)

head(Hitters)

 ## Fit Untransformed model
 In the section 8 notes, the model was fitted to the log of the salary, not in the salary metric directly. We discussed how we can back-transform the predicted values using the exponential function, however let's explore what would happen if we did the analysis on the original salary metric that was highly skewed. For this analysis, I'm only going to change the outcome, but keep the two attributes that we used before, number of home runs and number of hits in the prior season. Both models will be shown below.

 ### Transformed model

 This model will first apply the log transformation to the player salary to attempt to normalize the distribution instead of having a heavily skewed distribution. 

In [None]:
hit_reg <- rpart(log(Salary) ~ HmRun + Hits, data = Hitters, method = "anova", cp = .012)

rpart.plot(hit_reg, roundint = FALSE, type = 3, branch = .3)

 ### Untransformed Model

 This model uses the Salary attribute as originally collected in the positively/right skewed distribution. 

In [None]:
hit_reg <- rpart(Salary ~ HmRun + Hits, data = Hitters, method = "anova", cp = .012)

rpart.plot(hit_reg, roundint = FALSE, type = 3, branch = .3)

 ## Visualize Differences

 This section explores the differences in the predictions from the two models, the first shows the prediction results from the untransformed models and the second shows the untransformed model. 

 ### Transformed Model

In [None]:
gf_point(HmRun ~ Hits, data = Hitters, color = ~ log(Salary)) %>% 
    gf_vline(xintercept = c(118, 146), size = 1) %>%
    gf_segment(8.5 + 8.5 ~ 0 + 118, size = 0.75, color = "black") %>%
    gf_segment(8.5 + 8.5 ~ 146 + 238, size = 0.75, color = "black") %>%
    gf_text(x = 1, y = 3, label = "5.4", color = "red", size = 5) %>%
    gf_text(x = 128, y = 3, label = "6.3", color = "red", size = 5) %>%
    gf_text(x = 170, y = 3, label = "6.1", color = "red", size = 5) %>%
    gf_text(x = 50, y = 35, label = "5.9", color = "red", size = 5) %>%
    gf_text(x = 200, y = 35, label = "6.7", color = "red", size = 5) %>%
    gf_refine(scale_color_distiller(palette = "BuGn")) %>%
    gf_labs(x = "Number of Hits",
            y = "Number of Home Runs",
            title = "Log salary by number of home runs and hits")

 ### Untransformed Model

In [None]:
gf_point(HmRun ~ Hits, data = Hitters, color = ~ Salary, size = 3) %>% 
    gf_vline(xintercept = 123, size = 1) %>%
    gf_hline(yintercept = 8.5, size = 1) %>%
    gf_segment(8.5 + 40 ~ 130 + 130, size = 0.75, color = "black") %>% 
    gf_segment(8.5 + 40 ~ 146 + 146, size = 0.75, color = "black") %>% 
    gf_segment(8.5 + 40 ~ 152 + 152, size = 0.75, color = "black") %>%
    gf_segment(8.5 + 40 ~ 160 + 160, size = 0.75, color = "black") %>%
    gf_text(x = 2, y = 4, label = "311", color = "red", size = 5) %>%
    gf_text(x = 100, y = 35, label = "455", color = "red", size = 5) %>%
    gf_text(x = 170, y = 3, label = "601", color = "red", size = 5) %>%
    gf_text(x = 137, y = 35, label = "605", color = "red", size = 5) %>%
    gf_text(x = 155, y = 35, label = "616", color = "red", size = 5) %>% 
    gf_text(x = 200, y = 35, label = "980", color = "red", size = 5) %>%
    gf_text(x = 150, y = 38, label = "1151", color = "red", size = 5) %>%
    gf_text(x = 127, y = 40, label = "1204", color = "red", size = 5) %>%
    gf_refine(scale_color_distiller(palette = "BuGn")) %>%
    gf_labs(x = "Number of Hits",
            y = "Number of Home Runs",
            title = "Log salary by number of home runs and hits")

 ## Error of Salary Model

 This section shows the error from the model that did not transform the salary information. 

In [None]:
Hitters <- Hitters %>%
  mutate(salary_pred = predict(hit_reg),
         error = Salary - salary_pred)

Hitters %>%
  df_stats(~ error, mean, median, min, max, sd)

Hitters %>%
  df_stats(~ abs(error), mean, median, min, max, sd)

 ## Use more variables
 Let's use more variables to see what happens to the model.

In [None]:
hit_reg <- rpart(Salary ~ HmRun + Hits + CAtBat + CHits + CHmRun + CRuns + CRBI + CWalks + League + Division + PutOuts + Assists + Errors + Years + AtBat + Runs + RBI + Walks, 
   data = Hitters, method = "anova", cp = .012)

rpart.plot(hit_reg, roundint = FALSE, type = 3, branch = .3)

Hitters <- Hitters %>%
  mutate(salary_pred = predict(hit_reg),
         error = Salary - salary_pred)

Hitters %>%
  df_stats(~ error, mean, median, min, max, sd)

Hitters %>%
  df_stats(~ abs(error), mean, median, min, max, sd)
