 ## Assignment 2

 This assignment is aimed to give you some practice using Jupyter Notebooks, R, and interpretting statistical output using real-world data. The notebook below will be used to generate the statistical output and you will also write up answers to specific questions related to the statistical output. The assignment will be guided and much of the R code will be provided for you, but specific aspects of the R code you will be asked to interact with and ultimately make a decision about appropriate values to include. The notebook should be run from the first code cell in sequential order, this means that you must run the beginning cells in order to be able to have access to the R packages needed for the assignment and that the data are read in appropriately.

 You may work in groups of up to 3 to complete the assignment. In these situations, please turn in one assignment in ICON with all group members names on the submission.

 *Assignment 2 Due*: **Monday, October 14, by 11:59 pm**

 ## Description of the Data

 These data are housing data from Ames, Iowa for the years 2006, 2007, 2008, 2009, and through July 2010. The data come from an R package names, `AmesHousing`, and contains a variety of variables about the houses that were sold during these years. The data have been simplified for this assignment by focusing on single family houses (i.e. omitting condos, multifamily homes, apartments, etc) and a subset of the attributes (i.e. variables) from the full data were retained. These are described in some detail below.

 + **SalePrice**: The home sale price is US dollars.
 + **price_above_60**: Whether the home sold above the 60th percentile, TRUE = above 60th percentile, FALSE = below 60th percentile
 + **Yr_Sold**: Year the home was sold
 + **Mo_Sold**: Month the home was sold, represented as number, e.g. 1 = January
 + **Neighborhood**: Name of the neighborhood in Ames
 + **Lot_Config**: The configuration of the lot, whether it is a corner lot, on a cul-de-sac, inside lot, etc.
 + **Lot_Area**: Square footage of the lot the house resides on.
 + **Overall_Qual**: The overall quality of the home with respect to material and finish: 0 = worst quality, 10 = best quality
 + **Overall_Cond**: The overall condition of the home: 0 = worst condition, 10 = best condition.
 + **Year_Built**: The year the home was built.
 + **Gr_Liv_Area**: Total square feet of home that is above ground, does not include basement square footage.
 + **Bedroom_AbvGr**: Number of bedrooms above ground, does not include any bedrooms in the basement.
 + **num_baths**: Number of bathrooms.
 + **Fireplaces**: Number of fireplaces
 + **Garage_Cars**: Number of cars the garage can hold

 Please don't hesitate to reach out with any data questions about the structure and intepretation of the variables in the data.

 ## Assignment Setup
 **Run this cell first, upon opening the notebook. This cell need only be ran once**

In [0]:
.libPaths('../RPackages')

install.packages("AmesHousing")


 **Run this cell first upon opening the notebook everytime** This cell loads the R packages and prepares the data for you.

In [0]:
.libPaths('../RPackages')

library(tidyverse)
library(ggformula)
library(mosaic)
library(rpart)
library(rpart.plot)
library(AmesHousing)
library(rsample)

theme_set(theme_bw(base_size = 14))

ames <- ames_raw
names(ames) <- gsub("\\s", "_", names(ames))

ames <- ames %>% 
  filter(Bldg_Type == '1Fam') %>%
  mutate(num_baths = Full_Bath + .5 * Half_Bath,
         price_above_60 = SalePrice > quantile(SalePrice, .6)) %>%
  select(SalePrice, price_above_60, Yr_Sold, Mo_Sold, 
         Neighborhood, Lot_Config, Lot_Area, Overall_Qual, Overall_Cond, 
         Year_Built, Gr_Liv_Area, Bedroom_AbvGr, num_baths, Fireplaces, Garage_Cars)



 ## Question 1
 View the first few rows of the data by completing the code chunk below by replacing the "??" with the name of the object the data are stored into in the assignment setup step above. **1 pt**

In [0]:
head(??)


 ## Question 2
 Explore the `SalePrice` distribution visually using the code provided below.

 Complete the code by filling in the appropriate variable where the "^^" are and fill in the visualization type you are most comfortable with where "??" are located. Finally, replace the "$$" with an appropriate plot title and x-axis label that are descriptive. **2 pts**

In [0]:
gf_??(~ ^^, data = ames) %>%
  gf_labs(title = "$$", 
          x = "$$") 


 ## Question 3
 Based on the figure created in question 2 above, estimate the mean and median from the figure. Provide a sentence rationale for why you think the mean and median are approximately what you estimated. Be as specific as you can about features in the figure that you considered when picking values for the mean and median. Note, no need to be exact or compute the statistics yet (this will come later), rather provide your best guess solely from the figure create in question 2. **1 pts**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*

 ## Question 4
 Based on the figure created in question 2 above, estimate the sd and IQR from the figure. Provide a sentence rationale for why you think the mean and median are approximately what you estimated. Be as specific as you can about features in the figure that you considered when picking values for the mean and median. Note, no need to be exact or compute the statistics yet (this will come later), rather provide your best guess solely from the figure create in question 2. **1 pts**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*

 ## Question 5
 Use the following code to compute descriptive statistics that summarize the center and variability for the `SalePrice` variable.

 Complete the code by filling in the appropriate descriptive functions where the "&&" are located in the code below. Functions that may be useful here could include: `mean`, `median`, `sd`, `var`, `IQR`, `min`, `max`, `length`. **1 pt**

In [0]:
ames %>%
  df_stats(~ SalePrice, &&)


 ## Question 6
 For each of the statistics you calculated in question 5 above, provide a one or two sentence discussion of what this statistic tells us about the `SalePrice` variable. More specifically, interpret the statistics in the context of this particular data. **2 pts**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*

 ## Question 7
 Hypothesize which variables may help to explain variation in the `SalePrice` variable. Put another way, what are some important variables that may account for differences in the sale price of the home. These may include variables that are in the data or other variables that are not in the data, think big picture here about other things that may the home sale price. **2 pts**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*

 ## Question 8
 Compute conditional multivariate descriptive statistics for the `SalePrice` variable based on a variable you think may help to explain differences in the sale price of the home. This may include variables you discussed in question 7 above.

 Complete the code by filling in the appropriate outcome variable in place of "^^", the variable(s) of interest in place of "$$", and the descriptive functions where the "&&" are located in the code below. Functions that may be useful here could include: `mean`, `median`, `sd`, `var`, `IQR`, `min`, `max`, `length`. **1 pt**

In [0]:
ames %>%
  df_stats(^^ ~ $$, &&)


 ## Question 9
 Based on the descriptive statistics computed in question 8 above, interpret important differences in the descriptive statistics by the variable(s) that you explored in question 8. Be as specific as possible about the differences and interpret the differences in the context of the data. **2 pts**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*

 ## Question 10
 We are now going to explore which data attributes (i.e. house attributes) that are most important in distinguishing between houses that sell above the 60th percentile. A new variable was created named, `price_above_60` which represents if the house sale price was above the 60th percentile. If the house sold above the 60th percentile, the variable is labeled as TRUE, if not, it is labeled as FALSE. To see which attributes are important, we will fit a classification tree to these data using the `rpart()` function.
 Complete the code within the `rpart()` function below by replacing "$$" with variables included in the data that you think would be important in distinguishing between a house that sells above the 60th percentile vs houses that do not sell above the 60th percentile. Note, separate variables with a `+` symbol. **1 pt**

In [0]:
set.seed(2019)
ames_split <- initial_split(ames, prop = .7)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)


# Fit classification tree
ames_class <- rpart(price_above_60 ~ $$, 
                    method = 'class', data = ames_train)

rpart.plot(ames_class, roundint = FALSE, type = 3, branch = .3)


 ## Question 11
 Based on the classification tree figure created from question 10 above, what are the most important variables that help to differentiate between houses that sell above the 60th percentile? Which variable was the most important? **2 pts**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*

 ## Question 12
 The code below will use the model fitted above and produce predicted values for the withheld test data. A bar chart is created that shows the accuracy of the predictions compared to the actual observed data.

In [0]:
ames_test <- ames_test %>%
  mutate(price_60_predict = predict(prune_ames_class, 
                                    newdata = ames_test, 
                                    type = 'class'))
ames_test %>%
  count(price_above_60, price_60_predict)

gf_bar(~ price_above_60, fill = ~price_60_predict, data = ames_test, 
       position = 'fill')


 Do you feel the classification tree did a good job at accurately predicting which houses were above vs below the 60th percentile? Be as specific as possible citing the bar chart created above. **1 pt**

 ## Question 13
 The following code chunk fits two classification models and repeats the . One includes most of the variables in the data including: Yr_Sold, Mo_Sold, Lot_Area, Overall_Qual, Overall_Cond, Year_Built, Gr_Liv_Area, Bedroom_AbvGr, num_baths, Fireplaces, Garage_Cars. The second classification model simplifies the model to only include the following variables: Yr_Sold, Mo_Sold, Lot_Area, Bedroom_AbvGr, Fireplaces, Garage_Cars. Prior to running the code below, hypothesize how much lower the classification accuracy will be for the classification model that includes fewer variables. To evaluate this, it may be worth thinking about how important the variables that are omitted from the second model may be on the classification of the houses. **1 pt**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*

 ## Question 14
 Run the code below to generate a figure that compares model 1 and model 2 based on the description in question 13. Model 1 is the classification model with more variables included and model 2 is the model that simplifies the variables in the model. Note, this command may take up to 20 to 30 seconds to complete.

In [0]:
set.seed(2019)
accuracy_list <- vector("list", 100)

for(i in 1:100) {
  ames_split <- initial_split(ames, prop = .7)
  ames_train <- training(ames_split)
  ames_test <- testing(ames_split)
  
  ames_class <- rpart(price_above_60 ~ Yr_Sold + Mo_Sold + Lot_Area + Overall_Qual + Overall_Cond + Year_Built + Gr_Liv_Area + Bedroom_AbvGr + num_baths + Fireplaces + Garage_Cars, 
                      method = 'class', data = ames_train)
  ames_class_red <- rpart(price_above_60 ~ Yr_Sold + Mo_Sold + Lot_Area + Bedroom_AbvGr + Fireplaces + Garage_Cars, 
                      method = 'class', data = ames_train)
  ames_test <- ames_test %>%
    mutate(price_60_predict = predict(ames_class, 
                                      newdata = ames_test, 
                                      type = 'class'),
           price_60_predict_red = predict(ames_class_red, 
                                      newdata = ames_test, 
                                      type = 'class'))
  
   full_mod <- ames_test %>%
    mutate(same_class = ifelse(price_above_60 == price_60_predict, 1, 0)) %>%
    df_stats(~ same_class, mean, sum) %>% 
    mutate(iteration = i,
           model = 'Model 1')
   red_mod <- ames_test %>%
     mutate(same_class = ifelse(price_above_60 == price_60_predict_red, 1, 0)) %>%
     df_stats(~ same_class, mean, sum) %>% 
     mutate(iteration = i, 
            model = 'Model 2')
   accuracy_list[[i]] <- rbind(full_mod, red_mod)
}

# combine operations
eval_accuracy <- bind_rows(accuracy_list)
gf_density(~ mean_same_class, data = eval_accuracy) %>%
  gf_facet_wrap(~ model)


 Summarise the difference in classification accuracy based on the figure created. Be as specific as possible about notable differences in the classification accuracy based on the two models. Are you surprised by any differences in classification accuracy observed? **2 pts**

 *Write your response in this cell by double clicking on this text. When finished typing your response, hit control + enter to convert the text.*