 ## Assignment 3

 This assignment is aimed to give you some practice using Jupyter Notebooks, R, and interpretting statistical output using real-world data. The notebook below will be used to generate the statistical output and you will also write up answers to specific questions related to the statistical output. The assignment will be guided and much of the R code will be provided for you, but specific aspects of the R code you will be asked to interact with and ultimately make a decision about appropriate values to include. The notebook should be run from the first code cell in sequential order, this means that you must run the beginning cells in order to be able to have access to the R packages needed for the assignment and that the data are read in appropriately.

 You may work in groups of up to 3 to complete the assignment. In these situations, please turn in one assignment in ICON with all group members names on the submission.

 *Assignment 3 Due*: **Friday, November 19th, by 11:59 pm**

 ## Description of the Data

 These data are housing data from Ames, Iowa for the years 2006, 2007, 2008, 2009, and through July 2010. The data come from an R package names, `AmesHousing`, and contains a variety of variables about the houses that were sold during these years. The data have been simplified for this assignment by focusing on single family houses (i.e. omitting condos, multifamily homes, apartments, etc) and a subset of the attributes (i.e. variables) from the full data were retained. These are described in some detail below.

 + **SalePrice**: The home sale price is US dollars.
 + **price_above_60**: Whether the home sold above the 60th percentile, TRUE = above 60th percentile, FALSE = below 60th percentile
 + **Yr_Sold**: Year the home was sold
 + **Mo_Sold**: Month the home was sold, represented as number, e.g. 1 = January
 + **Neighborhood**: Name of the neighborhood in Ames
 + **Lot_Config**: The configuration of the lot, whether it is a corner lot, on a cul-de-sac, inside lot, etc.
 + **Lot_Area**: Square footage of the lot the house resides on.
 + **Overall_Qual**: The overall quality of the home with respect to material and finish: 0 = worst quality, 10 = best quality
 + **Overall_Cond**: The overall condition of the home: 0 = worst condition, 10 = best condition.
 + **Year_Built**: The year the home was built.
 + **Gr_Liv_Area**: Total square feet of home that is above ground, does not include basement square footage.
 + **Bedroom_AbvGr**: Number of bedrooms above ground, does not include any bedrooms in the basement.
 + **num_baths**: Number of bathrooms.
 + **Fireplaces**: Number of fireplaces
 + **Garage_Cars**: Number of cars the garage can hold

 Please don't hesitate to reach out with any data questions about the structure and interpretation of the variables in the data.

## Assignment Setup

 **Run this cell first upon opening the notebook everytime** This cell loads the R packages and prepares the data for you.

In [None]:
library(tidyverse)
library(ggformula)
library(mosaic)
library(rpart)
library(rpart.plot)
library(AmesHousing)
library(rsample)
library(corrr)

options(scipen = 999)

theme_set(theme_bw(base_size = 16))

ames <- ames_raw
names(ames) <- gsub("\\s", "_", names(ames))

ames <- ames %>% 
  filter(Bldg_Type == '1Fam') %>%
  mutate(num_baths = Full_Bath + .5 * Half_Bath,
         price_above_60 = SalePrice > quantile(SalePrice, .6)) %>%
  select(SalePrice, price_above_60, Yr_Sold, Mo_Sold, 
         Neighborhood, Lot_Config, Lot_Area, Overall_Qual, Overall_Cond, 
         Year_Built, Gr_Liv_Area, Bedroom_AbvGr, num_baths, Fireplaces, Garage_Cars)

 ## Question 1
 Let's first explore a classification model to predict whether the house sells higher than the 60th percentile of all of the houses in Ames, Iowa.
 Complete the code within the `rpart()` function below by replacing "$$" with variables included in the data that you think would be important in distinguishing between a house that sells above the 60th percentile vs houses that do not sell above the 60th percentile. Note, separate variables with a `+` symbol. **1 pt**

In [None]:
set.seed(1985)
ames_split <- initial_split(ames, prop = .7)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)


# Fit classification tree
ames_class <- rpart(price_above_60 ~ $$, 
                    method = 'class', data = ames_train)

rpart.plot(ames_class, roundint = FALSE, type = 3, branch = .3)

ames_test <- ames_test %>%
  mutate(price_60_predict = predict(ames_class, 
                                    newdata = ames_test, 
                                    type = 'class'))

gf_bar(~ price_above_60, fill = ~price_60_predict, data = ames_test, 
       position = 'fill')

 ## Question 2
 
 For the remainder of the assignment, we'll explore the `SalePrice` attribute, representing how much the house sold for. Pick another **continuous, quantitative** attribute to complete the code to create a scatterplot of the relationship between `SalePrice` and the attribute you pick. Add in the **continuous, quantitative** attribute of your choosing in place of "$$" below and add appropriate axis labels in place of "@@".

 Examples of continuous outcomes to use in place of "$$" include: `Lot_Area`, `Overall_Qual`, `Overall_Cond`, `Year_Built`, `Gr_Liv_Area`, `Bedroom_AbvGr`.

In [None]:
gf_point(SalePrice ~ $$, data = ames_train, size = 3) %>%
  gf_smooth(method = 'lm', linetype = 2, size = 1) %>%
  gf_labs(x = "@@",
          y = "@@")

 ## Question 3
 Complete the code below that computes the correlations between the `SalePrice` attribute and a few other attributes. 
 
Attributes you could use in place of "&&" include: `Garage_Cars`, `Fireplaces`, `num_baths`.

Use the same attribute from question #2 in place of "$$".

Note: If a value shows up as NA, then it likely only has 1 data point in the group. The number of data records going into each correlation is shown within the `num_data` column. 

In [None]:
ames_train %>%
  group_by(&&) %>%
  summarise(cor = cor(SalePrice ~ $$), 
            num_data = n())

 ## Question 4
 Now fit a regression tree to the ames housing training data to predict the sale price of the house. Complete the code below to add data attributes that you feel will help to predict the sale price of the house in Ames in place of "$$". Note, separate variables with a `+` symbol and do not include the `price_above_60` attribute and I recommend not using the `Neighborhood` attribute as the resulting figure will be extremely small. 

In [None]:
ames_saleprice <- rpart(SalePrice ~ $$, data = ames_train, method = 'anova')

rpart.plot(ames_saleprice, roundint = FALSE, type = 3, branch = .3)

 ## Question 5
 Complete the following code to generate the prediction accuracy on the withheld test data. Fill in appropriate descriptive functions (i.e. `mean`, `median`, `sd`, `min`, `max`, `IQR`, `length`, etc) in place of "@@" to calculate statistics of interest to evaluate how well the model predicted the sale price of the test data.

In [None]:
ames_test <- ames_test %>%
  mutate(sale_predicted = predict(ames_saleprice, newdata = ames_test),
         error = SalePrice - sale_predicted,
         absolute_error = abs(error))

ames_test %>%
  df_stats(~ absolute_error, @@)

  ## Question 6
The following code generates conditional error for 10 ranges of sale price. The resulting figure shows the minimum and maximum error shown by the horizontal line and the average error as a single point along the horizontal line.

In [None]:
cond_error <- ames_test %>%
  df_stats(absolute_error ~ cut_number(SalePrice, 10), mean, min, max) %>%
  rename(saleprice = `cut_number(SalePrice, 10)`)  %>% 
  mutate(saleprice2 = gsub("]", "", str_split_fixed(as.character(saleprice), ",", n = 2)[,2]))

gf_pointrangeh(saleprice2 ~ mean + min + max, data = cond_error) %>%
  gf_labs(y = "Sale Price", 
                 x = "Absolute Error")