# Group Project Proposal: Home Price Regression

## 1. Introduction

#### 1.1 Background Information

Housing prices vary due to a confluence of various factors including location, size of the house, size of the lot, to name only a few. The extent to which each factor correlates with house prices is not a cut and dry formula. As such, it can be difficult to predict what price one should sell their house for or what price one should offer for a house. This project is built to help figure out these values and the variables that affect housing prices.

#### 1.2 Objective Question

Which real estate factors contribute to most accurately predicting the selling price of a home? Can we build a regression model to accurately predict home prices based on given input real estate factors?

#### 1.3 Dataset Description

The dataset comprises data for ~1500 house sales in the city of Ames, Iowa from 2006-2010. The response variable in this dataset is the sale price of each home. Additionally, there are approximately 80 descriptive real estate factors (including living area, # of bedrooms and bathrooms, and lot size) for each observation.

In [1]:
#Place fun house picture here if possible!

## 2. Preliminary exploratory data analysis

#### 2.1 Reading in data

In [2]:
#TODO Check which packages are included in base R
install.packages('gsheet')
library(tidyverse)
#library(cowplot)
#library(scales)
library(tidymodels)
library(gsheet)
library(repr)
options(repr.matrix.max.rows = 8)
data <- gsheet2tbl('docs.google.com/spreadsheets/d/1nNlzfwXkHVk2i946pgf3247KT2vqCDfHZjrBjT-losg/edit?usp=sharing')

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0
[32m✔[39m [34mstringr[39m 1.4.0     

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mreadr[39m::[32mcol_factor()[39m masks [34mscales[39m::col_factor()
[31m✖[39m [34mpurrr[39m::[32mdiscard()[39m    masks [34mscales[39m::discard()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m     masks [34mstats[39m::filter()
[31m✖[39m [34mstringr[39m::[32mfixed()[39m    masks [34mrecipes[39m::fixed()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m        masks [34mstats[39m::lag()
[31m✖[39m [34mreadr[39m::[32mspec()[39m       masks [34myardstick[39m::spec()



#### 2.2 Clean and wrangle data

In [8]:
set.seed(1234)

names(data)[70] <- "ThreeSsnPorch"

# Selecting the variables that we believe will be useful in predicting SalePrice
data_selected <- data %>%
    select(Id, MSSubClass, LotArea, LotConfig, Neighborhood, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, Exterior1st, Exterior2nd, TotalBsmtSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageCars, GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch,
           ThreeSsnPorch, 
           ScreenPorch, PoolArea, MoSold, YrSold, SalePrice) %>%
    mutate(HasDeck = WoodDeckSF > 0,HasPorch = (OpenPorchSF > 0 | EnclosedPorch > 0 | ThreeSsnPorch > 0 | ScreenPorch > 0), HasPool = PoolArea > 0) %>%
    select(-WoodDeckSF, -OpenPorchSF, -EnclosedPorch, -ThreeSsnPorch, -ScreenPorch, -PoolArea)

data_selected <- as.data.frame(data_selected)

#Splitting the selected data into a training and testing set
data_split <- initial_split(data_selected, prop = 3/4, strata = NULL)
data_train <- training(data_split)
data_test <- testing(data_split)

#Selecting only the numeric data for a summary table
data_selected_numeric <- select_if(data_train,is.numeric)

Unnamed: 0_level_0,Id,MSSubClass,LotArea,LotConfig,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,⋯,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,MoSold,YrSold,SalePrice,HasDeck,HasPorch,HasPool
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>
1,1,60,8450,Inside,CollgCr,1Fam,2Story,7,5,2003,2003,⋯,1,8,0,2,548,2,2008,208500,FALSE,TRUE,FALSE
2,2,20,9600,FR2,Veenker,1Fam,1Story,6,8,1976,1976,⋯,1,6,1,2,460,5,2007,181500,TRUE,FALSE,FALSE
3,3,60,11250,Inside,CollgCr,1Fam,2Story,7,5,2001,2002,⋯,1,6,1,2,608,9,2008,223500,FALSE,TRUE,FALSE
5,5,60,14260,FR2,NoRidge,1Fam,2Story,8,5,2000,2000,⋯,1,9,1,3,836,12,2008,250000,TRUE,TRUE,FALSE
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
1457,1457,20,13175,Inside,NWAmes,1Fam,1Story,6,6,1978,1988,⋯,1,7,2,2,500,2,2010,210000,TRUE,FALSE,FALSE
1458,1458,70,9042,Inside,Crawfor,1Fam,2Story,7,9,1941,2006,⋯,1,9,2,1,252,5,2010,266500,FALSE,TRUE,FALSE
1459,1459,20,9717,Inside,NAmes,1Fam,1Story,5,6,1950,1996,⋯,1,5,0,1,240,4,2010,142125,TRUE,TRUE,FALSE
1460,1460,20,9937,Inside,Edwards,1Fam,1Story,5,6,1965,1965,⋯,1,6,0,1,276,6,2008,147500,TRUE,TRUE,FALSE


#### 2.3 Tabular Summary of Data

In [7]:
#Using only training data, summarize the data in at least one table (this is exploratory data analysis). 
#An example of a useful table could be one that reports the number of observations in each class, 
#the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 
options(repr.matrix.max.cols = 22)

#Summarize the data based on Max, Min, Mean, and unique observations.
data_max <- summarize_all(data_selected_numeric,max, na.rm = TRUE)
data_min <- summarize_all(data_selected_numeric,min, na.rm = TRUE)
data_mean <- summarize_all(data_selected_numeric,mean, na.rm = TRUE)
data_unique_counts <- summarize_all(data_selected_numeric,n_distinct,na.rm=TRUE)

#Compile all summaries into one data frame.
data_observations <- data_max %>%
    rbind(data_min) %>%
    rbind(data_mean) %>%
    rbind(data_unique_counts) %>%
    signif(4)
data_observations <- cbind(ObservationType = c("Maximum value","Minimum value", "Mean value", "Number of unique observations"), data_observations)


data_observations


ObservationType,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,TotalBsmtSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,MoSold,YrSold,SalePrice
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Maximum value,1460.0,190.0,215200,10.0,9.0,2009,2010,6110,5642,3.0,2.0,3.0,2.0,2.0,14.0,3.0,4.0,1418.0,12.0,2010,755000
Minimum value,1.0,20.0,1300,1.0,1.0,1872,1950,0,520,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,1.0,2006,34900
Mean value,737.6,57.78,10560,6.113,5.568,1971,1985,1057,1523,0.4283,0.06119,1.565,0.3781,1.042,6.542,0.6155,1.769,472.2,6.353,2008,181200
Number of unique observations,1095.0,15.0,831,9.0,9.0,110,61,598,722,4.0,3.0,4.0,3.0,3.0,11.0,4.0,5.0,382.0,12.0,5,547


#### 2.4 Visual Summary of Data

In [8]:
#Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis).
#An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your 
#analysis.

## 3. Methods

//Explain how you will conduct either your data analysis and which variables/columns you will use. 
//Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. 
//For each variable think: is this a useful variable for prediction?
//TODO Describe at least one way in which you will visualize results

(DRAFT)
In order to determine which real estate factors contribute to most accurately predicting the selling price of a home, 
we will construct a multivariate linear regression model. Below is an overview of our proposed data analysis methods:
- Reduce the number of potential explanatory variables from ~80 to [!!! INSERT HERE] (refer to section 2.2)
- Clean, organize, and summarize the dataset (sections 2.2-2.4)
- Explore and document pairwise relationships using graphical (scatterplot) and statistical (correlation) analysis
- Conduct variable selection analysis to narrow down to a "best subset" of explanatory variables, leveraging the R 'leaps' and 'caret' packages
- Tune model for a final result

Throughout the process, we will be evaluating not only statistical significance but also interpretability and meaningfulness of our results. Our imperative is to produce a final result that we can explain in plain English.

## 4. Expected outcomes and significance

#### 4.1 Expected findings

// What do you expect to find?

(DRAFT)

We expect to find that size measures (square footage, # of bedrooms, lot size, etc.) of the home will be most positively correlated to selling price. Furthermore, we anticipate that recently renovated homes will exhibit higher predicted values than non-renovated homes. Finally, we expect an element of the unexpected. That is, we hypothesize that we will find factors that will significantly contribute to the selling price of a house for which we were not expecting to drive correlation.

#### 4.2 Potential impact

(DRAFT)

In our view, there are two potential impacts from this work: 
1. Real estate agents and developers could garner a better understanding into the magnitudes of the drivers of home selling prices.
2. We can take key valuation insights with us in our personal life for when we one day purchase homes of our own.

#### 4.3 Potential future questions

In [15]:
Our analysis on factors contributing to the selling price of a home could spark future questions, including:
- TEAM TO FILL IN