# Group Project Proposal: Home Price Regression

## 1. Introduction

#### 1.1 Background Information

Housing prices vary due to a confluence of various factors including location, size of the house, size of the lot, to name only a few. The extent to which each factor correlates with house prices is not a cut and dry formula. As such, it can be difficult to predict what price one should sell their house for or what price one should offer for a house. This project is built to help figure out these values and the variables that affect housing prices.

#### 1.2 Objective Question

Which real estate factors contribute to most accurately predicting the selling price of a home? Can we build a regression model to accurately predict home prices based on given input real estate factors?

#### 1.3 Dataset Description

The dataset comprises data for ~1500 house sales in the city of Ames, Iowa from 2006-2010. The response variable in this dataset is the sale price of each home. Additionally, there are approximately 80 descriptive real estate factors (including living area, # of bedrooms and bathrooms, and lot size) for each observation.

In [1]:
#Place fun house picture here if possible!

## 2. Preliminary exploratory data analysis

#### 2.1 Reading in data

In [16]:
#TODO Check which packages are included in base R
install.packages('gsheet')
library(tidyverse)
#library(cowplot)
#library(scales)
library(tidymodels)
library(gsheet)
library(repr)
options(repr.matrix.max.rows = 8)
data <- gsheet2tbl('docs.google.com/spreadsheets/d/1nNlzfwXkHVk2i946pgf3247KT2vqCDfHZjrBjT-losg/edit?usp=sharing')

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



#### 2.2 Clean and wrangle data

In [27]:
names(data)[70] <- "ThreeSsnPorch"
data_selected <- data %>%
    select(Id, MSSubClass, LotArea, LotConfig, Neighborhood, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, Exterior1st, Exterior2nd, TotalBsmtSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageCars, GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch,
           ThreeSsnPorch, 
           ScreenPorch, PoolArea, MoSold, YrSold, SalePrice) %>%
    mutate(HasDeck = WoodDeckSF > 0,HasPorch = (OpenPorchSF > 0 | EnclosedPorch > 0 | ThreeSsnPorch > 0 | ScreenPorch > 0), HasPool = PoolArea > 0) %>%
    select(-WoodDeckSF, -OpenPorchSF, -EnclosedPorch, -ThreeSsnPorch, -ScreenPorch, -PoolArea)
data_selected

data_selected_numeric <- select_if(data_selected,is.numeric)
data_max <- summarize_all(data_selected_numeric,max, na.rm = TRUE)
data_max
data_min <- summarize_all(data_selected_numeric,min, na.rm = TRUE)
data_mean <- summarize_all(data_selected_numeric,mean, na.rm = TRUE)
data_count <- summarize_all(data_selected_numeric,na.rm=TRUE)
data_count

Id,MSSubClass,LotArea,LotConfig,Neighborhood,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,⋯,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,MoSold,YrSold,SalePrice,HasDeck,HasPorch,HasPool
<dbl>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>
1,60,8450,Inside,CollgCr,1Fam,2Story,7,5,2003,⋯,8,0,2,548,2,2008,208500,FALSE,TRUE,FALSE
2,20,9600,FR2,Veenker,1Fam,1Story,6,8,1976,⋯,6,1,2,460,5,2007,181500,TRUE,FALSE,FALSE
3,60,11250,Inside,CollgCr,1Fam,2Story,7,5,2001,⋯,6,1,2,608,9,2008,223500,FALSE,TRUE,FALSE
4,70,9550,Corner,Crawfor,1Fam,2Story,7,5,1915,⋯,7,1,3,642,2,2006,140000,FALSE,TRUE,FALSE
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
1457,20,13175,Inside,NWAmes,1Fam,1Story,6,6,1978,⋯,7,2,2,500,2,2010,210000,TRUE,FALSE,FALSE
1458,70,9042,Inside,Crawfor,1Fam,2Story,7,9,1941,⋯,9,2,1,252,5,2010,266500,FALSE,TRUE,FALSE
1459,20,9717,Inside,NAmes,1Fam,1Story,5,6,1950,⋯,5,0,1,240,4,2010,142125,TRUE,TRUE,FALSE
1460,20,9937,Inside,Edwards,1Fam,1Story,5,6,1965,⋯,6,0,1,276,6,2008,147500,TRUE,TRUE,FALSE


Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,TotalBsmtSF,GrLivArea,BsmtFullBath,⋯,FullBath,HalfBath,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,MoSold,YrSold,SalePrice
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1460,190,215245,10,9,2010,2010,6110,5642,3,⋯,3,2,3,14,3,4,1418,12,2010,755000


ERROR: Error: `n()` must only be used inside dplyr verbs.


In [20]:
library(tidymodels)
set.seed(1234)

data_selected <- as.data.frame(data_selected)

data_selected <- initial_split(data, prop = 3/4, strata = NULL)
data_train <- training(data_selected)
data_test <- testing(data_selected)

data_train


Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,⋯,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
1,60,RL,65,8450,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,Pave,,IR1,Lvl,AllPub,⋯,0,,,,0,9,2008,WD,Normal,223500
5,60,RL,84,14260,Pave,,IR1,Lvl,AllPub,⋯,0,,,,0,12,2008,WD,Normal,250000
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
1457,20,RL,85,13175,Pave,,Reg,Lvl,AllPub,⋯,0,,MnPrv,,0,2,2010,WD,Normal,210000
1458,70,RL,66,9042,Pave,,Reg,Lvl,AllPub,⋯,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1459,20,RL,68,9717,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,4,2010,WD,Normal,142125
1460,20,RL,75,9937,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,6,2008,WD,Normal,147500


Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,⋯,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
<dbl>,<dbl>,<chr>,<dbl>,<dbl>,<chr>,<chr>,<chr>,<chr>,<chr>,⋯,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<dbl>,<chr>,<chr>,<dbl>
4,70,RL,60,9550,Pave,,IR1,Lvl,AllPub,⋯,0,,,,0,2,2006,WD,Abnorml,140000
10,190,RL,50,7420,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,1,2008,WD,Normal,118000
11,20,RL,70,11200,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,2,2008,WD,Normal,129500
13,20,RL,,12968,Pave,,IR2,Lvl,AllPub,⋯,0,,,,0,9,2008,WD,Normal,144000
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
1438,20,RL,96,12444,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,11,2008,New,Partial,394617
1445,20,RL,63,8500,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,11,2007,WD,Normal,179600
1447,20,RL,,26142,Pave,,IR1,Lvl,AllPub,⋯,0,,,,0,4,2010,WD,Normal,157900
1452,20,RL,78,9262,Pave,,Reg,Lvl,AllPub,⋯,0,,,,0,5,2009,New,Partial,287090


#### 2.3 Tabular Summary of Data

In [28]:
#Using only training data, summarize the data in at least one table (this is exploratory data analysis). 
#An example of a useful table could be one that reports the number of observations in each class, 
#the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 


data_max <- summarize_all(data_selected_numeric,max, na.rm = TRUE)
data_max
data_min <- summarize_all(data_selected_numeric,min, na.rm = TRUE)
data_mean <- summarize_all(data_selected_numeric,mean, na.rm = TRUE)
data_unique_counts <- summarize_all(data_selected_numeric,n_distinct,na.rm=TRUE)


Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,TotalBsmtSF,GrLivArea,BsmtFullBath,⋯,FullBath,HalfBath,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,MoSold,YrSold,SalePrice
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1460,190,215245,10,9,2010,2010,6110,5642,3,⋯,3,2,3,14,3,4,1418,12,2010,755000


Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,TotalBsmtSF,GrLivArea,BsmtFullBath,⋯,FullBath,HalfBath,KitchenAbvGr,TotRmsAbvGrd,Fireplaces,GarageCars,GarageArea,MoSold,YrSold,SalePrice
<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,⋯,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>,<int>
1460,15,1073,10,9,112,61,721,861,4,⋯,4,3,4,12,4,5,441,12,5,663


#### 2.4 Visual Summary of Data

In [8]:
#Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis).
#An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your 
#analysis.

## 3. Methods

//Explain how you will conduct either your data analysis and which variables/columns you will use. 
//Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. 
//For each variable think: is this a useful variable for prediction?
//TODO Describe at least one way in which you will visualize results

(DRAFT)
In order to determine which real estate factors contribute to most accurately predicting the selling price of a home, 
we will construct a multivariate linear regression model. Below is an overview of our proposed data analysis methods:
- Reduce the number of potential explanatory variables from ~80 to [!!! INSERT HERE] (refer to section 2.2)
- Clean, organize, and summarize the dataset (sections 2.2-2.4)
- Explore and document pairwise relationships using graphical (scatterplot) and statistical (correlation) analysis
- Conduct variable selection analysis to narrow down to a "best subset" of explanatory variables, leveraging the R 'leaps' and 'caret' packages
- Tune model for a final result

Throughout the process, we will be evaluating not only statistical significance but also interpretability and meaningfulness of our results. Our imperative is to produce a final result that we can explain in plain English.

## 4. Expected outcomes and significance

#### 4.1 Expected findings

// What do you expect to find?

(DRAFT)

We expect to find that size measures (square footage, # of bedrooms, lot size, etc.) of the home will be most positively correlated to selling price. Furthermore, we anticipate that recently renovated homes will exhibit higher predicted values than non-renovated homes. Finally, we expect an element of the unexpected. That is, we hypothesize that we will find factors that will significantly contribute to the selling price of a house for which we were not expecting to drive correlation.

#### 4.2 Potential impact

(DRAFT)

In our view, there are two potential impacts from this work: 
1. Real estate agents and developers could garner a better understanding into the magnitudes of the drivers of home selling prices.
2. We can take key valuation insights with us in our personal life for when we one day purchase homes of our own.

#### 4.3 Potential future questions

In [15]:
Our analysis on factors contributing to the selling price of a home could spark future questions, including:
- TEAM TO FILL IN