## Ingest the Data

This notebook contains the basic commands required to ingest the data for our work. Note that all of these commands were added to the file, `src/load_data-01.r` so that in subsequent notebooks the data is loaded via script.

### Join the Data Sets

Often you will receive data describing the same instances from multiple data sources. The original Ames, Iowa housing data has been arbitrarily split in order to allow us the opportunity to practice joining data from different sources. 

In [1]:
zoning_df = read.csv('data/zoning.csv')
listing_df = read.csv('data/listing.csv')
sale_df = read.csv('data/sale.csv')

In [2]:
head(zoning_df)

Id,MSSubClass,MSZoning,LotFrontage,LotArea,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle
1,60,RL,65,8450,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story
2,20,RL,80,9600,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story
3,60,RL,68,11250,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story
4,70,RL,60,9550,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story
5,60,RL,84,14260,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story
6,50,RL,85,14115,IR1,Lvl,AllPub,Inside,Gtl,Mitchel,Norm,Norm,1Fam,1.5Fin


In [3]:
head(listing_df)

Id,Street,Alley,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,⋯,WoodDeckSF,OpenPorchSF,EnclosedPorch,ThreeSsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal
1,Pave,,7,5,2003,2003,Gable,CompShg,VinylSd,⋯,0,61,0,0,0,0,,,,0
2,Pave,,6,8,1976,1976,Gable,CompShg,MetalSd,⋯,298,0,0,0,0,0,,,,0
3,Pave,,7,5,2001,2002,Gable,CompShg,VinylSd,⋯,0,42,0,0,0,0,,,,0
4,Pave,,7,5,1915,1970,Gable,CompShg,Wd Sdng,⋯,0,35,272,0,0,0,,,,0
5,Pave,,8,5,2000,2000,Gable,CompShg,VinylSd,⋯,192,84,0,0,0,0,,,,0
6,Pave,,5,5,1993,1995,Gable,CompShg,VinylSd,⋯,40,30,0,320,0,0,,MnPrv,Shed,700


In [4]:
head(sale_df)

Id,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,2,2008,WD,Normal,208500
2,5,2007,WD,Normal,181500
3,9,2008,WD,Normal,223500
4,2,2006,WD,Abnorml,140000
5,12,2008,WD,Normal,250000
6,10,2009,WD,Normal,143000


Here, we join the three datasets using the `merge` command using the column `Id` as reference.

In [5]:
housing_df = merge(zoning_df, listing_df, by="Id")
housing_df = merge(housing_df, sale_df, by="Id")

In [6]:
head(housing_df)

Id,MSSubClass,MSZoning,LotFrontage,LotArea,LotShape,LandContour,Utilities,LotConfig,LandSlope,⋯,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
1,60,RL,65,8450,Reg,Lvl,AllPub,Inside,Gtl,⋯,0,,,,0,2,2008,WD,Normal,208500
2,20,RL,80,9600,Reg,Lvl,AllPub,FR2,Gtl,⋯,0,,,,0,5,2007,WD,Normal,181500
3,60,RL,68,11250,IR1,Lvl,AllPub,Inside,Gtl,⋯,0,,,,0,9,2008,WD,Normal,223500
4,70,RL,60,9550,IR1,Lvl,AllPub,Corner,Gtl,⋯,0,,,,0,2,2006,WD,Abnorml,140000
5,60,RL,84,14260,IR1,Lvl,AllPub,FR2,Gtl,⋯,0,,,,0,12,2008,WD,Normal,250000
6,50,RL,85,14115,IR1,Lvl,AllPub,Inside,Gtl,⋯,0,,MnPrv,Shed,700,10,2009,WD,Normal,143000


In [7]:
dim(housing_df)

In [8]:
str(Filter(is.numeric, housing_df))

'data.frame':	1460 obs. of  38 variables:
 $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
 $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
 $ LotFrontage  : num  65 80 68 60 84 85 75 NA 51 50 ...
 $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
 $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
 $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
 $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
 $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
 $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
 $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
 $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
 $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
 $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
 $ FirstFlrSF   : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
 $ SecondFlrSF  : int  854 0 866 756 1053 566 0 983 752 0 ...
 $ LowQualFinSF : int 

In [9]:
rownames(housing_df) <- housing_df$Id 
housing_df$Id <- NULL

### Typecast Categorical Features

Several features are categorical in nature in spite of the fact that the data is stored as integer values. We must explicitly cast these features as `factor` type features.

In [10]:
housing_df$MSSubClass <- as.factor(housing_df$MSSubClass)
housing_df$OverallQual <- as.factor(housing_df$OverallQual)
housing_df$OverallCond <- as.factor(housing_df$OverallCond)
housing_df$BsmtFullBath <- as.factor(housing_df$BsmtFullBath)
housing_df$BsmtHalfBath <- as.factor(housing_df$BsmtHalfBath)
housing_df$FullBath <- as.factor(housing_df$FullBath)
housing_df$HalfBath <- as.factor(housing_df$HalfBath)
housing_df$BedroomAbvGr <- as.factor(housing_df$BedroomAbvGr)
housing_df$KitchenAbvGr <- as.factor(housing_df$KitchenAbvGr)
housing_df$TotRmsAbvGrd <- as.factor(housing_df$TotRmsAbvGrd)
housing_df$Fireplaces <- as.factor(housing_df$Fireplaces)
housing_df$GarageCars <- as.factor(housing_df$GarageCars)
housing_df$MoSold <- as.factor(housing_df$MoSold)