-
Notifications
You must be signed in to change notification settings - Fork 0
/
Individual R Project (MartinezAndrew).Rmd
466 lines (341 loc) · 18.9 KB
/
Individual R Project (MartinezAndrew).Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
---
title: "Individual R Project (MartinezAndrew)"
author: "Andrew Martinez"
date: "5/8/2019"
output: html_document
---
This analysis utilizes the data found in a Kaggle competition where competitors seek to predict housing prices in King County, WA, USA (https://bit.ly/2lRv48E). The data set provides fields including the number of bedrooms, number of bathrooms, number of floors, square footage of the living room, square footage of the overall lot, among others. The time frame this data set covers are the years 2014 through 2015. Therefore, this analysis will utilize a number of machine learning models and techniques in order to achieve the best possible models (in this instance measured using MAPE, MAE, MSE, and R-Squared with emphasis on the MAPE score).
```{r, include=FALSE}
#Check if necessary packages are installed and loaded
if(!"pacman" %in% installed.packages()) {install.packages("pacman")}
pacman::p_load("ggplot2","data.table","lubridate","bit64","rpart","partykit","dplyr","leaflet","chron","proj4","ranger", "ggplot2", "ggthemes", "tidyr", "dplyr", "xts", "glmnet", "xgboost", "tuneRanger", "mlr", "Hmisc", "gridExtra", "breakDown", "caret")
source('code/regression_metrics.R')
source('code/f_partition.R')
source('code/graph_helpers.R')
source('code/split_and_train.R')
# Source for tuneRanger library
# @ARTICLE{tuneRanger,
# author = {Probst, Philipp and Wright, Marvin and Boulesteix, Anne-Laure},
# title = {Hyperparameters and Tuning Strategies for Random Forest},
# journal = {ArXiv preprint arXiv:1804.03515},
# archivePrefix = "arXiv",
# eprint = {1804.03515},
# primaryClass = "stat.ML",
# keywords = {Statistics - Machine Learning, Computer Science - Learning},
# year = 2018,
# url = {https://arxiv.org/abs/1804.03515}
# }
```
## Data Import
```{r}
raw_train_df <- fread('Data/house_price_train.csv', stringsAsFactors = F)
raw_test_df <- fread('Data/house_price_test.csv', stringsAsFactors = F)
str(raw_train_df)
summary(raw_train_df)
head(raw_train_df)
#Check for null values
sum(is.na(raw_train_df))
sum(is.na(raw_test_df))
```
## Initial Cleaning
As time series modeling will not be utilized in this analysis, the day, month, and year of each purchase will be individually parsed out rather than using the date time field.
```{r}
clean_train_df <- raw_train_df
clean_test_df <- raw_test_df
# Train Data set
clean_train_df$date <- as.Date(raw_train_df$date, "%m/%d/%Y")
clean_train_df$year <- year(clean_train_df[,clean_train_df$date])
clean_train_df$month <- month(clean_train_df[,clean_train_df$date])
clean_train_df$day <- day(clean_train_df[,clean_train_df$date])
clean_train_df$day_of_week <- as.POSIXlt(as.Date(clean_train_df$date, "%m/%d/%Y"))$wday
# Test Data Set
clean_test_df$date <- as.Date(clean_test_df$date, "%m/%d/%Y")
clean_test_df$year <- year(clean_test_df[,clean_test_df$date])
clean_test_df$month <- month(clean_test_df[,clean_test_df$date])
clean_test_df$day <- day(clean_test_df[,clean_test_df$date])
clean_test_df$day_of_week <- as.POSIXlt(as.Date(clean_test_df$date, "%m/%d/%Y"))$wday
```
## Exploratory Analysis
### Histograms
In order readable visualizations, a random selection of 1000 houses will be taken for all subsequent graphs/charts. Based on the histogram outputs, it appears that at least some of the variables are not normally distributed.
```{r}
set.seed(12345)
sqft_hist <- c('sqft_living', 'sqft_lot', 'sqft_above', 'sqft_living15', 'sqft_lot15')
stats_hist <- c('bedrooms', 'floors','condition', 'grade')
#Randomly Sample 1000 values
df.1000 <- clean_train_df[sample(nrow(clean_train_df), 1000),]
multiple_hist(df.1000, sqft_hist)
multiple_hist(df.1000, stats_hist)
single_hist(df.1000$yr_built, "Year Built")
single_hist(df.1000$yr_renovated, "Year Renovated")
single_hist(df.1000$price, "Price")
single_hist(df.1000$sqft_basement, "Basement Area")
```
### Scatterplots
The following scatter plots seek to visual describe the relationship between the numerical variables and the target price variable. From this sample, it can be seen that the area of the living room and number of bathrooms in particular have a strong positive relationship (an intuitive observation from a business perspective as they factors are often explicitly taken into account with regards to pricing).
```{r}
#Create dataframe with only numerical variables
numerical_var <- c('bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement', 'yr_built', 'sqft_living15', 'sqft_lot15', 'price')
scatter_df <- clean_train_df[,..numerical_var]
var_list <- names(scatter_df)[1:(length(scatter_df)-1)]
#Create list of ggplots of each numerical variable against price
plot_list <- lapply(var_list, gg_scatter, df = scatter_df)
do.call(grid.arrange, plot_list)
```
### Geographic Analysis
The following map allows for the visualization of the where houses of different price bands (based on quartiles) are located.
```{r}
#Bin into quartiles for data visualization
df.1000$bin <- factor(Hmisc::cut2(df.1000$price, g = 4), labels = c(1:4))
colorsmap <- colors()[c(490,24,100,657)]
map <- leaflet(data.frame(df.1000)) %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addCircleMarkers(lng=~long, lat=~lat,
popup= paste0("Number of Bedrooms: ", df.1000$bedrooms, sep="\n",
"Number of Bathrooms: ", df.1000$bathrooms, sep="\n",
"Living Room Size: " , df.1000$sqft_living, sep="\n",
"Lot Size: ", df.1000$sqft_lot, sep="\n",
"Number of Floors: ", df.1000$floors, sep="\n",
"Current Condition: ", df.1000$condition),
color= ~colorsmap,
group= unique(df.1000$bin)) #%>%
# This seems to be no longer supported
# addLegend(position = 'bottomright', colors = colorsmap, labels = unique(df.1000$bin))
# addLegend(map, position = 'bottomright', colors = colorsmap, labels = unique(df.1000$bin))
map
```
### Outlier Analysis
#### Univariate Analysis
The following charts look at the distributions of each numerical value to visually see outliers. In this sample there are clear outliers for all variables other than the number of floors and the year the home was built.
```{r}
for (var in numerical_var[1:(length(numerical_var)-1)]){
univariate_outlier(clean_test_df, var)
}
```
#### Bivariate Analysis
The following charts look at the distributions of each numerical value to visually see outliers on a monthly and daily basis to help understand any temporal patterns. Once again, outliers occur in all variables other than the number of floors and the year built. It is interesting to note that relatively speaking, a greater proportion of outliers are occur on weekdays.
```{r}
for (var in numerical_var[1:(length(numerical_var)-1)]){
bivariate_outlier(clean_test_df, var)
}
```
### Clip Outliers
As the above charts demonstrate that outliers are present, they will be clipped if they lie outside a 95% distribution band.
```{r}
clipped_outliers <- lapply(clean_train_df[,..numerical_var], clip, lower_bound = .025, upper_bound = .975)
clipped_outliers_df <- as.data.table(matrix(unlist(clipped_outliers), nrow=length(unlist(clipped_outliers[1]))))
clean_train_df[,numerical_var] <- clipped_outliers_df
head(clean_train_df)
```
## Correlation Heatmap
```{r}
heatmap_data<-clean_train_df[, !c('id','date')]
d3heatmap::d3heatmap(cor(heatmap_data))
```
## Baseline Model Comparisons
This analysis will initially compare the results of Lasso Linear Regression, Ranger's implementation of Random Forest, and XG Boost to determine which algorithm will be used going forward.
### Lasso Linear Regression
```{r}
split_clean_train_df <- f_partition(clean_train_df, test_proportion = 0.2, seed = 123456)
split_clean_train_df$train$date = NULL
split_clean_train_df$test$date = NULL
glmnet_cv<-cv.glmnet(x = data.matrix(split_clean_train_df$train[, !c('id','price')]), nfolds = 5,
y = split_clean_train_df$train[['price']],
alpha=1, family = 'gaussian', standardize = T)
plot.cv.glmnet(glmnet_cv)
glmnet_cv$lambda.min
glmnet_0<-glmnet(x = data.matrix(split_clean_train_df$train[, !c('id','price')]),
y = split_clean_train_df$train[['price']],
family = 'gaussian',
alpha=1, lambda = glmnet_cv$lambda.min)
print(glmnet_0)
glmnet_0$beta
test_glmnet<-predict(glmnet_0, newx = data.matrix(split_clean_train_df$test[,!c('id','price')]))
df_pred<-split_clean_train_df$test[, .(id=1:.N, price, test_glmnet)]
str(df_pred)
rmse_glmnet<-rmse(real=split_clean_train_df$test$price, predicted = test_glmnet)
mae_glmnet<-mae(real=split_clean_train_df$test$price, predicted = test_glmnet)
mape_glmnet<-mape(real=split_clean_train_df$test$price, predicted = test_glmnet)
mape_glmnet
rsq_glment<-custom_rsq(real=split_clean_train_df$test$price, predicted = test_glmnet)
rsq_glment
```
## Ranger Random Forest
```{r}
baseline_rf <- ranger(formula = as.formula(price~.), data=split_clean_train_df$train[,!c('id')], importance = 'impurity')
print(baseline_rf)
test_rf<-predict(baseline_rf, data = split_clean_train_df$test, type='response')$predictions
df_pred<-cbind(df_pred, test_rf)
str(df_pred)
rmse_rf<-rmse(real=split_clean_train_df$test$price, predicted = test_rf)
mae_rf<-mae(real=split_clean_train_df$test$price, predicted = test_rf)
mape_rf<-mape(real=split_clean_train_df$test$price, predicted = test_rf)
mape_rf
rsq_rf<-custom_rsq(real=split_clean_train_df$test$price, predicted = test_rf)
rsq_rf
```
## XG Boost
```{r}
xgb_reg_0<-xgboost(booster='gblinear',
data=data.matrix(split_clean_train_df$train[, !c('id','price'), with=F]),
label=split_clean_train_df$train$price,
nrounds = 100,
objective='reg:linear')
print(xgb_reg_0)
test_xgb<-predict(xgb_reg_0, newdata = data.matrix(split_clean_train_df$test[, !c('id','price'), with=F]),
type='response')
df_pred<-cbind(df_pred, test_xgb)
str(df_pred)
rmse_xgb<-rmse(real=split_clean_train_df$test$price, predicted = test_xgb)
mae_xgb<-mae(real=split_clean_train_df$test$price, predicted = test_xgb)
mape_xgb<-mape(real=split_clean_train_df$test$price, predicted = test_xgb)
mape_xgb
rsq_xgb<-custom_rsq(real=split_clean_train_df$test$price, predicted = test_xgb)
rsq_xgb
```
## Model Comparison
As can be seen from the following charts outlining each algorithm's prediction metrics, Random Forest proved to have superior results when compared to the other two and will be used for subsequent feature engineering and tuning.
```{r}
metrics_plot(df_pred, c('glmnet','rf','xgb_reg'), verbose = T)
```
## Feature Engineering
A number of features will be created in the hopes their inclusion into the model will improve the overall prediction abilities. The features created include 1) Weekday/Weekend flag, 2) Holiday flag, 3) Renovation flag (defined as when the 2015 area of either the lot or living room is different from the original area), 4) Missing Renovation Year flag (as the presence of a renovation year should correspond to a positive renovation flag), and 5) House Age. It was found that the inclusion of the first, third, and fourth features actually improved model performance as seen from the below graphs.
### 1. Weekday/Weekend
```{r}
df_pipeline_pred<-split_clean_train_df$test[, .(id=1:.N, price, test_rf)]
colnames(df_pipeline_pred) <-c('id','price','baseline')
fe_train_df1 <- clean_train_df
fe_test_df1 <- clean_test_df
fe_train_df1$weekend <-as.logical(is.weekend(clean_train_df$date))
fe_test_df1$weekend <-as.logical(is.weekend(clean_test_df$date))
fe_train_df1$date = NULL
fe_output_1 <- split_and_train(fe_train_df1, df_pipeline_pred)
metrics_plot(fe_output_1[[3]], c('baseline','fe1'), verbose = T)
```
### 2. Holiday
```{r}
fe_train_df2 <- clean_train_df
fe_test_df2 <- clean_test_df
fe_train_df2$holiday <-as.logical(is.holiday(clean_train_df$date))
fe_test_df2$holiday <-as.logical(is.holiday(clean_test_df$date))
fe_train_df2$date = NULL
fe_output_2 <- split_and_train(fe_train_df2, fe_output_1[[3]])
metrics_plot(fe_output_2[[3]], c('baseline','fe1','fe2'), verbose = T)
clean_test_df$date = NULL
```
### 3. Renovation Flag
```{r}
fe_train_df3 <- clean_train_df
fe_test_df3 <- clean_test_df
fe_train_df3$renovated <- ifelse(((fe_train_df3$sqft_living != fe_train_df3$sqft_living15) |
(fe_train_df3$sqft_lot != fe_train_df3$sqft_lot15)), 1, 0)
fe_test_df3$rennovated <- ifelse(((fe_test_df3$sqft_living != fe_test_df3$sqft_living15) |
(fe_test_df3$sqft_lot != fe_test_df3$sqft_lot15)), 1, 0)
fe_output_3 <- split_and_train(fe_train_df3, fe_output_2[[3]])
metrics_plot(fe_output_3[[3]], c('baseline','fe1','fe2','fe3'), verbose = T)
```
### 4. Missing Renovation Year
```{r}
fe_train_df4 <- fe_train_df3
fe_test_df4 <- fe_test_df3
fe_train_df4$missing_ren_year <- ifelse(((fe_train_df4$yr_renovated == 0) & (fe_train_df4$renovated == T)), 1, 0)
fe_test_df4$missing_ren_year <- ifelse(((fe_test_df4$yr_renovated == 0) & (fe_test_df4$renovated == T)), 1, 0)
fe_output_4 <- split_and_train(fe_train_df4, fe_output_3[[3]])
metrics_plot(fe_output_4[[3]], c('baseline','fe1','fe2','fe3','fe4'), verbose = T)
```
### 5. House Age
```{r}
fe_train_df5 <- clean_train_df
fe_test_df5 <- clean_test_df
fe_train_df5$house_age <- year(Sys.Date()) - fe_train_df5$yr_built
fe_test_df5$house_age <- year(Sys.Date()) - fe_test_df5$yr_built
fe_output_5 <- split_and_train(fe_train_df5, fe_output_4[[3]])
metrics_plot(fe_output_5[[3]], c('baseline','fe1','fe2','fe3','fe4','fe5'), verbose = T)
```
#### Combine Best Features
A combination of the first, third, and fourth engineered features are combined to train on the next model iteration. This combined model is found to be superior to any of the other models separately and therefore these features will be included in the tuning phase.
```{r}
fe_train_df_final <- fe_train_df1
fe_test_df_final<- fe_test_df1
fe_train_df_final$renovated <- ifelse(((fe_train_df_final$sqft_living != fe_train_df_final$sqft_living15) |
(fe_train_df_final$sqft_lot != fe_train_df_final$sqft_lot15)), 1, 0)
fe_test_df_final$renovated <- ifelse(((fe_test_df_final$sqft_living != fe_test_df_final$sqft_living15) |
(fe_test_df_final$sqft_lot != fe_test_df_final$sqft_lot15)), 1, 0)
fe_train_df_final$missing_ren_year <- ifelse(((fe_train_df_final$yr_renovated == 0)
& (fe_train_df_final$renovated == T)), 1, 0)
fe_test_df_final$missing_ren_year <- ifelse(((fe_test_df_final$yr_renovated == 0)
& (fe_test_df_final$renovated == T)), 1, 0)
fe_output_final <- split_and_train(fe_train_df_final, fe_output_5[[3]])
metrics_plot(fe_output_final[[3]], c('baseline','fe1','fe2','fe3','fe4','fe5','final_fe'), verbose = T)
```
## Hyperparameter Tuning
Per the documentation, the TuneRanger package seeks to find the optimal minimum node size, sample fraction, and mtry for a given Ranger model. Rather than using the standard cross-validation folds, out-of-bag (OOB) predictions are used for faster performance time. This final tuned model further improves the performance from previous iterations with a MAPE on the holdout of approximately **0.1152482**.
```{r}
# test_rf_tuned <- csrf(
# formula = as.formula(price~.),
# training_data = split_clean_train_df$train[,!c('id')],
# test_data = split_clean_train_df$test[,!c('id')],
# params1 = list(importance = 'impurity'),
# params2 = list(num.trees = 50)
# )
##################################################### TESTING ######################################################
####################################################################################################################
####################################################################################################################
final_train_df <- fe_output_final[[1]]
final_test_df <- fe_test_df_final
#Need to convert to integers as task doesn't support categoricals
#Train/Test Split
final_train_df$train$weekend <- ifelse((final_train_df$train$weekend), 1, 0)
final_train_df$test$weekend <- ifelse((final_train_df$test$weekend), 1, 0)
#Validation Split
final_test_df$weekend <- ifelse((final_test_df$weekend), 1, 0)
task = makeRegrTask(data = final_train_df$train[,!c('id')], target = "price")
# Estimate runtime
estimateTimeTuneRanger(task)
# Tuning
res = tuneRanger(task, num.trees = 500, num.threads = 2, iters = 50, save.file.path = NULL)
# Model with the new tuned hyperparameters
res$model
# Prediction
final <- predict(res$model, newdata = final_train_df$test[,!c('id')])$data$response
df_pipeline_final<-cbind(fe_output_final[[3]], final)
metrics_plot(df_pipeline_final, c('baseline','fe1','fe2','fe3','fe4','fe5','final_fe','final_model'), verbose = T)
```
## Retrain Model On Entire Data Set and Predict on Test Set
The following sections retrain the model using the feature engineering and optimal hyperparamters from previous sections on the entire training set. This retrained model is then used to make predictions on the validation set, which is finally prepared for the final CSV format.
```{r}
final_total_train<- rbind(final_train_df$train, final_train_df$test)
#Running this was computationally expensive and ultimately unsuccessful
# fit_control <- trainControl(method = "cv", number = 3, verboseIter = TRUE, search = "random")
# final_rf <- train(as.factor(price) ~ ., data = final_total_train[,!c('id')],
# method = "ranger",
# trControl = fit_control)
final_rf <- ranger(formula = as.formula(price~.), data=final_total_train[,!c('id')],
importance = 'impurity',
mtry = res$recommended.pars$mtry,
min.node.size = res$recommended.pars$min.node.size,
sample.fraction = res$recommended.pars$sample.fraction)
print(final_rf)
final_test_rf<-predict(final_rf, data = final_test_df, type='response')$predictions
prediction<-clean_test_df[, .(id=id,final_test_rf)]
head(prediction)
```
## Variable Importance
This chart displays the variable importance sorted by node impurity (i.e. the variation generated when observations reach that variable). Many of the most important factors influencing house pricing are intuitive (i.e. the size/area has a clear positive relationship with price).
```{r}
importance_df <- data.frame(final_rf$variable.importance)
setDT(importance_df, keep.rownames = TRUE)[]
colnames(importance_df) <- c('variable', 'importance')
ggplot(importance_df, aes(x=reorder(variable,importance), y=importance, fill=importance)) +
geom_bar(stat="identity", position="dodge")+ coord_flip()+
ylab("Variable Importance")+
xlab("")+
ggtitle("Information Value Summary")+
guides(fill=F)+
scale_fill_gradient(low="red", high="blue")
```
## CSV Output
```{r}
colnames(prediction) <- c('id', 'target')
write.csv(prediction, file = "output.csv")
```