Individual R Project (MartinezAndrew).Rmd

---
title: "Individual R Project (MartinezAndrew)"
author: "Andrew Martinez"
date: "5/8/2019"
output: html_document
---

This analysis utilizes the data found in a Kaggle competition where competitors seek to predict housing prices in King County, WA, USA (https://bit.ly/2lRv48E). The data set provides fields including the number of bedrooms, number of bathrooms, number of floors, square footage of the living room, square footage of the overall lot, among others. The time frame this data set covers are the years 2014 through 2015. Therefore, this analysis will utilize a number of machine learning models and techniques in order to achieve the best possible models (in this instance measured using MAPE, MAE, MSE, and R-Squared with emphasis on the MAPE score).

```{r, include=FALSE}
#Check if necessary packages are installed and loaded 
if(!"pacman" %in% installed.packages()) {install.packages("pacman")}
pacman::p_load("ggplot2","data.table","lubridate","bit64","rpart","partykit","dplyr","leaflet","chron","proj4","ranger", "ggplot2", "ggthemes", "tidyr", "dplyr", "xts", "glmnet", "xgboost", "tuneRanger", "mlr", "Hmisc", "gridExtra", "breakDown", "caret")

source('code/regression_metrics.R')
source('code/f_partition.R')
source('code/graph_helpers.R')
source('code/split_and_train.R')

# Source for tuneRanger library
# @ARTICLE{tuneRanger,
#   author = {Probst, Philipp and Wright, Marvin and Boulesteix, Anne-Laure}, 
#   title = {Hyperparameters and Tuning Strategies for Random Forest},
#   journal = {ArXiv preprint arXiv:1804.03515},
#   archivePrefix = "arXiv",
#   eprint = {1804.03515},
#   primaryClass = "stat.ML",
#   keywords = {Statistics - Machine Learning, Computer Science - Learning},
#   year = 2018,
#   url = {https://arxiv.org/abs/1804.03515}
# }
```

## Data Import

```{r}
raw_train_df <- fread('Data/house_price_train.csv', stringsAsFactors = F)
raw_test_df <- fread('Data/house_price_test.csv', stringsAsFactors = F)

str(raw_train_df)
summary(raw_train_df)
head(raw_train_df)

#Check for null values
sum(is.na(raw_train_df))
sum(is.na(raw_test_df))
```

## Initial Cleaning

As time series modeling will not be utilized in this analysis, the day, month, and year of each purchase will be individually parsed out rather than using the date time field.

```{r}
clean_train_df <- raw_train_df
clean_test_df <- raw_test_df

# Train Data set
clean_train_df$date <- as.Date(raw_train_df$date, "%m/%d/%Y")

clean_train_df$year <- year(clean_train_df[,clean_train_df$date])
clean_train_df$month <- month(clean_train_df[,clean_train_df$date])
clean_train_df$day <- day(clean_train_df[,clean_train_df$date])
clean_train_df$day_of_week <- as.POSIXlt(as.Date(clean_train_df$date, "%m/%d/%Y"))$wday

# Test Data Set
clean_test_df$date <- as.Date(clean_test_df$date, "%m/%d/%Y")

clean_test_df$year <- year(clean_test_df[,clean_test_df$date])
clean_test_df$month <- month(clean_test_df[,clean_test_df$date])
clean_test_df$day <- day(clean_test_df[,clean_test_df$date])
clean_test_df$day_of_week <- as.POSIXlt(as.Date(clean_test_df$date, "%m/%d/%Y"))$wday
```

## Exploratory Analysis
### Histograms

In order readable visualizations, a random selection of 1000 houses will be taken for all subsequent graphs/charts. Based on the histogram outputs, it appears that at least some of the variables are not normally distributed.

```{r}
set.seed(12345)

sqft_hist <- c('sqft_living', 'sqft_lot', 'sqft_above', 'sqft_living15', 'sqft_lot15')
stats_hist <- c('bedrooms', 'floors','condition', 'grade')

#Randomly Sample 1000 values
df.1000 <- clean_train_df[sample(nrow(clean_train_df), 1000),]

multiple_hist(df.1000, sqft_hist)
multiple_hist(df.1000, stats_hist)

single_hist(df.1000$yr_built, "Year Built")
single_hist(df.1000$yr_renovated, "Year Renovated")
single_hist(df.1000$price, "Price")
single_hist(df.1000$sqft_basement, "Basement Area")

```

### Scatterplots

The following scatter plots seek to visual describe the relationship between the numerical variables and the target price variable. From this sample, it can be seen that the area of the living room and number of bathrooms in particular have a strong positive relationship (an intuitive observation from a business perspective as they factors are often explicitly taken into account with regards to pricing).

```{r}
#Create dataframe with only numerical variables
numerical_var <- c('bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'sqft_above', 'sqft_basement', 'yr_built', 'sqft_living15', 'sqft_lot15', 'price')
scatter_df <- clean_train_df[,..numerical_var]
var_list <- names(scatter_df)[1:(length(scatter_df)-1)]

#Create list of ggplots of each numerical variable against price
plot_list <- lapply(var_list, gg_scatter, df = scatter_df)
do.call(grid.arrange, plot_list)

```

### Geographic Analysis

The following map allows for the visualization of the where houses of different price bands (based on quartiles) are located.  

```{r}
#Bin into quartiles for data visualization
df.1000$bin <- factor(Hmisc::cut2(df.1000$price, g = 4), labels = c(1:4))

colorsmap <- colors()[c(490,24,100,657)]
map <- leaflet(data.frame(df.1000)) %>%
  addTiles() %>%  # Add default OpenStreetMap map tiles
  addCircleMarkers(lng=~long, lat=~lat,
                   popup= paste0("Number of Bedrooms: ", df.1000$bedrooms, sep="\n",
                                 "Number of Bathrooms: ", df.1000$bathrooms, sep="\n",
                                 "Living Room Size: " , df.1000$sqft_living, sep="\n",
                                 "Lot Size: ", df.1000$sqft_lot, sep="\n",
                                 "Number of Floors: ", df.1000$floors, sep="\n",
                                 "Current Condition: ", df.1000$condition),
                   color= ~colorsmap,
                   group= unique(df.1000$bin)) #%>% 
# This seems to be no longer supported
  # addLegend(position = 'bottomright', colors = colorsmap, labels = unique(df.1000$bin))

# addLegend(map, position = 'bottomright', colors = colorsmap, labels = unique(df.1000$bin))

map
```

### Outlier Analysis
#### Univariate Analysis
The following charts look at the distributions of each numerical value to visually see outliers. In this sample there are clear outliers for all variables other than the number of floors and the year the home was built.

```{r}
for (var in numerical_var[1:(length(numerical_var)-1)]){
  univariate_outlier(clean_test_df, var)
}
```

#### Bivariate Analysis
The following charts look at the distributions of each numerical value to visually see outliers on a monthly and daily basis to help understand any temporal patterns. Once again, outliers occur in all variables other than the number of floors and the year built. It is interesting to note that relatively speaking, a greater proportion of outliers are occur on weekdays.

```{r}
for (var in numerical_var[1:(length(numerical_var)-1)]){
  bivariate_outlier(clean_test_df, var)
}
```

### Clip Outliers

As the above charts demonstrate that outliers are present, they will be clipped if they lie outside a 95% distribution band.

```{r}
clipped_outliers <- lapply(clean_train_df[,..numerical_var], clip, lower_bound = .025, upper_bound = .975)
clipped_outliers_df <- as.data.table(matrix(unlist(clipped_outliers), nrow=length(unlist(clipped_outliers[1]))))
clean_train_df[,numerical_var] <- clipped_outliers_df

head(clean_train_df)
```

## Correlation Heatmap

```{r}
heatmap_data<-clean_train_df[, !c('id','date')]
d3heatmap::d3heatmap(cor(heatmap_data))
```

## Baseline Model Comparisons

This analysis will initially compare the results of Lasso Linear Regression, Ranger's implementation of Random Forest, and XG Boost to determine which algorithm will be used going forward.

### Lasso Linear Regression

```{r}
split_clean_train_df <- f_partition(clean_train_df, test_proportion = 0.2, seed = 123456)
split_clean_train_df$train$date = NULL
split_clean_train_df$test$date = NULL

glmnet_cv<-cv.glmnet(x = data.matrix(split_clean_train_df$train[, !c('id','price')]), nfolds = 5, 
                     y = split_clean_train_df$train[['price']],
                     alpha=1, family = 'gaussian', standardize = T)
plot.cv.glmnet(glmnet_cv)

glmnet_cv$lambda.min

glmnet_0<-glmnet(x = data.matrix(split_clean_train_df$train[, !c('id','price')]), 
                 y = split_clean_train_df$train[['price']],
                 family = 'gaussian',
                 alpha=1, lambda = glmnet_cv$lambda.min)

print(glmnet_0)
glmnet_0$beta

test_glmnet<-predict(glmnet_0, newx = data.matrix(split_clean_train_df$test[,!c('id','price')]))

df_pred<-split_clean_train_df$test[, .(id=1:.N, price, test_glmnet)]
str(df_pred)

rmse_glmnet<-rmse(real=split_clean_train_df$test$price, predicted = test_glmnet)
mae_glmnet<-mae(real=split_clean_train_df$test$price, predicted = test_glmnet)
mape_glmnet<-mape(real=split_clean_train_df$test$price, predicted = test_glmnet)
mape_glmnet

rsq_glment<-custom_rsq(real=split_clean_train_df$test$price, predicted = test_glmnet)
rsq_glment
```

## Ranger Random Forest

```{r}
baseline_rf <- ranger(formula = as.formula(price~.), data=split_clean_train_df$train[,!c('id')], importance = 'impurity')
print(baseline_rf)

test_rf<-predict(baseline_rf, data = split_clean_train_df$test, type='response')$predictions

df_pred<-cbind(df_pred, test_rf)
str(df_pred)

rmse_rf<-rmse(real=split_clean_train_df$test$price, predicted = test_rf)
mae_rf<-mae(real=split_clean_train_df$test$price, predicted = test_rf)
mape_rf<-mape(real=split_clean_train_df$test$price, predicted = test_rf)
mape_rf

rsq_rf<-custom_rsq(real=split_clean_train_df$test$price, predicted = test_rf)
rsq_rf
```

## XG Boost

```{r}
xgb_reg_0<-xgboost(booster='gblinear',
                   data=data.matrix(split_clean_train_df$train[, !c('id','price'), with=F]),
                   label=split_clean_train_df$train$price,
                   nrounds = 100,
                   objective='reg:linear')
print(xgb_reg_0)

test_xgb<-predict(xgb_reg_0, newdata = data.matrix(split_clean_train_df$test[, !c('id','price'), with=F]), 
                  type='response')

df_pred<-cbind(df_pred, test_xgb)
str(df_pred)

rmse_xgb<-rmse(real=split_clean_train_df$test$price, predicted = test_xgb)
mae_xgb<-mae(real=split_clean_train_df$test$price, predicted = test_xgb)
mape_xgb<-mape(real=split_clean_train_df$test$price, predicted = test_xgb)
mape_xgb

rsq_xgb<-custom_rsq(real=split_clean_train_df$test$price, predicted = test_xgb)
rsq_xgb
```

## Model Comparison

As can be seen from the following charts outlining each algorithm's prediction metrics, Random Forest proved to have superior results when compared to the other two and will be used for subsequent feature engineering and tuning.

```{r}
metrics_plot(df_pred, c('glmnet','rf','xgb_reg'), verbose = T)
```

## Feature Engineering

A number of features will be created in the hopes their inclusion into the model will improve the overall prediction abilities. The features created include 1) Weekday/Weekend flag, 2) Holiday flag, 3) Renovation flag (defined as when the 2015 area of either the lot or living room is different from the original area), 4) Missing Renovation Year flag (as the presence of a renovation year should correspond to a positive renovation flag), and 5) House Age. It was found that the inclusion of the first, third, and fourth features actually improved model performance as seen from the below graphs. 

### 1. Weekday/Weekend

```{r}
df_pipeline_pred<-split_clean_train_df$test[, .(id=1:.N, price, test_rf)]
colnames(df_pipeline_pred) <-c('id','price','baseline')

fe_train_df1 <- clean_train_df
fe_test_df1 <- clean_test_df

fe_train_df1$weekend <-as.logical(is.weekend(clean_train_df$date))
fe_test_df1$weekend <-as.logical(is.weekend(clean_test_df$date))
fe_train_df1$date = NULL

fe_output_1 <- split_and_train(fe_train_df1, df_pipeline_pred)
metrics_plot(fe_output_1[[3]], c('baseline','fe1'), verbose = T)
```

### 2. Holiday

```{r}
fe_train_df2 <- clean_train_df
fe_test_df2 <- clean_test_df

fe_train_df2$holiday <-as.logical(is.holiday(clean_train_df$date))
fe_test_df2$holiday <-as.logical(is.holiday(clean_test_df$date))
fe_train_df2$date = NULL

fe_output_2 <- split_and_train(fe_train_df2, fe_output_1[[3]])
metrics_plot(fe_output_2[[3]], c('baseline','fe1','fe2'), verbose = T)

clean_test_df$date = NULL
```

### 3. Renovation Flag

```{r}
fe_train_df3 <- clean_train_df
fe_test_df3 <- clean_test_df

fe_train_df3$renovated <- ifelse(((fe_train_df3$sqft_living != fe_train_df3$sqft_living15) | 
                                     (fe_train_df3$sqft_lot != fe_train_df3$sqft_lot15)), 1, 0)
fe_test_df3$rennovated <- ifelse(((fe_test_df3$sqft_living != fe_test_df3$sqft_living15) | 
                                     (fe_test_df3$sqft_lot != fe_test_df3$sqft_lot15)), 1, 0)

fe_output_3 <- split_and_train(fe_train_df3, fe_output_2[[3]])
metrics_plot(fe_output_3[[3]], c('baseline','fe1','fe2','fe3'), verbose = T)
```

### 4. Missing Renovation Year

```{r}
fe_train_df4 <- fe_train_df3
fe_test_df4 <- fe_test_df3

fe_train_df4$missing_ren_year <- ifelse(((fe_train_df4$yr_renovated == 0) & (fe_train_df4$renovated == T)), 1, 0)
fe_test_df4$missing_ren_year <- ifelse(((fe_test_df4$yr_renovated == 0) & (fe_test_df4$renovated == T)), 1, 0)

fe_output_4 <- split_and_train(fe_train_df4, fe_output_3[[3]])
metrics_plot(fe_output_4[[3]], c('baseline','fe1','fe2','fe3','fe4'), verbose = T)
```

### 5. House Age

```{r}
fe_train_df5 <- clean_train_df
fe_test_df5 <- clean_test_df

fe_train_df5$house_age <- year(Sys.Date()) - fe_train_df5$yr_built
fe_test_df5$house_age <- year(Sys.Date()) - fe_test_df5$yr_built

fe_output_5 <- split_and_train(fe_train_df5, fe_output_4[[3]])
metrics_plot(fe_output_5[[3]], c('baseline','fe1','fe2','fe3','fe4','fe5'), verbose = T)
```

#### Combine Best Features

A combination of the first, third, and fourth engineered features are combined to train on the next model iteration. This combined model is found to be superior to any of the other models separately and therefore these features will be included in the tuning phase. 

```{r}
fe_train_df_final <- fe_train_df1
fe_test_df_final<- fe_test_df1

fe_train_df_final$renovated <- ifelse(((fe_train_df_final$sqft_living != fe_train_df_final$sqft_living15) | 
                                     (fe_train_df_final$sqft_lot != fe_train_df_final$sqft_lot15)), 1, 0)
fe_test_df_final$renovated <- ifelse(((fe_test_df_final$sqft_living != fe_test_df_final$sqft_living15) | 
                                     (fe_test_df_final$sqft_lot != fe_test_df_final$sqft_lot15)), 1, 0)

fe_train_df_final$missing_ren_year <- ifelse(((fe_train_df_final$yr_renovated == 0) 
                                              & (fe_train_df_final$renovated == T)), 1, 0)
fe_test_df_final$missing_ren_year <- ifelse(((fe_test_df_final$yr_renovated == 0) 
                                             & (fe_test_df_final$renovated == T)), 1, 0)

fe_output_final <- split_and_train(fe_train_df_final, fe_output_5[[3]])
metrics_plot(fe_output_final[[3]], c('baseline','fe1','fe2','fe3','fe4','fe5','final_fe'), verbose = T)
```

## Hyperparameter Tuning

Per the documentation, the TuneRanger package seeks to find the optimal minimum node size, sample fraction, and mtry for a given Ranger model. Rather than using the standard cross-validation folds, out-of-bag (OOB) predictions are used for faster performance time. This final tuned model further improves the performance from previous iterations with a MAPE on the holdout of approximately **0.1152482**.

```{r}
# test_rf_tuned <- csrf(
#   formula = as.formula(price~.),
#   training_data = split_clean_train_df$train[,!c('id')],
#   test_data = split_clean_train_df$test[,!c('id')],
#   params1 = list(importance = 'impurity'),
#   params2 = list(num.trees = 50)
# )

##################################################### TESTING ###################################################### 
####################################################################################################################
####################################################################################################################
final_train_df <- fe_output_final[[1]]
final_test_df <- fe_test_df_final

#Need to convert to integers as task doesn't support categoricals
#Train/Test Split
final_train_df$train$weekend <- ifelse((final_train_df$train$weekend), 1, 0)
final_train_df$test$weekend <- ifelse((final_train_df$test$weekend), 1, 0)

#Validation Split
final_test_df$weekend <- ifelse((final_test_df$weekend), 1, 0)

task = makeRegrTask(data = final_train_df$train[,!c('id')], target = "price")
 
# Estimate runtime
estimateTimeTuneRanger(task)

# Tuning
res = tuneRanger(task, num.trees = 500, num.threads = 2, iters = 50, save.file.path = NULL)
  
# Model with the new tuned hyperparameters
res$model

# Prediction
final <- predict(res$model, newdata = final_train_df$test[,!c('id')])$data$response
df_pipeline_final<-cbind(fe_output_final[[3]], final)

metrics_plot(df_pipeline_final, c('baseline','fe1','fe2','fe3','fe4','fe5','final_fe','final_model'), verbose = T)
```

## Retrain Model On Entire Data Set and Predict on Test Set

The following sections retrain the model using the feature engineering and optimal hyperparamters from previous sections on the entire training set. This retrained model is then used to make predictions on the validation set, which is finally prepared for the final CSV format.

```{r}
final_total_train<- rbind(final_train_df$train, final_train_df$test)

#Running this was computationally expensive and ultimately unsuccessful
# fit_control <- trainControl(method = "cv", number = 3, verboseIter = TRUE, search = "random")
# final_rf <- train(as.factor(price) ~ ., data = final_total_train[,!c('id')],
#                  method = "ranger",
#                  trControl = fit_control)

final_rf <- ranger(formula = as.formula(price~.), data=final_total_train[,!c('id')], 
                   importance = 'impurity',
                   mtry = res$recommended.pars$mtry, 
                   min.node.size = res$recommended.pars$min.node.size,
                   sample.fraction = res$recommended.pars$sample.fraction)
print(final_rf)

final_test_rf<-predict(final_rf, data = final_test_df, type='response')$predictions
prediction<-clean_test_df[, .(id=id,final_test_rf)]
head(prediction)
```

## Variable Importance

This chart displays the variable importance sorted by node impurity (i.e. the variation generated when observations reach that variable). Many of the most important factors influencing house pricing are intuitive (i.e. the size/area has a clear positive relationship with price).

```{r}
importance_df <- data.frame(final_rf$variable.importance)
setDT(importance_df, keep.rownames = TRUE)[]
colnames(importance_df) <- c('variable', 'importance')

ggplot(importance_df, aes(x=reorder(variable,importance), y=importance, fill=importance)) + 
    geom_bar(stat="identity", position="dodge")+ coord_flip()+
    ylab("Variable Importance")+
    xlab("")+
    ggtitle("Information Value Summary")+
    guides(fill=F)+
    scale_fill_gradient(low="red", high="blue")
```

## CSV Output

```{r}
colnames(prediction) <- c('id', 'target')
write.csv(prediction, file = "output.csv")
```