In [128]:
# This R environment comes with many helpful analytics packages installed
# It is defined by the kaggle/rstats Docker image: https://github.com/kaggle/docker-rstats
# For example, here's a helpful package to load

library(tidyverse) # metapackage of all tidyverse packages
library(xgboost)
library(caret)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

list.files(path = "../input")


# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift


The following object is masked from ‘package:httr’:

    progress




In [63]:
train = read.csv('../input/house-prices-advanced-regression-techniques/train.csv', stringsAsFactors = FALSE)
test = read.csv('../input/house-prices-advanced-regression-techniques/test.csv', stringsAsFactors = FALSE)


Get x's and y's

In [64]:
train_x = train %>% select(-SalePrice)
train_y = train %>% select(SalePrice)

Start Categorical Imputation. List categorical variables and select them from the list. 

In [65]:
categorical = c('SaleCondition', 'SaleType', 'MiscFeature', 'Fence',
                'PoolQC', 'PavedDrive', 'GarageCond', 'GarageQual', 
                'GarageFinish', 'GarageType', 'FireplaceQu',
                'Functional', 'KitchenQual', 'Electrical', 
                'CentralAir', 'HeatingQC', 'Heating','BsmtFinType2',
                'BsmtFinType1', 'BsmtExposure', 'BsmtCond',
                'BsmtQual', 'Foundation', 'ExterCond',
                'ExterQual', 'MasVnrType', 'Exterior2nd',
                'Exterior1st', 'RoofMatl', 'RoofStyle',
                'HouseStyle', 'BldgType', 'Condition2', 'Condition1',
                'Neighborhood','LandSlope', 'LotConfig', 'Utilities','LandContour',
                'LotShape', 'Alley', 'Street', 'MSZoning')
cat_train = train_x %>% select(.dots = categorical)



Same for test. Impute None's instead of NA. 

In [68]:
cat_test = test %>% select(.dots = categorical)
cont_train = train_x %>% select(-(.dots = categorical))
cont_test = test %>% select(-(.dots = categorical))
cat_train[is.na(cat_train)] = 'None'
cat_test[is.na(cat_test)] = 'None'

Replace anything with less than 1% of the observations with other and use those same values to impute the test. 

In [69]:

x = data.frame()
for (i in seq(ncol(cat_train))){
  below = names(which(table(cat_train[i], useNA = 'ifany')/nrow(train)<.01))
  above = names(which(table(cat_train[i],useNA = 'ifany')/nrow(train)>=.01))
  y = data.frame(col = categorical[i], dots_col = names(cat_train[i]), keep = above)
  x = rbind(x,y)
  cat_train[i] = lapply(cat_train[i], function(x) replace (x, x %in% below, 'other'))
}
#apply same imputer to test
`%notin%` <- Negate(`%in%`)
for (i in seq(ncol(cat_test))){
  testthing = x %>% filter(dots_col == names(cat_test[i])) %>% select(keep)
  cat_test[i] = lapply(cat_test[i], function(x) replace(x, x %notin% testthing$keep, 'other'))
}

col,dots_col,keep
<fct>,<fct>,<fct>
KitchenQual,.dots13,Ex
KitchenQual,.dots13,Fa
KitchenQual,.dots13,Gd
KitchenQual,.dots13,TA


Make model matrix and DMatrix for the xgboost package

In [70]:
dummy_test = model.matrix(~.,cat_test)
test_dummy = cbind(dummy_test, cont_test) %>% select(-.dots13other) %>% as.matrix()
DM_test = xgb.DMatrix(data = test_dummy)

Same for test

In [74]:
dummy_train_x = model.matrix(~., cat_train)
dummy_train = as.matrix(cbind(dummy_train_x, cont_train))
data = xgb.DMatrix(data = dummy_train, label = train$SalePrice)


Grid Search to find the best tuning parameters. 

In [130]:
tune_grid <- expand.grid(
  nrounds = seq(from = 400, to = 500, by = 50),
  eta = c(0.1,  0.05),
  max_depth = c(5,  6,4),
  gamma = c(.1,.15),
  colsample_bytree = c(.25,.3),
  subsample = c(.6,.7),
  min_child_weight = 1
)

tune_control <- caret::trainControl(
  method = "cv", # cross-validation
  number = 5, # with n folds 
  #index = createFolds(tr_treated$Id_clean), # fix the folds
  verboseIter = FALSE, # no training log
  allowParallel = TRUE # FALSE for reproducible results 
)

xgb_tune <- caret::train(
  x = dummy_train,
  y = train$SalePrice,
  trControl = tune_control,
  tuneGrid = tune_grid,
  method = "xgbTree",
)



In [133]:
xgb_tune$bestTune

Unnamed: 0_level_0,nrounds,max_depth,eta,gamma,colsample_bytree,min_child_weight,subsample
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
21,500,4,0.05,0.15,0.3,1,0.6


In [134]:
bst = xgboost(data ,nrounds = 500, eta = .05, max.depth = 4, gamma = .15, colsample_bytree = .3, subsample = .6)
pred = predict(bst, test_dummy)

[1]	train-rmse:188414.765625 
[2]	train-rmse:179641.656250 
[3]	train-rmse:171204.156250 
[4]	train-rmse:163219.593750 
[5]	train-rmse:155726.468750 
[6]	train-rmse:148452.468750 
[7]	train-rmse:141469.671875 
[8]	train-rmse:135164.562500 
[9]	train-rmse:129019.007812 
[10]	train-rmse:123207.046875 
[11]	train-rmse:117738.039062 
[12]	train-rmse:112624.609375 
[13]	train-rmse:107659.906250 
[14]	train-rmse:102746.515625 
[15]	train-rmse:98167.929688 
[16]	train-rmse:93877.750000 
[17]	train-rmse:89857.000000 
[18]	train-rmse:86081.664062 
[19]	train-rmse:82490.679688 
[20]	train-rmse:78958.789062 
[21]	train-rmse:75581.039062 
[22]	train-rmse:72381.101562 
[23]	train-rmse:69328.570312 
[24]	train-rmse:66505.156250 
[25]	train-rmse:63828.085938 
[26]	train-rmse:61267.074219 
[27]	train-rmse:58886.511719 
[28]	train-rmse:56568.031250 
[29]	train-rmse:54536.027344 
[30]	train-rmse:52448.457031 
[31]	train-rmse:50487.531250 
[32]	train-rmse:48665.886719 
[33]	train-rmse:46982.031250 
[34]	

In [135]:
test2 = read.csv('../input/house-prices-advanced-regression-techniques/test.csv', stringsAsFactors = FALSE)

In [138]:
submission = data.frame(Id = test2$Id, SalePrice =pred)
write_csv(submission, 'submission2.csv')


My best score was .1259 this last csv was not it. Forgot to set the seed. 