## Random Forest on Healthstatus

The goal of this is to perform random forest on the healthstatus attribute of our primary dataset 

In [1]:
library(randomForest)
library(ggplot2)
library(caret)
library(here)

data = read.csv(here("data","2015_data.csv"))

randomForest 4.6-14
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'ggplot2'

The following object is masked from 'package:randomForest':

    margin

Loading required package: lattice
"package 'here' was built under R version 3.6.3"here() starts at D:/nyctrees


### Downsampling Data
As with our [CART Classification](https://github.com/kbfoerster/nyctrees/blob/master/code/CART_Raw_Data.ipynb)we are going to downsample and upsample the data. 

In [2]:
summary(data$healthstatus)

In [6]:
set.seed(42)
down_data = downSample(data, data$healthstatus)
part = createDataPartition(down_data$healthstatus, p = 0.80, list = FALSE)
train = down_data[part,]
test = down_data[-part,]
summary(train$healthstatus)

In [10]:
down_rf = randomForest(healthstatus ~ latitude + longitude + zipcode + st_assem + sidw_crack + st_senate + inf_guard, data=train)

ERROR: Error in randomForest.default(healthstatus ~ latitude + longitude + zipcode + : NA not permitted in predictors


In [15]:
down_rf


Call:
 randomForest(formula = healthstatus ~ latitude + longitude +      zipcode + st_assem + sidw_crack + st_senate + inf_guard,      data = train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 51.07%
Confusion matrix:
     Dead Good Poor class.error
Dead 9205  956  939   0.1707207
Good 5105 3979 2016   0.6415315
Poor 5275 2714 3111   0.7197297

In [14]:
#Train prediction accuracy
train_predict = predict(down_rf, train, type = "class")
mean(train_predict == train$healthstatus) 

#Test prediction accuracy
test_predict = predict(down_rf, test, type = "class")
mean(test_predict == test$healthstatus)

In [17]:
#Confusion Matrix, etc.
confusion_matrix <- table(predicted = test_predict, actual = test$healthstatus)
confusion_matrix

precision <- confusion_matrix[2,2]/sum(confusion_matrix[2,]) #precision - 0.53
precision

recall <- confusion_matrix[2,2]/sum(confusion_matrix[,2]) #recall - 0.35
recall

F1score <- 2 * ((precision * recall)/(precision + recall)) #F1 score - 0.42
F1score

         actual
predicted Dead Good Poor
     Dead 2330 1309 1311
     Good  193  972  672
     Poor  251  493  791

### Upsampling Data

In [17]:
set.seed(42)
up_data = upSample(data, data$healthstatus)
summary(up_data$healthstatus)

half_part = createDataPartition(up_data$healthstatus, p = 0.20, list = FALSE)
temp_data = up_data[half_part,]
summary(temp_data$healthstatus)

part = createDataPartition(temp_data$healthstatus, p=0.8, list = FALSE)
up_train = temp_data[part,]
up_test = temp_data[-part,]
summary(up_train$healthstatus)

In [18]:
up_rf = randomForest(healthstatus ~ latitude + longitude + zipcode + st_assem + sidw_crack + st_senate + inf_guard, data=up_train)

In [19]:
up_rf


Call:
 randomForest(formula = healthstatus ~ latitude + longitude +      zipcode + st_assem + sidw_crack + st_senate + inf_guard,      data = up_train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 44.28%
Confusion matrix:
      Dead  Good  Poor class.error
Dead 81865  6860  5748   0.1334561
Good 40240 39481 14752   0.5820922
Poor 40908 17000 36565   0.6129582

In [20]:
#Train prediction accuracy
train_predict = predict(up_rf, up_train, type = "class")
mean(train_predict == up_train$healthstatus) 

#Test prediction accuracy
test_predict = predict(up_rf, up_test, type = "class")
mean(test_predict == up_test$healthstatus)

In [21]:
#Confusion Matrix, etc.
confusion_matrix <- table(predicted = test_predict, actual = up_test$healthstatus)
confusion_matrix

precision <- confusion_matrix[2,2]/sum(confusion_matrix[2,]) #precision - 0.62
precision

recall <- confusion_matrix[2,2]/sum(confusion_matrix[,2]) #recall - 0.42
recall

F1score <- 2 * ((precision * recall)/(precision + recall)) #F1 score - 0.50
F1score

         actual
predicted  Dead  Good  Poor
     Dead 20522  9997 10260
     Good  1778  9856  4199
     Poor  1318  3765  9159