# Hospital Length of Stay

In order for hospitals to optimize resource allocation, it is important to predict accurately how long a newly admitted patient will stay in the hospital.

This notebook takes advantage of the power of SQL Server and RevoScaleR (Microsoft R Server). The tables are all stored in a SQL Server, and most of the computations are done by loading chunks of data in-memory instead of the whole dataset.

It does the following: 

 * **Step 0: Packages and Compute Contexts**
 * **Step 1: Processing and Cleaning**
 * **Step 2: Feature Engineering**
 * **Step 3A: Training and Evalutating a Random Forest (Classification Approach)**
 * **Step 3B: Training and Evalutating a Random Forest (Regression Approach)**

## Step 0: Packages and Compute Contexts

#### In this step, we set up the connection string to access a SQL Server Database and load the necessary library. 

In [133]:
# WARNING.
# We recommend not using Internet Explorer as it does not support plotting, and may crash your session.

In [1]:
# Load package.
library(RevoScaleR)

In [2]:
# Choose a database name and create it. 
db <- "Hospital"

## Connect to the master database only to create a new database. Change UID and PWD if you modified them. 
connection_string <- "Driver=SQL Server;Server=localhost;Database=master;UID=rdemo;PWD=D@tascience"

## Open a connection with SQL Server to be able to write queries with the rxExecuteSQLDDL function.
outOdbcDS <- RxOdbcData(table = "NewData", connectionString = connection_string, useFastRead=TRUE)
rxOpen(outOdbcDS, "w")

query <- sprintf( "if not exists(SELECT * FROM sys.databases WHERE name = '%s') CREATE DATABASE %s;", db, db)

## Create database. 
rxExecuteSQLDDL(outOdbcDS, sSQLString = query)

In [3]:
# Define Compute Contexts: user to input Server Name, database name, UID and Password. 
connection_string <- sprintf("Driver=SQL Server;Server=localhost;Database=%s;UID=rdemo;PWD=D@tascience", db)
sql <- RxInSqlServer(connectionString = connection_string)
local <- RxLocalSeq()

## Open a connection with SQL Server to be able to write queries with the rxExecuteSQLDDL function in the new database.
outOdbcDS <- RxOdbcData(table = "NewData", connectionString = connection_string, useFastRead=TRUE)
rxOpen(outOdbcDS, "w")

#### The function below can be used to get the top n rows of a table stored on SQL Server. 
#### You can execute this cell throughout your progress by removing the comment "#", and inputting:
#### - the table name.
#### - the number of rows you want to display.

In [4]:
 display_head <- function(table_name, n_rows){
   table_sql <- RxSqlServerData(sqlQuery = sprintf("SELECT TOP(%s) * FROM %s", n_rows, table_name), connectionString = connection_string)
   table <- rxImport(table_sql)
   print(table)
}

# table_name <- "insert_table_name"
# n_rows <- 10
# display_head(table_name, n_rows)

## Step 1: Pre-Processing and Cleaning

In this step, we: 

**1.** Upload the data set to SQL.

**2.** Clean the merged data set: we replace NAs with the mode (categorical variables) or mean (continuous variables).

**Input:**  Data Set LengthOfStay.csv

**Output:** Cleaned raw data set LoS.

In [5]:
# Set the compute context to Local. 
rxSetComputeContext(local)

In [7]:
# Upload the data set to SQL.

## Specify the desired column types. 
## When uploading to SQL, Character and Factor are converted to nvarchar(255), Integer to Integer and Numeric to Float. 
column_types <-  c(eid = "integer",               
                   vdate = "character",           
                   rcount = "character",        
                   gender = "factor",            
                   dialysisrenalendstage = "factor",             
                   asthma = "factor",                
                   irondef = "factor",                   
                   pneum = "factor",                 
                   substancedependence = "factor",                  
                   psychologicaldisordermajor = "factor",             
                   depress = "factor",           
                   psychother = "factor",        
                   fibrosisandother = "factor",          
                   malnutrition = "factor",                               
                   hemo = "numeric",            
                   hematocritic = "numeric",           
                   neutrophils = "numeric",           
                   sodium = "numeric",          
                   glucose = "numeric",             
                   bloodureanitro = "numeric",                 
                   creatinine = "numeric",                 
                   bmi = "numeric",                 
                   pulse = "numeric",                  
                   respiration = "numeric",                  
                   secondarydiagnosisnonicd9 = "factor",
                   discharged = "character",
                   facid = "factor",
                   lengthofstay = "integer")


## Point to the input data set while specifying the classes.
LoS_text <- RxTextData(file = "LengthOfStay.csv", colClasses = column_types)

## Upload the table to SQL. 
LengthOfStay_sql <- RxSqlServerData(table = "LengthOfStay", connectionString = connection_string)
rxDataStep(inData = LoS_text, outFile = LengthOfStay_sql, overwrite = TRUE)

print("Data exported to SQL")

Rows Read: 100000, Total Rows Processed: 100000, Total Chunk Time: 2.172 seconds 

Elapsed time to compute low/high values and/or factor levels: 2.441 secs.
 
Total Rows written: 100000, Total time: 7.719
Rows Read: 100000, Total Rows Processed: 100000, Total Chunk Time: 10.266 seconds 
[1] "Data exported to SQL"


In [9]:
# Determine if LengthOfStay has missing values

table <- "LengthOfStay"

# First, get the names and types of the variables to be treated.
data_sql <- RxSqlServerData(table = table, connectionString = connection_string)
col <- rxCreateColInfo(data_sql)

# Then, get the names of the variables that actually have missing values. Assumption: no NA in eid. 
colnames <- names(col)
var <- colnames[!colnames %in% c("eid")]
formula <- as.formula(paste("~", paste(var, collapse = "+")))
summary <- rxSummary(formula, data_sql, byTerm = TRUE)
var_with_NA <- summary$sDataFrame[summary$sDataFrame$MissingObs > 0, 1] 

if(length(var_with_NA) == 0){
  print("No missing values.")
  missing <- 0
  
} else{
  print("Variables containing missing values are:")
  print(var_with_NA)
  print("The NAs will be replaced with the mode or mean.")
  missing <- 1
}    

Rows Read: 50000, Total Rows Processed: 50000, Total Chunk Time: 1.203 seconds
Rows Read: 50000, Total Rows Processed: 100000, Total Chunk Time: 1.188 seconds 
Computation time: 2.546 seconds.
[1] "No missing values."


In [14]:
# If applicable, NULL is replaced with the mode (categorical variables: integer or character) or mean (continuous variables).

if(missing == 0){
    print("Nothing to clean")
    LengthOfStay_cleaned_sql <- RxSqlServerData(table = table, connectionString = connection_string)
} else{
# Get the variables types (categortical vs. continuous) 
categ_names <- c()
contin_names <- c()
  for(name in var_with_NA){
    if(col[[name]]$type == "numeric"){
      contin_names[length(contin_names) + 1] <- name
    } else{
      categ_names[length(categ_names) + 1] <- name
    }
  }
# For Categoricals: Compute the mode of the variables with SQL queries in table Modes. We then import Modes. 
rxExecuteSQLDDL(outOdbcDS, sSQLString = paste("DROP TABLE if exists Modes;"
                                              , sep=""))

rxExecuteSQLDDL(outOdbcDS, sSQLString = paste("CREATE TABLE Modes
                                              (name varchar(30),
                                              mode varchar(30));"
                                              , sep=""))

for(name in categ_names){
  rxExecuteSQLDDL(outOdbcDS, sSQLString = sprintf("INSERT INTO Modes
                                                  SELECT '%s', mode
                                                  FROM (SELECT TOP(1) %s as mode, count(*) as cnt
                                                  FROM %s
                                                  GROUP BY %s 
                                                  ORDER BY cnt desc) as t;",name, name, table, name))
}
Modes_sql <- RxSqlServerData(table = "Modes", connectionString = connection_string) 
Modes <- rxImport(Modes_sql)

# For Continuous: Compute the mode of the variables with SQL queries in table Means. We then import Means. 
rxExecuteSQLDDL(outOdbcDS, sSQLString = paste("DROP TABLE if exists Means;"
                                              , sep=""))

rxExecuteSQLDDL(outOdbcDS, sSQLString = paste("CREATE TABLE Means
                                              (name varchar(30),
                                              mean float);"
                                              , sep=""))

for(name in contin_names){
  rxExecuteSQLDDL(outOdbcDS, sSQLString = sprintf("INSERT INTO Means
                                                  SELECT '%s', mean
                                                  FROM (SELECT AVG(%s) as mean
                                                  FROM %s) as t;",name, name, table))
}
Means_sql <- RxSqlServerData(table = "Means", connectionString = connection_string) 
Means <- rxImport(Means_sql)
 
# Function to replace missing values with the mode (categorical variables) or mean (continuous variables)
fill_NA_mode_mean <- function(data){
  data <- data.frame(data)
  for(j in 1:length(categ)){
    row_na <- which(is.na(data[,categ[j]]) == TRUE) 
    if(length(row_na > 0)){
      data[row_na, categ[j]] <- subset(Mode, name == categ[j])[1,2]
    }
  }
  for(j in 1:length(contin)){
    row_na <- which(is.na(data[,contin[j]]) == TRUE) 
    if(length(row_na > 0)){
      data[row_na, contin[j]] <- subset(Mean, name == contin[j])[1,2]
    }
  }
  return(data)
}

# Apply this function to LeangthOfStay by wrapping it up in rxDataStep. Output is written to LoS0.   
LoS0_sql <- RxSqlServerData(table = "LoS0", connectionString = connection_string)
rxDataStep(inData = LengthOfStay_sql , outFile = LoS0_sql, overwrite = TRUE, transformFunc = fill_NA_mode_mean, 
           transformObjects = list(categ = categ_names, contin = contin_names, Mode = Modes, Mean = Means))
   
LengthOfStay_cleaned_sql <- RxSqlServerData(table = "LoS0", connectionString = connection_string)    
    
print("Data cleaned")
}


[1] "Nothing to clean"


## Step 2: Feature Engineering

In this step, we:

**1.** Standardize the continuous variables (Z-score).

**2.** Create the variable number_of_issues: the number of preidentified medical conditions.

**3.** Create the variable lengthofstay_bucket: bucketed version of the target variable for classification.
 

**Input:** Data set before feature engineering LengthOfStay.

**Output:** Data set with new features LoS.

In [15]:
# Get the mean and standard deviation of the variables to standardize.
names <- c("hemo", "hematocritic", "neutrophils", "sodium", "glucose", "bloodureanitro",
           "creatinine", "bmi", "pulse", "respiration")
summary <- rxSummary(formula = ~., LengthOfStay_cleaned_sql, byTerm = TRUE)$sDataFrame
Statistics <- summary[summary$Name %in% names,c("Name","Mean","StdDev")]

# Function to standardize.
standardize <- function(data){
  data <- data.frame(data)
  for(n in 1:nrow(Stats)){
    data[[Stats[n,1]]] <- (data[[Stats[n,1]]] - Stats[n,2])/Stats[n,3]
    }
  return(data)
}

# Apply this function to the cleaned table by wrapping it up in rxDataStep. Output is written to LoS.  
# At the same time, we create number_of_issues as the number of preidentified medical conditions.
# We also create lengthofstay_bucket as the bucketed version of lengthofstay for classification. 

LoS_sql <- RxSqlServerData(table = "LoS", connectionString = connection_string)
rxDataStep(inData = LengthOfStay_cleaned_sql , outFile = LoS_sql, overwrite = TRUE, transformFunc = standardize, 
           transformObjects = list(Stats = Statistics), transforms = list(
             number_of_issues = as.numeric(dialysisrenalendstage) + as.numeric(asthma) + as.numeric(irondef) + 
                                as.numeric(pneum) + as.numeric(substancedependence) +
                                as.numeric(psychologicaldisordermajor) + as.numeric(depress) + as.numeric(psychother) + 
                                as.numeric(fibrosisandother) + as.numeric(malnutrition),
             lengthofstay_bucket = ifelse(lengthofstay < 4, "1",
                                          ifelse(lengthofstay < 7, "2",
                                                 ifelse(lengthofstay < 10, "3",
                                                        "4")))))

           
# Converting number_of_issues to character with a SQL query because as.character in rxDataStep is crashing.           

## Open a connection with SQL Server to be able to write queries with the rxExecuteSQLDDL function.
outOdbcDS <- RxOdbcData(table = "NewData", connectionString = connection_string, useFastRead=TRUE)
rxOpen(outOdbcDS, "w") 

## Convert number_of_issues to character.
rxExecuteSQLDDL(outOdbcDS, sSQLString = paste("ALTER TABLE LoS ALTER COLUMN number_of_issues varchar(2);", sep=""))

print("Feature Engineering Completed")

Rows Read: 50000, Total Rows Processed: 50000, Total Chunk Time: 1.265 seconds
Rows Read: 50000, Total Rows Processed: 100000, Total Chunk Time: 1.203 seconds 
Computation time: 2.609 seconds.
Total Rows written: 50000, Total time: 4.375
Rows Read: 50000, Total Rows Processed: 50000, Total Chunk Time: 6.219 secondsTotal Rows written: 50000, Total time: 4.359
Rows Read: 50000, Total Rows Processed: 100000, Total Chunk Time: 6.109 seconds 


[1] "Feature Engineering Completed"


## Step 3-A: Training and Evaluating the Models: Classification Approach

In this step we:

**1.** Split LoS into a Training LoS_Train, and a Testing set LoS_Test.  

**2.** Train classification Random Forest (RF) on LoS_Train, and save it to SQL. 

**3.** Score RF on LoS_Test.

**Input:** Data set LoS.

**Output:** Random forest saved to SQL and performance metrics.  

In [16]:
# Point to the SQL table with the data set for modeling. Strings will be converted to factors.
LoS <- RxSqlServerData(table = "LoS", connectionString = connection_string, stringsAsFactors = T)

# Get variable names, types, and levels for factors and reorder the factors for clarity during evaluation.
column_info <- rxCreateColInfo(LoS)
column_info$lengthofstay_bucket$levels <- c("1", "2", "3", "4")

print("Column information received")

[1] "Column information received"


In [17]:
# Randomly split the data into a training set and a testing set, with a splitting % p.
# p % goes to the training set, and the rest goes to the testing set. Default is 70%. 

p <- "70" 

## Open a connection with SQL Server to be able to write queries with the rxExecuteSQLDDL function.
outOdbcDS <- RxOdbcData(table = "NewData", connectionString = connection_string, useFastRead=TRUE)
rxOpen(outOdbcDS, "w")

## Create the Train_Id table containing Lead_Id of training set. 
rxExecuteSQLDDL(outOdbcDS, sSQLString = paste("DROP TABLE if exists Train_Id;", sep=""))

rxExecuteSQLDDL(outOdbcDS, sSQLString = sprintf(
  "SELECT eid
   INTO Train_Id
   FROM LoS
   WHERE ABS(CAST(BINARY_CHECKSUM(eid, NEWID()) as int)) %s < %s ;"
  ,"% 100", p ))

## Point to the training set. It will be created on the fly when training models. 
LoS_Train <- RxSqlServerData(  
  sqlQuery = "SELECT *   
              FROM LoS 
              WHERE eid IN (SELECT eid from Train_Id)",
  connectionString = connection_string, colInfo = column_info)

## Point to the testing set. It will be created on the fly when testing models. 
LoS_Test <- RxSqlServerData(  
  sqlQuery = "SELECT *   
              FROM LoS 
              WHERE eid NOT IN (SELECT eid from Train_Id)",
  connectionString = connection_string, colInfo = column_info)

print("Splitting completed")

[1] "Splitting completed"


In [19]:
# Write the formula after removing variables not used in the modeling.
variables_all <- rxGetVarNames(LoS)
variables_to_remove <- c("eid", "vdate", "discharged", "lengthofstay", "facid")
traning_variables <- variables_all[!(variables_all %in% c("lengthofstay_bucket", variables_to_remove))]
formula <- as.formula(paste("lengthofstay_bucket ~", paste(traning_variables, collapse = "+")))

# In order to deal with class imbalance, we do a stratification sampling.
# We take all observations in the smallest class  and we sample from the three other classes to have the same number.
summary <- rxSummary(formula = ~ lengthofstay_bucket, LoS_Train)$categorical[[1]]
strat_sampling <- function(){
  min <- which.min(summary[,2])
  return(c(summary[min,2]/summary[1,2], summary[min,2]/summary[2,2], summary[min,2]/summary[3,2],
           summary[min,2]/summary[4,2]))
}
sampling_rate <- strat_sampling()

print("Formula written and sampling rates computed")

Rows Read: 50000, Total Rows Processed: 50000, Total Chunk Time: 0.109 seconds
Rows Read: 20142, Total Rows Processed: 70142, Total Chunk Time: 0.016 seconds 
Computation time: 0.188 seconds.
[1] "Formula written and sampling rates computed"


In [20]:
# Compute Context is set to SQL for model training.
rxSetComputeContext(sql)

In [21]:
# Train the Random Forest. 
forest_model_class <- rxDForest(formula = formula,
                                data = LoS_Train,
                                nTree = 40,
                                minSplit = 10,
                                minBucket = 5,
                                cp = 0.00005,
                                seed = 5, 
                                strata = c("lengthofstay_bucket"),
                                sampRate = sampling_rate)

print("Training Classification RF done")

[1] "Training Classification RF done"


In [22]:
# Save the Random Forest in SQL. The compute context is set to Local in order to export the model. 
rxSetComputeContext(local)
saveRDS(forest_model_class, file = "forest_model_class.rds")
forest_model_class_raw <- readBin("forest_model_class.rds", "raw", n = file.size("forest_model_class.rds"))
forest_model_class_char <- as.character(forest_model_class_raw)
forest_model_class_sql <- RxSqlServerData(table = "Models_Class", connectionString = connection_string) 
rxDataStep(inData = data.frame(x = forest_model_class_char ), outFile = forest_model_class_sql, overwrite = TRUE)

print("Classification RF model uploaded to SQL")

Rows Read: 880470, Total Rows Processed: 880470
Total Rows written: 100000, Total time: 2.593
Total Rows written: 200000, Total time: 5.234
Total Rows written: 300000, Total time: 7.906
Total Rows written: 400000, Total time: 10.531
Total Rows written: 500000, Total time: 13.093
Total Rows written: 600000, Total time: 15.656
Total Rows written: 700000, Total time: 18.203
Total Rows written: 800000, Total time: 20.828
Total Rows written: 880470, Total time: 22.968
, Total Chunk Time: 23.109 seconds 
[1] "Classification RF model uploaded to SQL"


In [23]:
# Multi-class classification model evaluation metrics

evaluate_model_class <- function(observed, predicted, model) {
  confusion <- table(observed, predicted)
  num_classes <- nlevels(observed)
  tp <- rep(0, num_classes)
  fn <- rep(0, num_classes)
  fp <- rep(0, num_classes)
  tn <- rep(0, num_classes)
  accuracy <- rep(0, num_classes)
  precision <- rep(0, num_classes)
  recall <- rep(0, num_classes)
  for(i in 1:num_classes) {
    tp[i] <- sum(confusion[i, i])
    fn[i] <- sum(confusion[-i, i])
    fp[i] <- sum(confusion[i, -i])
    tn[i] <- sum(confusion[-i, -i])
    accuracy[i] <- (tp[i] + tn[i]) / (tp[i] + fn[i] + fp[i] + tn[i])
    precision[i] <- tp[i] / (tp[i] + fp[i])
    recall[i] <- tp[i] / (tp[i] + fn[i])
  }
  overall_accuracy <- sum(tp) / sum(confusion)
  average_accuracy <- sum(accuracy) / num_classes
  micro_precision <- sum(tp) / (sum(tp) + sum(fp))
  macro_precision <- sum(precision) / num_classes
  micro_recall <- sum(tp) / (sum(tp) + sum(fn))
  macro_recall <- sum(recall) / num_classes
  metrics <- c("Overall accuracy" = overall_accuracy,
               "Average accuracy" = average_accuracy,
               "Micro-averaged Precision" = micro_precision,
               "Macro-averaged Precision" = macro_precision,
               "Micro-averaged Recall" = micro_recall,
               "Macro-averaged Recall" = macro_recall)
  print(model)
  print(metrics)
  print(confusion)
  return(metrics)
}

In [25]:
# Classification Random Forest Scoring

# Make Predictions, then import them into R. The observed Conversion_Flag is kept through the argument extraVarsToWrite.
Prediction_Table_RF_Class <- RxSqlServerData(table = "Forest_Prediction_Class", stringsAsFactors = T, connectionString = connection_string)
rxPredict(forest_model_class, data = LoS_Test, outData = Prediction_Table_RF_Class, overwrite = T, type = "prob",
          extraVarsToWrite = c("lengthofstay_bucket", "eid"))

Prediction_RF_Class <- rxImport(inData = Prediction_Table_RF_Class, stringsAsFactors = T, outFile = NULL)

# Compute the performance metrics of the model.
Metrics_RF_Class <- evaluate_model_class(observed = factor(Prediction_RF_Class$lengthofstay_bucket, levels = c("1","2","3","4")),
                                         predicted = factor(Prediction_RF_Class$lengthofstay_bucket_Pred, levels = c("1","2","3","4")),
                                         model = "RF")

print("Scoring Classification RF done")

Rows Read: 29858, Total Rows Processed: 29858, Total Chunk Time: 1.656 seconds
Total Rows written: 29858, Total time: 0.921
 
Rows Read: 29858, Total Rows Processed: 29858, Total Chunk Time: 0.203 seconds 
[1] "RF"
        Overall accuracy         Average accuracy Micro-averaged Precision 
               0.7099605                0.8549802                0.7099605 
Macro-averaged Precision    Micro-averaged Recall    Macro-averaged Recall 
               0.7703876                0.7099605                0.6466639 
        predicted
observed     1     2     3     4
       1  6013    13     0     0
       2  4308 10120  1978    14
       3    70  1124  4288   961
       4     1    60   131   777
[1] "Scoring Classification RF done"


## Step 3-B: Training and Evaluating the Models: Regression Approach

In this step we:
 
**1.** Train regression Random Forest (RF) on LoS_Train, and save it to SQL. 

**2.** Score RF on LoS_Test.

**Input:** Data set LoS.

**Output:** Random forest model saved to SQL and performance metrics. 

In [33]:
# Write the formula after removing variables not used in the modeling.
variables_all <- rxGetVarNames(LoS)
variables_to_remove <- c("eid", "vdate", "discharged", "lengthofstay_bucket", "facid")
traning_variables <- variables_all[!(variables_all %in% c("lengthofstay", variables_to_remove))]
formula <- as.formula(paste("lengthofstay ~", paste(traning_variables, collapse = "+")))

print("Formula written")

[1] "Formula written"


In [34]:
# Compute Context is set to SQL for model training.
rxSetComputeContext(sql)

In [35]:
# Train the Random Forest.
forest_model_reg <- rxDForest(formula = formula,
                              data = LoS_Train,
                              nTree = 40,
                              minSplit = 10,
                              minBucket = 5,
                              cp = 0.00005,
                              seed = 5)

print("Training Regression RF done")

[1] "Training Regression RF done"


In [36]:
# Save the Random Forest in SQL. The compute context is set to Local in order to export the model. 
rxSetComputeContext(local)
saveRDS(forest_model_reg, file = "forest_model_reg.rds")
forest_model_reg_raw <- readBin("forest_model_reg.rds", "raw", n = file.size("forest_model_reg.rds"))
forest_model_reg_char <- as.character(forest_model_reg_raw)
forest_model_reg_sql <- RxSqlServerData(table = "Models_Reg", connectionString = connection_string) 
rxDataStep(inData = data.frame(x = forest_model_reg_char ), outFile = forest_model_reg_sql, overwrite = TRUE)

# Set back the compute context to SQL.
rxSetComputeContext(sql)

print("RF Regression model uploaded to SQL")

Rows Read: 1065856, Total Rows Processed: 1065856
Total Rows written: 100000, Total time: 2.562
Total Rows written: 200000, Total time: 5.125
Total Rows written: 300000, Total time: 7.656
Total Rows written: 400000, Total time: 10.312
Total Rows written: 500000, Total time: 13.078
Total Rows written: 600000, Total time: 15.624
Total Rows written: 700000, Total time: 18.187
Total Rows written: 800000, Total time: 20.749
Total Rows written: 900000, Total time: 23.296
Total Rows written: 1000000, Total time: 26.015
Total Rows written: 1065856, Total time: 27.812
, Total Chunk Time: 27.874 seconds 
[1] "RF Regression model uploaded to SQL"


In [37]:
# Write a function that computes regression performance metrics. 
evaluate_model_reg <- function(observed, predicted, model) {
  mean_observed <- mean(observed)
  se <- (observed - predicted)^2
  ae <- abs(observed - predicted)
  sem <- (observed - mean_observed)^2
  aem <- abs(observed - mean_observed)
  mae <- mean(ae)
  rmse <- sqrt(mean(se))
  rae <- sum(ae) / sum(aem)
  rse <- sum(se) / sum(sem)
  rsq <- 1 - rse
  metrics <- c("Mean Absolute Error" = mae,
               "Root Mean Squared Error" = rmse,
               "Relative Absolute Error" = rae,
               "Relative Squared Error" = rse,
               "Coefficient of Determination" = rsq)
  print(model)
  print(metrics)
  print("Summary statistics of the absolute error")
  print(summary(abs(observed-predicted)))
  return(metrics)
}

In [38]:
# Regression Random Forest Scoring 

# Make Predictions, then import them into R. The observed Conversion_Flag is kept through the argument extraVarsToWrite.
Prediction_Table_RF_Reg <- RxSqlServerData(table = "Forest_Prediction_Reg", stringsAsFactors = T, connectionString = connection_string)
rxPredict(forest_model_reg, data = LoS_Test, outData = Prediction_Table_RF_Reg, overwrite = T, type = "response",
          extraVarsToWrite = c("lengthofstay", "eid"))

Prediction_RF_Reg<- rxImport(inData = Prediction_Table_RF_Reg, stringsAsFactors = T, outFile = NULL)

# Compute the performance metrics of the model.
Metrics_RF_Reg <- evaluate_model_reg(observed = Prediction_RF_Reg$lengthofstay,
                                    predicted = Prediction_RF_Reg$lengthofstay_Pred,
                                    model = "RF")
print("Scoring Regression RF done")

Rows Read: 29858, Total Rows Processed: 29858, Total Chunk Time: 0.015 seconds 
[1] "RF"
         Mean Absolute Error      Root Mean Squared Error 
                   0.6274529                    0.8466414 
     Relative Absolute Error       Relative Squared Error 
                   0.3837478                    0.1751443 
Coefficient of Determination 
                   0.8248557 
[1] "Summary statistics of the absolute error"
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.000181 0.298100 0.541900 0.627500 0.765400 6.907000 
[1] "Scoring Regression RF done"
