-
Notifications
You must be signed in to change notification settings - Fork 17
Aggregating results of multiple ensembles
Aggregating the results of multiple ensembles may be helpful if multiple ensembles have been fitted to differect (versions of a) dataset. This can occur, for example, when using multiple imputation*.
Below, an example and experimental functions are provided allowing for aggregation of results, when PREs are fitted to multiple datasets:
We will use an example dataset based on the airquality dataset, with the response (Ozone) converted to a binary factor. We will create a list of datasets, containing subsamples of the original dataset. Note that we employ a binary outcome here, to illustrate how different approaches can be taken to generating and aggregating the predictions. For a continuous outcome, there is only one possible approach.
We first prepare the original dataset:
airq <- airquality[complete.cases(airquality), ]
airq$Ozone <- factor(airq$Ozone > median(airq$Ozone))Create a list with several different versions of the dataset:
set.seed(42)
datasets <- list()
for (i in 1:5) datasets[[i]] <- airq[sample(1:nrow(airq), nrow(airq)*.8),]Create a function that fits PREs to several datasets contained in a list:
pre.agg <- function(datasets, ...) {
result <- list()
for (i in 1:length(datasets)) {
result[[i]] <- pre(datasets[[i]], ...)
}
result
}Load library pre:
library("pre")Illustrate new function:
set.seed(43)
airq.agg <- pre.agg(datasets, formula = Ozone ~ ., family = "binomial")Note that we can used the ellipsis (...) to pass arguments to pre (see ?pre for an overview of arguments that can be specified).
Again, we can use the ellipsis (...) to pass arguments to print.pre (see ?pre:::print.pre):
print.agg <- function(object, ...) {
result <- list()
sink("NULL")
for (i in 1:length(object)) {
result[[i]] <- print(object[[i]], ...)
}
sink()
print(result)
}
print.agg(airq.agg)## [[1]]
## rule coefficient description
## 62 (Intercept) 1.0659771 1
## 1 rule1 -0.7583351 Temp <= 80
## 2 rule2 -0.6683016 Temp <= 77
## 3 rule3 -0.5956491 Temp <= 82
##
## [[2]]
## rule coefficient description
## 76 (Intercept) 1.40983290 1
## 45 rule46 -1.12542864 Wind > 7.4 & Temp <= 82
## 5 rule5 -1.02132310 Temp <= 77
## 6 rule6 -0.72477893 Temp <= 82
## 8 rule8 0.42983680 Temp > 78 & Solar.R > 157
## 61 rule64 0.33538374 Solar.R > 148 & Month <= 8
## 30 rule30 0.17304579 Wind <= 10.3 & Temp > 76
## 22 rule22 -0.10733325 Wind > 7.4 & Temp <= 80
## 2 rule2 -0.04783606 Temp <= 78
## 67 rule70 -0.04162906 Wind > 8 & Temp <= 76
##
## [[3]]
## rule coefficient description
## 72 (Intercept) 0.10999535 1
## 14 rule14 1.06035821 Temp > 78 & Solar.R > 148
## 2 rule2 -0.55207831 Temp <= 78
## 1 rule1 -0.34416744 Temp <= 82
## 3 rule3 -0.16654473 Temp <= 77
## 33 rule36 0.15287944 Temp > 77 & Solar.R > 148
## 8 rule8 -0.05862083 Temp <= 80
##
## [[4]]
## rule coefficient description
## 77 (Intercept) 0.59725073 1
## 58 rule59 0.91439182 Temp > 77 & Solar.R > 148
## 41 rule42 -0.72497066 Wind > 8 & Temp <= 80
## 6 rule6 -0.59529999 Temp <= 82
## 67 rule68 -0.39683663 Wind > 8 & Temp <= 82
## 1 rule1 -0.03881522 Temp <= 78
##
## [[5]]
## rule coefficient description
## 98 (Intercept) 1.359001129 1
## 66 rule67 -0.977975550 Temp <= 82 & Day <= 23
## 2 rule2 -0.888961378 Temp <= 82
## 12 rule12 0.664921163 Solar.R > 149 & Temp > 77
## 17 rule17 -0.554631889 Temp <= 80 & Wind > 9.2
## 47 rule47 -0.446569958 Temp <= 77 & Day > 17
## 39 rule39 0.352016861 Solar.R > 175 & Month <= 8
## 86 rule88 -0.345194349 Solar.R <= 273 & Wind > 7.4
## 1 rule1 -0.239940247 Temp <= 77
## 88 rule90 0.198040668 Wind <= 10.3 & Solar.R > 148
## 53 rule53 -0.142824604 Temp <= 77 & Wind <= 11.5
## 57 rule57 -0.038575297 Temp <= 80 & Month > 6
## 29 rule29 0.010655689 Temp > 78 & Solar.R > 201
## 25 rule25 -0.005047176 Temp <= 77 & Day > 13
Again, we can use the ellipsis (...) to pass any argument to summary.pre (see ?pre:::summary.pre for the arguments that can be specified).
summary.agg <- function(object, ...) {
for (i in 1:length(object)) summary(object[[i]], ...)
}
summary.agg(airq.agg)##
## Final ensemble with cv error within 1se of minimum:
## lambda = 0.07568624
## number of terms = 3
## mean cv error (se) = 0.9909853 (0.06623608)
##
## cv error type : Binomial Deviance
##
##
## Final ensemble with cv error within 1se of minimum:
## lambda = 0.036382
## number of terms = 9
## mean cv error (se) = 0.8142007 (0.09033872)
##
## cv error type : Binomial Deviance
##
##
## Final ensemble with cv error within 1se of minimum:
## lambda = 0.07049046
## number of terms = 6
## mean cv error (se) = 0.9651525 (0.1009328)
##
## cv error type : Binomial Deviance
##
##
## Final ensemble with cv error within 1se of minimum:
## lambda = 0.05940188
## number of terms = 5
## mean cv error (se) = 0.9504962 (0.08500335)
##
## cv error type : Binomial Deviance
##
##
## Final ensemble with cv error within 1se of minimum:
## lambda = 0.03194007
## number of terms = 13
## mean cv error (se) = 0.8168138 (0.09947524)
##
## cv error type : Binomial Deviance
For averaging over predictions, there is only one option for continuous outcomes. For non-continuous outcomes, we can average over the linear
predictor, or over the predicted values on the scale of the response. I do not know which would be more appropriate. The resulting predicted
values will be highly correlated, though. Again, the ellipsis (...) can be used to pass arguments to predict.pre (see ?pre:::predict.pre for the arguments that can be specified):
predict.agg <- function(object, newdata, ...) {
rowMeans(sapply(object, predict, newdata = newdata, ...))
}Return predictions on the scale of the linear predictor (default in
predict.pre):
lin_preds <- predict.agg(airq.agg, newdata = airq)
lin_preds## 1 2 3 4 7 8
## -0.47879075 -0.97147025 -1.21944223 -1.10856491 -0.95803040 -1.28651898
## 9 12 13 14 15 16
## -1.28651898 -1.13799564 -0.95803040 -1.10957434 -1.28752841 -1.10957434
## 17 18 19 20 21 22
## -1.08100942 -1.37684240 -1.19888833 -1.40540732 -1.40540732 -1.17032341
## 23 24 28 29 30 31
## -1.40540732 -1.18124729 -1.18124729 0.68923689 0.97355492 -0.37351906
## 38 40 41 44 47 48
## -0.25583780 1.69250019 1.69250019 -0.17647048 -1.14172248 -1.08100942
## 49 50 51 62 63 64
## -1.29448095 -1.40540732 -1.40540732 1.76671748 1.69767862 0.56785908
## 66 67 68 69 70 71
## 1.69418297 1.69250019 1.76671748 1.76671748 1.76671748 1.69418297
## 73 74 76 77 78 79
## -1.15675392 0.42110727 -0.73894037 0.94135100 0.63689795 1.76671748
## 80 81 82 85 86 87
## 1.76458635 1.62346132 -0.75391749 1.76671748 1.69767862 -0.06024269
## 88 89 90 91 92 93
## 0.83937258 1.76671748 1.76671748 1.76671748 0.76345419 0.11765412
## 94 95 99 100 101 104
## -0.29044696 0.11765412 1.76671748 1.69767862 1.69767862 1.62133019
## 105 106 108 109 110 111
## 0.49364178 -0.03913620 -1.28087342 -0.05345213 -0.75391749 -0.38276768
## 112 113 114 116 117 118
## -0.30855039 -1.23875153 -1.38455746 0.31496078 1.13694611 1.69767862
## 120 121 122 123 124 125
## 1.69767862 1.76671748 1.76671748 1.76458635 1.62710623 1.62710623
## 126 127 128 129 130 131
## 1.62710623 1.62710623 0.94302061 0.83937258 -0.09233174 -0.44603051
## 132 133 134 135 136 137
## -1.32279896 -1.28319082 0.35616166 -1.29423404 -0.58937677 -1.32380839
## 138 139 140 141 142 143
## -1.32380839 0.20484857 -1.29524347 -1.41312238 -1.37351425 0.50761514
## 144 145 146 147 148 149
## -1.38455746 -1.30219601 -0.25583780 -1.21752727 -1.18896235 -0.51871424
## 151 152 153
## -1.18896235 -0.87391363 -1.21752727
Return predictions on the scale of the response:
prob_preds <- predict.agg(airq.agg, newdata = airq, type = "response")
prob_preds## 1 2 3 4 7 8 9 12
## 0.3877162 0.2851789 0.2309900 0.2489050 0.2818367 0.2209226 0.2209226 0.2435769
## 13 14 15 16 17 18 19 20
## 0.2818367 0.2487154 0.2207862 0.2487154 0.2542605 0.2104713 0.2338699 0.2078445
## 21 22 23 24 28 29 30 31
## 0.2078445 0.2382024 0.2078445 0.2377911 0.2377911 0.6619755 0.7076285 0.4138447
## 38 40 41 44 47 48 49 50
## 0.4384131 0.8333282 0.8333282 0.4580023 0.2429138 0.2542605 0.2198593 0.2078445
## 51 62 63 64 66 67 68 69
## 0.2078445 0.8391008 0.8338747 0.6366888 0.8335685 0.8333282 0.8391008 0.8391008
## 70 71 73 74 76 77 78 79
## 0.8391008 0.8335685 0.2403972 0.6006015 0.3317624 0.7128066 0.6526960 0.8391008
## 80 81 82 85 86 87 88 89
## 0.8389612 0.8271835 0.3282217 0.8391008 0.8338747 0.4848920 0.6907667 0.8391008
## 90 91 92 93 94 95 99 100
## 0.8391008 0.8391008 0.6771176 0.5271226 0.4300309 0.5271226 0.8391008 0.8338747
## 101 104 105 106 108 109 110 111
## 0.8338747 0.8269658 0.6186853 0.4881552 0.2223857 0.4850475 0.3282217 0.4061816
## 112 113 114 116 117 118 120 121
## 0.4243932 0.2288212 0.2097322 0.5739233 0.7473804 0.8338747 0.8338747 0.8391008
## 122 123 124 125 126 127 128 129
## 0.8391008 0.8389612 0.8274370 0.8274370 0.8274370 0.8274370 0.7083077 0.6907667
## 130 131 132 133 134 135 136 137
## 0.4758638 0.3910778 0.2163077 0.2213757 0.5845020 0.2198919 0.3614919 0.2161876
## 138 139 140 141 142 143 144 145
## 0.2161876 0.5478032 0.2197590 0.2071852 0.2107970 0.6199352 0.2097322 0.2188564
## 146 147 148 149 151 152 153
## 0.4384131 0.2313794 0.2363751 0.3778404 0.2363751 0.3043952 0.2313794
The two averaging methods yield highly correlated, but not identical, predicted probabilities:
cor(exp(lin_preds) / (1 + exp(lin_preds)), prob_preds)## [1] 0.9999314
The coef method should return the averaged / aggregated final PRE. That is, it returns:
-
One averaged intercept;
-
All rules and linear terms, with their coefficients scaled by the number of datasets;
-
In presence of identical rules and linear terms, it aggregates those rules and their coefficients into one rule / term, and adds together the scaled coefficients.
Note that linear terms that do not have the same winsorizing points will not be
aggregated. Note that the labels of rules and variables may overlap between different datasets (e.g., the label rule 12 or may appear multiple times in the aggregated ensemble, but each rule 12 will have different conditions).
Again, using the ellipsis (...), arguments can be passed to coef.pre:
coef.agg <- function(object, ...) {
coefs <- coef(object[[1]], ...)
coefs <- coefs[coefs$coefficient != 0,]
for (i in 2:length(object)) {
coefs_tmp <- coef(object[[i]], ...)
coefs_tmp <- coefs_tmp[coefs_tmp$coefficient != 0,]
## Add intercepts:
coefs[coefs$rule == "(Intercept)", "coefficient"] <-
coefs[coefs$rule == "(Intercept)", "coefficient"] +
coefs_tmp[coefs_tmp$rule == "(Intercept)", "coefficient"]
## Append other terms rest to coefs:
coefs <- rbind(coefs, coefs_tmp[coefs_tmp$rule!= "(Intercept)", ])
}
## Divide coefficients by the number of datasets:
coefs$coefficient <- coefs$coefficient / length(object)
## Identify identical rules:
duplicates <- which(duplicated(coefs$description))
for (i in duplicates) {
first_match <- which(coefs$description == coefs$description[i])[1]
## Add the coefficients:
coefs$coefficient[first_match] <-
coefs$coefficient[first_match] + coefs$coefficient[i]
}
## Remove duplicates:
coefs <- coefs[-duplicates, ]
## Return results:
coefs
}
coef.agg(airq.agg)## rule coefficient description
## 62 (Intercept) 0.908411448 1
## 1 rule1 -0.163391189 Temp <= 80
## 2 rule2 -0.419221939 Temp <= 77
## 3 rule3 -0.629771375 Temp <= 82
## 45 rule46 -0.225085728 Wind > 7.4 & Temp <= 82
## 8 rule8 0.085967360 Temp > 78 & Solar.R > 157
## 61 rule64 0.067076748 Solar.R > 148 & Month <= 8
## 30 rule30 0.034609158 Wind <= 10.3 & Temp > 76
## 22 rule22 -0.021466650 Wind > 7.4 & Temp <= 80
## 21 rule2 -0.127745917 Temp <= 78
## 67 rule70 -0.008325812 Wind > 8 & Temp <= 76
## 14 rule14 0.212071642 Temp > 78 & Solar.R > 148
## 33 rule36 0.213454252 Temp > 77 & Solar.R > 148
## 41 rule42 -0.144994131 Wind > 8 & Temp <= 80
## 671 rule68 -0.079367326 Wind > 8 & Temp <= 82
## 66 rule67 -0.195595110 Temp <= 82 & Day <= 23
## 121 rule12 0.132984233 Solar.R > 149 & Temp > 77
## 17 rule17 -0.110926378 Temp <= 80 & Wind > 9.2
## 47 rule47 -0.089313992 Temp <= 77 & Day > 17
## 39 rule39 0.070403372 Solar.R > 175 & Month <= 8
## 86 rule88 -0.069038870 Solar.R <= 273 & Wind > 7.4
## 88 rule90 0.039608134 Wind <= 10.3 & Solar.R > 148
## 53 rule53 -0.028564921 Temp <= 77 & Wind <= 11.5
## 57 rule57 -0.007715059 Temp <= 80 & Month > 6
## 29 rule29 0.002131138 Temp > 78 & Solar.R > 201
## 25 rule25 -0.001009435 Temp <= 77 & Day > 13