new dataset - examples for classification

ModelOriented · Aug 3, 2018 · d9b6af9 · d9b6af9
1 parent 4972bbf
commit d9b6af9
Show file tree

Hide file tree

Showing 6 changed files with 91 additions and 1 deletion.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,6 +1,6 @@
 Package: DALEX
 Title: Descriptive mAchine Learning EXplanations
-Version: 0.2.3
+Version: 0.2.4
 Authors@R: person("Przemyslaw", "Biecek", email = "przemyslaw.biecek@gmail.com", role = c("aut", "cre"))
 Description: Machine Learning (ML) models are widely used and have various applications in classification 
   or regression. Models created with boosting, bagging, stacking or similar techniques are often

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,7 @@
+DALEX 0.2.4
+----------------------------------------------------------------
+* New dataset `HR` and `HRTest`. Target variable is a factor with three levels. Is used in examples for classification.
+
 DALEX 0.2.3
 ----------------------------------------------------------------
 * Small fixes in `variable_response()` to better support of `gbm` models (c8393120ffb05e2f3c70b0143c4e92dc91f6c823).

diff --git a/R/HR.R b/R/HR.R
@@ -0,0 +1,54 @@
+#' Human Resources Data
+#'
+#' Datasets \code{HR} and \code{HRTest} are artificial, generated form the same model.
+#' Structure of the dataset is based on a real data, from Human Resources department with
+#' information which employees were promoted, which were fired.
+#'
+#' Values are generated in a way to:
+#' - have interaction between age and gender for the 'fired' variable
+#' - have non monotonic relation for the salary variable
+#' - have linear effects for hours and evaluation.
+#'
+#' \itemize{
+#' \item gender - gender of an employee.
+#' \item age - gender of an employee in the moment of evaluation.
+#' \item hours - average number of working hours per week.
+#' \item evaluation - evaluation in the scale 2 (bad) - 5 (very good).
+#' \item salary - level of salary in the scale 0 (lowest) - 5 (highest).
+#' \item status - target variable, either `fired` or `promoted` or `ok`.
+#' }
+#'
+#' @aliases HRTest
+#' @docType data
+#' @keywords HR
+#' @name HR
+#' @usage data(HR)
+#' @format a data frame with 10000 rows and 6 columns
+NULL
+
+
+# N <- 10000
+# set.seed(1313)
+#
+# gender <- rbinom(N, size = 1, prob = 0.5)
+# age    <- runif(N, 20, 60)
+# hours  <- 35 + 45*runif(N, 0, 1)^2
+# evaluation <- floor(runif(N, 0, 4)) + 2
+# salary <- floor(runif(N, 0, 6))
+#
+#
+# score1 <- 2*(gender - 0.5)*(age-40)/15 + 0.35*(salary - 2.5)^2 - 1.6*(hours > 45)
+# score2 <- 2*(evaluation > 3.5) + (hours-50)/15
+#
+# y1 <- runif(N) < pnorm(score1 - mean(score1))
+# y2 <- runif(N) < pnorm(score2 - mean(score2))
+#
+# HR <- data.frame(gender = factor(ifelse(gender == 0, "female", "male")),
+#                  age, hours, evaluation, salary,
+#                  status = factor(ifelse(y1 == 1, "fired",
+#                                         ifelse(y2 == 1, "promoted",
+#                                                "ok"))),
+#                  y1 = factor(y1),
+#                  y2 = factor(y2))
+# HR <- HR[!(y1&y2),1:6]
+
diff --git a/data/HR.rda b/data/HR.rda
diff --git a/data/HRTest.rda b/data/HRTest.rda
diff --git a/man/HR.Rd b/man/HR.Rd