# CPO

In [24]:
library("mlr")
devtools::load_all("..")

Loading mlr


In [25]:
df = data.frame(a = 1:3, b = -(1:3) * 10)

**CPO**s are first-class objects in R that represent data manipulation. They can be combined to form networks of operation, they can be attached to `mlr` `Learner`s, and they have tunable Hyperparameters that influence their behaviour.

# Lifecycle of a CPO

## CPO Constructor

In [26]:
print(cpoPca)  # example CPOConstructor

<<CPO pca()>>


In [27]:
class(cpoPca)

CPO constructors have parameters that
* set the CPO Hyperparameters
* set the CPO ID (default NULL)
* resetrict the data columns a CPO operates on (`affect.*` parameters)

In [28]:
names(formals(cpoPca))

## CPO

In [29]:
(cpo = cpoScale()) # construct CPO with default Hyperparameter values

scale(center = TRUE, scale = TRUE)

In [30]:
class(cpo)  # CPOs that are not compound are "CPOPrimitive"

In [31]:
summary(cpo)  # detailed printing

Retrafo chain of 1 elements:
scale(center = TRUE, scale = TRUE)

                Type len  Def Constr Req Tunable Trafo
scale.center logical   - TRUE      -   -    TRUE     -
scale.scale  logical   - TRUE      -   -    TRUE     -

In [32]:
# Functions that work on CPOs:
getParamSet(cpo)

                Type len  Def Constr Req Tunable Trafo
scale.center logical   - TRUE      -   -    TRUE     -
scale.scale  logical   - TRUE      -   -    TRUE     -

In [33]:
getHyperPars(cpo)

In [34]:
setHyperPars(cpo, scale.center = FALSE)

scale(center = FALSE, scale = TRUE)

In [35]:
getCPOId(cpo)

In [36]:
setCPOId(cpo, "MYID")

MYID<scale>(center = TRUE, scale = TRUE)

In [37]:
getCPOName(cpo)
getCPOName(setCPOId(cpo, "MYID"))  # the name includes the ID

In [38]:
getCPOAffect(cpo)  # empty, since no affect set
getCPOAffect(cpoPca(affect.pattern = "Width$"))

In [39]:
getCPOProperties(cpo)  # see properties explanation below

In [40]:
# these are internals
getCPOKind(cpo)  # trafo, retrafo, inverter
getCPOBound(cpo)  # databound, targetbound, both

### Exporting Parameters
Sometimes when using many CPOs, their hyperparameters may get messy. CPO enables the user to control which hyperparameter get exported. The parameter "export" can be one of "export.default", "export.set", "export.unset", "export.default.set", "export.default.unset", "export.all", "export.none". "all" and "none" do what one expects; "default" exports the "recommended" parameters; "set" and "unset" export the values that have not been set, or only the values that were set (and are not left as default). "default.set" and "default.unset" work as "set" and "unset", but restricted to the default exported parameters.

In [162]:
(sc = cpoScale())
getParamSet(sc)
cat("---\n")
(sc = cpoScale(export = "export.none"))
getParamSet(sc)
cat("---\n")
(sc = cpoScale(scale = FALSE, export = "export.unset"))
getParamSet(sc)

scale(center = TRUE, scale = TRUE)

                Type len  Def Constr Req Tunable Trafo
scale.center logical   - TRUE      -   -    TRUE     -
scale.scale  logical   - TRUE      -   -    TRUE     -

---


scale()[not exp'd: center = TRUE, scale = TRUE]

[1] "Empty parameter set."

---


scale(center = TRUE)[not exp'd: scale = FALSE]

                Type len  Def Constr Req Tunable Trafo
scale.center logical   - TRUE      -   -    TRUE     -

### CPO Application using `%>>%` or `applyCPO`
`CPO`s can be applied to `data.frame` and `Task` objects.

In [41]:
head(iris) %>>% cpoPca()
# head(getTaskData(applyCPO(cpoPca(), iris.task)))

Species,PC1,PC2,PC3,PC4
setosa,-6.344251,3.699099e-05,-0.10237713,-0.001527648
setosa,-5.909522,-0.29391,0.0139843,0.031126691
setosa,-5.835572,-0.01780612,-0.07507399,0.011978402
setosa,-5.747518,-0.0519258,0.13484436,-0.071787045
setosa,-6.319018,0.135989,-0.10270328,-0.031196772
setosa,-6.882318,0.1859359,0.12770825,0.053118356


### CPO Composition using `%>>%` or `composeCPO`
`CPO` composition results in a new CPO which mostly behaves like a primitive CPO. Exceptions are:
* Compound CPOs have no `id`
* Affect of compound CPOs cannot be retrieved

In [42]:
scale1 = cpoScale()
scale2 = cpoScale()
# scale1 %>>% scale2  # error! parameters 'center' and 'scale' occur in both
compound = setCPOId(scale1, "scale1") %>>% setCPOId(scale2, "scale2")
composeCPO(setCPOId(scale1, "scale1"), setCPOId(scale2, "scale2"))  # same

(scale1<scale> >> scale2<scale>)(scale1.center = TRUE, scale1.scale = TRUE, scale2.center = TRUE, scale2.scale = TRUE)

In [43]:
class(compound)

In [44]:
summary(compound)

Retrafo chain of 2 elements:
scale1<scale>(center = TRUE, scale = TRUE)

                 Type len  Def Constr Req Tunable Trafo
scale1.center logical   - TRUE      -   -    TRUE     -
scale1.scale  logical   - TRUE      -   -    TRUE     -
  ====>
scale2<scale>(center = TRUE, scale = TRUE)

                 Type len  Def Constr Req Tunable Trafo
scale2.center logical   - TRUE      -   -    TRUE     -
scale2.scale  logical   - TRUE      -   -    TRUE     -

In [45]:
getCPOName(compound)

In [46]:
getCPOId(compound)  # error: no ID for compound CPOs
getCPOAffect(compound)  # error: no affect for compound CPOs

ERROR: Error in getCPOId.CPO(compound): Compound CPOs have no IDs.


In [47]:
getParamSet(compound)

                 Type len  Def Constr Req Tunable Trafo
scale1.center logical   - TRUE      -   -    TRUE     -
scale1.scale  logical   - TRUE      -   -    TRUE     -
scale2.center logical   - TRUE      -   -    TRUE     -
scale2.scale  logical   - TRUE      -   -    TRUE     -

In [48]:
getHyperPars(compound)

In [49]:
setHyperPars(compound, scale1.center = TRUE, scale2.center = FALSE)

(scale1<scale> >> scale2<scale>)(scale1.center = TRUE, scale1.scale = TRUE, scale2.center = FALSE, scale2.scale = TRUE)

### Compound CPO decomposition, CPO chaining

In [50]:
as.list(compound)

[[1]]
scale1<scale>(center = TRUE, scale = TRUE)

[[2]]
scale2<scale>(center = TRUE, scale = TRUE)


In [51]:
chainCPO(as.list(compound))  # chainCPO: list CPO -> CPO

(scale1<scale> >> scale2<scale>)(scale1.center = TRUE, scale1.scale = TRUE, scale2.center = TRUE, scale2.scale = TRUE)

### CPO - Learner attachment using `%>>%` or `attachCPO`

In [52]:
lrn = makeLearner("classif.logreg")

In [53]:
(cpolrn = cpo %>>% lrn)  # the new learner has the CPO hyperparameters

Learner classif.logreg.scale from package stats
Type: classif
Name: ; Short name: 
Class: CPOLearner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,scale.center=TRUE,scale.scale=TRUE


In [54]:
attachCPO(compound, lrn)  # attaching compound CPO

Learner classif.logreg.scale.scale from package stats
Type: classif
Name: ; Short name: 
Class: CPOLearner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,scale1.center=TRUE,scale1.scale=TRUE,scale2.center=TRUE,scale2.scale=TRUE


In [55]:
# CPO learner decomposition
getLearnerCPO(cpolrn)  # the CPO
getLearnerBare(cpolrn)  # the Learner

scale(center = TRUE, scale = TRUE)

Learner classif.logreg from package stats
Type: classif
Name: Logistic Regression; Short name: logreg
Class: classif.logreg
Properties: twoclass,numerics,factors,prob,weights
Predict-Type: response
Hyperparameters: model=FALSE


## Retrafo
CPOs perform data-dependent operation. However, when this operation becomes part of a machine-learning process, the operation on predict-data must depend only on the training data.

The `Retrafo` object represents the re-application of a trained CPO

In [56]:
transformed = iris %>>% cpo
head(transformed)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa


In [57]:
retrafo(transformed)

CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

In [58]:
# retrafos are stored as attributes
attributes(transformed)

$names
[1] "Sepal.Length" "Sepal.Width"  "Petal.Length" "Petal.Width"  "Species"     

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150

$class
[1] "data.frame"

$retrafo
CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]


### Retrafo Inspection
`Retrafo` objects can be inspected using `getRetrafoState`. The state contains the hyperparameters, the `control` object (CPO dependent data representing the data information needed to re-apply the operation), and information about the `Task` / `data.frame` layout used for training (column names, column types) in `data$shapeinfo.input` and `data$shapeinfo.output`.

The state can be manipulated and used to create new `Retrafo`s, using `makeRetrafoFromState`.

In [59]:
(state = getRetrafoState(retrafo(iris %>>% cpoScale())))

$scale.center
[1] TRUE

$scale.scale
[1] TRUE

$control
$control$center
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333 

$control$scale
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.8280661    0.4358663    1.7652982    0.7622377 


$data
$data$shapeinfo.input
<ShapeInfo (input) Sepal.Length: num, Sepal.Width: num, Petal.Length: num, Petal.Width: num, Species: fac>

$data$shapeinfo.output
<ShapeInfo (output)>:
numeric:
<ShapeInfo Sepal.Length: num, Sepal.Width: num, Petal.Length: num, Petal.Width: num>
factor:
<ShapeInfo Species: fac>
other:
<ShapeInfo (empty)>



In [60]:
state$control$center[1] = 1000  # will now subtract 1000 from the first column
new.retrafo = makeRetrafoFromState(cpoScale, state)
head(iris %>>% new.retrafo)

ERROR: Error in IRkernel::main(): Assertion on 'names(bare$par.vals)' failed: Must be a subset of {'center','scale'}.


### Application of Retrafo using `%>>%`, `applyCPO`, or `predict`

In [62]:
head(iris) %>>% retrafo(transformed)
# should give the same as head(transformed), since the same data was used.
# same:
invisible(applyCPO(retrafo(transformed), head(iris)))
invisible(predict(retrafo(transformed), head(iris)))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa


### Retrafos from CPO Learners

In [63]:
cpomodel = train(cpolrn, pid.task)

In [64]:
retrafo(cpomodel)

CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

In [65]:
head(getTaskData(pid.task %>>% retrafo(cpomodel)))
# this is what the model would see, if we predict() it with the model

pregnant,glucose,pressure,triceps,insulin,mass,pedigree,age,diabetes
0.6395305,0.8477713,0.1495433,0.9066791,-0.6924393,0.2038799,0.4681869,1.42506672,pos
-0.8443348,-1.1226647,-0.1604412,0.5305558,-0.6924393,-0.6839762,-0.364823,-0.19054773,neg
1.2330766,1.942458,-0.2637694,-1.2873733,-0.6924393,-1.102537,0.6040037,-0.10551539,pos
-0.8443348,-0.9975577,-0.1604412,0.1544326,0.1232213,-0.4937213,-0.920163,-1.04087112,neg
-1.1411079,0.5037269,-1.5037073,0.9066791,0.7653372,1.4088275,5.481337,-0.02048305,pos
0.3427574,-0.1530851,0.2528715,-1.2873733,-0.6924393,-0.8108128,-0.8175458,-0.27558007,neg


### Retrafos are automatically chained when applying CPOs (!!!)
When executing `data %>>% CPO`, the result has an associated `Retrafo` object. When applying another `CPO`, the `Retrafo` will be the chained operation. This is to make `data %>>% CPO1 %>>% CPO2` the way one expects it to work.

In [66]:
data = head(iris) %>>% pca
retrafo(data)

ERROR: Error in "NULLCPO" %in% class(cpo): object 'pca' not found


In [68]:
data2 = data %>>% cpoScale()
# retrafo(data2) is the same as retrafo(data %>>% pca %>>% scale)
retrafo(data2)

CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

In [70]:
# to interrupt this chain, set retrafo to NULL
retrafo(data) = NULL
data2 = data %>>% cpoScale()
retrafo(data2)

CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

### Retrafo Composition, Decomposition, Chaining

In [71]:
compound.retrafo = retrafo(head(iris) %>>% compound)
compound.retrafo

CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]=>[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

In [72]:
(retrafolist = as.list(compound.retrafo))

[[1]]
CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

[[2]]
CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]


In [73]:
retrafolist[[1]] %>>% retrafolist[[2]]

CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]=>[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

In [74]:
chainCPO(retrafolist)

CPO Retrafo chain
[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]=>[RETRAFO scale(scale.center = TRUE, scale.scale = TRUE)]

## Inverter
Inverters represent the operation of inverting transformations done to prediction columns. They are not usually exposed outside of `Learner` objects, but can be retrieved when retransformed data is tagged using `tagInverse`.

Inverters are currently not fully functional.

In [None]:
# there is currently no example targetbound cpo
logtransform = makeCPOTargetOp("logtransform", .data.dependent = FALSE,
                               .stateless = TRUE, .type = "regr",
  cpo.trafo = {
    target[[1]] = log(target[[1]])
    target
  }, cpo.retrafo = { print(match.call()) })


In [50]:
log.retrafo = retrafo(bh.task %>>% logtransform())  # get a target-bound retrafo
getCPOKind(log.retrafo)  # logtransform is *stateless*, so it is a retrafo *and* an inverter
getCPOBound(log.retrafo)

In [51]:
inverter(bh.task %>>% log.retrafo)

NULLCPO

In [52]:
#inverter(tagInvert(bh.task) %>>% log.retrafo)
# currently not implemented :-/

Inverting is done with the `invert` function.

In [53]:
log.bh = bh.task %>>% logtransform()
log.prediction = predict(train("regr.lm", log.bh), log.bh)

In [54]:
# invert(retrafo(log.bh), log.prediction)  # not implemented :-/
# invert(retrafo(log.bh), log.prediction$data["response"])  # not implemented :-/


# CPO Properties
CPOs contain information about the kind of data they can work with, and what kind of data they produce. `getCPOProperties` returns a list with the slots `properties`, `properties.data`, `properties.needed`, `properties.adding`, indicating the kind of data a CPO can handle, the kind of data it needs the data receiver (e.g. attached learner) to have, and the properties it adds to a given learner. An example is a CPO that converts factors to numerics: The receiving learner needs to handle numerics, so `properties.needed = "numerics"`, but it *adds* the ability to handle factors (since they are converted), so `properties.adding = c("factors", "ordered")`. `properties.data` is only different from `properties` if `affect.*` parameters are given. In that case, `properties.data` determines what properties the selected subset of columns must have.

In [76]:
getCPOProperties(cpoDummyEncode())

In [77]:
train("classif.geoDA", bc.task)  # gives an error

ERROR: Error in checkLearnerBeforeTrain(task, learner, weights): Task 'BreastCancer-example' has factor inputs in 'Cl.thickness, Cell.size, Cell.shape, Marg.adhes...', but learner 'classif.geoDA' does not support that!


In [78]:
train(cpoDummyEncode(reference.cat = TRUE) %>>% makeLearner("classif.geoDA"), bc.task)

Model for learner.id=classif.geoDA.dummyencode; learner.class=CPOLearner
Trained on: task.id = BreastCancer-example; obs = 683; features = 9
Hyperparameters: validation=NULL,dummyencode.reference.cat=TRUE

In [79]:
getLearnerProperties("classif.geoDA")

In [80]:
getLearnerProperties(cpoDummyEncode(TRUE) %>>% makeLearner("classif.geoDA"))

# Special CPOs

## NULLCPO
`NULLCPO` is the neutral element of `%>>%`. It is returned by some functions when no other CPO or Retrafo is present.

In [81]:
NULLCPO

NULLCPO

In [82]:
is.nullcpo(NULLCPO)

In [83]:
NULLCPO %>>% cpoScale()

scale(center = TRUE, scale = TRUE)

In [84]:
NULLCPO %>>% NULLCPO

NULLCPO

In [85]:
print(as.list(NULLCPO))

list()


In [86]:
chainCPO(list())

NULLCPO

## CPO Applicator
A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument's parameters to the outside.

In [89]:
cpa = cpoApply()
summary(cpa)

Retrafo chain of 1 elements:
apply()

             Type len Def Constr Req Tunable Trafo
apply.cpo untyped   -   -      -   -    TRUE     -

In [91]:
head(iris %>>% setHyperPars(cpa, apply.cpo = cpoScale()))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa


In [93]:
head(iris %>>% setHyperPars(cpa, apply.cpo = cpoPca()))

Species,PC1,PC2,PC3,PC4
setosa,-5.912747,2.302033,0.007401536,0.003087706
setosa,-5.572482,1.971826,0.244592251,0.097552888
setosa,-5.446977,2.095206,0.015029262,0.018013331
setosa,-5.436459,1.870382,0.02050488,-0.078491501
setosa,-5.875645,2.32829,-0.110338269,-0.060719326
setosa,-6.477598,2.32465,-0.237202487,-0.021419633


In [94]:
# attaching the cpo applicator to a learner gives this learner a "cpo" hyperparameter
# that can be set to any CPO.
getParamSet(cpoApply() %>>% makeLearner("classif.logreg"))

             Type len  Def Constr Req Tunable Trafo
apply.cpo untyped   -    -      -   -    TRUE     -
model     logical   - TRUE      -   -   FALSE     -

## CPO Multiplexer
Combine many CPOs into one, with an extra `selected.cpo` parameter that chooses between them.

In [95]:
cpm = cpoMultiplex(list(cpoScale, cpoPca))
summary(cpm)

Retrafo chain of 1 elements:
multiplex(selected.cpo = scale, scale.center = TRUE, scale.scale = TRUE)

                           Type len   Def    Constr Req Tunable Trafo
multiplex.selected.cpo discrete   - scale scale,pca   -    TRUE     -
multiplex.scale.center  logical   -  TRUE         -   Y    TRUE     -
multiplex.scale.scale   logical   -  TRUE         -   Y    TRUE     -

In [97]:
head(iris %>>% setHyperPars(cpm, multiplex.selected.cpo = "scale"))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa


In [98]:
# every CPO's Hyperparameters are exported
head(iris %>>% setHyperPars(cpm, multiplex.selected.cpo = "scale", multiplex.scale.center = FALSE))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0.8613268,1.1296201,0.3362663,0.140405,setosa
0.8275493,0.9682458,0.3362663,0.140405,setosa
0.7937718,1.0327956,0.3122473,0.140405,setosa
0.776883,1.0005207,0.3602853,0.140405,setosa
0.844438,1.161895,0.3362663,0.140405,setosa
0.9119931,1.2587196,0.4083234,0.28081,setosa


In [100]:
head(iris %>>% setHyperPars(cpm, multiplex.selected.cpo = "pca"))

Species,PC1,PC2,PC3,PC4
setosa,-5.912747,2.302033,0.007401536,0.003087706
setosa,-5.572482,1.971826,0.244592251,0.097552888
setosa,-5.446977,2.095206,0.015029262,0.018013331
setosa,-5.436459,1.870382,0.02050488,-0.078491501
setosa,-5.875645,2.32829,-0.110338269,-0.060719326
setosa,-6.477598,2.32465,-0.237202487,-0.021419633


## Meta-CPO
A CPO that builds data-dependent CPO networks. This is a generalized CPO-Multiplexer that takes a function which decides (from the data, and from user-specified hyperparameters) what CPO operation to perform. Besides optional arguments, the used CPO's Hyperparameters are exported as well. This is a generalization of `cpoMultiplex`; however, `requires` of the involved parameters are not adjusted, since this is impossible in principle.

In [102]:
s.and.p = cpoMeta(logical.param: logical,
.export = list(cpoScale(id = "scale"), 
  cpoPca(id = "pca")),
cpo.build = function(data, target, logical.param, scale, pca) {
  if (logical.param || mean(data[[1]]) > 10) {
    scale %>>% pca
  } else {
    pca %>>% scale
  }
})

In [103]:
 summary(s.and.p())

Retrafo chain of 1 elements:
meta(scale.center = TRUE, scale.scale = TRUE)

                      Type len  Def Constr Req Tunable Trafo
meta.logical.param logical   -    -      -   -    TRUE     -
meta.scale.center  logical   - TRUE      -   -    TRUE     -
meta.scale.scale   logical   - TRUE      -   -    TRUE     -

The resulting CPO `s.and.p` performs scaling and PCA, with the order depending on the parameter `logical.param` and on whether the mean of the data's first column exceeds 10. If either of those is true, the data will be first scaled, then PCA'd, otherwise the order is reversed.
The all CPOs listed in `.export` are passed to the `cpo.build`.

## CBind CPO
`cbind` other CPOs as operation. The `cbinder` makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other.

In [104]:
scale = cpoScale(id = "scale")
scale.pca = scale %>>% cpoPca()
cbinder = cpoCbind(scaled = scale, pcad = scale.pca, original = NULLCPO)

In [106]:
# cpoCbind recognises that "scale.scale" happens before "pca.pca" but is also fed to the
# result directly. The summary draws a (crude) ascii-art graph.
summary(cbinder)

In [107]:
head(iris %>>% cbinder)

scaled.Sepal.Length,scaled.Sepal.Width,scaled.Petal.Length,scaled.Petal.Width,scaled.Species,pcad.Species,pcad.PC1,pcad.PC2,pcad.PC3,pcad.PC4,original.Sepal.Length,original.Sepal.Width,original.Petal.Length,original.Petal.Width,original.Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa,setosa,-2.257141,-0.4784238,0.12727962,0.024087508,5.1,3.5,1.4,0.2,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa,setosa,-2.074013,0.6718827,0.23382552,0.102662845,4.9,3.0,1.4,0.2,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa,setosa,-2.356335,0.3407664,-0.0440539,0.028282305,4.7,3.2,1.3,0.2,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa,setosa,-2.291707,0.5953999,-0.0909853,-0.06573534,4.6,3.1,1.5,0.2,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa,setosa,-2.381863,-0.6446757,-0.01568565,-0.03580287,5.0,3.6,1.4,0.2,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa,setosa,-2.068701,-1.4842053,-0.02687825,0.006586116,5.4,3.9,1.7,0.4,setosa


In [108]:
# the unnecessary copies of "Species" are unfortunate. Remove them with cpoSelect:
selector = mlr:::cpoSelect(type = "numeric")
cbinder.select = cpoCbind(scaled = selector %>>% scale, pcad = selector %>>% scale.pca, original = NULLCPO)
cbinder.select
head(iris %>>% cbinder)

scaled.Sepal.Length,scaled.Sepal.Width,scaled.Petal.Length,scaled.Petal.Width,scaled.Species,pcad.Species,pcad.PC1,pcad.PC2,pcad.PC3,pcad.PC4,original.Sepal.Length,original.Sepal.Width,original.Petal.Length,original.Petal.Width,original.Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa,setosa,-2.257141,-0.4784238,0.12727962,0.024087508,5.1,3.5,1.4,0.2,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa,setosa,-2.074013,0.6718827,0.23382552,0.102662845,4.9,3.0,1.4,0.2,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa,setosa,-2.356335,0.3407664,-0.0440539,0.028282305,4.7,3.2,1.3,0.2,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa,setosa,-2.291707,0.5953999,-0.0909853,-0.06573534,4.6,3.1,1.5,0.2,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa,setosa,-2.381863,-0.6446757,-0.01568565,-0.03580287,5.0,3.6,1.4,0.2,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa,setosa,-2.068701,-1.4842053,-0.02687825,0.006586116,5.4,3.9,1.7,0.4,setosa


In [109]:
# alternatively, we apply the cbinder only to numerical data
head(iris %>>% cpoApply(cbinder, affect.type = "numeric"))

Species,scaled.Sepal.Length,scaled.Sepal.Width,scaled.Petal.Length,scaled.Petal.Width,pcad.PC1,pcad.PC2,pcad.PC3,pcad.PC4,original.Sepal.Length,original.Sepal.Width,original.Petal.Length,original.Petal.Width
setosa,-0.8976739,1.01560199,-1.335752,-1.311052,-2.257141,-0.4784238,0.12727962,0.024087508,5.1,3.5,1.4,0.2
setosa,-1.1392005,-0.13153881,-1.335752,-1.311052,-2.074013,0.6718827,0.23382552,0.102662845,4.9,3.0,1.4,0.2
setosa,-1.3807271,0.32731751,-1.392399,-1.311052,-2.356335,0.3407664,-0.0440539,0.028282305,4.7,3.2,1.3,0.2
setosa,-1.5014904,0.09788935,-1.279104,-1.311052,-2.291707,0.5953999,-0.0909853,-0.06573534,4.6,3.1,1.5,0.2
setosa,-1.0184372,1.24503015,-1.335752,-1.311052,-2.381863,-0.6446757,-0.01568565,-0.03580287,5.0,3.6,1.4,0.2
setosa,-0.535384,1.93331463,-1.165809,-1.048667,-2.068701,-1.4842053,-0.02687825,0.006586116,5.4,3.9,1.7,0.4


# Builtin CPOs

## Listing CPOs
Builtin CPOs can be listed with `listCPO()`.

In [110]:
listCPO()

Unnamed: 0,name,cponame,category,subcategory,description
18,cpoDropConstants,dropconst,data,cleanup,Drop constant or near-constant Features.
19,cpoFixFactors,fixfactors,data,cleanup,Clean up Factorial Features.
12,cpoCollapseFact,collapse.fact,data,factor data preprocessing,Combine rare factors.
13,cpoCollapseFact,collapse.fact,data,feature conversion,Convert Numerics to Ordered by binning.
14,cpoCollapseFact,collapse.fact,data,feature conversion,Convert all Features to Numerics using as.numeric.
16,cpoDummyEncode,dummyencode,data,feature conversion,Convert factorial columns to numeric columns by dummy encoding them
10,cpoImpactEncodeClassif,impact.encode.classif,data,feature conversion,Convert factorial columns in classification tasks to numeric columns by impact encoding them
11,cpoImpactEncodeRegr,impact.encode.regr,data,feature conversion,Convert factorial columns in regression tasks to numeric columns by impact encoding them
9,cpoProbEncode,prob.encode,data,feature conversion,Convert factorial columns in classification tasks to numeric columns by probability encoding them
17,cpoSelect,select,data,feature selection,"Select features from a data set by type, column name, or column index."


## cpoScale
Implements the `base::scale` function.

In [111]:
df %>>% cpoScale()

a,b
-1,1
0,0
1,-1


In [112]:
df %>>% cpoScale(scale = FALSE)  # center = TRUE

a,b
-1,10
0,0
1,-10


## cpoPca
Implements `stats::prcomp`. No scaling or centering is performed.

In [113]:
df %>>% cpoPca()

PC1,PC2
-10.04988,0
-20.09975,0
-30.14963,0


In [115]:
df %>>% cpoPca()

PC1,PC2
-10.04988,0
-20.09975,0
-30.14963,0


## cpoDummyEncode
Dummy encoding of factorial variables. Optionally uses the first factor as reference variable.

In [116]:
head(iris %>>% cpoDummyEncode())

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Speciessetosa,Speciesversicolor,Speciesvirginica
5.1,3.5,1.4,0.2,1,0,0
4.9,3.0,1.4,0.2,1,0,0
4.7,3.2,1.3,0.2,1,0,0
4.6,3.1,1.5,0.2,1,0,0
5.0,3.6,1.4,0.2,1,0,0
5.4,3.9,1.7,0.4,1,0,0


In [117]:
head(iris %>>% cpoDummyEncode(reference.cat = TRUE))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Speciesversicolor,Speciesvirginica
5.1,3.5,1.4,0.2,0,0
4.9,3.0,1.4,0.2,0,0
4.7,3.2,1.3,0.2,0,0
4.6,3.1,1.5,0.2,0,0
5.0,3.6,1.4,0.2,0,0
5.4,3.9,1.7,0.4,0,0


## cpoSelect
Select to use only certain columns of a dataset. Select by column index, name, or regex pattern.

In [118]:
head(iris %>>% cpoSelect(pattern = "Width"))

Sepal.Width,Petal.Width
3.5,0.2
3.0,0.2
3.2,0.2
3.1,0.2
3.6,0.2
3.9,0.4


In [119]:
# selection is additive
head(iris %>>% cpoSelect(pattern = "Width", type = "factor"))

Sepal.Width,Petal.Width,Species
3.5,0.2,setosa
3.0,0.2,setosa
3.2,0.2,setosa
3.1,0.2,setosa
3.6,0.2,setosa
3.9,0.4,setosa


## cpoDropConstants
Drops constant features or numerics, with variable tolerance

In [120]:
head(iris) %>>% cpoDropConstants()  # drops 'species'
head(iris) %>>% cpoDropConstants(abs.tol = 0.2)  # also drops 'Petal.Width'

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4


Sepal.Length,Sepal.Width,Petal.Length
5.1,3.5,1.4
4.9,3.0,1.4
4.7,3.2,1.3
4.6,3.1,1.5
5.0,3.6,1.4
5.4,3.9,1.7


## cpoFixFactors
Drops unused factors and makes sure prediction data has the same factor levels as training data.

In [121]:
levels(iris$Species)

In [122]:
irisfix = head(iris) %>>% cpoFixFactors()  # Species only has level 'setosa' in train
levels(irisfix$Species)

In [123]:
rf = retrafo(irisfix)
iris[c(1, 100, 140), ]
iris[c(1, 100, 140), ] %>>% rf

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
100,5.7,2.8,4.1,1.3,versicolor
140,6.9,3.1,5.4,2.1,virginica


Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
100,5.7,2.8,4.1,1.3,
140,6.9,3.1,5.4,2.1,


## cpoMissingIndicators
Creates columns indicating missing data. Most useful in combination with cpoCbind.

In [124]:
impdata = df
impdata[[1]][1] = NA
impdata

a,b
,-10
2.0,-20
3.0,-30


In [125]:
impdata %>>% cpoMissingIndicators()
impdata %>>% cpoCbind(NULLCPO, dummy = cpoMissingIndicators())

a
True
False
False


a,b,dummy.a
,-10,True
2.0,-20,False
3.0,-30,False


## cpoApplyFun
Apply an univariate function to data columns

In [164]:
head(iris %>>% cpoApplyFun(function(x) sqrt(x) - 10, affect.type = "numeric"))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
-7.741682,-8.129171,-8.816784,-9.552786,setosa
-7.786406,-8.267949,-8.816784,-9.552786,setosa
-7.832052,-8.211146,-8.859825,-9.552786,setosa
-7.855239,-8.239318,-8.775255,-9.552786,setosa
-7.763932,-8.102633,-8.816784,-9.552786,setosa
-7.67621,-8.025158,-8.69616,-9.367544,setosa


## cpoAsNumeric
Convert (non-numeric) features to numeric

In [166]:
head(iris[sample(nrow(iris), 10), ] %>>% cpoAsNumeric())

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
44,5.0,3.5,1.6,0.6,1
118,7.7,3.8,6.7,2.2,3
61,5.0,2.0,3.5,1.0,2
130,7.2,3.0,5.8,1.6,3
138,6.4,3.1,5.5,1.8,3
7,4.6,3.4,1.4,0.3,1


## cpoCollapseFact
Combine low prevalence factors. Set `max.collapsed.class.prevalence` how big the combined factor level may be.

In [180]:
iris2 = iris
iris2$Species = factor(c("a", "b", "c", "b", "b", "c", "b", "c",
                        as.character(iris2$Species[-(1:8)])))
head(iris2, 10)
head(iris2 %>>% cpoCollapseFact(max.collapsed.class.prevalence = 0.2), 10)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,a
4.9,3.0,1.4,0.2,b
4.7,3.2,1.3,0.2,c
4.6,3.1,1.5,0.2,b
5.0,3.6,1.4,0.2,b
5.4,3.9,1.7,0.4,c
4.6,3.4,1.4,0.3,b
5.0,3.4,1.5,0.2,c
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,collapsed
4.9,3.0,1.4,0.2,collapsed
4.7,3.2,1.3,0.2,collapsed
4.6,3.1,1.5,0.2,collapsed
5.0,3.6,1.4,0.2,collapsed
5.4,3.9,1.7,0.4,collapsed
4.6,3.4,1.4,0.3,collapsed
5.0,3.4,1.5,0.2,collapsed
4.4,2.9,1.4,0.2,setosa
4.9,3.1,1.5,0.1,setosa


## cpoModelMatrix
Specify which columns get used, and how they are transformed, using a `formula`.

In [185]:
head(iris %>>% cpoModelMatrix(~0 + Species:Petal.Width))
# use . + ... to retain originals
head(iris %>>% cpoModelMatrix(~0 + . + Species:Petal.Width))

Speciessetosa:Petal.Width,Speciesversicolor:Petal.Width,Speciesvirginica:Petal.Width
0.2,0,0
0.2,0,0
0.2,0,0
0.2,0,0
0.2,0,0
0.4,0,0


Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Speciessetosa,Speciesversicolor,Speciesvirginica,Petal.Width:Speciesversicolor,Petal.Width:Speciesvirginica
5.1,3.5,1.4,0.2,1,0,0,0,0
4.9,3.0,1.4,0.2,1,0,0,0,0
4.7,3.2,1.3,0.2,1,0,0,0,0
4.6,3.1,1.5,0.2,1,0,0,0,0
5.0,3.6,1.4,0.2,1,0,0,0,0
5.4,3.9,1.7,0.4,1,0,0,0,0


## cpoScaleRange
scale values to a given range

In [187]:
head(iris %>>% cpoScaleRange(-1, 1))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
-0.5555556,0.25,-0.8644068,-0.9166667,setosa
-0.6666667,-0.16666667,-0.8644068,-0.9166667,setosa
-0.7777778,0.0,-0.8983051,-0.9166667,setosa
-0.8333333,-0.08333333,-0.8305085,-0.9166667,setosa
-0.6111111,0.33333333,-0.8644068,-0.9166667,setosa
-0.3888889,0.58333333,-0.7627119,-0.75,setosa


## cpoScaleMaxAbs
Multiply features to set the maximum absolute value.

In [191]:
head(iris %>>% cpoScaleMaxAbs(0.1))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0.06455696,0.07954545,0.02028986,0.008,setosa
0.06202532,0.06818182,0.02028986,0.008,setosa
0.05949367,0.07272727,0.01884058,0.008,setosa
0.05822785,0.07045455,0.02173913,0.008,setosa
0.06329114,0.08181818,0.02028986,0.008,setosa
0.06835443,0.08863636,0.02463768,0.016,setosa


## cpoSpatialSign
Normalize values row-wise

In [192]:
head(iris %>>% cpoSpatialSign())

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0.8037728,0.5516088,0.2206435,0.0315205,setosa
0.8281329,0.5070201,0.2366094,0.03380134,setosa
0.8053331,0.5483119,0.2227517,0.03426949,setosa
0.8000302,0.5391508,0.2608794,0.03478392,setosa
0.790965,0.5694948,0.2214702,0.0316386,setosa
0.784175,0.5663486,0.2468699,0.05808704,setosa


## Imputation
There are two *general* and many *specialised* imputation CPOs. The general imputation CPOs have parameters that let them use different imputation methods on different columns. They are a thin wrapper around `mlr`'s `impute()` and `reimpute()` functions. The specialised imputation CPOs each implement exactly one imputation method and are closer to the behaviour of typical CPOs.

### General Imputation Wrappers
`cpoImpute` and `cpoImputeAll` both have parameters very much like `impute()`. The latter assumes that *all* columns of its input is somehow being imputed and can be preprended to a learner to give it the ability to work with missing data. It will, however, throw an error if data is missing after imputation.

In [126]:
impdata %>>% cpoImpute(cols = list(a = imputeMedian()))

a,b
2.5,-10
2.0,-20
3.0,-30


In [127]:
impdata %>>% cpoImpute(cols = list(b = imputeMedian()))  # NAs remain
#impdata %>>% cpoImputeAll(cols = list(b = imputeMedian()))  # error, since NAs remain

a,b
,-10
2.0,-20
3.0,-30


In [128]:
missing.task = makeRegrTask("missing.task", impdata, target = "b")
# the following gives an error, since 'cpoImpute' does not make sure all missings are removed
# and hence does not add the 'missings' property.
#train(cpoImpute(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)
# instead, the following works:
train(cpoImputeAll(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)

Model for learner.id=regr.lm.impute; learner.class=CPOLearner
Trained on: task.id = missing.task; obs = 3; features = 1
Hyperparameters: impute.target.cols=character(0),impute.classes=,impute.cols=a=<ImputeMethod>,impute.dummy.classes=character(0),impute.dummy.cols=character(0),impute.dummy.type=factor,impute.force.dummies=FALSE,impute.impute.new.levels=TRUE,impute.recode.factor.levels=TRUE

### Specialised Imputation Wrappers
There is one for each imputation method.

In [129]:
impdata %>>% cpoImputeConstant(10)

a,b
10,-10
2,-20
3,-30


In [131]:
getTaskData(missing.task %>>% cpoImputeMedian())

a,b
2.5,-10
2.0,-20
3.0,-30


In [132]:
# The specialised impute CPOs are:
listCPO()[listCPO()$category == "imputation" & listCPO()$subcategory == "specialised",
          c("name", "description")]

Unnamed: 0,name,description
44,cpoImputeConstant,Imputation using a constant value.
52,cpoImputeHist,Imputation using random values with probabilities approximating the data.
53,cpoImputeLearner,Imputation using the response of a classification or regression learner.
49,cpoImputeMax,Imputation using constant values shifted above the maximum.
46,cpoImputeMean,Imputation using the mean.
45,cpoImputeMedian,Imputation using the median.
48,cpoImputeMin,Imputation using constant values shifted below the minimum.
47,cpoImputeMode,Imputation using the mode.
51,cpoImputeNormal,Imputation using normally distributed random values.
50,cpoImputeUniform,Imputation using uniformly distributed random values.


## Feature Filtering
There is one *general* and many *specialised* feature filtering CPOs. The general filtering CPO, `cpoFilterFeatures`, is a thin wrapper around `filterFeatures` and takes the filtering method as its argument. The specialised CPOs each call a specific filtering method.

Most arguments of `filterFeatures` are reflected in the CPOs. The exceptions being:
1. for `filterFeatures`, the filter method arguments are given in a list `filter.args`, instead of in `...`
2. The argument `fval` was dropped for the specialised filter CPOs.
3. The argument `mandatory.feat` was dropped. Use `affect.*` parameters to prevent features from being filtered.

In [133]:
head(getTaskData(iris.task %>>% cpoFilterFeatures(method = "variance", perc = 0.5)))

Sepal.Length,Petal.Length,Species
5.1,1.4,setosa
4.9,1.4,setosa
4.7,1.3,setosa
4.6,1.5,setosa
5.0,1.4,setosa
5.4,1.7,setosa


In [134]:
head(getTaskData(iris.task %>>% cpoFilterVariance(perc = 0.5)))

Sepal.Length,Petal.Length,Species
5.1,1.4,setosa
4.9,1.4,setosa
4.7,1.3,setosa
4.6,1.5,setosa
5.0,1.4,setosa
5.4,1.7,setosa


In [135]:
# The specialised filter CPOs are:
listCPO()[listCPO()$category == "featurefilter" & listCPO()$subcategory == "specialised",
          c("name", "description")]

Unnamed: 0,name,description
38,cpoFilterAnova,Filter features using analysis of variance.
24,cpoFilterCarscore,Filter features using correlation-adjusted marginal correlation.
34,cpoFilterChiSquared,Filter features using chi-squared test.
32,cpoFilterGainRatio,Filter features using entropy-based information gain ratio
31,cpoFilterInformationGain,Filter features using entropy-based information gain.
39,cpoFilterKruskal,Filter features using the Kruskal-Wallis rank sum test.
29,cpoFilterLinearCorrelation,Filter features using Pearson correlation.
23,cpoFilterMrmr,"Filter features using 'minimum redundancy, maximum relevance'."
36,cpoFilterOneR,Filter features using the OneR learner.
41,cpoFilterPermutationImportance,Filter features using predictiveness loss upon permutation of a variable.


# Creating Custom CPOs

In [155]:
names(formals(makeCPO))  # see help(makeCPO) for explanation of arguments

In [142]:
# an example 'pca' CPO
# demonstrates the (object based) "separate" CPO API
pca = makeCPO("pca",  # name
  center = TRUE: logical,  # one logical parameter 'center'
  .datasplit= "numeric",  # only handle numeric columns
  .retrafo.format = "separate",  # default, can be omitted
  # cpo.trafo is given as a function body. The function head is added
  # automatically, containing 'data', 'target', and 'center'
  # (since a 'center' parameter was defined)
  cpo.trafo = {
    pcr = prcomp(as.matrix(data), center = center)
    # The following line creates a 'control' object, which will be given
    # to retrafo.
    control = list(rotation = pcr$rotation, center = pcr$center)
    pcr$x  # returning a matrix is ok
  # Just like cpo.trafo, cpo.retrafo is a function body, with implicit
  # arguments 'data', 'control', and 'center'.
  }, cpo.retrafo = {
    scale(as.matrix(data), center = control$center, scale = FALSE) %*%
      control$rotation
  })
head(iris %>>% pca())

Species,PC1,PC2,PC3,PC4
setosa,-2.684126,-0.3193972,0.02791483,0.002262437
setosa,-2.714142,0.1770012,0.21046427,0.09902655
setosa,-2.888991,0.1449494,-0.01790026,0.01996839
setosa,-2.745343,0.318299,-0.03155937,-0.075575817
setosa,-2.728717,-0.3267545,-0.09007924,-0.061258593
setosa,-2.28086,-0.7413304,-0.16867766,-0.024200858


In [151]:
# an example 'scale' CPO
# demonstrates the (functional) "separate" CPO API
scaleC = makeCPO("scale",
  .datasplit = "numeric",
  # .retrafo.format = "separate" is implicit
  cpo.trafo = function(data, target) {
    result = scale(as.matrix(data))
    cpo.retrafo = function(data) {
      # here we can use the 'result' object generated in cpo.trafo
      scale(as.matrix(data), attr(result, "scaled:center"),
	attr(result, "scaled:scale"))
    }
    result
  }, cpo.retrafo = NULL)
head(iris) %>>% scaleC()

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0.5206576,0.3401105,-0.3627381,-0.4082483,setosa
-0.1735525,-1.117506,-0.3627381,-0.4082483,setosa
-0.8677627,-0.5344594,-1.0882144,-0.4082483,setosa
-1.2148677,-0.8259827,0.3627381,-0.4082483,setosa
0.1735525,0.6316338,-0.3627381,-0.4082483,setosa
1.5619728,1.5062037,1.8136906,2.0412415,setosa


In [147]:
# an example constant feature remover CPO
# demonstrates the "combined" CPO API
constFeatRem = makeCPO("constFeatRem",
  .datasplit = "target",
  .retrafo.format = "combined",
  cpo.trafo = function(data, target) {
    cols.keep = names(Filter(function(x) {
	length(unique(x)) > 1
      }, data))
    # the following function will do both the trafo and retrafo
    result = function(data) {
      data[cols.keep]
    }
    result
  }, cpo.retrafo = NULL)
head(iris) %>>% constFeatRem()

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4


In [148]:

# an example 'square' CPO
# demonstrates the "stateless" CPO API
square = makeCPO("scale",
  .datasplit = "numeric",
  .retrafo.format = "stateless",
  cpo.trafo = NULL, # optional, we don't need it since trafo & retrafo same
  cpo.retrafo = function(data) {
    as.matrix(data) * 2
  })
head(iris) %>>% square()

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
10.2,7.0,2.8,0.4,setosa
9.8,6.0,2.8,0.4,setosa
9.4,6.4,2.6,0.4,setosa
9.2,6.2,3.0,0.4,setosa
10.0,7.2,2.8,0.4,setosa
10.8,7.8,3.4,0.8,setosa
