# CPO

In [1]:
library("mlr")

Loading required package: ParamHelpers


In [2]:
df = data.frame(a = 1:3, b = -(1:3) * 10)

**CPO**s are first-class objects in R that represent data manipulation. They can be combined to form networks of operation, they can be attached to `mlr` `Learner`s, and they have tunable Hyperparameters that influence their behaviour.

# Lifecycle of a CPO

## CPO Constructor

In [3]:
print(cpoPca)  # example CPOConstructor

<<CPO pca(center = TRUE, scale = FALSE)>>


In [4]:
class(cpoPca)

CPO constructors have parameters that
* set the CPO Hyperparameters
* set the CPO ID (default NULL)
* resetrict the data columns a CPO operates on (`affect.*` parameters)

In [5]:
names(formals(cpoPca))

## CPO

In [6]:
(cpo = cpoPca()) # construct CPO with default Hyperparameter values

pca(center = TRUE, scale = FALSE)

In [7]:
class(cpo)  # CPOs that are not compound are "CPOPrimitive"

In [8]:
summary(cpo)  # detailed printing

Retrafo chain of 1 elements:
pca(center = TRUE, scale = FALSE)

          Type len   Def Constr Req Tunable Trafo
center logical   -  TRUE      -   -    TRUE     -
scale  logical   - FALSE      -   -    TRUE     -

In [9]:
# Functions that work on CPOs:
getParamSet(cpo)

          Type len   Def Constr Req Tunable Trafo
center logical   -  TRUE      -   -    TRUE     -
scale  logical   - FALSE      -   -    TRUE     -

In [10]:
getHyperPars(cpo)

In [11]:
setHyperPars(cpo, center = FALSE)

pca(center = FALSE, scale = FALSE)

In [12]:
getCPOId(cpo)

NULL

In [13]:
setCPOId(cpo, "MYID")

pca.MYID(MYID.center = TRUE, MYID.scale = FALSE)

In [14]:
getCPOName(cpo)
getCPOName(setCPOId(cpo, "MYID"))  # the name includes the ID

In [15]:
getCPOAffect(cpo)  # empty, since no affect set
getCPOAffect(cpoPca(affect.pattern = "Width$"))

In [16]:
getCPOProperties(cpo)  # see properties explanation below

In [17]:
getCPOKind(cpo)  # trafo, retrafo, inverter
getCPOBound(cpo)  # databound, targetbound, both

### CPO Application using `%>>%` or `applyCPO`
`CPO`s can be applied to `data.frame` and `Task` objects.

In [18]:
head(iris) %>>% cpoPca()
head(getTaskData(applyCPO(cpoPca(), iris.task)))

Species,PC1,PC2,PC3,PC4
setosa,-0.1634147,0.017230444,-0.11038321,-0.0231625616
setosa,0.332497,-0.189351624,-0.08152883,0.0005612917
setosa,0.3268659,0.101103375,-0.02238439,0.046453773
setosa,0.4202367,0.005523981,0.17106514,-0.0222757931
setosa,-0.1768684,0.140149101,-0.04185224,-0.0194870755
setosa,-0.7393165,-0.074655279,0.08508352,0.0179103657


Species,PC1,PC2,PC3,PC4
setosa,-2.684126,-0.3193972,0.02791483,0.002262437
setosa,-2.714142,0.1770012,0.21046427,0.09902655
setosa,-2.888991,0.1449494,-0.01790026,0.01996839
setosa,-2.745343,0.318299,-0.03155937,-0.075575817
setosa,-2.728717,-0.3267545,-0.09007924,-0.061258593
setosa,-2.28086,-0.7413304,-0.16867766,-0.024200858


### CPO Composition using `%>>%` or `composeCPO`
`CPO` composition results in a new CPO which mostly behaves like a primitive CPO. Exceptions are:
* Compound CPOs have no `id`
* Affect of compound CPOs cannot be retrieved

In [19]:
pca = cpoPca(center = FALSE, scale = FALSE)
scale = cpoScale()
# pca %>>% scale  # error! parameters 'center' and 'scale' occur in both
compound = setCPOId(pca, "pca") %>>% setCPOId(scale, "scale")
composeCPO(setCPOId(pca, "pca"), setCPOId(scale, "scale"))  # same

(pca.pca >> scale.scale)(pca.center = FALSE, pca.scale = FALSE, scale.center = TRUE, scale.scale = TRUE)

In [20]:
class(compound)

In [21]:
summary(compound)

Retrafo chain of 2 elements:
pca.pca(pca.center = FALSE, pca.scale = FALSE)

              Type len   Def Constr Req Tunable Trafo
pca.center logical   -  TRUE      -   -    TRUE     -
pca.scale  logical   - FALSE      -   -    TRUE     -
  ====>
scale.scale(scale.center = TRUE, scale.scale = TRUE)

                Type len  Def Constr Req Tunable Trafo
scale.center logical   - TRUE      -   -    TRUE     -
scale.scale  logical   - TRUE      -   -    TRUE     -

In [22]:
getCPOName(compound)

In [23]:
getCPOId(compound)  # error: no ID for compound CPOs
getCPOAffect(compound)  # error: no affect for compound CPOs

ERROR: Error in getCPOId.CPO(compound): Compound CPOs have no IDs.


In [24]:
getParamSet(compound)

                Type len   Def Constr Req Tunable Trafo
pca.center   logical   -  TRUE      -   -    TRUE     -
pca.scale    logical   - FALSE      -   -    TRUE     -
scale.center logical   -  TRUE      -   -    TRUE     -
scale.scale  logical   -  TRUE      -   -    TRUE     -

In [25]:
getHyperPars(compound)

In [26]:
setHyperPars(compound, pca.center = TRUE, scale.center = TRUE)

(pca.pca >> scale.scale)(pca.center = TRUE, pca.scale = FALSE, scale.center = TRUE, scale.scale = TRUE)

### Compound CPO decomposition, CPO chaining

In [27]:
as.list(compound)

[[1]]
pca.pca(pca.center = FALSE, pca.scale = FALSE)

[[2]]
scale.scale(scale.center = TRUE, scale.scale = TRUE)


In [28]:
chainCPO(as.list(compound))  # chainCPO: list CPO -> CPO

(pca.pca >> scale.scale)(pca.center = FALSE, pca.scale = FALSE, scale.center = TRUE, scale.scale = TRUE)

### CPO - Learner attachment using `%>>%` or `attachCPO`

In [29]:
lrn = makeLearner("classif.logreg")

In [30]:
(cpolrn = cpo %>>% lrn)  # the new learner has the CPO hyperparameters

Learner classif.logreg.pca from package stats
Type: classif
Name: ; Short name: 
Class: CPOS3Learner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,center=TRUE,scale=FALSE


In [31]:
attachCPO(compound, lrn)  # attaching compound CPO

Learner classif.logreg.scale.pca from package stats
Type: classif
Name: ; Short name: 
Class: CPOS3Learner
Properties: numerics,factors,prob,twoclass
Predict-Type: response
Hyperparameters: model=FALSE,pca.center=FALSE,pca.scale=FALSE,scale.center=TRUE,scale.scale=TRUE


In [32]:
# CPO learner decomposition
getLearnerCPO(cpolrn)  # the CPO
getLearnerBare(cpolrn)  # the Learner

pca(center = TRUE, scale = FALSE)

Learner classif.logreg from package stats
Type: classif
Name: Logistic Regression; Short name: logreg
Class: classif.logreg
Properties: twoclass,numerics,factors,prob,weights
Predict-Type: response
Hyperparameters: model=FALSE


## Retrafo
CPOs perform data-dependent operation. However, when this operation becomes part of a machine-learning process, the operation on predict-data must depend only on the training data.

The `Retrafo` object represents the re-application of a trained CPO

In [33]:
transformed = iris %>>% cpo
head(transformed)

Species,PC1,PC2,PC3,PC4
setosa,-2.684126,-0.3193972,0.02791483,0.002262437
setosa,-2.714142,0.1770012,0.21046427,0.09902655
setosa,-2.888991,0.1449494,-0.01790026,0.01996839
setosa,-2.745343,0.318299,-0.03155937,-0.075575817
setosa,-2.728717,-0.3267545,-0.09007924,-0.061258593
setosa,-2.28086,-0.7413304,-0.16867766,-0.024200858


In [34]:
retrafo(transformed)

CPO Retrafo chain
[RETRAFO pca(center = TRUE, scale = FALSE)]

In [35]:
# retrafos are stored as attributes
attributes(transformed)

$names
[1] "Species" "PC1"     "PC2"     "PC3"     "PC4"    

$row.names
  [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
 [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
 [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
 [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
 [91]  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108
[109] 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
[127] 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
[145] 145 146 147 148 149 150

$class
[1] "data.frame"

$retrafo
CPO Retrafo chain
[RETRAFO pca(center = TRUE, scale = FALSE)]


### Retrafo Inspection
`Retrafo` objects can be inspected using `getRetrafoState`. The state contains the hyperparameters, the `control` object (CPO dependent data representing the data information needed to re-apply the operation), and information about the `Task` / `data.frame` layout used for training (column names, column types) in `data$shapeinfo.input` and `data$shapeinfo.output`.

The state can be manipulated and used to create new `Retrafo`s, using `makeRetrafoFromState`.

In [36]:
(state = getRetrafoState(retrafo(iris %>>% cpoScale())))

$center
[1] TRUE

$scale
[1] TRUE

$control
$control$center
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
    5.843333     3.057333     3.758000     1.199333 

$control$scale
Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
   0.8280661    0.4358663    1.7652982    0.7622377 


$data
$data$shapeinfo.input
<ShapeInfo (input) Sepal.Length: num, Sepal.Width: num, Petal.Length: num, Petal.Width: num, Species: fac>

$data$shapeinfo.output
<ShapeInfo (output)>:
numeric:
<ShapeInfo Sepal.Length: num, Sepal.Width: num, Petal.Length: num, Petal.Width: num>
factor:
<ShapeInfo Species: fac>
other:
<ShapeInfo (empty)>



In [37]:
state$control$center[1] = 1000  # will now subtract 1000 from the first column
new.retrafo = makeRetrafoFromState(cpoScale, state)
head(iris %>>% new.retrafo)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
-1201.474,1.01560199,-1.335752,-1.311052,setosa
-1201.716,-0.13153881,-1.335752,-1.311052,setosa
-1201.957,0.32731751,-1.392399,-1.311052,setosa
-1202.078,0.09788935,-1.279104,-1.311052,setosa
-1201.595,1.24503015,-1.335752,-1.311052,setosa
-1201.112,1.93331463,-1.165809,-1.048667,setosa


### Application of Retrafo using `%>>%`, `applyCPO`, or `predict`

In [38]:
head(iris) %>>% retrafo(transformed)
# should give the same as head(transformed), since the same data was used.
# same:
invisible(applyCPO(retrafo(transformed), head(iris)))
invisible(predict(retrafo(transformed), head(iris)))

Species,PC1,PC2,PC3,PC4
setosa,-2.684126,-0.3193972,0.02791483,0.002262437
setosa,-2.714142,0.1770012,0.21046427,0.09902655
setosa,-2.888991,0.1449494,-0.01790026,0.01996839
setosa,-2.745343,0.318299,-0.03155937,-0.075575817
setosa,-2.728717,-0.3267545,-0.09007924,-0.061258593
setosa,-2.28086,-0.7413304,-0.16867766,-0.024200858


### Retrafos from CPO Learners

In [39]:
cpomodel = train(cpolrn, pid.task)

In [40]:
retrafo(cpomodel)

CPO Retrafo chain
[RETRAFO pca(center = TRUE, scale = FALSE)]

In [41]:
head(getTaskData(pid.task %>>% retrafo(cpomodel)))
# this is what the model would see, if we predict() it with the model

diabetes,PC1,PC2,PC3,PC4,PC5,PC6,PC7,PC8
pos,-75.71465,35.950783,-7.26078895,-15.669269,-16.506541,-3.460442,-0.702047,-0.09497708
neg,-82.35827,-28.908213,-5.49667139,-9.004554,-3.481527,-5.590262,-2.5720149,0.09153472
pos,-74.63064,67.906496,19.46180812,5.653056,10.300113,-7.144367,4.279067,-0.27101062
neg,11.07742,-34.898486,-0.05301779,-1.314873,7.619414,-2.583855,-0.8098285,0.27330484
pos,89.74379,2.746937,25.21285861,-18.994237,-8.522694,9.486986,-3.6264099,-1.67434826
neg,-80.97792,3.946887,0.64139494,15.117736,8.976962,-2.314746,1.5693795,0.16000894


### Retrafos are automatically chained when applying CPOs (!!!)
When executing `data %>>% CPO`, the result has an associated `Retrafo` object. When applying another `CPO`, the `Retrafo` will be the chained operation. This is to make `data %>>% CPO1 %>>% CPO2` the way one expects it to work.

In [42]:
data = head(iris) %>>% pca
retrafo(data)

CPO Retrafo chain
[RETRAFO pca(center = FALSE, scale = FALSE)]

In [43]:
data2 = data %>>% scale
# retrafo(data2) is the same as retrafo(data %>>% pca %>>% scale)
retrafo(data2)

CPO Retrafo chain
[RETRAFO pca(center = FALSE, scale = FALSE)]=>[RETRAFO scale(center = TRUE, scale = TRUE)]

In [44]:
# to interrupt this chain, set retrafo to NULL
retrafo(data) = NULL
data2 = data %>>% scale
retrafo(data2)

CPO Retrafo chain
[RETRAFO scale(center = TRUE, scale = TRUE)]

### Retrafo Composition, Decomposition, Chaining

In [45]:
compound.retrafo = retrafo(head(iris) %>>% compound)
compound.retrafo

CPO Retrafo chain
[RETRAFO pca(center = FALSE, scale = FALSE)]=>[RETRAFO scale(center = TRUE, scale = TRUE)]

In [46]:
(retrafolist = as.list(compound.retrafo))

[[1]]
CPO Retrafo chain
[RETRAFO pca(center = FALSE, scale = FALSE)]

[[2]]
CPO Retrafo chain
[RETRAFO scale(center = TRUE, scale = TRUE)]


In [47]:
retrafolist[[1]] %>>% retrafolist[[2]]

CPO Retrafo chain
[RETRAFO pca(center = FALSE, scale = FALSE)]=>[RETRAFO scale(center = TRUE, scale = TRUE)]

In [48]:
chainCPO(retrafolist)

CPO Retrafo chain
[RETRAFO pca(center = FALSE, scale = FALSE)]=>[RETRAFO scale(center = TRUE, scale = TRUE)]

## Inverter
Inverters represent the operation of inverting transformations done to prediction columns. They are not usually exposed outside of `Learner` objects, but can be retrieved when retransformed data is tagged using `tagInverse`.

Inverters are currently not fully functional.

In [49]:
# there is currently no example targetbound cpo
logtransform = makeCPOTargetOp("logtransform", .data.dependent = FALSE,
                               .stateless = TRUE, .type = "regr",
  cpo.trafo = {
    target[[1]] = log(target[[1]])
    target
  }, cpo.retrafo = { print(match.call()) })


In [50]:
log.retrafo = retrafo(bh.task %>>% logtransform())  # get a target-bound retrafo
getCPOKind(log.retrafo)  # logtransform is *stateless*, so it is a retrafo *and* an inverter
getCPOBound(log.retrafo)

In [51]:
inverter(bh.task %>>% log.retrafo)

NULLCPO

In [52]:
#inverter(tagInvert(bh.task) %>>% log.retrafo)
# currently not implemented :-/

Inverting is done with the `invert` function.

In [53]:
log.bh = bh.task %>>% logtransform()
log.prediction = predict(train("regr.lm", log.bh), log.bh)

In [54]:
# invert(retrafo(log.bh), log.prediction)  # not implemented :-/
# invert(retrafo(log.bh), log.prediction$data["response"])  # not implemented :-/


# CPO Properties
CPOs contain information about the kind of data they can work with, and what kind of data they produce. `getCPOProperties` returns a list with the slots `properties`, `properties.data`, `properties.needed`, `properties.adding`, indicating the kind of data a CPO can handle, the kind of data it needs the data receiver (e.g. attached learner) to have, and the properties it adds to a given learner. An example is a CPO that converts factors to numerics: The receiving learner needs to handle numerics, so `properties.needed = "numerics"`, but it *adds* the ability to handle factors (since they are converted), so `properties.adding = c("factors", "ordered")`. `properties.data` is only different from `properties` if `affect.*` parameters are given. In that case, `properties.data` determines what properties the selected subset of columns must have.

In [55]:
getCPOProperties(cpoDummyEncode())

In [56]:
train("classif.geoDA", bc.task)  # gives an error

ERROR: Error in checkLearnerBeforeTrain(task, learner, weights): Task 'BreastCancer-example' has factor inputs in 'Cl.thickness, Cell.size, Cell.shape, Marg.adhes...', but learner 'classif.geoDA' does not support that!


In [57]:
train(cpoDummyEncode(reference.cat = TRUE) %>>% makeLearner("classif.geoDA"), bc.task)

Model for learner.id=classif.geoDA.dummyencode; learner.class=CPOS3Learner
Trained on: task.id = BreastCancer-example; obs = 683; features = 9
Hyperparameters: validation=NULL,reference.cat=TRUE

In [58]:
getLearnerProperties("classif.geoDA")

In [59]:
getLearnerProperties(cpoDummyEncode(TRUE) %>>% makeLearner("classif.geoDA"))

# Special CPOs

## NULLCPO
`NULLCPO` is the neutral element of `%>>%`. It is returned by some functions when no other CPO or Retrafo is present.

In [60]:
NULLCPO

NULLCPO

In [61]:
is.nullcpo(NULLCPO)

In [62]:
NULLCPO %>>% cpoScale()

scale(center = TRUE, scale = TRUE)

In [63]:
NULLCPO %>>% NULLCPO

NULLCPO

In [64]:
print(as.list(NULLCPO))

list()


In [65]:
chainCPO(list())

NULLCPO

## CPO Applicator
A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument's parameters to the outside.

In [66]:
cpa = cpoApply()
summary(cpa)

Retrafo chain of 1 elements:
apply()

       Type len Def Constr Req Tunable Trafo
cpo untyped   -   -      -   -    TRUE     -

In [67]:
head(iris) %>>% setHyperPars(cpa, cpo = cpoScale())

“Empty factor levels were dropped for columns: Species”

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0.5206576,0.3401105,-0.3627381,-0.4082483,setosa
-0.1735525,-1.117506,-0.3627381,-0.4082483,setosa
-0.8677627,-0.5344594,-1.0882144,-0.4082483,setosa
-1.2148677,-0.8259827,0.3627381,-0.4082483,setosa
0.1735525,0.6316338,-0.3627381,-0.4082483,setosa
1.5619728,1.5062037,1.8136906,2.0412415,setosa


In [68]:
head(iris) %>>% setHyperPars(cpa, cpo = cpoPca())

“Empty factor levels were dropped for columns: Species”

Species,PC1,PC2,PC3,PC4
setosa,-0.1634147,0.017230444,-0.11038321,-0.0231625616
setosa,0.332497,-0.189351624,-0.08152883,0.0005612917
setosa,0.3268659,0.101103375,-0.02238439,0.046453773
setosa,0.4202367,0.005523981,0.17106514,-0.0222757931
setosa,-0.1768684,0.140149101,-0.04185224,-0.0194870755
setosa,-0.7393165,-0.074655279,0.08508352,0.0179103657


In [69]:
# attaching the cpo applicator to a learner gives this learner a "cpo" hyperparameter
# that can be set to any CPO.
getParamSet(cpoApply() %>>% makeLearner("classif.logreg"))

         Type len  Def Constr Req Tunable Trafo
cpo   untyped   -    -      -   -    TRUE     -
model logical   - TRUE      -   -   FALSE     -

## CPO Multiplexer
Combine many CPOs into one, with an extra `selected.cpo` parameter that chooses between them.

In [70]:
cpm = cpoMultiplex(list(cpoScale, cpoPca))
summary(cpm)

Retrafo chain of 1 elements:
multiplex(selected.cpo = scale, scale.center = TRUE, scale.scale = TRUE, pca.center = TRUE, pca.scale = FALSE)

                 Type len   Def    Constr Req Tunable Trafo
selected.cpo discrete   - scale scale,pca   -    TRUE     -
scale.center  logical   -  TRUE         -   Y    TRUE     -
scale.scale   logical   -  TRUE         -   Y    TRUE     -
pca.center    logical   -  TRUE         -   Y    TRUE     -
pca.scale     logical   - FALSE         -   Y    TRUE     -

In [71]:
head(iris) %>>% setHyperPars(cpm, selected.cpo = "scale")

“Empty factor levels were dropped for columns: Species”

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0.5206576,0.3401105,-0.3627381,-0.4082483,setosa
-0.1735525,-1.117506,-0.3627381,-0.4082483,setosa
-0.8677627,-0.5344594,-1.0882144,-0.4082483,setosa
-1.2148677,-0.8259827,0.3627381,-0.4082483,setosa
0.1735525,0.6316338,-0.3627381,-0.4082483,setosa
1.5619728,1.5062037,1.8136906,2.0412415,setosa


In [72]:
# every CPO's Hyperparameters are exported
head(iris) %>>% setHyperPars(cpm, selected.cpo = "scale", scale.center = FALSE)

“Empty factor levels were dropped for columns: Species”

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
0.939209,0.9403303,0.8780925,0.745356,setosa
0.9023773,0.8059974,0.8780925,0.745356,setosa
0.8655455,0.8597306,0.8153716,0.745356,setosa
0.8471297,0.832864,0.9408134,0.745356,setosa
0.9207931,0.9671969,0.8780925,0.745356,setosa
0.9944566,1.0477967,1.0662552,1.490712,setosa


In [73]:
head(iris) %>>% setHyperPars(cpm, selected.cpo = "pca")

“Empty factor levels were dropped for columns: Species”

Species,PC1,PC2,PC3,PC4
setosa,-0.1634147,0.017230444,-0.11038321,-0.0231625616
setosa,0.332497,-0.189351624,-0.08152883,0.0005612917
setosa,0.3268659,0.101103375,-0.02238439,0.046453773
setosa,0.4202367,0.005523981,0.17106514,-0.0222757931
setosa,-0.1768684,0.140149101,-0.04185224,-0.0194870755
setosa,-0.7393165,-0.074655279,0.08508352,0.0179103657


## Meta-CPO
A CPO that builds data-dependent CPO networks. This is a generalized CPO-Multiplexer that takes a function which decides (from the data, and from user-specified hyperparameters) what CPO operation to perform. Besides optional arguments, the used CPO's Hyperparameters are exported as well. This is a generalization of `cpoMultiplex`; however, `requires` of the involved parameters are not adjusted, since this is impossible in principle.

In [74]:
s.and.p = cpoMeta(logical.param: logical,
.export = list(cpoScale(id = "scale"), 
  cpoPca(id = "pca", scale = FALSE, center = FALSE)),
cpo.build = function(data, target, logical.param, scale, pca) {
  if (logical.param || mean(data[[1]]) > 10) {
    scale %>>% pca
  } else {
    pca %>>% scale
  }
})

In [75]:
 summary(s.and.p())

Retrafo chain of 1 elements:
meta(scale.center = TRUE, scale.scale = TRUE, pca.center = FALSE, pca.scale = FALSE)

                 Type len   Def Constr Req Tunable Trafo
logical.param logical   -     -      -   -    TRUE     -
scale.center  logical   -  TRUE      -   -    TRUE     -
scale.scale   logical   -  TRUE      -   -    TRUE     -
pca.center    logical   -  TRUE      -   -    TRUE     -
pca.scale     logical   - FALSE      -   -    TRUE     -

The resulting CPO `s.and.p` performs scaling and PCA, with the order depending on the parameter `logical.param` and on whether the mean of the data's first column exceeds 10. If either of those is true, the data will be first scaled, then PCA'd, otherwise the order is reversed.
The all CPOs listed in `.export` are passed to the `cpo.build`.

## CBind CPO
`cbind` other CPOs as operation. The `cbinder` makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other.

In [76]:
scale = cpoScale(id = "scale")
scale.pca = scale %>>% cpoPca(center = FALSE, scale = FALSE, id = "pca")
cbinder = cpoCbind(scaled = scale, pcad = scale.pca, original = NULLCPO)

In [77]:
# cpoCbind recognises that "scale.scale" happens before "pca.pca" but is also fed to the
# result directly. The summary draws a (crude) ascii-art graph.
summary(cbinder)

Retrafo chain of 1 elements:
cbind(scale.center = TRUE, scale.scale = TRUE, pca.center = FALSE, pca.scale = FALSE, .CPO = <unnamed>=<CPOGraphItem>, <unnamed>=<CPOGraphItem>, <unnamed>=<CPOGraphItem>, <unnamed>=<CPOGraphItem>)
O>+   scale.scale(scale.center = TRUE, scale.scale = TRUE)
| |  
+<O   pca.pca(pca.center = FALSE, pca.scale = FALSE)
|  
O   CBIND[scaled,pcad,original]
 

                Type len   Def Constr Req Tunable Trafo
scale.center logical   -  TRUE      -   -    TRUE     -
scale.scale  logical   -  TRUE      -   -    TRUE     -
pca.center   logical   -  TRUE      -   -    TRUE     -
pca.scale    logical   - FALSE      -   -    TRUE     -
.CPO         untyped   -     -      -   -    TRUE     -
O>+   scale.scale(scale.center = TRUE, scale.scale = TRUE)
| |  
+<O   pca.pca(pca.center = FALSE, pca.scale = FALSE)
|  
O   CBIND[scaled,pcad,original]
 

In [78]:
head(iris %>>% cbinder)

scaled.Sepal.Length,scaled.Sepal.Width,scaled.Petal.Length,scaled.Petal.Width,scaled.Species,pcad.Species,pcad.PC1,pcad.PC2,pcad.PC3,pcad.PC4,original.Sepal.Length,original.Sepal.Width,original.Petal.Length,original.Petal.Width,original.Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa,setosa,-2.257141,-0.4784238,0.12727962,0.024087508,5.1,3.5,1.4,0.2,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa,setosa,-2.074013,0.6718827,0.23382552,0.102662845,4.9,3.0,1.4,0.2,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa,setosa,-2.356335,0.3407664,-0.0440539,0.028282305,4.7,3.2,1.3,0.2,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa,setosa,-2.291707,0.5953999,-0.0909853,-0.06573534,4.6,3.1,1.5,0.2,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa,setosa,-2.381863,-0.6446757,-0.01568565,-0.03580287,5.0,3.6,1.4,0.2,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa,setosa,-2.068701,-1.4842053,-0.02687825,0.006586116,5.4,3.9,1.7,0.4,setosa


In [79]:
# the unnecessary copies of "Species" are unfortunate. Remove them with cpoSelect:
selector = mlr:::cpoSelect(type = "numeric")
cbinder.select = cpoCbind(scaled = selector %>>% scale, pcad = selector %>>% scale.pca, original = NULLCPO)
cbinder.select
head(iris %>>% cbinder)

cbind(type = numeric, index = integer(0), names = character(0), pattern = <NULL>, pattern.ignore.case = FALSE, pattern.perl = FALSE, pattern.fixed = FALSE, invert = FALSE, scale.center = TRUE, scale.scale = TRUE, pca.center = FALSE, pca.scale = FALSE, .CPO = <unnamed>=<CPOGraphItem>, <unnamed>=<CPOGraphItem>, <unnamed>=<CPOGraphItem>, <unnamed>=<CPOGraphItem>, <unnamed>=<CPOGraphItem>)
O     select(type = numeric, index = integer(0), names = character(0), pattern
|    = <NULL>, pattern.ignore.case = FALSE, pattern.perl = FALSE,
|    pattern.fixed = FALSE, invert = FALSE)
|    
O>+   scale.scale(scale.center = TRUE, scale.scale = TRUE)
| |  
+<O   pca.pca(pca.center = FALSE, pca.scale = FALSE)
|  
O   CBIND[scaled,pcad,original]
 

scaled.Sepal.Length,scaled.Sepal.Width,scaled.Petal.Length,scaled.Petal.Width,scaled.Species,pcad.Species,pcad.PC1,pcad.PC2,pcad.PC3,pcad.PC4,original.Sepal.Length,original.Sepal.Width,original.Petal.Length,original.Petal.Width,original.Species
-0.8976739,1.01560199,-1.335752,-1.311052,setosa,setosa,-2.257141,-0.4784238,0.12727962,0.024087508,5.1,3.5,1.4,0.2,setosa
-1.1392005,-0.13153881,-1.335752,-1.311052,setosa,setosa,-2.074013,0.6718827,0.23382552,0.102662845,4.9,3.0,1.4,0.2,setosa
-1.3807271,0.32731751,-1.392399,-1.311052,setosa,setosa,-2.356335,0.3407664,-0.0440539,0.028282305,4.7,3.2,1.3,0.2,setosa
-1.5014904,0.09788935,-1.279104,-1.311052,setosa,setosa,-2.291707,0.5953999,-0.0909853,-0.06573534,4.6,3.1,1.5,0.2,setosa
-1.0184372,1.24503015,-1.335752,-1.311052,setosa,setosa,-2.381863,-0.6446757,-0.01568565,-0.03580287,5.0,3.6,1.4,0.2,setosa
-0.535384,1.93331463,-1.165809,-1.048667,setosa,setosa,-2.068701,-1.4842053,-0.02687825,0.006586116,5.4,3.9,1.7,0.4,setosa


In [80]:
# alternatively, we apply the cbinder only to numerical data
head(iris %>>% cpoApply(cbinder, affect.type = "numeric"))

Species,scaled.Sepal.Length,scaled.Sepal.Width,scaled.Petal.Length,scaled.Petal.Width,pcad.PC1,pcad.PC2,pcad.PC3,pcad.PC4,original.Sepal.Length,original.Sepal.Width,original.Petal.Length,original.Petal.Width
setosa,-0.8976739,1.01560199,-1.335752,-1.311052,-2.257141,-0.4784238,0.12727962,0.024087508,5.1,3.5,1.4,0.2
setosa,-1.1392005,-0.13153881,-1.335752,-1.311052,-2.074013,0.6718827,0.23382552,0.102662845,4.9,3.0,1.4,0.2
setosa,-1.3807271,0.32731751,-1.392399,-1.311052,-2.356335,0.3407664,-0.0440539,0.028282305,4.7,3.2,1.3,0.2
setosa,-1.5014904,0.09788935,-1.279104,-1.311052,-2.291707,0.5953999,-0.0909853,-0.06573534,4.6,3.1,1.5,0.2
setosa,-1.0184372,1.24503015,-1.335752,-1.311052,-2.381863,-0.6446757,-0.01568565,-0.03580287,5.0,3.6,1.4,0.2
setosa,-0.535384,1.93331463,-1.165809,-1.048667,-2.068701,-1.4842053,-0.02687825,0.006586116,5.4,3.9,1.7,0.4


# Builtin CPOs

## Listing CPOs
Builtin CPOs can be listed with `listCPO()`.

In [81]:
listCPO()

Unnamed: 0,name,cponame,category,subcategory,description
8,cpoDropConstants,dropconst,data,cleanup,Drop constant or near-constant Features.
9,cpoFixFactors,fixfactors,data,cleanup,Clean up Factorial Features.
6,cpoDummyEncode,dummyencode,data,feature conversion,Convert factorial columns to numeric columns by dummy encoding them
7,cpoSelect,select,data,feature selection,"Select features from a data set by type, column name, or column index."
4,cpoPca,pca,data,numeric data preprocessing,Perform Principal Component Analysis (PCA) using stats::prcomp.
5,cpoScale,scale,data,numeric data preprocessing,Center and / or scale the data using base::scale.
11,cpoFilterFeatures,filterFeatures,featurefilter,general,Filter features using a provided method.
27,cpoFilterAnova,anova.test,featurefilter,specialised,Filter features using analysis of variance.
13,cpoFilterCarscore,carscore,featurefilter,specialised,Filter features using correlation-adjusted marginal correlation.
23,cpoFilterChiSquared,chi.squared,featurefilter,specialised,Filter features using chi-squared test.


## cpoScale
Implements the `base::scale` function.

In [82]:
df %>>% cpoScale()

a,b
-1,1
0,0
1,-1


In [83]:
df %>>% cpoScale(scale = FALSE)  # center = TRUE

a,b
-1,10
0,0
1,-10


## cpoPca
Implements `stats::prcomp`.

In [84]:
df %>>% cpoPca()

PC1,PC2
-10.04988,4.440892e-16
0.0,0.0
10.04988,-4.440892e-16


In [85]:
df %>>% cpoPca(scale = TRUE)

PC1,PC2
-1.414214,1.110223e-16
0.0,0.0
1.414214,-1.110223e-16


## cpoDummyEncode
Dummy encoding of factorial variables. Optionally uses the first factor as reference variable.

In [86]:
head(iris %>>% cpoDummyEncode())

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Speciessetosa,Speciesversicolor,Speciesvirginica
5.1,3.5,1.4,0.2,1,0,0
4.9,3.0,1.4,0.2,1,0,0
4.7,3.2,1.3,0.2,1,0,0
4.6,3.1,1.5,0.2,1,0,0
5.0,3.6,1.4,0.2,1,0,0
5.4,3.9,1.7,0.4,1,0,0


In [87]:
head(iris %>>% cpoDummyEncode(reference.cat = TRUE))

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Speciesversicolor,Speciesvirginica
5.1,3.5,1.4,0.2,0,0
4.9,3.0,1.4,0.2,0,0
4.7,3.2,1.3,0.2,0,0
4.6,3.1,1.5,0.2,0,0
5.0,3.6,1.4,0.2,0,0
5.4,3.9,1.7,0.4,0,0


## cpoSelect
Select to use only certain columns of a dataset. Select by column index, name, or regex pattern.

In [88]:
head(iris %>>% cpoSelect(pattern = "Width"))

Sepal.Width,Petal.Width
3.5,0.2
3.0,0.2
3.2,0.2
3.1,0.2
3.6,0.2
3.9,0.4


In [89]:
# selection is additive
head(iris %>>% cpoSelect(pattern = "Width", type = "factor"))

Sepal.Width,Petal.Width,Species
3.5,0.2,setosa
3.0,0.2,setosa
3.2,0.2,setosa
3.1,0.2,setosa
3.6,0.2,setosa
3.9,0.4,setosa


## cpoDropConstants
Drops constant features or numerics, with variable tolerance

In [90]:
head(iris) %>>% cpoDropConstants()  # drops 'species'
head(iris) %>>% cpoDropConstants(abs.tol = 0.2)  # also drops 'Petal.Width'

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
5.1,3.5,1.4,0.2
4.9,3.0,1.4,0.2
4.7,3.2,1.3,0.2
4.6,3.1,1.5,0.2
5.0,3.6,1.4,0.2
5.4,3.9,1.7,0.4


Sepal.Length,Sepal.Width,Petal.Length
5.1,3.5,1.4
4.9,3.0,1.4
4.7,3.2,1.3
4.6,3.1,1.5
5.0,3.6,1.4
5.4,3.9,1.7


## cpoFixFactors
Drops unused factors and makes sure prediction data has the same factor levels as training data.

In [91]:
levels(iris$Species)

In [92]:
irisfix = head(iris) %>>% cpoFixFactors()  # Species only has level 'setosa' in train
levels(irisfix$Species)

In [93]:
rf = retrafo(irisfix)
iris[c(1, 100, 140), ]
iris[c(1, 100, 140), ] %>>% rf

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
100,5.7,2.8,4.1,1.3,versicolor
140,6.9,3.1,5.4,2.1,virginica


Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
1,5.1,3.5,1.4,0.2,setosa
100,5.7,2.8,4.1,1.3,
140,6.9,3.1,5.4,2.1,


## cpoMissingIndicators
Creates columns indicating missing data. Most useful in combination with cpoCbind.

In [94]:
impdata = df
impdata[[1]][1] = NA
impdata

a,b
,-10
2.0,-20
3.0,-30


In [95]:
impdata %>>% cpoMissingIndicators()
impdata %>>% cpoCbind(NULLCPO, dummy = cpoMissingIndicators())

a
True
False
False


a,b,dummy.a
,-10,True
2.0,-20,False
3.0,-30,False


## Imputation
There are two *general* and many *specialised* imputation CPOs. The general imputation CPOs have parameters that let them use different imputation methods on different columns. They are a thin wrapper around `mlr`'s `impute()` and `reimpute()` functions. The specialised imputation CPOs each implement exactly one imputation method and are closer to the behaviour of typical CPOs.

### General Imputation Wrappers
`cpoImpute` and `cpoImputeAll` both have parameters very much like `impute()`. The latter assumes that *all* columns of its input is somehow being imputed and can be preprended to a learner to give it the ability to work with missing data. It will, however, throw an error if data is missing after imputation.

In [96]:
impdata %>>% cpoImpute(cols = list(a = imputeMedian()))

a,b
2.5,-10
2.0,-20
3.0,-30


In [97]:
impdata %>>% cpoImpute(cols = list(b = imputeMedian()))  # NAs remain
#impdata %>>% cpoImputeAll(cols = list(b = imputeMedian()))  # error, since NAs remain

a,b
,-10
2.0,-20
3.0,-30


In [98]:
missing.task = makeRegrTask("missing.task", impdata, target = "b")
# the following gives an error, since 'cpoImpute' does not make sure all missings are removed
# and hence does not add the 'missings' property.
#train(cpoImpute(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)
# instead, the following works:
train(cpoImputeAll(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)

Model for learner.id=regr.lm.impute; learner.class=CPOS3Learner
Trained on: task.id = missing.task; obs = 3; features = 1
Hyperparameters: target.cols=character(0),classes=,cols=a=<ImputeMethod>,dummy.classes=character(0),dummy.cols=character(0),dummy.type=factor,force.dummies=FALSE,impute.new.levels=TRUE,recode.factor.levels=TRUE

### Specialised Imputation Wrappers
There is one for each imputation method.

In [99]:
impdata %>>% cpoImputeConstant(10)

a,b,a.dummy
10,-10,True
2,-20,False
3,-30,False


In [100]:
getTaskData(missing.task %>>% cpoImputeMedian(make.dummy.cols = FALSE))

a,b
2.5,-10
2.0,-20
3.0,-30


In [101]:
# The specialised impute CPOs are:
listCPO()[listCPO()$category == "imputation" & listCPO()$subcategory == "specialised",
          c("name", "description")]

Unnamed: 0,name,description
33,cpoImputeConstant,Imputation using a constant value.
41,cpoImputeHist,Imputation using random values with probabilities approximating the data.
42,cpoImputeLearner,Imputation using the response of a classification or regression learner.
38,cpoImputeMax,Imputation using constant values shifted above the maximum.
35,cpoImputeMean,Imputation using the mean.
34,cpoImputeMedian,Imputation using the median.
37,cpoImputeMin,Imputation using constant values shifted below the minimum.
36,cpoImputeMode,Imputation using the mode.
40,cpoImputeNormal,Imputation using normally distributed random values.
39,cpoImputeUniform,Imputation using uniformly distributed random values.


## Feature Filtering
There is one *general* and many *specialised* feature filtering CPOs. The general filtering CPO, `cpoFilterFeatures`, is a thin wrapper around `filterFeatures` and takes the filtering method as its argument. The specialised CPOs each call a specific filtering method.

Most arguments of `filterFeatures` are reflected in the CPOs. The exceptions being:
1. for `filterFeatures`, the filter method arguments are given in a list `filter.args`, instead of in `...`
2. The argument `fval` was dropped for the specialised filter CPOs.
3. The argument `mandatory.feat` was dropped. Use `affect.*` parameters to prevent features from being filtered.

In [102]:
head(getTaskData(iris.task %>>% cpoFilterFeatures(method = "variance", perc = 0.5)))

Sepal.Length,Petal.Length,Species
5.1,1.4,setosa
4.9,1.4,setosa
4.7,1.3,setosa
4.6,1.5,setosa
5.0,1.4,setosa
5.4,1.7,setosa


In [103]:
head(getTaskData(iris.task %>>% cpoFilterVariance(perc = 0.5)))

Sepal.Length,Petal.Length,Species
5.1,1.4,setosa
4.9,1.4,setosa
4.7,1.3,setosa
4.6,1.5,setosa
5.0,1.4,setosa
5.4,1.7,setosa


In [104]:
# The specialised filter CPOs are:
listCPO()[listCPO()$category == "featurefilter" & listCPO()$subcategory == "specialised",
          c("name", "description")]

Unnamed: 0,name,description
27,cpoFilterAnova,Filter features using analysis of variance.
13,cpoFilterCarscore,Filter features using correlation-adjusted marginal correlation.
23,cpoFilterChiSquared,Filter features using chi-squared test.
21,cpoFilterGainRatio,Filter features using entropy-based information gain ratio
20,cpoFilterInformationGain,Filter features using entropy-based information gain.
28,cpoFilterKruskal,Filter features using the Kruskal-Wallis rank sum test.
18,cpoFilterLinearCorrelation,Filter features using Pearson correlation.
12,cpoFilterMrmr,"Filter features using 'minimum redundancy, maximum relevance'."
25,cpoFilterOneR,Filter features using the OneR learner.
30,cpoFilterPermutationImportance,Filter features using predictiveness loss upon permutation of a variable.


# Creating Custom CPOs
I will write this up some other time.

In [None]:
print(class(cpoApply()))