In [1]:
library(tidyverse)
library(mlr)
library(mlbench)
library(e1071)
library(xgboost)
library(parallelMap)

count_na = function(df){
    sapply(df, function(x){sum(is.na(x))})
    }

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.1       [32m✔[39m [34mpurrr  [39m 0.3.2  
[32m✔[39m [34mtibble [39m 2.1.1       [32m✔[39m [34mdplyr  [39m 0.8.0.[31m1[39m
[32m✔[39m [34mtidyr  [39m 0.8.3       [32m✔[39m [34mstringr[39m 1.4.0  
[32m✔[39m [34mreadr  [39m 1.3.1       [32m✔[39m [34mforcats[39m 0.4.0  
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Loading required package: ParamHelpers

Attaching package: ‘e1071’

The following object is masked

## MLR: Machine Learning in R

Site de referência: https://mlr.mlr-org.com/

Aproveite para ver o tutorial básico [neste link](https://mlr.mlr-org.com/articles/tutorial/usecase_regression.html).

![workflow](imgs/Selection_047.png)

Vamos aprender o workflow com o `BostonHousing`. Descrição em na documentação do pacote [mlbench](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/BostonHousing).

In [2]:
library(mlbench)
library(tidyverse)
library(mlr)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.1       [32m✔[39m [34mpurrr  [39m 0.3.2  
[32m✔[39m [34mtibble [39m 2.1.1       [32m✔[39m [34mdplyr  [39m 0.8.0.[31m1[39m
[32m✔[39m [34mtidyr  [39m 0.8.3       [32m✔[39m [34mstringr[39m 1.4.0  
[32m✔[39m [34mreadr  [39m 1.3.1       [32m✔[39m [34mforcats[39m 0.4.0  
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Loading required package: ParamHelpers


In [5]:
data(BostonHousing)
df = BostonHousing
head(df)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,b,lstat,medv
<dbl>,<dbl>,<dbl>,<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
0.00632,18,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
0.02731,0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7


In [7]:
?BostonHousing

## 1. Criar a task

In [8]:
regr.task = makeRegrTask(data = df, target = 'medv')

In [9]:
regr.task

Supervised task: df
Type: regr
Target: medv
Observations: 506
Features:
   numerics     factors     ordered functionals 
         12           1           0           0 
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE

## 2. Definir o learner

Checar os learners disponíveis no [site](https://mlr.mlr-org.com/articles/tutorial/integrated_learners.html)

In [10]:
svm_learner = makeLearner(cl='regr.svm', cost = 1)

In [11]:
svm_learner

Learner regr.svm from package e1071
Type: regr
Name: Support Vector Machines (libsvm); Short name: svm
Class: regr.svm
Properties: numerics,factors
Predict-Type: response
Hyperparameters: cost=1


## 3. Treinar o modelo

Após os 2 primeiros passos, podemos definir a estratégia de resample e treinar o modelo.

Aqui vamos criar duas estratégias: `Holdout` e `Cross Validation` com 5 folds.

In [12]:
holdout = makeResampleDesc(method = 'Holdout', split = 0.7)
cv = makeResampleDesc(method = 'CV', iters = 5)

In [13]:
holdout
cv

Resample description: holdout with 0.70 split rate.
Predict: test
Stratification: FALSE

Resample description: cross-validation with 5 iterations.
Predict: test
Stratification: FALSE

Para treinar, usamos a função `resample()`.

In [14]:
res_holdout = resample(learner = svm_learner,task = regr.task,
                       resampling = holdout, list(mae, mse) )

Resampling: holdout
Measures:             mae       mse       
[Resample] iter 1:    2.5856057 20.7799511


Aggregated Result: mae.test.mean=2.5856057,mse.test.mean=20.7799511




In [27]:
set.seed(2019)
res_cv = resample(svm_learner, regr.task, cv, mae)

Resampling: cross-validation
Measures:             mae       
[Resample] iter 1:    2.2784589 
[Resample] iter 2:    1.8464070 
[Resample] iter 3:    2.4946917 
[Resample] iter 4:    2.1577096 
[Resample] iter 5:    2.3452422 


Aggregated Result: mae.test.mean=2.2245019




## 3.1 Com ajuste de hiperparâmetros

In [28]:
parameters_svm = makeParamSet(makeNumericParam("cost",lower = 0.1,
                                              upper = 1),
                             makeNumericParam("gamma", lower = 0.1,
                                              upper = 1))

In [29]:
parameters_svm

         Type len Def   Constr Req Tunable Trafo
cost  numeric   -   - 0.1 to 1   -    TRUE     -
gamma numeric   -   - 0.1 to 1   -    TRUE     -

Definir a forma de busca, vamos usar `random search`. Mais detalhes no [link](https://mlr.mlr-org.com/articles/tutorial/tune.html).

In [31]:
ctrl  = makeTuneControlRandom(maxit = 100)

In [34]:
tr$x

In [37]:
tr$resampling

Resample instance for 506 cases.
Resample description: cross-validation with 5 iterations.
Predict: test
Stratification: FALSE

In [39]:
getLearnerParamSet(svm_learner)

                   Type  len            Def                           Constr
type           discrete    - eps-regression     eps-regression,nu-regression
kernel         discrete    -         radial linear,polynomial,radial,sigmoid
degree          integer    -              3                         1 to Inf
gamma           numeric    -              -                         0 to Inf
coef0           numeric    -              0                      -Inf to Inf
cost            numeric    -              1                         0 to Inf
nu              numeric    -            0.5                      -Inf to Inf
cachesize       numeric    -             40                      -Inf to Inf
tolerance       numeric    -          0.001                         0 to Inf
epsilon         numeric    -              -                         0 to Inf
shrinking       logical    -           TRUE                                -
cross           integer    -              0                         0 to Inf

In [32]:
set.seed(2019)
tr = tuneParams(svm_learner, regr.task, cv, mae, parameters_svm, 
               ctrl)

[Tune] Started tuning learner regr.svm for parameter set:
         Type len Def   Constr Req Tunable Trafo
cost  numeric   -   - 0.1 to 1   -    TRUE     -
gamma numeric   -   - 0.1 to 1   -    TRUE     -
With control class: TuneControlRandom
Imputation value: Inf
[Tune-x] 1: cost=0.802; gamma=0.7
[Tune-y] 1: mae.test.mean=3.3302249; time: 0.0 min
[Tune-x] 2: cost=0.409; gamma=0.713
[Tune-y] 2: mae.test.mean=3.7629726; time: 0.0 min
[Tune-x] 3: cost=0.542; gamma=0.509
[Tune-y] 3: mae.test.mean=3.2051902; time: 0.0 min
[Tune-x] 4: cost=0.4; gamma=0.742
[Tune-y] 4: mae.test.mean=3.8274946; time: 0.0 min
[Tune-x] 5: cost=0.529; gamma=0.75
[Tune-y] 5: mae.test.mean=3.6524673; time: 0.0 min
[Tune-x] 6: cost=0.818; gamma=0.51
[Tune-y] 6: mae.test.mean=2.9913404; time: 0.0 min
[Tune-x] 7: cost=0.898; gamma=0.32
[Tune-y] 7: mae.test.mean=2.5550492; time: 0.0 min
[Tune-x] 8: cost=0.428; gamma=0.907
[Tune-y] 8: mae.test.mean=4.0515848; time: 0.0 min
[Tune-x] 9: cost=0.692; gamma=0.142
[Tune-y] 9

[Tune-x] 91: cost=0.245; gamma=0.986
[Tune-y] 91: mae.test.mean=4.6019590; time: 0.0 min
[Tune-x] 92: cost=0.449; gamma=0.162
[Tune-y] 92: mae.test.mean=2.5496946; time: 0.0 min
[Tune-x] 93: cost=0.691; gamma=0.765
[Tune-y] 93: mae.test.mean=3.5130059; time: 0.0 min
[Tune-x] 94: cost=0.92; gamma=0.518
[Tune-y] 94: mae.test.mean=2.9531818; time: 0.0 min
[Tune-x] 95: cost=0.239; gamma=0.937
[Tune-y] 95: mae.test.mean=4.5571643; time: 0.0 min
[Tune-x] 96: cost=0.589; gamma=0.945
[Tune-y] 96: mae.test.mean=3.8780895; time: 0.0 min
[Tune-x] 97: cost=0.377; gamma=0.834
[Tune-y] 97: mae.test.mean=4.0328014; time: 0.0 min
[Tune-x] 98: cost=0.239; gamma=0.817
[Tune-y] 98: mae.test.mean=4.3796056; time: 0.0 min
[Tune-x] 99: cost=0.425; gamma=0.639
[Tune-y] 99: mae.test.mean=3.6075700; time: 0.0 min
[Tune-x] 100: cost=0.924; gamma=0.99
[Tune-y] 100: mae.test.mean=3.6698739; time: 0.0 min
[Tune] Result: cost=0.946; gamma=0.139 : mae.test.mean=2.2119958


Melhores hiperparâmetros:

In [40]:
tr$x

# Agora é sua vez!


![your_turn](imgs/avengers.jpg)




## Faça o mesmo com o conjunto de dados [Soybean](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/Soybean) do pacote mlbench. 

### Siga as instruções abaixo:

1. Crie um holdout set e NÃO USE DURANTE O CROSS VALIDATION
2. Vamos comparar `xgboost` e `svm`
3. Crie um learner para cada tecninca
4. Use cv com 5 folds como técnica de amostragem (resample)
5. Use random search com 100 iterações como controle do ajuste de parâmetros
6. Encontre os melhores hiperparâmetros para cada técnica
7. Ao fim, treinaremos um modelo com os melhores e testaremos no conjunto separado no item 1 para comparar a performance dos dois

## 0. Criando dummy features (0 e 1 para categóricas) 

In [48]:
data(Soybean,package = 'mlbench')
soy = createDummyFeatures(Soybean,target = "Class")
dim(Soybean)
dim(soy)


In [52]:
table(soy$Class)


               2-4-d-injury         alternarialeaf-spot 
                         16                          91 
                anthracnose            bacterial-blight 
                         44                          20 
          bacterial-pustule                  brown-spot 
                         20                          92 
             brown-stem-rot                charcoal-rot 
                         44                          20 
              cyst-nematode diaporthe-pod-&-stem-blight 
                         14                          15 
      diaporthe-stem-canker                downy-mildew 
                         20                          20 
         frog-eye-leaf-spot            herbicide-injury 
                         91                           8 
     phyllosticta-leaf-spot            phytophthora-rot 
                         20                          88 
             powdery-mildew           purple-seed-stain 
                         20   

In [53]:
dim(drop_na(soy))

In [65]:
task = makeClassifTask(data = drop_na(soy), target = 'Class')
set.seed(25)
holdout = makeResampleInstance("Holdout",task, split = 0.7)
tsk_train = subsetTask(task, holdout$train.inds[[1]])
tsk_test = subsetTask(task, holdout$test.inds[[1]])

“Target column 'Class' contains empty factor levels”

In [66]:
tsk_train

Supervised task: drop_na(soy)
Type: classif
Target: Class
Observations: 393
Features:
   numerics     factors     ordered functionals 
         99           0           0           0 
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE
Classes: 15
   alternarialeaf-spot            anthracnose       bacterial-blight 
                    64                     31                     16 
     bacterial-pustule             brown-spot         brown-stem-rot 
                    14                     62                     28 
          charcoal-rot  diaporthe-stem-canker           downy-mildew 
                    14                     15                     14 
    frog-eye-leaf-spot phyllosticta-leaf-spot       phytophthora-rot 
                    68                     13                     14 
        powdery-mildew      purple-seed-stain   rhizoctonia-root-rot 
                    15                     13                     12 
Positive class: NA

In [64]:
holdout$test.inds[[1]]

In [63]:
length(holdout$train.inds[[1]])

“NA used as a default value for learner parameter missing.
ParamHelpers uses NA as a special value for dependent parameters.”

[Tune] Started tuning learner classif.xgboost for parameter set:
             Type len Def   Constr Req Tunable Trafo
eta       numeric   -   -   0 to 1   -    TRUE     -
lambda    numeric   -   - 0 to 200   -    TRUE     -
max_depth integer   -   -  1 to 20   -    TRUE     -
With control class: TuneControlRandom
Imputation value: -0
[Tune-x] 1: eta=0.447; lambda=154; max_depth=9
[Tune-y] 1: acc.test.mean=0.8192795; time: 0.0 min
[Tune-x] 2: eta=0.965; lambda=131; max_depth=8
[Tune-y] 2: acc.test.mean=0.8498215; time: 0.0 min
[Tune-x] 3: eta=0.387; lambda=183; max_depth=10
[Tune-y] 3: acc.test.mean=0.8141837; time: 0.0 min
[Tune-x] 4: eta=0.00539; lambda=87.1; max_depth=3
[Tune-y] 4: acc.test.mean=0.7122038; time: 0.0 min
[Tune-x] 5: eta=0.112; lambda=91.3; max_depth=15
[Tune-y] 5: acc.test.mean=0.8270042; time: 0.0 min
[Tune-x] 6: eta=0.992; lambda=14.1; max_depth=9
[Tune-y] 6: acc.test.mean=0.9059396; time: 0.0 min
[Tune-x] 7: eta=0.427; lambda=117; max_depth=17
[Tune-y] 7: acc.test.

[Tune-x] 79: eta=0.471; lambda=9.66; max_depth=17
[Tune-y] 79: acc.test.mean=0.8830250; time: 0.0 min
[Tune-x] 80: eta=0.261; lambda=120; max_depth=6
[Tune-y] 80: acc.test.mean=0.8167478; time: 0.0 min
[Tune-x] 81: eta=0.932; lambda=25.9; max_depth=18
[Tune-y] 81: acc.test.mean=0.8805583; time: 0.0 min
[Tune-x] 82: eta=0.524; lambda=174; max_depth=5
[Tune-y] 82: acc.test.mean=0.8218111; time: 0.0 min
[Tune-x] 83: eta=0.749; lambda=110; max_depth=1
[Tune-y] 83: acc.test.mean=0.6363194; time: 0.0 min
[Tune-x] 84: eta=0.369; lambda=140; max_depth=10
[Tune-y] 84: acc.test.mean=0.8192795; time: 0.0 min
[Tune-x] 85: eta=0.0632; lambda=130; max_depth=5
[Tune-y] 85: acc.test.mean=0.8218436; time: 0.0 min
[Tune-x] 86: eta=0.823; lambda=197; max_depth=16
[Tune-y] 86: acc.test.mean=0.8321000; time: 0.0 min
[Tune-x] 87: eta=0.787; lambda=183; max_depth=8
[Tune-y] 87: acc.test.mean=0.8346641; time: 0.0 min
[Tune-x] 88: eta=0.84; lambda=60.6; max_depth=14
[Tune-y] 88: acc.test.mean=0.8601428; time: 

In [None]:
tr_xgb

[Tune] Started tuning learner classif.svm for parameter set:
         Type len Def   Constr Req Tunable Trafo
cost  numeric   -   - 0.1 to 1   -    TRUE     -
gamma numeric   -   - 0.1 to 1   -    TRUE     -
With control class: TuneControlRandom
Imputation value: -0
Mapping in parallel: mode = multicore; cpus = 2; elements = 100.


In [None]:
 tr_svm

## Treine no conjunto de treino completo

## Teste no conjunto de teste do passo 1

## Acurácia dos dois modelos

## Matriz de confusão dos dois modelos