In [1]:
library(tidyverse)
library(mlr)
library(mlbench)
library(e1071)
library(xgboost)
library(parallelMap)

count_na = function(df){
    sapply(df, function(x){sum(is.na(x))})
    }

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.2.1 ──
[32m✔[39m [34mggplot2[39m 3.1.1       [32m✔[39m [34mpurrr  [39m 0.3.2  
[32m✔[39m [34mtibble [39m 2.1.1       [32m✔[39m [34mdplyr  [39m 0.8.0.[31m1[39m
[32m✔[39m [34mtidyr  [39m 0.8.3       [32m✔[39m [34mstringr[39m 1.4.0  
[32m✔[39m [34mreadr  [39m 1.3.1       [32m✔[39m [34mforcats[39m 0.4.0  
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Loading required package: ParamHelpers

Attaching package: ‘e1071’

The following object is masked

## MLR: Machine Learning in R

Site de referência: https://mlr.mlr-org.com/

Aproveite para ver o tutorial básico [neste link](https://mlr.mlr-org.com/articles/tutorial/usecase_regression.html).

![workflow](imgs/Selection_047.png)

Vamos aprender o workflow com o `BostonHousing`. Descrição em na documentação do pacote [mlbench](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/BostonHousing).

In [2]:
data(BostonHousing, package='mlbench')
df = BostonHousing

In [3]:
summary(BostonHousing)

      crim                zn             indus       chas         nox        
 Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471   Min.   :0.3850  
 1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1: 35   1st Qu.:0.4490  
 Median : 0.25651   Median :  0.00   Median : 9.69           Median :0.5380  
 Mean   : 3.61352   Mean   : 11.36   Mean   :11.14           Mean   :0.5547  
 3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10           3rd Qu.:0.6240  
 Max.   :88.97620   Max.   :100.00   Max.   :27.74           Max.   :0.8710  
       rm             age              dis              rad        
 Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
 1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
 Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
 Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
 3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
 Max.   :8.780   Max.   :100.00   Max.   :12.1

## 1. Criar a task

In [4]:
regr.task = makeRegrTask(data = BostonHousing, target = "medv")
regr.task

Supervised task: BostonHousing
Type: regr
Target: medv
Observations: 506
Features:
   numerics     factors     ordered functionals 
         12           1           0           0 
Missings: FALSE
Has weights: FALSE
Has blocking: FALSE
Has coordinates: FALSE

## 2. Definir o learner

Checar os learners disponíveis no [site](https://mlr.mlr-org.com/articles/tutorial/integrated_learners.html)

In [5]:
svm_learner = makeLearner(cl='regr.svm', cost = 1)

## 3. Treinar o modelo

Após os 2 primeiros passos, podemos definir a estratégia de resample e treinar o modelo.

Aqui vamos criar duas estratégias: `Holdout` e `Cross Validation` com 5 folds.

In [6]:
holdout = makeResampleDesc(method = 'Holdout', split= 0.7)
cv = makeResampleDesc(method='CV', iters = 5)

Para treinar, usamos a função `resample()`.

In [7]:
res_holdout = resample(svm_learner, regr.task, holdout, list(mae,mse))

Resampling: holdout
Measures:             mae       mse       
[Resample] iter 1:    2.5040020 22.2283934


Aggregated Result: mae.test.mean=2.5040020,mse.test.mean=22.2283934




In [8]:
res_holdout = resample(svm_learner, regr.task, cv, list(mae,mse))

Resampling: cross-validation
Measures:             mae       mse       
[Resample] iter 1:    2.4475795 13.8910410
[Resample] iter 2:    2.2686768 11.7204118
[Resample] iter 3:    2.3573254 18.8630761
[Resample] iter 4:    2.1473660 11.3037386
[Resample] iter 5:    2.1859449 13.5079347


Aggregated Result: mae.test.mean=2.2813785,mse.test.mean=13.8572405




## 3.1 Com ajuste de hiperparâmetros

In [9]:
?makeNumericParamSet

In [10]:
?svm

In [11]:
parameters_svm = makeParamSet(
    makeNumericParam("cost", lower=0.1, upper = 1),
    makeNumericParam("gamma", lower=0.1, upper = 1)
)

In [12]:
parameters_svm

         Type len Def   Constr Req Tunable Trafo
cost  numeric   -   - 0.1 to 1   -    TRUE     -
gamma numeric   -   - 0.1 to 1   -    TRUE     -

Definir a forma de busca, vamos usar `random search`. Mais detalhes no [link](https://mlr.mlr-org.com/articles/tutorial/tune.html).

In [13]:
ctrl = makeTuneControlRandom(maxit = 100)

In [14]:
tr = tuneParams(svm_learner, regr.task, cv, list(mae,mse), parameters_svm, ctrl)

[Tune] Started tuning learner regr.svm for parameter set:
         Type len Def   Constr Req Tunable Trafo
cost  numeric   -   - 0.1 to 1   -    TRUE     -
gamma numeric   -   - 0.1 to 1   -    TRUE     -
With control class: TuneControlRandom
Imputation value: InfImputation value: Inf
[Tune-x] 1: cost=0.585; gamma=0.224
[Tune-y] 1: mae.test.mean=2.5453329,mse.test.mean=19.3417587; time: 0.0 min
[Tune-x] 2: cost=0.819; gamma=0.677
[Tune-y] 2: mae.test.mean=3.2692018,mse.test.mean=31.2177058; time: 0.0 min
[Tune-x] 3: cost=0.904; gamma=0.263
[Tune-y] 3: mae.test.mean=2.4378547,mse.test.mean=17.1055647; time: 0.0 min
[Tune-x] 4: cost=0.334; gamma=0.994
[Tune-y] 4: mae.test.mean=4.3189929,mse.test.mean=51.5324915; time: 0.0 min
[Tune-x] 5: cost=0.442; gamma=0.487
[Tune-y] 5: mae.test.mean=3.2613660,mse.test.mean=32.3009914; time: 0.0 min
[Tune-x] 6: cost=0.371; gamma=0.895
[Tune-y] 6: mae.test.mean=4.0952927,mse.test.mean=47.4855794; time: 0.0 min
[Tune-x] 7: cost=0.383; gamma=0.898
[Tune-

[Tune-x] 71: cost=0.443; gamma=0.245
[Tune-y] 71: mae.test.mean=2.7162507,mse.test.mean=22.5049382; time: 0.0 min
[Tune-x] 72: cost=0.858; gamma=0.118
[Tune-y] 72: mae.test.mean=2.3008972,mse.test.mean=14.6191993; time: 0.0 min
[Tune-x] 73: cost=0.284; gamma=0.619
[Tune-y] 73: mae.test.mean=3.8333863,mse.test.mean=43.4013363; time: 0.0 min
[Tune-x] 74: cost=0.255; gamma=0.284
[Tune-y] 74: mae.test.mean=3.1347271,mse.test.mean=30.3452159; time: 0.0 min
[Tune-x] 75: cost=0.87; gamma=0.349
[Tune-y] 75: mae.test.mean=2.6187838,mse.test.mean=19.8377688; time: 0.0 min
[Tune-x] 76: cost=0.665; gamma=0.358
[Tune-y] 76: mae.test.mean=2.7758032,mse.test.mean=22.8172348; time: 0.0 min
[Tune-x] 77: cost=0.229; gamma=0.326
[Tune-y] 77: mae.test.mean=3.3243715,mse.test.mean=33.9442621; time: 0.0 min
[Tune-x] 78: cost=0.567; gamma=0.598
[Tune-y] 78: mae.test.mean=3.3219370,mse.test.mean=33.1054721; time: 0.0 min
[Tune-x] 79: cost=0.168; gamma=0.825
[Tune-y] 79: mae.test.mean=4.6516303,mse.test.mean=5

Melhores hiperparâmetros:

In [15]:
tr$x

# Agora é sua vez!


![your_turn](imgs/avengers.jpg)




## Faça o mesmo com o conjunto de dados [Soybean](https://www.rdocumentation.org/packages/mlbench/versions/2.1-1/topics/Soybean) do pacote mlbench. 

### Siga as instruções abaixo:

1. Crie um holdout set e NÃO USE DURANTE O CROSS VALIDATION
2. Vamos comparar `xgboost` e `svm`
3. Crie um learner para cada tecninca
4. Use cv com 5 folds como técnica de amostragem (resample)
5. Use random search com 100 iterações como controle do ajuste de parâmetros
6. Encontre os melhores hiperparâmetros para cada técnica
7. Ao fim, treinaremos um modelo com os melhores e testaremos no conjunto separado no item 1 para comparar a performance dos dois

## 0. Criando dummy features (0 e 1 para categóricas) 

In [2]:
data(Soybean,package = 'mlbench')
soy = createDummyFeatures(Soybean,target="Class")
dim(Soybean)
dim(soy)
head(soy,3)
head(Soybean,3)

Class,date.0,date.1,date.2,date.3,date.4,date.5,date.6,plant.stand.0,plant.stand.1,⋯,mold.growth.1,seed.discolor.0,seed.discolor.1,seed.size.0,seed.size.1,shriveling.0,shriveling.1,roots.0,roots.1,roots.2
<fct>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
diaporthe-stem-canker,0,0,0,0,0,0,1,1,0,⋯,0,1,0,1,0,1,0,1,0,0
diaporthe-stem-canker,0,0,0,0,1,0,0,1,0,⋯,0,1,0,1,0,1,0,1,0,0
diaporthe-stem-canker,0,0,0,1,0,0,0,1,0,⋯,0,1,0,1,0,1,0,1,0,0


Class,date,plant.stand,precip,temp,hail,crop.hist,area.dam,sever,seed.tmt,⋯,int.discolor,sclerotia,fruit.pods,fruit.spots,seed,mold.growth,seed.discolor,seed.size,shriveling,roots
<fct>,<fct>,<ord>,<ord>,<ord>,<fct>,<fct>,<fct>,<fct>,<fct>,⋯,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>
diaporthe-stem-canker,6,0,2,1,0,1,1,1,0,⋯,0,0,0,4,0,0,0,0,0,0
diaporthe-stem-canker,4,0,2,1,0,2,0,2,1,⋯,0,0,0,4,0,0,0,0,0,0
diaporthe-stem-canker,3,0,2,1,0,1,0,2,1,⋯,0,0,0,4,0,0,0,0,0,0


In [3]:
tsk = makeClassifTask(data=drop_na(soy),target="Class")
ho = makeResampleInstance("Holdout",tsk, split = 0.7)
tsk.train = subsetTask(tsk,ho$train.inds[[1]])
tsk.test = subsetTask(tsk,ho$test.inds[[1]])

“Target column 'Class' contains empty factor levels”

In [18]:
ho$test.inds

In [19]:
ho$train.inds

In [4]:
xgb_learner = makeLearner("classif.xgboost",nrounds=10)# 10 trees each iteration 
#Warning: https://stackoverflow.com/questions/55545145/what-does-the-warning-na-used-as-a-default-value-for-learner-parameter-missing
svm_learner = makeLearner("classif.svm",scale=FALSE)
cv = makeResampleDesc("CV",iters=5)

“NA used as a default value for learner parameter missing.
ParamHelpers uses NA as a special value for dependent parameters.”

In [5]:
parameters_xgb = makeParamSet(makeNumericParam("eta",0,1),
                              makeNumericParam("lambda",0,200),
                              makeIntegerParam("max_depth",1,20))

parameters_svm = makeParamSet(makeNumericParam("cost", lower=0.1, upper = 1),
                              makeNumericParam("gamma", lower=0.1, upper = 1))

tc = makeTuneControlRandom(budget=100)

In [22]:
tr_xgb = tuneParams(xgb_learner,tsk.train,cv5,acc,parameters_xgb,tc)

[Tune] Started tuning learner classif.xgboost for parameter set:
             Type len Def   Constr Req Tunable Trafo
eta       numeric   -   -   0 to 1   -    TRUE     -
lambda    numeric   -   - 0 to 200   -    TRUE     -
max_depth integer   -   -  1 to 20   -    TRUE     -
With control class: TuneControlRandom
Imputation value: -0
[Tune-x] 1: eta=0.447; lambda=154; max_depth=9
[Tune-y] 1: acc.test.mean=0.8192795; time: 0.0 min
[Tune-x] 2: eta=0.965; lambda=131; max_depth=8
[Tune-y] 2: acc.test.mean=0.8498215; time: 0.0 min
[Tune-x] 3: eta=0.387; lambda=183; max_depth=10
[Tune-y] 3: acc.test.mean=0.8141837; time: 0.0 min
[Tune-x] 4: eta=0.00539; lambda=87.1; max_depth=3
[Tune-y] 4: acc.test.mean=0.7122038; time: 0.0 min
[Tune-x] 5: eta=0.112; lambda=91.3; max_depth=15
[Tune-y] 5: acc.test.mean=0.8270042; time: 0.0 min
[Tune-x] 6: eta=0.992; lambda=14.1; max_depth=9
[Tune-y] 6: acc.test.mean=0.9059396; time: 0.0 min
[Tune-x] 7: eta=0.427; lambda=117; max_depth=17
[Tune-y] 7: acc.test.

[Tune-x] 79: eta=0.471; lambda=9.66; max_depth=17
[Tune-y] 79: acc.test.mean=0.8830250; time: 0.0 min
[Tune-x] 80: eta=0.261; lambda=120; max_depth=6
[Tune-y] 80: acc.test.mean=0.8167478; time: 0.0 min
[Tune-x] 81: eta=0.932; lambda=25.9; max_depth=18
[Tune-y] 81: acc.test.mean=0.8805583; time: 0.0 min
[Tune-x] 82: eta=0.524; lambda=174; max_depth=5
[Tune-y] 82: acc.test.mean=0.8218111; time: 0.0 min
[Tune-x] 83: eta=0.749; lambda=110; max_depth=1
[Tune-y] 83: acc.test.mean=0.6363194; time: 0.0 min
[Tune-x] 84: eta=0.369; lambda=140; max_depth=10
[Tune-y] 84: acc.test.mean=0.8192795; time: 0.0 min
[Tune-x] 85: eta=0.0632; lambda=130; max_depth=5
[Tune-y] 85: acc.test.mean=0.8218436; time: 0.0 min
[Tune-x] 86: eta=0.823; lambda=197; max_depth=16
[Tune-y] 86: acc.test.mean=0.8321000; time: 0.0 min
[Tune-x] 87: eta=0.787; lambda=183; max_depth=8
[Tune-y] 87: acc.test.mean=0.8346641; time: 0.0 min
[Tune-x] 88: eta=0.84; lambda=60.6; max_depth=14
[Tune-y] 88: acc.test.mean=0.8601428; time: 

In [None]:
tr_xgb

In [None]:
t0 = Sys.time()
tr_svm = tuneParams(svm_learner,tsk.train,cv5,list(acc),parameters_svm,tc)
t1 = Sys.time()

[Tune] Started tuning learner classif.svm for parameter set:
         Type len Def   Constr Req Tunable Trafo
cost  numeric   -   - 0.1 to 1   -    TRUE     -
gamma numeric   -   - 0.1 to 1   -    TRUE     -
With control class: TuneControlRandom
Imputation value: -0
Mapping in parallel: mode = multicore; cpus = 2; elements = 100.


In [None]:
t1-t0

In [None]:
 tr_svm

In [None]:
tr_xgb$x

In [None]:
tr_svm$x

In [None]:
tuned_xgb = setHyperPars(xgb_learner,par.vals = tr_xgb$x)
tuned_svm = setHyperPars(svm_learner,par.vals = tr_svm$x)

## Treine no conjunto de treino completo

In [None]:
xgb_model = train(tuned_xgb,tsk.train)

svm_model = train(tuned_svm,tsk.train)

## Teste no conjunto de teste do passo 1

In [None]:
xgb_pred = predict(xgb_model, tsk.test)

svm_pred = predict(svm_model, tsk.test)

## Acurácia dos dois modelos

In [None]:
mean(xgb_pred$data$truth == xgb_pred$data$response)

In [None]:
mean(svm_pred$data$truth == svm_pred$data$response)

## Matriz de confusão dos dois modelos

In [None]:
cm_xgb = calculateConfusionMatrix(xgb_pred)
cm_svm = calculateConfusionMatrix(svm_pred)

In [None]:
cm_xgb$result

In [None]:
cm_svm$result

## Junte treino e teste em um único `df`

In [None]:
train = read_csv("../data/train.csv", col_types=cols())
test = read_csv("../data/test.csv", col_types=cols())
df = bind_rows(train,test)

## Guardar os IDs para separar depois

In [None]:
Id = test$Id

## Número de NA's por coluna

In [None]:
sapply(df, function(x) {sum(is.na(x))})

## Separar variaveis categoricas e numericas

In [None]:
glimpse(df)

In [None]:
chr = select_if(df,is.character)
dbl = select_if(df,is.numeric)

In [None]:
dim(chr)
dim(dbl)

### Tratar os faltantes de cada categoria

In [None]:
chr[is.na(chr)] = 'Not Available'

In [None]:
sapply(dbl,function(x){sum(is.na(x))})

In [None]:
library(rpart)

In [None]:
na_lotArea = dbl %>% filter(!is.na(LotFrontage))
dim(na_lotArea)

In [None]:
train_lotArea = dbl %>% filter(is.na(LotFrontage))
dim(train_lotArea)

In [None]:
library(e1071)