UCI Machine Learning Repository
Bank Marketing データセット
http://archive.ics.uci.edu/ml/machine-learning-databases/00222/
bank.zip

例2 前処理を行う関数の定義

In [1]:
makefeature <- function(x)
{
    is.num <- sapply(x,class) == "integer"
    x[,is.num] <- lapply(x[,is.num],scale)
    
    x
}

P94

例3 訓練データとテストデータの作成

In [2]:
set.seed(123)

bank <- read.csv('bank-full.csv',sep=";")
head(bank,3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
1,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
2,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
3,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no


各列のデータ型の確認

In [3]:
sapply(bank,class)

前処理の実行

In [4]:
bank.proceed <- makefeature(bank)

In [5]:
sapply(bank.proceed,class)

訓練データとテストデータの作成

In [6]:
N <- nrow(bank)
inds.tr <- sample(seq(N), as.integer(0.7*N))
bank.train <- bank.proceed[inds.tr,]
bank.test <- bank.proceed[-inds.tr,]

例4 RBFカーネルのサポートベクタマシンによる予測モデルの構築・評価

In [7]:
library(kernlab)

fit.svm <- ksvm(y~., data=bank.train)
fit.svm

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.0761707453834637 

Number of Support Vectors : 7106 

Objective Function Value : -6083.564 
Training error : 0.084463 

テストデータに対する予測

In [8]:
pred <- predict(fit.svm, bank.test)

予測結果の評価

In [9]:
(conf.mat <- table(pred, bank.test$y))

     
pred     no   yes
  no  11682  1041
  yes   302   539

適合率

In [10]:
(prec <- conf.mat["yes","yes"]/sum(conf.mat["yes",]))  # precision ratio

再現率

In [11]:
(rec <- conf.mat["yes","yes"]/sum(conf.mat[,"yes"]))  # recall ratio

F-値

In [12]:
(f.value <- 2 * prec * rec / ( prec + rec ))  # F-Value

正解率

In [13]:
(acc <- sum(diag(conf.mat))/sum(conf.mat))  # accuracy rate

P95

例5 ランダムフォレストによる予測モデルの構築・評価

In [14]:
library(randomForest)
set.seed(123)

fit.rf <- randomForest(y~., data=bank.train)
fit.rf

randomForest 4.6-12
Type rfNews() to see new features/changes/bug fixes.



Call:
 randomForest(formula = y ~ ., data = bank.train) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 9.08%
Confusion matrix:
       no  yes class.error
no  26948  990  0.03543561
yes  1882 1827  0.50741440

テストデータに対する予測

In [15]:
pred <- predict(fit.rf,bank.test)

予測結果の評価

In [16]:
(conf.mat <- table(pred,bank.test$y))

     
pred     no   yes
  no  11503   804
  yes   481   776

適合率

In [17]:
(prec <- conf.mat["yes","yes"]/sum(conf.mat["yes",]))  # precision ratio

再現率

In [18]:
(rec <- conf.mat["yes","yes"]/sum(conf.mat[,"yes"]))  # recall ratio

F-値

In [19]:
(f.value <- 2 * prec * rec / ( prec + rec ))  # F-Value

正解率

In [20]:
(acc <- sum(diag(conf.mat))/sum(conf.mat))  # accuracy rate

P96

例6 適合率、再現率、F-値、正解率を算出する独自の評価関数の作成

In [21]:
my.summary <- function(data,lev=NULL,model=NULL) {
    conf <- table(data$pred, data$obs)
    prec <- conf["yes","yes"]/sum(conf["yes",])
    rec <- conf["yes","yes"]/sum(conf[,"yes"])
    f.value <- 2 * prec * rec / (prec+rec)
    acc <- sum(diag(conf))/sum(conf)
    out <- c(Precision=prec, Recall=rec, F=f.value, Accuracy=acc)
    out
}

例7 F-値を評価指標とした予測モデルの構築

In [22]:
library(caret)

Loading required package: lattice
Loading required package: ggplot2

Attaching package: 'ggplot2'

The following object is masked from 'package:randomForest':

    margin

The following object is masked from 'package:kernlab':

    alpha



In [23]:
set.seed(123)
fit.svm <- train(y~., data=bank.train, method="svmRadial", metric="F", 
                tuneGrid=expand.grid(.C=c(0.5,1.0), .sigma=c(0.05,0.1)),
                trControl=trainControl(summaryFunction=my.summary,method="cv",number=10))
fit.svm

Support Vector Machines with Radial Basis Function Kernel 

31647 samples
   16 predictor
    2 classes: 'no', 'yes' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 28482, 28483, 28482, 28482, 28484, 28482, ... 
Resampling results across tuning parameters:

  C    sigma  Precision  Recall     F          Accuracy   Precision SD
  0.5  0.05   0.6797595  0.2677278  0.3837624  0.8993270  0.03717759  
  0.5  0.10   0.6815328  0.1685124  0.2698623  0.8932602  0.05301848  
  1.0  0.05   0.6583357  0.3308130  0.4399156  0.9012859  0.04081755  
  1.0  0.10   0.6518828  0.2680010  0.3794032  0.8973046  0.04263180  
  Recall SD   F SD        Accuracy SD
  0.01915547  0.02198316  0.002995947
  0.01666804  0.02350491  0.002743951
  0.01701379  0.01949426  0.004046352
  0.01768989  0.02115640  0.003568473

F was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.05 and C = 1. 

P97

例8 caretパッケージで使用可能なアルゴリズムとその個数

In [24]:
head(modelLookup())

Unnamed: 0,model,parameter,label,forReg,forClass,probModel
1,ada,iter,#Trees,False,True,True
2,ada,maxdepth,Max Tree Depth,False,True,True
3,ada,nu,Learning Rate,False,True,True
4,AdaBag,mfinal,#Trees,False,True,True
5,AdaBag,maxdepth,Max Tree Depth,False,True,True
6,AdaBoost.M1,mfinal,#Trees,False,True,True


In [25]:
packageVersion("caret")

[1] '6.0.64'

In [26]:
length(unique(modelLookup()$model))

P98

例9 最適なハイパーパラメータを用いて全データに対して構築した予測モデル

In [27]:
fit.svm$finalModel

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 1 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.05 

Number of Support Vectors : 8447 

Objective Function Value : -5544.194 
Training error : 0.0686 

例10 テストデータに対する予測と評価

In [28]:
pred <- predict(fit.svm,bank.test)
(conf.mat <- table(pred,bank.test$y))

     
pred     no   yes
  no  11674  1048
  yes   310   532

In [29]:
(prec <- conf.mat["yes","yes"]/sum(conf.mat["yes",]))  # precision ratio

In [30]:
(rec <- conf.mat["yes","yes"]/sum(conf.mat[,"yes"]))  # recall ratio

In [31]:
(f.value <- 2 * prec * rec / ( prec + rec ))  # F-Value

In [32]:
(acc <- sum(diag(conf.mat))/sum(conf.mat))  # accuracy rate