# 7장 AdaBoost meta-algorithm


### 7.1 Classifiers using multiple samples of the dataset

#### Pros
- Low generalization error, easy to code, works with most classifiers, no parameters to adjust

#### Cons
- Sensitive to outliers

#### Works with
- Numeric values, nominal values

#### 7.1.1 Building classwifiers from randomly resampled data: bagging
- Bootstrap aggregating, which is known as bagging, is a technique where the data is taken from the original dataset S times to make S new datasets.
- The datasets are the same size as the original.
- Each dataset is built by randomly selecting an example from the original with replacement.
- By "with replacement" I mean that you can select the same example more than once.
- This property allows you to have values in the new dataset that are repeated, and some values from the original won't be present in the new set.

#### 7.1.2 Boosting

##### General approach to AdaBoost
1. Collect: Any method.
2. Prepare: It depends on which type of weak learner you’re going to use. In this
chapter, we’ll use decision stumps, which can take any type of data. You could
use any classifier, so any of the classifiers from chapters 2–6 would work. Simple
classifiers work better for a weak learner.
3. Analyze: Any method.
4. Train: The majority of the time will be spent here. The classifier will train the weak learner multiple times over the same dataset.
5. Test: Calculate the error rate.
6. Use: Like support vector machines, AdaBoost predicts one of two classes. If
you want to use it for classification involving more than two classes, then you’ll
need to apply some of the same methods as for support vector machines.


### 7.2 Train: Improving the classifier by focusing on errors

### 7.3 Creating a weak learner with a decision stump

In [3]:
from numpy import *

def loadSimpData():
    dataMat = matrix([[1., 2.1],
                     [2., 1.1],
                     [1.3, 1.],
                     [1., 1.],
                     [2., 1.]])
    classLabels = [1.0, 1.0, -1.0, -1.0, 1.0]
    return dataMat, classLabels

dataMat, classLabels = loadSimpData()
print dataMat, classLabels

[[ 1.   2.1]
 [ 2.   1.1]
 [ 1.3  1. ]
 [ 1.   1. ]
 [ 2.   1. ]] [1.0, 1.0, -1.0, -1.0, 1.0]


In [5]:
def stumpClassify(dataMatrix, dimen, threshVal, threshIneq):
    retArray = ones((shape(dataMatrix)[0], 1))
    if threshIneq == 'lt':
        retArray[dataMatrix[:,dimen] <= threshVal] = -1.0
    else:
        retArray[dataMatrix[:,dimen] > threshVal] = -1.0
    return retArray

def buildStump(dataArr,classLabels,D):
    dataMatrix = mat(dataArr); labelMat = mat(classLabels).T
    m,n = shape(dataMatrix)
    numSteps = 10.0; bestStump = {}; bestClasEst = mat(zeros((m,1)))
    minError = inf
    for i in range(n):
        rangeMin = dataMatrix[:,i].min(); rangeMax = dataMatrix[:,i].max();
        stepSize = (rangeMax-rangeMin)/numSteps
    for j in range(-1,int(numSteps)+1):
        for inequal in ['lt', 'gt']:
            threshVal = (rangeMin + float(j) * stepSize)
            predictedVals = stumpClassify(dataMatrix,i,threshVal,inequal)
            errArr = mat(ones((m,1)))
            errArr[predictedVals == labelMat] = 0
            weightedError = D.T*errArr
            print "split: dim %d, thresh %.2f, thresh ineqal: \
                %s, the weighted error is %.3f" %\
                (i, threshVal, inequal, weightedError)
            if weightedError < minError:
                minError = weightedError
                bestClasEst = predictedVals.copy()
                bestStump['dim'] = i
                bestStump['thresh'] = threshVal
                bestStump['ineq'] = inequal
    return bestStump,minError,bestClasEst

D = mat(ones((5,1))/5)
buildStump(dataMat, classLabels, D)

split: dim 1, thresh 0.89, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 0.89, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.00, thresh ineqal:                 lt, the weighted error is 0.200
split: dim 1, thresh 1.00, thresh ineqal:                 gt, the weighted error is 0.800
split: dim 1, thresh 1.11, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.11, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.22, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.22, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.33, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.33, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.44, thresh ineqal:                 lt, the weighted error is 0.400
split: dim

({'dim': 1, 'ineq': 'lt', 'thresh': 1.0}, matrix([[ 0.2]]), array([[ 1.],
        [ 1.],
        [-1.],
        [-1.],
        [-1.]]))

### 7.4 Implementing the full AdaBoost algorithm

In [8]:
def adaBoostTrainDS(dataArr,classLabels,numIt=40):
    weakClassArr = []
    m = shape(dataArr)[0]
    D = mat(ones((m,1))/m)
    aggClassEst = mat(zeros((m,1)))
    for i in range(numIt):
        bestStump,error,classEst = buildStump(dataArr,classLabels,D)
        print "D:",D.T
        alpha = float(0.5*log((1.0-error)/max(error,1e-16)))
        bestStump['alpha'] = alpha
        weakClassArr.append(bestStump)
        print "classEst: ",classEst.T
        expon = multiply(-1*alpha*mat(classLabels).T,classEst)
        D = multiply(D,exp(expon))
        D = D/D.sum()
        aggClassEst += alpha*classEst
        print "aggClassEst: ",aggClassEst.T
        aggErrors = multiply(sign(aggClassEst) != mat(classLabels).T,ones((m,1)))
        errorRate = aggErrors.sum()/m
        print "total error: ",errorRate,"\n"
        if errorRate == 0.0: break
    return weakClassArr

classifierArray = adaBoostTrainDS(dataMat, classLabels, 9)
print classifierArray

split: dim 1, thresh 0.89, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 0.89, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.00, thresh ineqal:                 lt, the weighted error is 0.200
split: dim 1, thresh 1.00, thresh ineqal:                 gt, the weighted error is 0.800
split: dim 1, thresh 1.11, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.11, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.22, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.22, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.33, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.33, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.44, thresh ineqal:                 lt, the weighted error is 0.400
split: dim

### 7.5 Tset: classifying with AdaBoost

In [9]:
def adaClassify(datToClass,classifierArr):
    dataMatrix = mat(datToClass)
    m = shape(dataMatrix)[0]
    aggClassEst = mat(zeros((m,1)))
    for i in range(len(classifierArr)):
        classEst = stumpClassify(dataMatrix,classifierArr[i]['dim'],\
            classifierArr[i]['thresh'],\
            classifierArr[i]['ineq'])
        aggClassEst += classifierArr[i]['alpha']*classEst
        print aggClassEst
    return sign(aggClassEst)

In [10]:
datArr, labelArr = loadSimpData()
classifierArr = adaBoostTrainDS(datArr, labelArr, 30)
adaClassify([0, 0], classifierArr)
adaClassify([[5, 5], [0, 0]], classifierArr)

split: dim 1, thresh 0.89, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 0.89, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.00, thresh ineqal:                 lt, the weighted error is 0.200
split: dim 1, thresh 1.00, thresh ineqal:                 gt, the weighted error is 0.800
split: dim 1, thresh 1.11, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.11, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.22, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.22, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.33, thresh ineqal:                 lt, the weighted error is 0.400
split: dim 1, thresh 1.33, thresh ineqal:                 gt, the weighted error is 0.600
split: dim 1, thresh 1.44, thresh ineqal:                 lt, the weighted error is 0.400
split: dim

matrix([[ 1.],
        [-1.]])

### 7.6 Example: AdaBoost on a difficult dataset

1. Collect: Text file provided.
2. Prepare: We need to make sure the class labels are +1 and -1, not 1 and 0.
3. Analyze: Manually inspect the data.
4. Train: We’ll train a series of classifiers on the data using the adaBoost-
TrainDS() function.
5. Test: We have two datasets. With no randomization, we can have an apples-toapples
comparison of the AdaBoost results versus the logistic regression results.
6. Use: We’ll look at the error rates in this example. But you could create a website
that asks a trainer for the horse’s symptoms and then predicts whether
the horse will live or die.

In [12]:
def loadDataSet(fileName):
    numFeat = len(open(fileName).readline().split('\t'))
    dataMat = []; labelMat = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat-1):
            lineArr.append(float(curLine[i]))
        dataMat.append(lineArr)
        labelMat.append(float(curLine[-1]))
    return dataMat,labelMat

datArr, labelArr = loadDataSet('data/horseColicTraining2.txt')
classifierArray = adaBoostTrainDS(datArr, labelArr, 10)
testArr, testLabelArr = loadDataSet('data/horseColicTest2.txt')
prediction10 = adaClassify(testArr, classifierArray)
errArr = mat(ones((67, 1)))
errArr[prediction10 != mat(testLabelArr).T].sum()

split: dim 20, thresh -1.01, thresh ineqal:                 lt, the weighted error is 0.405
split: dim 20, thresh -1.01, thresh ineqal:                 gt, the weighted error is 0.595
split: dim 20, thresh 0.00, thresh ineqal:                 lt, the weighted error is 0.585
split: dim 20, thresh 0.00, thresh ineqal:                 gt, the weighted error is 0.415
split: dim 20, thresh 1.01, thresh ineqal:                 lt, the weighted error is 0.609
split: dim 20, thresh 1.01, thresh ineqal:                 gt, the weighted error is 0.391
split: dim 20, thresh 2.02, thresh ineqal:                 lt, the weighted error is 0.605
split: dim 20, thresh 2.02, thresh ineqal:                 gt, the weighted error is 0.395
split: dim 20, thresh 3.03, thresh ineqal:                 lt, the weighted error is 0.609
split: dim 20, thresh 3.03, thresh ineqal:                 gt, the weighted error is 0.391
split: dim 20, thresh 4.04, thresh ineqal:                 lt, the weighted error is 0.5

25.0

### 7.7 Classification imbalance

#### 7.7.1 Alternative performance metrics: precision, recall, and ROC

In [13]:
def plotROC(predStrengths, classLabels):
    import matplotlib.pyplot as plt
    cur = (1.0,1.0)
    ySum = 0.0
    numPosClas = sum(array(classLabels)==1.0)
    yStep = 1/float(numPosClas)
    xStep = 1/float(len(classLabels)-numPosClas)
    sortedIndicies = predStrengths.argsort()
    fig = plt.figure()
    fig.clf()
    ax = plt.subplot(111)
    for index in sortedIndicies.tolist()[0]:
        if classLabels[index] == 1.0:
            delX = 0; delY = yStep;
        else:
            delX = xStep; delY = 0;
        ySum += cur[1]
        ax.plot([cur[0],cur[0]-delX],[cur[1],cur[1]-delY], c='b')
    cur = (cur[0]-delX,cur[1]-delY)
    ax.plot([0,1],[0,1],'b--')
    plt.xlabel('False Positive Rate'); plt.ylabel('True Positive Rate')
    plt.title('ROC curve for AdaBoost Horse Colic Detection System')
    ax.axis([0,1,0,1])
    plt.show()
    print "the Area Under the Curve is: ",ySum*xStep

In [14]:
datArr, labelArr = loadDataSet('data/horseColicTraining2.txt')
classifierArray, aggClassEst = adaBoostTrainDS(datArr, labelArr, 10)
plotROC(aggClassEst.T, labelArr)

split: dim 20, thresh -1.01, thresh ineqal:                 lt, the weighted error is 0.405
split: dim 20, thresh -1.01, thresh ineqal:                 gt, the weighted error is 0.595
split: dim 20, thresh 0.00, thresh ineqal:                 lt, the weighted error is 0.585
split: dim 20, thresh 0.00, thresh ineqal:                 gt, the weighted error is 0.415
split: dim 20, thresh 1.01, thresh ineqal:                 lt, the weighted error is 0.609
split: dim 20, thresh 1.01, thresh ineqal:                 gt, the weighted error is 0.391
split: dim 20, thresh 2.02, thresh ineqal:                 lt, the weighted error is 0.605
split: dim 20, thresh 2.02, thresh ineqal:                 gt, the weighted error is 0.395
split: dim 20, thresh 3.03, thresh ineqal:                 lt, the weighted error is 0.609
split: dim 20, thresh 3.03, thresh ineqal:                 gt, the weighted error is 0.391
split: dim 20, thresh 4.04, thresh ineqal:                 lt, the weighted error is 0.5

ValueError: too many values to unpack

#### 7.7.2 Manipulating the classifier's decision with a cost function