# Ensemble methods. Exercises


In this section we have only two exercise:

1. Find the best three classifier in the stacking method using the classifiers from scikit-learn package.

2. Build arcing arc-x4 method. 

In [1]:
%store -r data_set
%store -r labels
%store -r test_data_set
%store -r test_labels
%store -r unique_labels

## Exercise 1: Find the best three classifier in the stacking method

Please use the following classifiers:

* Linear regression,
* Nearest Neighbors,
* Linear SVM,
* Decision Tree,
* Naive Bayes,
* QDA.

In [2]:
import numpy as np
from sklearn.metrics import accuracy_score

from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [3]:
def build_classifiers():
    
    linear_regression = LinearRegression()
    linear_regression.fit(data_set, labels)
    
    neighbors = KNeighborsClassifier()
    neighbors.fit(data_set, labels)

    svm = SVC()
    svm.fit(data_set,labels)
    
    decision_tree = DecisionTreeClassifier()
    decision_tree.fit(data_set,labels)
    
    naive_bayes = GaussianNB()
    naive_bayes.fit(data_set, labels)
    
    qda = QuadraticDiscriminantAnalysis()
    qda.fit(data_set, labels)


    return linear_regression, neighbors, svm, decision_tree, naive_bayes, qda

The idea is to add another variable into build_stacked_classifier called *stacker*, which is the classifier we will use to stack. After we do this, we can run the function inside a for loop that goes over all the six possible classifiers. At the same time, we have the variable *classifier*, which will be three selected classifiers. In order to find all the possibilities, we can use *combinations* from itertools, and write another loop inside the first one that runs all over the possible trios.

Also, you told us that there is at least one combination that gives accuracy of 1, so I will add some lines that would stop the loops if we find such combination (I ran it without these lines to check if there are more, and there are plenty).


In [4]:
def build_stacked_classifier(classifiers,stacker):
    output = []
    for classifier in classifiers:
        output.append(classifier.predict(data_set))
    output = np.array(output).reshape((130,3))
    
    # stacked classifier part:
    stacked_classifier = stacker
    stacked_classifier.fit(output.reshape((130,3)), labels.reshape((130,)))
    test_set = []
    for classifier in classifiers:
        test_set.append(classifier.predict(test_data_set))
    test_set = np.array(test_set).reshape((len(test_set[0]),3))
    predicted = stacked_classifier.predict(test_set)
    return predicted

In [5]:
from itertools import combinations

classifiers = build_classifiers()
stackers = [ LinearRegression(), KNeighborsClassifier(), SVC(), DecisionTreeClassifier(),
            GaussianNB(), QuadraticDiscriminantAnalysis()]

classifier_combinations= list(combinations(classifiers,r=3))

accuracy_aux = 0
stop = False
for stacker in stackers: 
 for i in range(len(classifier_combinations)):      
       predicted = build_stacked_classifier(classifier_combinations[i],stacker)
       accuracy = accuracy_score(test_labels, [ int(x) for x in predicted ]) # Only necessary for LinearRegression
       if accuracy>accuracy_aux:
        accuracy_aux = accuracy
        best_classifier = [classifier_combinations[i],stacker]
       if accuracy_aux == 1:
        stop = True
        break
 if stop:
    break
print("The highest accuracy", accuracy_aux)
print("The best three classifiers are:", best_classifier[0])
print("With the stacker:", best_classifier[1])


The highest accuracy 1.0
The best three classifiers are: (LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False), KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                     weights='uniform'), SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False))
With the stacker: GaussianNB(priors=None, var_smoothing=1e-09)


## Exercise 2: 

Use the boosting method and change the code to fullfilt the following requirements:

* the weights should be calculated as:
$w_{n}^{(t+1)}=\frac{1+ I(y_{n}\neq h_{t}(x_{n})}{\sum_{i=1}^{N}1+I(y_{n}\neq h_{t}(x_{n})}$,
* the prediction is done with a voting method.

In [6]:
import numpy as np
from sklearn.tree import DecisionTreeClassifier

# prepare data set

def generate_data(sample_number, feature_number, label_number):
    data_set = np.random.random_sample((sample_number, feature_number))
    labels = np.random.choice(label_number, sample_number)
    return data_set, labels

labels = 2
dimension = 2
test_set_size = 1000
train_set_size = 5000
train_set, train_labels = generate_data(train_set_size, dimension, labels)
test_set, test_labels = generate_data(test_set_size, dimension, labels)

# init weights
number_of_iterations = 10
weights = np.ones((test_set_size,)) / test_set_size


def train_model(classifier, weights):
    return classifier.fit(X=test_set, y=test_labels, sample_weight=weights)


def calculate_accuracy_vector(predicted, labels):
    result = []
    for i in range(len(predicted)):
        if predicted[i] == labels[i]:
            result.append(0)
        else:
            result.append(1)
    return result

def calculate_error(model):
    predicted = model.predict(test_set)
    I=calculate_accuracy_vector(predicted, test_labels)
    Z=np.sum(I)
    return (1+Z)/1.0

Fill the two functions below:

In [7]:
def set_new_weights(model):
    new_weights = np.add(calculate_accuracy_vector(model.predict(test_set), test_labels),[1 for i in range(len(test_labels))])
    print(len(new_weights))
    sum_I = np.sum(new_weights)
    return new_weights/sum_I

Train the classifier with the code below:

In [8]:
classifier = DecisionTreeClassifier(max_depth=1, random_state=1)
classifier.fit(X=train_set, y=train_labels)
alphas = []
classifiers = []
for iteration in range(number_of_iterations):
    model = train_model(classifier, weights)
    weights = set_new_weights(model)
    classifiers.append(model)

print(weights)


validate_x, validate_label = generate_data(1, dimension, labels)

1000
1000
1000
1000
1000
1000
1000
1000
1000
1000
[0.0013132 0.0013132 0.0013132 0.0013132 0.0006566 0.0013132 0.0013132
 0.0013132 0.0013132 0.0013132 0.0006566 0.0013132 0.0006566 0.0006566
 0.0013132 0.0013132 0.0006566 0.0013132 0.0013132 0.0013132 0.0006566
 0.0013132 0.0013132 0.0006566 0.0006566 0.0006566 0.0013132 0.0006566
 0.0013132 0.0013132 0.0006566 0.0013132 0.0013132 0.0006566 0.0006566
 0.0006566 0.0006566 0.0006566 0.0013132 0.0013132 0.0006566 0.0006566
 0.0013132 0.0013132 0.0013132 0.0006566 0.0013132 0.0006566 0.0013132
 0.0006566 0.0013132 0.0013132 0.0006566 0.0006566 0.0013132 0.0013132
 0.0013132 0.0013132 0.0013132 0.0013132 0.0013132 0.0006566 0.0013132
 0.0013132 0.0013132 0.0006566 0.0013132 0.0013132 0.0013132 0.0013132
 0.0006566 0.0013132 0.0013132 0.0006566 0.0013132 0.0006566 0.0006566
 0.0013132 0.0013132 0.0013132 0.0006566 0.0006566 0.0006566 0.0006566
 0.0006566 0.0013132 0.0006566 0.0006566 0.0006566 0.0013132 0.0006566
 0.0006566 0.0006566 0.0013

Set the validation data set:

In [9]:
validate_x, validate_label = generate_data(1, dimension, labels)

Fill the prediction code:

In [10]:
def get_prediction(x):
    predictions = []
    for i in range(len(classifiers)):
        predicted = classifiers[i].predict(x)
        predictions.append(int(predicted))
    counts = np.bincount(predictions)
    print(predictions) #line NOT necessary, just for checking
    return np.argmax(counts)

Test it:

In [11]:
prediction = get_prediction(validate_x)

print(prediction)

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
1
