# School of Computing and Information Systems
### The University of Melbourne
### COMP30027 M ACHINE L EARNING (Semester 1, 2019)

Practical exercises: Week 10
Following on from our examination of the simple voting ensemble method from last week:

# 1. 
Bagging is often associated with Decision Trees, but in scikit-learn , it is a meta–classifier, that
can be applied to any learner, for example:
```python
>>> from sklearn.ensemble import BaggingClassifier
>>> from sklearn.neighbors import KNeighborsClassifier
>>> bagging = BaggingClassifier(KNeighborsClassifier(),max_samples=0.5, max_features=0.5)
```

What are the significance of max_samples and max_features , and why might we wish to use
values less than 1.0?

Look at the documentation https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html.

Bagging classifier builds N base classifier, each base classifier is trained/fitted on a subset of features/samples.
For each base classifier:
* Randomly select max_features * X.shape[1] subset of features.
* Randomly select max_samples * X.shape[0] subset of samples.
* Create a new X_base from the selected features and samples.
* Fit the base classifier on X_base and y_base.

Then use Voting or averaging to combine the prediction of the base classifier for X_test.

**If max_features=1.0 and max_samples=1.0 then all the base classifiers will be similar so there will be no point in combining them.**

# 2. 
Stacking is one strategy that scikit-learn doesn’t directly support.
It’s not too hard, though:
* We need to train each of our models (using fit() ),
* And then classify each training instance (using predict() ),
* We build up a matrix where the instances are composed of attributes, which correspond to
the predictions of each model on this training instance 1 .
* We then train our final learner on this matrix of predictions.
# (a) 
Implement a stacking classifier.
# (b) 
Think about which classifier is most suited to being the final meta–classifier in this situation.

In [5]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
import time
import time
import numpy as np

np.random.seed(1)

class StackingClassifier():

    def __init__(self, classifiers, metaclassifier):
        self.classifiers = classifiers
        self.metaclassifier = metaclassifier

    def fit(self, X, y):
        for clf in self.classifiers:
            clf.fit(X, y)
        X_meta = self._predict_base(X)
        self.metaclassifier.fit(X_meta, y)
    
    def _predict_base(self, X):
        yhats = []
        for clf in self.classifiers:
            yhat = clf.predict_proba(X)
            yhats.append(yhat)
        yhats = np.concatenate(yhats, axis=1)
        #print(yhats.shape)
        assert yhats.shape[0] == X.shape[0]
        return yhats
    
    def predict(self, X):
        X_meta = self._predict_base(X)
        yhat = self.metaclassifier.predict(X_meta)
        return yhat
    def score(self, X, y):
        yhat = self.predict(X)
        return accuracy_score(y, yhat)
    


classifiers = [LogisticRegression(),
                KNeighborsClassifier(),
                GaussianNB(),
                MultinomialNB(),
                DecisionTreeClassifier()]

meta_classifier = DecisionTreeClassifier()
stacker = StackingClassifier(classifiers, meta_classifier)

def load_car_data(car_file):
    X = []
    y = []
    with open(car_file, mode='r') as fin:
        for line in fin:
            atts = line.strip().split(",")
            X.append(atts[:-1]) #all atts minus the last one
            y.append(atts[-1])
    onehot = OneHotEncoder()
    X = onehot.fit_transform(X).toarray()
    return X, y
X, y = load_car_data('car.data')
print('labels:', set(y))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
stacker.fit(X_train, y_train)
print('stacker acc:', stacker.score(X_test, y_test))


labels: {'good', 'vgood', 'acc', 'unacc'}
stacker acc: 0.9544658493870403




A final meta-classifier for stacking should probably be non-linear, and because the features might be dependent, a naive base classifier might not be a good idea. So a decision tree or *nonlinear* SVC is a natural choice.

# 3. 
Remember the UCI Car Evaluation dataset from way back?
### (a) 
By examining the data, remind yourself in what way this is an artificial dataset. Hypothesise
what a Decision Tree classifier built on this dataset would probably look like.

The features are subjective, and artificial, so a decision tree needs to be deeper than usual to be able to predict well on the training data, which results in overfitting (not having a good performance on test data).

### (b) 
Compare the training accuracy and cross-validation accuracy of a Decision Tree Classifier
and a Random Forest on this dataset. What do you notice?
### (c) 
Based on your observations, hypothesise what sorts of datasets Random Forests would be
more effective/less effective than the regular deterministic Decision Tree method.

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier 

dt = DecisionTreeClassifier()
rf = RandomForestClassifier()
bagging = BaggingClassifier(base_estimator=dt, max_samples = 0.8, max_features = 0.8)

dt.fit(X, y)
rf.fit(X, y)
bagging.fit(X,y)

print('dt train acc:', dt.score(X, y), 'rf train acc:', rf.score(X, y), 'bagging train acc', bagging.score(X,y))
cv = 30
print('dt cross-val acc:', np.mean(cross_val_score(dt, X, y, cv=cv)), 'rf cross-val acc:', np.mean(cross_val_score(rf, X, y, cv=cv)),'bagging cross-val acc:', np.mean(cross_val_score(bagging, X, y, cv=cv)))



dt train acc: 1.0 rf train acc: 0.9976851851851852 bagging train acc 0.9971064814814815
dt cross-val acc: 0.9456285487730626 rf cross-val acc: 0.9068106939372588 bagging cross-val acc: 0.943726329509728
