# Bootstrapping and Permutation Tests
In class we learned how to perform bootstrapping, and permutation tests. In today's lab we'll use both of these methods to 1) build a random forest almost from scratch, and assess how well singificantly different a machine learning model is from luck or random chance performance. After this lab you should have a deeper understanding of how both of these methods are implemented by libraries in Python.

In [1]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
import random

In [2]:
breastCancer = datasets.load_breast_cancer()
X_all = breastCancer.data
# randomNumGen = np.random.RandomState(seed=23)
# E = randomNumGen.normal(size=X.shape, loc = 0.0, scale = 5)
# Add noisy data to the informative features for make the task harder
# X = X + E
y_all = breastCancer.target
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.25, random_state = 23)

In [3]:
# print(E.shape)
# print(E.mean())

## Part 1: Bootstrapping, Aggregation, and Random Forests
In this section you'll implement a random forest model. Don't worry, you'll have ample access to scikit-learn libraries, but you won't be able to use the [RandomForestClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) because that would defeat the purpose of learning some of the ins and outs of this model.

In [67]:
from sklearn.base import BaseEstimator, ClassifierMixin
# scikit-learn bootstrap
from sklearn.utils import resample
class ensembleLearner(BaseEstimator, ClassifierMixin):  
    """Random Forest classifer"""

    def __init__(self, nEstimators):
        """
        Called when initializing the classifier
        """
        self.nEstimators = nEstimators



    def generateBootStrapData(self, X, y):
        bootStrapSampleIdxs = np.array(random.choices(np.arange(X.shape[0]), k = X.shape[0]))
        return(X[bootStrapSampleIdxs], y[bootStrapSampleIdxs])

    def fit(self, X, y):
        """
        This should fit classifier. All the "work" should be done here.

        Note: assert is not a good choice here and you should rather
        use try/except blog with exceptions. This is just for short syntax.
        """
        self.estimators_ = [LogisticRegression(solver = "liblinear") for i in range(self.nEstimators)]

        if self.nEstimators > 1:
            for estimator in self.estimators_:
                X_bootStrap, y_bootStrap = self.generateBootStrapData(X, y)
                estimator.fit(X_bootStrap, y_bootStrap)
        else:
            self.estimators_[0].fit(X, y)

        return(self)


    def predict(self, X):
        try:
            getattr(self, "estimators_")
        except AttributeError:
            raise RuntimeError("You must train classifer before predicting data!")
        predictions = [estimator.predict(X) for estimator in self.estimators_]
        predictions = np.array(predictions)
        predictions = np.mean(predictions, axis = 0)
        predictions[predictions >= 0.5] = 1
        predictions[predictions < 0.5] = 0
        return(predictions)

    def score(self, X, y):
        predictions = self.predict(X)
        accuracy = np.mean(predictions == y)
        return(accuracy) 

In [68]:
clf_ens = ensembleLearner(nEstimators = 10)#newton-cg', 'sag', 'saga' and 'lbfgs'
clf_ens.fit(X = X_train, y = y_train)
score = clf_ens.score(X_test, y_test)
print(score)

0.951048951048951


In [69]:
def testClassifier(X_train, y_train, X_test, y_test, nEstimators):
    allScores = []
    nRuns = 100
    for i in range(nRuns):
        clf_ens = ensembleLearner(nEstimators = nEstimators)#newton-cg', 'sag', 'saga' and 'lbfgs'
        clf_ens.fit(X = X_train, y = y_train)
        score = clf_ens.score(X_test, y_test)
        allScores.append(score)
    mean = np.mean(allScores)
    std = np.std(allScores)
    print("Average score for {} runs and {} estimators: {} (std:{})".format(nRuns, nEstimators, mean, std))

In [None]:
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 1)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 2)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 10)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 20)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 30)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 40)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 50)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 100)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 200)
testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = 500)

Average score for 100 runs and 1 estimators: 0.9510489510489509 (std:1.1102230246251565e-16)
Average score for 100 runs and 2 estimators: 0.9516083916083916 (std:0.006900081892891739)
Average score for 100 runs and 10 estimators: 0.9532167832167832 (std:0.004276039369569347)
Average score for 100 runs and 20 estimators: 0.952867132867133 (std:0.0033711806414528462)
Average score for 100 runs and 30 estimators: 0.952027972027972 (std:0.0024264827374681722)
Average score for 100 runs and 40 estimators: 0.952027972027972 (std:0.002426482737468173)
Average score for 100 runs and 50 estimators: 0.952027972027972 (std:0.0024264827374681722)
Average score for 100 runs and 100 estimators: 0.9516083916083915 (std:0.0018971552400350359)
Average score for 100 runs and 200 estimators: 0.9510489510489509 (std:1.1102230246251565e-16)


## Part 2: Permutation Tests