# Bootstrapping and Permutation Tests
In class we learned how to perform bootstrapping, and permutation tests. In today's lab we'll use both of these methods to 1) build a random forest almost from scratch, and assess how well singificantly different a machine learning model is from luck or random chance performance. After this lab you should have a deeper understanding of how both of these methods are implemented by libraries in Python.

In [73]:
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.model_selection import train_test_split
import random

In [89]:
breastCancer = datasets.load_breast_cancer()
X_all = breastCancer.data
randomNumGen = np.random.RandomState(seed=23)
E = randomNumGen.normal(size=X_all.shape, loc = 0.0, scale = 1000)
# Add noisy data to the informative features for make the task harder
X_all += E
y_all = breastCancer.target
X_train, X_test, y_train, y_test = train_test_split(X_all, y_all, test_size = 0.25, random_state = 23)

In [90]:
print(E.shape)
print(E.mean())

(569, 30)
-14.248636442310712


## Part 1: Bootstrapping, Aggregation, and Random Forests
In this section you'll implement a random forest model. Don't worry, you'll have ample access to scikit-learn libraries, but you won't be able to use the [RandomForestClassifer](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) because that would defeat the purpose of learning some of the ins and outs of this model.

In [77]:
from sklearn.base import BaseEstimator, ClassifierMixin
# scikit-learn bootstrap
from sklearn.utils import resample
class ensembleLearner(BaseEstimator, ClassifierMixin):  
    """Random Forest classifer"""

    def __init__(self, nEstimators):
        """
        Called when initializing the classifier
        """
        self.nEstimators = nEstimators



    def generateBootStrapData(self, X, y):
        bootStrapSampleIdxs = np.array(random.choices(np.arange(X.shape[0]), k = X.shape[0]))
        return(X[bootStrapSampleIdxs], y[bootStrapSampleIdxs])

    def fit(self, X, y):
        """
        This should fit classifier. All the "work" should be done here.

        Note: assert is not a good choice here and you should rather
        use try/except blog with exceptions. This is just for short syntax.
        """
        self.estimators_ = [LogisticRegression(solver = "liblinear") for i in range(self.nEstimators)]

        if self.nEstimators > 1:
            for estimator in self.estimators_:
                X_bootStrap, y_bootStrap = self.generateBootStrapData(X, y)
                estimator.fit(X_bootStrap, y_bootStrap)
        else:
            self.estimators_[0].fit(X, y)

        return(self)


    def predict(self, X):
        try:
            getattr(self, "estimators_")
        except AttributeError:
            raise RuntimeError("You must train classifer before predicting data!")
        predictions = [estimator.predict(X) for estimator in self.estimators_]
        predictions = np.array(predictions)
        predictions = np.mean(predictions, axis = 0)
        predictions[predictions >= 0.5] = 1
        predictions[predictions < 0.5] = 0
        return(predictions)

    def score(self, X, y):
        predictions = self.predict(X)
        accuracy = np.mean(predictions == y)
        return(accuracy) 

In [91]:
clf_ens = ensembleLearner(nEstimators = 1)#newton-cg', 'sag', 'saga' and 'lbfgs'
clf_ens.fit(X = X_train, y = y_train)
score = clf_ens.score(X_test, y_test)
print(score)

0.6853146853146853


In [92]:
def testClassifier(X_train, y_train, X_test, y_test, nEstimators):
    allScores = []
    nRuns = 100
    for i in range(nRuns):
        clf_ens = ensembleLearner(nEstimators = nEstimators)#newton-cg', 'sag', 'saga' and 'lbfgs'
        clf_ens.fit(X = X_train, y = y_train)
        score = clf_ens.score(X_test, y_test)
        allScores.append(score)
    mean = np.mean(allScores)
    std = np.std(allScores)
    print("Average score for {} runs and {} estimators: {} (std:{})".format(nRuns, nEstimators, mean, std))

In [None]:
for i in [1, 10, 20, 30, 40, 50, 100]:
    testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = i)

Average score for 100 runs and 1 estimators: 0.6853146853146849 (std:4.440892098500626e-16)
Average score for 100 runs and 10 estimators: 0.7004895104895104 (std:0.016039921588081783)
Average score for 100 runs and 20 estimators: 0.6989510489510488 (std:0.010759917869530858)
Average score for 100 runs and 30 estimators: 0.6950349650349651 (std:0.01105227001580774)
Average score for 100 runs and 40 estimators: 0.6958741258741257 (std:0.009036686060078005)
Average score for 100 runs and 50 estimators: 0.6931468531468529 (std:0.008637425262458455)


In [93]:
myEstList = []
for i in np.arange(1, 20, 1):
    testClassifier(X_train = X_train, y_train = y_train, X_test = X_test, y_test = y_test, nEstimators = i)

Average score for 100 runs and 1 estimators: 0.6853146853146849 (std:4.440892098500626e-16)
Average score for 100 runs and 2 estimators: 0.705104895104895 (std:0.01727301038809905)
Average score for 100 runs and 3 estimators: 0.6916783216783216 (std:0.02005164733406758)
Average score for 100 runs and 4 estimators: 0.7093006993006994 (std:0.017890484651022238)
Average score for 100 runs and 5 estimators: 0.6941958041958043 (std:0.01633596967815023)
Average score for 100 runs and 6 estimators: 0.7055944055944056 (std:0.016592741982803746)
Average score for 100 runs and 7 estimators: 0.7008391608391605 (std:0.01424858423396874)
Average score for 100 runs and 8 estimators: 0.7037062937062936 (std:0.013292416884013862)
Average score for 100 runs and 9 estimators: 0.6976223776223776 (std:0.015132065694947117)
Average score for 100 runs and 10 estimators: 0.7022377622377621 (std:0.013745484145452281)
Average score for 100 runs and 11 estimators: 0.6988111888111888 (std:0.01484281178105426)
Av

## Part 2: Permutation Tests