In [1]:
import numpy as np

In this file, we will set up a pipeline for
1. Collecting data for an experiment
2. Processing the data
3. Running a machine learrning algorithm.
4. Evaluating the algorithm
5. Choosing the best algorithm and hyperparameters\

The main questions we need to answer are
1. How will data be collected?
2. How much data would we need?
3. What algorithm would be best? 
4. How would the amount of data influence algorithm selection?
5. How robust is our procedure to assumptions?

# Data collection

Here we want to call a simulator that collects data for us. The simulator can be arbitrary, but we would normally wish to have a specific API for calling it. In the simplest case, the only input parameter is the amount of data.

In [2]:
import data_generator

# Obtain data from the data generator. 
# In this case, the data generator gives us a random sample. Other sampling methodologies are possible, of course.
generator = data_generator.GaussianClassificationGenerator(2, np.array([0.3, 0.7]))

# Data processing

Here we perform preliminary processing of the data, if necessary. In particular, we may want to split the data in training, validation and test sets. Other standard pre-processing includes scaling the data, dealing with missing data, and removal of problematic data points. However, all of these could theoretically be dealt within the learning algorithm itself.

In [3]:
def process_data(X, Y):
    import sklearn.model_selection
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,Y)
    return X_train, X_test, y_train, y_test


# Evaluation

Here we run an ML algorithm, using the data for finding appropriate parameters. It is best to have a unified interface for doing this as well. 

In [4]:
from sklearn.metrics import accuracy_score
def Evaluate(clf, X_train, y_train, X_test, y_test):
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    accuracy = accuracy_score(y_pred, y_test)
    return accuracy

In [5]:
# The experiment setup

In [25]:
n_data = 100
n_experiments = 100
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
models = [svm.SVC(gamma=0.01, C=100),
         LogisticRegression(random_state=0),
         MLPClassifier(random_state=1, max_iter=300)]
n_models = len(models)
score = np.zeros([n_models, n_experiments])

for i in range(n_experiments):
    print ("Experiment ", i);
    print ("------------");
    [X, Y] = generator.generate(n_data)
    [X_train, X_test, y_train, y_test] = process_data(X, Y)
    k = 0
    for model in models:
        score[k,i] = Evaluate(model, X_train, y_train, X_test, y_test)
        print(model, score[k,i])
        k+=1
        

Experiment  0
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.68




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  1
------------
SVC(C=100, gamma=0.01) 1.0
LogisticRegression(random_state=0) 0.96




MLPClassifier(max_iter=300, random_state=1) 1.0
Experiment  2
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  3
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  4
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  5
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  6
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  7
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  8
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  9
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  10
------------
SVC(C=100, gamma=0.01) 0.68
LogisticRegression(random_state=0) 0.6




MLPClassifier(max_iter=300, random_state=1) 0.76
Experiment  11
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.68




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  12
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  13
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  14
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  15
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  16
------------
SVC(C=100, gamma=0.01) 0.68
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  17
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  18
------------
SVC(C=100, gamma=0.01) 0.72
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  19
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  20
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  21
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  22
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  23
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  24
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  25
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  26
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.68




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  27
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  28
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.96




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  29
------------
SVC(C=100, gamma=0.01) 1.0
LogisticRegression(random_state=0) 1.0




MLPClassifier(max_iter=300, random_state=1) 1.0
Experiment  30
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  31
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  32
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  33
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  34
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  35
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  36
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.6




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  37
------------
SVC(C=100, gamma=0.01) 0.68
LogisticRegression(random_state=0) 0.68




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  38
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  39
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  40
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  41
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  42
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  43
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  44
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.96




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  45
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  46
------------
SVC(C=100, gamma=0.01) 0.64
LogisticRegression(random_state=0) 0.64




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  47
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.64




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  48
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  49
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  50
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  51
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  52
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  53
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.64




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  54
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  55
------------
SVC(C=100, gamma=0.01) 0.72
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  56
------------
SVC(C=100, gamma=0.01) 0.72
LogisticRegression(random_state=0) 0.64




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  57
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  58
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  59
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  60
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  61
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.76
Experiment  62
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  63
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  64
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  65
------------
SVC(C=100, gamma=0.01) 0.68
LogisticRegression(random_state=0) 0.68




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  66
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  67
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  68
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  69
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  70
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  71
------------
SVC(C=100, gamma=0.01) 0.52
LogisticRegression(random_state=0) 0.44




MLPClassifier(max_iter=300, random_state=1) 0.76
Experiment  72
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  73
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  74
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  75
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 1.0
Experiment  76
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.96




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  77
------------
SVC(C=100, gamma=0.01) 0.72
LogisticRegression(random_state=0) 0.64




MLPClassifier(max_iter=300, random_state=1) 0.76
Experiment  78
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  79
------------
SVC(C=100, gamma=0.01) 0.64
LogisticRegression(random_state=0) 0.6




MLPClassifier(max_iter=300, random_state=1) 0.72
Experiment  80
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 1.0
Experiment  81
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 1.0
Experiment  82
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  83
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  84
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  85
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  86
------------
SVC(C=100, gamma=0.01) 0.88
LogisticRegression(random_state=0) 0.88




MLPClassifier(max_iter=300, random_state=1) 0.88
Experiment  87
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 1.0
Experiment  88
------------
SVC(C=100, gamma=0.01) 0.72
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  89
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.76




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  90
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.8
Experiment  91
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.72




MLPClassifier(max_iter=300, random_state=1) 0.76
Experiment  92
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 0.92
Experiment  93
------------
SVC(C=100, gamma=0.01) 0.64
LogisticRegression(random_state=0) 0.64




MLPClassifier(max_iter=300, random_state=1) 0.76
Experiment  94
------------
SVC(C=100, gamma=0.01) 0.96
LogisticRegression(random_state=0) 0.92




MLPClassifier(max_iter=300, random_state=1) 0.96
Experiment  95
------------
SVC(C=100, gamma=0.01) 0.76
LogisticRegression(random_state=0) 0.68




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  96
------------
SVC(C=100, gamma=0.01) 0.64
LogisticRegression(random_state=0) 0.64




MLPClassifier(max_iter=300, random_state=1) 0.68
Experiment  97
------------
SVC(C=100, gamma=0.01) 0.84
LogisticRegression(random_state=0) 0.8




MLPClassifier(max_iter=300, random_state=1) 0.84
Experiment  98
------------
SVC(C=100, gamma=0.01) 0.92
LogisticRegression(random_state=0) 0.84




MLPClassifier(max_iter=300, random_state=1) 1.0
Experiment  99
------------
SVC(C=100, gamma=0.01) 0.8
LogisticRegression(random_state=0) 0.72
MLPClassifier(max_iter=300, random_state=1) 0.88




In [29]:
np.mean(score[2])

0.8820000000000001