In [1]:
import numpy as np

In this file, we will set up a pipeline for
1. Collecting data for an experiment
2. Processing the data
3. Running a machine learrning algorithm.
4. Evaluating the algorithm
5. Choosing the best algorithm and hyperparameters\

The main questions we need to answer are
1. How will data be collected?
2. How much data would we need?
3. What algorithm would be best? 
4. How would the amount of data influence algorithm selection?
5. How robust is our procedure to assumptions?

# Data collection

Here we want to call a simulator that collects data for us. The simulator can be arbitrary, but we would normally wish to have a specific API for calling it. In the simplest case, the only input parameter is the amount of data.

In [2]:
import data_generator

# Obtain data from the data generator. 
# In this case, the data generator gives us a random sample. Other sampling methodologies are possible, of course.
generator = data_generator.GaussianClassificationGenerator(2, np.array([0.3, 0.7]))

# Data processing

Here we perform preliminary processing of the data, if necessary. In particular, we may want to split the data in training, validation and test sets. Other standard pre-processing includes scaling the data, dealing with missing data, and removal of problematic data points. However, all of these could theoretically be dealt within the learning algorithm itself.

In [3]:
def process_data(X, Y):
    import sklearn.model_selection
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X,Y)
    return X_train, X_test, y_train, y_test


# Evaluation

Here we run an ML algorithm, using the data for finding appropriate parameters. It is best to have a unified interface for doing this as well. 

In [4]:
# get a Trained model
def Train(clf, X_train, y_train):
    clf.fit(X_train, y_train) # Common API for classifiers in sklearn
    return clf

In [5]:
from sklearn.metrics import accuracy_score
def Evaluate(clf, X_test, y_test):
    y_pred = clf.predict(X_test) # Common API
    accuracy = accuracy_score(y_pred, y_test)
    return accuracy

In [6]:
# The experiment setup

In [7]:
n_data = 100 # how much data would we have - this can vary to see how much data we need
n_experiments = 100 # More experiments give us higher faith in the result
n_evaluation_data = 1000000

## Setup
from sklearn import svm
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
models = [svm.SVC(gamma=0.01, C=100),
         LogisticRegression(random_state=0),
         MLPClassifier(random_state=1, max_iter=300)]
n_models = len(models)
score = np.zeros([n_models, n_experiments])
real_score = np.zeros([n_models, n_experiments])

[X_eval, Y_eval] = generator.generate(n_evaluation_data) # data is generated here only to evaluate the pipeline



# The point of the experiment is to evaluate the models and select the best
for i in range(n_experiments):
    print ("Experiment ", i);
    print ("------------");
    [X, Y] = generator.generate(n_data) # data is generated here
    [X_train, X_test, y_train, y_test] = process_data(X, Y) # processing also splits the data in two parts
    k = 0
   
    for model in models:
        Train(model, X_train, y_train)
        score[k,i] = Evaluate(model, X_test, y_test)
        real_score[k,i] = Evaluate(model, X_eval, y_eval)
        print(model, score[k,i], real_score[k,i])
        k+=1
        

Experiment  0
------------


NameError: name 'y_eval' is not defined

In [None]:
np.mean(score[2])