Iris Simple Example
=====

This example shows how to use QuickExperiment to define an experiment suite for quick iterations and high result reproducibility. It uses the sklearn Iris dataset and a simple Logistic Regression classifier.

First, we load the dataset, which is already preprocess into a numeric matrix and an array of labels

In [1]:
from sklearn import datasets
iris = datasets.load_iris()

As first example, we are only going to train a classifier, evaluate it over a test portion and save the results. For this, we will need to use an instance of BaseDataset and define the Experiment configuration.

Our dataset consitst in a 2-D numpy array representing the instances, and a vector representing the labels. We can use the class SimpleDataset to model our data.

BaseDatasets are created to optimize the stored information for a dataset that will be used many times, and likely partitioned in many ways. Over the course of an investigation, numerous experiments will be run and re-runned on a dataset, each time creating training and evaluation partitions. Instead of saving an entire copy of the matrixes for every partition, BaseDatasets stores the matrix only once and keeps the indices of the instances assigned to each partition. This also allows to compare results between experiments better and keep track of where the instances of the original dataset are being used.

Let's create our Dataset instance: first, we need to create the train/test split of the iris data.

In [2]:
from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2)
train_index, test_index = next(splitter.split(iris.data, iris.target))

Now we can use this indices to create an instance of BaseDataset

In [None]:
# add parent directory to python path
import sys
sys.path.append('../')

from quick_experiment.dataset import SimpleDataset

iris_dataset = SimpleDataset()
indices = {'train': train_index, 'test': test_index}
iris_dataset.create_from_matrixes(iris.data, indices, iris.target)

We define now the experiment we want to run using a configuration dictionary. The class model.SKlearnModel provides a simple API to create models wrapping Scikit-learn learners.

In [15]:
from quick_experiment import model
model = reload(model)
from sklearn.linear_model import LogisticRegression

config = {
    'model': model.SkleanrModel,
    'model_arguments': {'model_class': LogisticRegression, 'sklearn_model_arguments': {'C': 0.5, 'n_jobs': 2}}
}

In [16]:
from quick_experiment import experiment
experiment = reload(experiment)
lr_experiment = experiment.Experiment(iris_dataset, config=config)

We run the experiments. The simple experiment just trains the classifier with the 'train' partition and prints the classification report for the 'test' partition predictions.

In [17]:
lr_experiment.run()

ValueError: Found input variables with inconsistent numbers of samples: [30, 2]

Iris Sampled Example
===

The framework also allows to run experiments in multiple samples with the same command, and obtain global metrics. For this, we will use the SimpleSampledDataset and the SampledExperiment class. This classes will create the samples for us, and train/evaluate the classifier in each of them.

In [7]:
partition_sizes = {'train': 0.8, 'test': 0.2}
samples = 5

In [3]:
from quick_experiment.dataset import SimpleSampledDataset

iris_sampled_dataset = SimpleSampledDataset()
iris_sampled_dataset.create_samples(iris.data, iris.target, samples, partition_sizes)

NameError: name 'samples' is not defined

In [13]:
experiment = reload(experiment)
# We use the same config as before
config = {
    'model': model.SkleanrModel,
    'model_arguments': {'model_class': LogisticRegression, 'sklearn_model_arguments': {'C': 0.5, 'n_jobs': 2}}
}
iris_sampled_experiment = experiment.SampledExperiment(iris_sampled_dataset, config=config)

In [14]:
iris_sampled_experiment.run()

INFO:root:
	Precision	Recall	F1 Score
mean	0.973333333333	0.973333333333	0.973333333333
std	0.0133333333333	0.0133333333333	0.0133333333333
