# auto-sklearn tutorial
auto-sklearn is an automated machine learning toolkit and a drop-in replacement for a scikit-learn estimator

## Simple classification example
Auto-build a machine learning model for the `digits` toy dataset from sklearn.

In [4]:
import autosklearn.classification
import sklearn.model_selection
import sklearn.datasets
import sklearn.metrics
X, y = sklearn.datasets.load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=1)

Default setup. This will take an hour to run, only do this when you have the time.

In [None]:
automl = autosklearn.classification.AutoSklearnClassifier()
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

Limiting resources: overall time, time per model, memory limit

In [5]:
automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=60, # sec., how long should this seed fit process run
    per_run_time_limit=15, # sec., each model may only take this long before it's killed
    ml_memory_limit=1024, # MB, memory limit imposed on each call to a ML algorithm
)
automl.fit(X_train, y_train)
y_hat = automl.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, y_hat))

Accuracy score 0.993333333333


Let's take s look at the overall model. auto-sklearn builds an ensemble using Forward Selection over the models discovered during the optimization.  
For details, see: http://www.cs.cornell.edu/~alexn/papers/shotgun.icml04.revised.rev2.pdf


In [6]:
# Print the final ensemble constructed by auto-sklearn.
print(automl.show_models())
predictions = automl.predict(X_test)
# Print statistics about the auto-sklearn run such as number of
# iterations, number of models failed with a time out.
print(automl.sprint_statistics())
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

[(0.120000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'categorical_encoding:__choice__': 'one_hot_encoding', 'classifier:__choice__': 'extra_trees', 'imputation:strategy': 'mean', 'preprocessor:__choice__': 'no_preprocessing', 'rescaling:__choice__': 'none', 'categorical_encoding:one_hot_encoding:use_minimum_fraction': 'False', 'classifier:extra_trees:bootstrap': 'False', 'classifier:extra_trees:criterion': 'gini', 'classifier:extra_trees:max_depth': 'None', 'classifier:extra_trees:max_features': 0.5, 'classifier:extra_trees:max_leaf_nodes': 'None', 'classifier:extra_trees:min_impurity_decrease': 0.0, 'classifier:extra_trees:min_samples_leaf': 1, 'classifier:extra_trees:min_samples_split': 2, 'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'classifier:extra_trees:n_estimators': 100},
dataset_properties={
  'task': 2,
  'sparse': False,
  'multilabel': False,
  'multiclass': True,
  'target_type': 'classification',
  'signed': False})),
(0.100000, SimpleClassif

auto-sklearn results:
  Dataset name: d74860caaa557f473ce23908ff7ba369
  Metric: accuracy
  Best validation score: 0.991011
  Number of target algorithm runs: 19
  Number of successful target algorithm runs: 17
  Number of crashed target algorithm runs: 1
  Number of target algorithms that exceeded the time limit: 1
  Number of target algorithms that exceeded the memory limit: 0

Accuracy score 0.993333333333


## Run auto-sklearn on OpenML datasets
auto-sklearn needs a bit of meta-data to be able to build models correctly. OpenML provides this meta-data for every dataset, so you can easily run auto-sklearn on every OpenML dataset.

In [13]:
import openml

# Get an OpenML task
task = openml.tasks.get_task(3954)

# Extract the outer train-test splits (cross-validation)
# We need to pass these explicitly
train_indices, test_indices = task.get_train_test_split_indices()
X, y = task.get_X_and_y()
X_train, y_train = X[train_indices], y[train_indices]
X_test, y_test = X[test_indices], y[test_indices]

# Get the data types for all features (numeric, categorical,...)
dataset = task.get_dataset()
_, _, categorical_indicator = dataset.\
    get_data(target=task.target_name, return_categorical_indicator=True)
feat_type = ['Categorical' if ci else 'Numerical'
             for ci in categorical_indicator]

Run auto-sklearn

In [14]:
cls = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=120,
    per_run_time_limit=30,
)
cls.fit(X_train, y_train, feat_type=feat_type)

# Score
predictions = cls.predict(X_test)
print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))

Accuracy score 0.885909568875


### Exercise: MAGIC
Run auto-sklearn on the MAGIC dataset, [task 3954](https://www.openml.org/t/3954)

## More examples
For more examples, e.g. how to set up auto-sklearn to run in parallel, see the [auto-sklearn example gallery](http://automl.github.io/auto-sklearn/stable/examples/index.html)