# Auto-Sklearn
![](https://miro.medium.com/max/512/1*s2myX8bJIp9mQ2V_htcEpw.png)

[Auto-sklearn](https://automl.github.io/auto-sklearn/master/#) ejecuta una gran selección de algoritmos de aprendizaje automático junto a la busqueda de sus hiperparámetros. Utiliza optimización Bayesiana,  meta-learning  y la construcción de esembles. Para más información sobre el sistema de optimización se encuentra disponible el [paper](http://papers.nips.cc/paper/5872-efficient-and-robust-automated-machine-learning.pdf) de los autores de Auto-Sklearn. 

Como resumen del funcionamiento, se genera un pipeline que es optimizado mediante búsqueda bayesiana. Se añaden dos componentes a la optimización bayesiana para la busqueda de hiperparámetros de un marco de ML: 

- Meta-learning para inicializar el optimizador bayesiano  
- Construcción automatizada de esembles a partir de configuraciones evaluadas durante la optimización. 

En cuanto a sus resultados, esta orientado a conjuntos de datos pequeños y medianos, pero no a sistemas modernos de deep learning.


### Alguna de las caracteristicas claves de Auto-Sklearn son:

- Permite ajustar limitaciones de memoria y tiempo de ejecución

- Restringir la busqueda de modelos seleccionando o excluyendo algunos de los preprocesados o estimadores

- Especificar estrategias de resampling (5-fold, cv, etc)

- Computación en paralela con el algorimo SMAC que utiliza un sistema de ficheros distribuido (sequential model-based algorithm configuration) 

- Compatibilidad con Scikit-learn 

In [1]:
import numpy as np
import pandas as pd
import os        
import sklearn
from sklearn.model_selection import train_test_split        

In [2]:
df = pd.read_csv("./heartDisease/heart.csv")
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [3]:
y = df.target.values
x_data = df.drop(['target'], axis = 1)

# Normalización
#x = (x_data - np.min(x_data)) / (np.max(x_data) - np.min(x_data)).values
#x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.2,random_state=0)

## Instalación

Se proponen dos alternativas para instalar Auto-Sklearn. 

### Instalación manual

La primera se basa en la instalación de los requisitos y dependencias necesarias mediante pip.

Los requisitos del sistema son:

- Sistema operativo Linux (por ejemplo Ubuntu) __[Linux](https://www.wikihow.com/Install-Linux)__.

- Python (>=3.6) __[Pyhton](https://www.python.org/downloads/)__.

- Compilador de C++  (con soporte para C++11) __[GCC](https://www.tutorialspoint.com/How-to-Install-Cplusplus-Compiler-on-Linux)__.

- SWIG (versión 3.0.* requerida; >=4.0.0 no compatible) __[SWIG](http://www.swig.org/survey.html)__.

In [4]:
#!curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip3 install
#!pip3 install auto-sklearn

### Uso de Docker

La segunda solución consiste en utilizar la __[imagen docker](https://hub.docker.com/r/mfeurer/auto-sklearn/)__ suministrada por el propio framework Auto-Sklearn.

- docker pull mfeurer/auto-sklearn:master
- docker run -it -v $PWD:/opt/nb -p 8888:8888 mfeurer/auto-sklearn:master /bin/bash -c "mkdir -p /opt/nb && jupyter notebook --notebook-dir=/opt/nb --ip='0.0.0.0' --port=8888 --no-browser --allow-root"


## Puebas

In [5]:
import sklearn
import autosklearn
from sklearn import model_selection, metrics
import autosklearn.classification

%timeit

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(x_data, y, random_state=1)

automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=3600,
per_run_time_limit=300,resampling_strategy='cv', resampling_strategy_arguments={'folds': 5},
include_preprocessors=["no_preprocessing"],ensemble_size=2)


automl.fit(X_train, y_train)


automl.fit_ensemble(y_train, ensemble_size=50)

print(automl.show_models())

predictions = automl.predict(X_test)

print(automl.sprint_statistics())

print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))    




[(0.500000, SimpleClassificationPipeline({'balancing:strategy': 'weighting', 'classifier:__choice__': 'libsvm_svc', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'minority_coalescer', 'data_preprocessing:numerical_transformer:imputation:strategy': 'median', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'quantile_transformer', 'feature_preprocessor:__choice__': 'no_preprocessing', 'classifier:libsvm_svc:C': 1.6607760071674351, 'classifier:libsvm_svc:gamma': 0.0010325252746543816, 'classifier:libsvm_svc:kernel': 'rbf', 'classifier:libsvm_svc:max_iter': -1, 'classifier:libsvm_svc:shrinking': 'False', 'classifier:libsvm_svc:tol': 0.0004738517446166401, 'data_preprocessing:categorical_transformer:category_coalescence:minority_coalescer:minimum_fraction': 0.01666535859065419, 'data_preprocessing:numerical_transformer:rescaling:quantile_transform

In [6]:
print(automl.cv_results_)

{'mean_test_score': array([0.87665198, 0.86784141, 0.81057269, 0.79735683, 0.81497797,
       0.84581498, 0.81938326, 0.54625551, 0.8061674 , 0.74889868,
       0.79735683, 0.86784141, 0.84581498, 0.86343612, 0.54625551,
       0.81057269, 0.80176211, 0.85903084, 0.82378855, 0.71365639,
       0.85022026, 0.7753304 , 0.54625551, 0.82378855, 0.80176211,
       0.55506608, 0.83259912, 0.85903084, 0.81057269, 0.84581498,
       0.52863436, 0.84581498, 0.81938326, 0.79295154, 0.60352423,
       0.85022026, 0.82378855, 0.84581498, 0.82378855, 0.81938326,
       0.84581498, 0.77973568, 0.84581498, 0.80176211, 0.82819383,
       0.79735683, 0.84140969, 0.82819383, 0.66960352, 0.83259912,
       0.81057269, 0.84581498, 0.81057269, 0.85462555, 0.66519824,
       0.82378855, 0.84581498, 0.84140969, 0.85462555, 0.86784141,
       0.8061674 , 0.85462555, 0.66519824, 0.85462555, 0.81057269,
       0.75330396, 0.83259912, 0.49339207, 0.85462555, 0.83259912,
       0.83700441, 0.86343612, 0.81938326,

### Pros:

    - Fácil de utilizar, muy parecido a sklearn
    
    - Soporta persistencia de modelos y computación en paralelo (Dask)
    
    
### Cons:

    - Un gran número de dependencias.
    
    - Tiempo de computo elevado para tener un resultado
    
    - Solo acepta Int o Float como entrada

## Dataset Titanic

Como se ha dicho antes, uno de los problemas de AutoSkalearn es que no acepta features que contengan strings, esto implica que es necesario realizar un preprocesado a los dataset con estas características. 

Es por ello que para ilustrar esta situación en este framework y los siguientes se usará el DataSet de Titanic de [Kaggle](https://www.kaggle.com/). El motivo por el que se ha seleccionado este DataSet es por su sencillez y por que es el utilizado en los tutoriales de la asignatura [SICT](https://github.com/gsi-upm/sitc).

In [8]:
import sklearn
import autosklearn
from sklearn import model_selection, metrics
import autosklearn.classification

#We get a URL with raw content (not HTML one)
url="https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv"
df = pd.read_csv(url)
df.head()

#Fill missing values
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Sex'].fillna('male', inplace=True)
df['Embarked'].fillna('S', inplace=True)

# Encode categorical variables
df['Age'] = df['Age'].fillna(df['Age'].median())
df.loc[df["Sex"] == "male", "Sex"] = 0
df.loc[df["Sex"] == "female", "Sex"] = 1
df.loc[df["Embarked"] == "S", "Embarked"] = 0
df.loc[df["Embarked"] == "C", "Embarked"] = 1
df.loc[df["Embarked"] == "Q", "Embarked"] = 2

# Drop colums
df.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)
df['Sex'] = df['Sex'].astype(np.int64)
df['Embarked'] = df['Embarked'].astype(np.int64)

# Features of the model
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
# Transform dataframe in numpy arrays
x_data = df[features].values
y = df['Survived'].values
%timeit

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(x_data, y, random_state=1)

automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=5*60,
per_run_time_limit=300,resampling_strategy='cv', resampling_strategy_arguments={'folds': 5},ensemble_size=2)

# Do not construct ensembles in parallel to avoid using more than one
# core at a time. The ensemble will be constructed after auto-sklearn
# finished fitting all machine learning models.

automl.fit(X_train, y_train)

# This call to fit_ensemble uses all models trained in the previous call
# to fit to build an ensemble which can be used with automl.predict()

automl.fit_ensemble(y_train, ensemble_size=50)

print(automl.show_models())

predictions = automl.predict(X_test)

print(automl.sprint_statistics())

print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))    

[(0.500000, SimpleClassificationPipeline({'balancing:strategy': 'none', 'classifier:__choice__': 'extra_trees', 'data_preprocessing:categorical_transformer:categorical_encoding:__choice__': 'one_hot_encoding', 'data_preprocessing:categorical_transformer:category_coalescence:__choice__': 'no_coalescense', 'data_preprocessing:numerical_transformer:imputation:strategy': 'median', 'data_preprocessing:numerical_transformer:rescaling:__choice__': 'none', 'feature_preprocessor:__choice__': 'select_rates_classification', 'classifier:extra_trees:bootstrap': 'False', 'classifier:extra_trees:criterion': 'entropy', 'classifier:extra_trees:max_depth': 'None', 'classifier:extra_trees:max_features': 0.562561668029056, 'classifier:extra_trees:max_leaf_nodes': 'None', 'classifier:extra_trees:min_impurity_decrease': 0.0, 'classifier:extra_trees:min_samples_leaf': 2, 'classifier:extra_trees:min_samples_split': 15, 'classifier:extra_trees:min_weight_fraction_leaf': 0.0, 'feature_preprocessor:select_rates_

En este ejemplo, se puede observar que sin un procesado previo de las features del DataSet AutoSklearn falla.

In [9]:
import sklearn
import autosklearn
from sklearn import model_selection, metrics
import autosklearn.classification

#We get a URL with raw content (not HTML one)
url="https://raw.githubusercontent.com/gsi-upm/sitc/master/ml2/data-titanic/train.csv"
df = pd.read_csv(url)

# Transform dataframe in numpy arrays
x_data = df.drop(['Survived'], axis=1).values
y = df['Survived'].values
%timeit

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(x_data, y, random_state=1)

automl = autosklearn.classification.AutoSklearnClassifier(time_left_for_this_task=5*60,
per_run_time_limit=300,resampling_strategy='cv', resampling_strategy_arguments={'folds': 5},ensemble_size=2)

# Do not construct ensembles in parallel to avoid using more than one
# core at a time. The ensemble will be constructed after auto-sklearn
# finished fitting all machine learning models.

automl.fit(X_train, y_train)

# This call to fit_ensemble uses all models trained in the previous call
# to fit to build an ensemble which can be used with automl.predict()

automl.fit_ensemble(y_train, ensemble_size=50)

print(automl.show_models())

predictions = automl.predict(X_test)

print(automl.sprint_statistics())

print("Accuracy score", sklearn.metrics.accuracy_score(y_test, predictions))    

ValueError: When providing a numpy array to Auto-sklearn, the only valid dtypes are numerical ones. The provided data type <class 'numpy.object_'> is not supported.