__Xgboost__ is one of the most popular machine-learning algorithms but the number of possible parameter combinations goes towards infinity:
- Booster: gbtree, gblinear or dart; gbtree and dart 
- disable_default_eval_metric 
- eta [default=0.3, alias: learning_rate]
- gamma [default=0, alias: min_split_loss]
- max_depth [default=6]
- min_child_weight [default=1]
- max_delta_step [default=0]
- subsample [default=1]
- colsample_bytree 
- colsample_bylevel 
- colsample_bynode 
- lambda [default=1, alias: reg_lambda]
- alpha [default=0, alias: reg_alpha]
- tree_method string [default= auto
- sketch_eps [default=0.03]
- scale_pos_weight [default=1]
- refresh_leaf [default=1]
- process_type [default= default]
- grow_policy [default= depthwise]
- max_leaves [default=0]
- max_bin, [default=256]
- sample_type [default= uniform]
- normalize_type [default= tree]
- rate_drop [default=0.0]
- one_drop [default=0]
- skip_drop [default=0.0]
- updater [default= shotgun]



 - [BigML](https://bigml.com/)
 - [H2O.ai](http://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html)
 - [rapidminer](https://rapidminer.com/products/go/)
 - [DataRobot](https://www.datarobot.com/solutions/data-scientists/)
 - [Microsoft Azure](https://azure.microsoft.com/en-us/services/machine-learning/automatedml/)
 - [Google Cloud AutoML](https://cloud.google.com/automl)
 - [Amazon AutoML](https://aws.amazon.com/blogs/machine-learning/code-free-machine-learning-automl-with-autogluon-amazon-sagemaker-and-aws-lambda/)

A competitor from Zurich:
 - [Modulus.ai](https://www.modulos.ai/)

<img alt="" caption="Bayesian Optimization: surrogate function (black, blue) and acquisition function (green)" 
id="bayesian_optimization" src="../images/image4.png" width="320" height="320">


<img alt="" caption="Auto-Sklearn" 
id="auto-sklearn" src="../images/image3.png" width="720" height="520">


__SMAC__ (sequential model-based algorithm configuration)

 - Data Set gets divided into n folds
 - For each fold, characteristics of the data are determined and a signature for this fold is calculated with PCA
 - A hyperparameter configuration applied to a fold leads to the following result c (cost) `[h1, h2, h3, h4, h5][s1, s2, s3] -> c`
 - Initially, random combinations of hyperparameters and data folds are evaluated to obtain measurement points
 - For these combinations random forests are calculated
 - New configurations (candidates) are combined with all data-folds signatures and classified by the random forest
 - The predictions of the end-leaves of the random forest are averaged over all data-fold signatures and these results are summed up over all trees in the forest. This results in mean values and variances that are used in the acquisition function (max. objective, min uncertainty).
 - In this way, many different parameter combinations can be tested without having to teach the actual ML algorithm with the new parameter configurations.
 - The hyperparameter combinations with the highest values in the acquisition function are tested against the incumbent (best combination so far) on the ML algorithm. Thus, new measuring points are created and the random forest is relearned.


[install auto-sklearn](https://automl.github.io/auto-sklearn/master/installation.html)

In [1]:
# curl https://raw.githubusercontent.com/automl/auto-sklearn/master/requirements.txt | xargs -n 1 -L 1 pip3 install
! cat auto-sklearn-requirements.txt | xargs -n 1 -L 1 pip3 install





In [2]:
!pip install -U auto-sklearn==0.12.4



In [3]:
!pip install -U scikit-learn==0.24.0



In [4]:
import sklearn.metrics
import autosklearn.regression
import pandas as pd
import numpy as np

  self.re = re.compile(self.reString)


## let's attack our house-prices example

In [87]:
!pwd

/home/martin/python/fhnw_lecture/notebooks


In [5]:
train = pd.read_csv('../data/train.csv', sep=",")
test = pd.read_csv('../data/test.csv')

In [6]:
y = train['SalePrice']
X = train.drop('SalePrice', axis=1)
categorical = [var for var in X.columns if X[var].dtype=='O']
numerical = [var for var in X.columns if X[var].dtype!='O']
X[categorical] = X[categorical].fillna('None')

# auto-sklearn can not deal with categorical variables
X= pd.concat([pd.get_dummies(X[categorical], dummy_na=True), X[numerical]], axis=1)

y = np.log1p(y)
X_train, X_test, y_train, y_test = \
    sklearn.model_selection.train_test_split(X, y, random_state=42, test_size=0.2)

[Parameters](https://automl.github.io/auto-sklearn/master/api.html#regression)

In [7]:
! rm -rf /tmp/autosklearn_*
automl = autosklearn.regression.AutoSklearnRegressor(
    time_left_for_this_task=600,
    per_run_time_limit=30,
    memory_limit = 4096,
    ensemble_size = 8, 
    ensemble_nbest=4,
    max_models_on_disc = 16,
    n_jobs = 2,
    include_estimators = ['gradient_boosting', 'ard_regression', 'sgd'],
    resampling_strategy = 'cv',
    # include_preprocessors=["no_preprocessing"],
    tmp_folder='/tmp/autosklearn_regression_example_tmp',
    output_folder='/tmp/autosklearn_regression_example_out',
    delete_tmp_folder_after_terminate = True,
    delete_output_folder_after_terminate = False
)

In [8]:
automl.fit(X_train, y_train, dataset_name='house-prices')

AutoSklearnRegressor(delete_output_folder_after_terminate=False,
                     ensemble_nbest=4, ensemble_size=8,
                     include_estimators=['gradient_boosting', 'ard_regression',
                                         'sgd'],
                     max_models_on_disc=16, memory_limit=4096, n_jobs=2,
                     output_folder='/tmp/autosklearn_regression_example_out',
                     per_run_time_limit=30, resampling_strategy='cv',
                     time_left_for_this_task=600,
                     tmp_folder='/tmp/autosklearn_regression_example_tmp')

In [None]:
print(automl.show_models())

In [9]:
autosklearn.__version__

'0.12.4'

In [10]:
predictions = automl.predict(X_test)
print("R2 score:", sklearn.metrics.r2_score(y_test, predictions))
print("mean-squared-error:", sklearn.metrics.mean_squared_error(y_test, predictions, squared=False))

R2 score: 0.8929860278928468
mean-squared-error: 0.14131519071831905


# autogluon.tabular

In [11]:
!pip install mxnet==1.7.0.post1



In [12]:
!pip install autogluon-core==0.0.16b20210114 autogluon-tabular==0.0.16b20210114

Collecting scikit-learn<0.24,>=0.22.0
  Using cached scikit_learn-0.23.2-cp37-cp37m-manylinux1_x86_64.whl (6.8 MB)




Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 0.24.0
    Uninstalling scikit-learn-0.24.0:
      Successfully uninstalled scikit-learn-0.24.0
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
auto-sklearn 0.12.4 requires scikit-learn<0.25.0,>=0.24.0, but you have scikit-learn 0.23.2 which is incompatible.[0m
Successfully installed scikit-learn-0.23.2


In [13]:
from autogluon.tabular import TabularPrediction as task
label_column = 'class'
X_train = pd.concat([X_train, y_train], axis=1)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  int_types = {_np.int, _np.int8, _np.int16, _np.int32, _np.int64, _np.integer}
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  bool = onp.bool
  mirr = onp.mirr
  npv = onp.npv
  pmt = onp.pmt
  ppmt = onp.ppmt
  pv = onp.pv
  rate = onp.rate
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  dtype=np.int):
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  from .mio5_utils import VarReader5


In [16]:
predictor = task.fit(train_data=X_train, label='SalePrice',
                    time_limits=600, output_directory='/tmp/autogluon_regression_example_tmp',
                    presets='best_quality', problem_type='regression',
                    eval_metric='root_mean_squared_error', auto_stack=True, save_space=True,
                    cache_data=False,
                    excluded_model_types = ['XT', 'NN', 'FASTAI'],
                    nthreads_per_trial=2)
y_pred = predictor.predict(X_test)
perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred)
predictor.fit_summary()

[1000]	train_set's rmse: 0.00456947	valid_set's rmse: 0.133531
[2000]	train_set's rmse: 0.0010747	valid_set's rmse: 0.133358
[1000]	train_set's rmse: 0.00732903	valid_set's rmse: 0.165941
[2000]	train_set's rmse: 0.00174502	valid_set's rmse: 0.163439
[3000]	train_set's rmse: 0.000438622	valid_set's rmse: 0.162985
[4000]	train_set's rmse: 0.000112624	valid_set's rmse: 0.162895
[5000]	train_set's rmse: 2.95958e-05	valid_set's rmse: 0.162879
[6000]	train_set's rmse: 7.70912e-06	valid_set's rmse: 0.162875
[7000]	train_set's rmse: 2.05131e-06	valid_set's rmse: 0.162874
[8000]	train_set's rmse: 5.91651e-07	valid_set's rmse: 0.162873
[9000]	train_set's rmse: 1.94868e-07	valid_set's rmse: 0.162873
[10000]	train_set's rmse: 6.65949e-08	valid_set's rmse: 0.162873
[1000]	train_set's rmse: 0.00491533	valid_set's rmse: 0.122266
[2000]	train_set's rmse: 0.00101376	valid_set's rmse: 0.119548
[3000]	train_set's rmse: 0.00023835	valid_set's rmse: 0.118962
[4000]	train_set's rmse: 5.69615e-05	valid_set'



{'model_types': {'RandomForestMSE_BAG_L0': 'StackerEnsembleModel_RF',
  'KNeighborsUnif_BAG_L0': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L0': 'StackerEnsembleModel_KNN',
  'LightGBM_BAG_L0': 'StackerEnsembleModel_LGB',
  'LightGBMXT_BAG_L0': 'StackerEnsembleModel_LGB',
  'CatBoost_BAG_L0': 'StackerEnsembleModel_CatBoost',
  'XGBoost_BAG_L0': 'StackerEnsembleModel_XGBoost',
  'LightGBMCustom_BAG_L0': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L1': 'WeightedEnsembleModel',
  'RandomForestMSE_BAG_L1': 'StackerEnsembleModel_RF',
  'KNeighborsUnif_BAG_L1': 'StackerEnsembleModel_KNN',
  'KNeighborsDist_BAG_L1': 'StackerEnsembleModel_KNN',
  'LightGBM_BAG_L1': 'StackerEnsembleModel_LGB',
  'LightGBMXT_BAG_L1': 'StackerEnsembleModel_LGB',
  'CatBoost_BAG_L1': 'StackerEnsembleModel_CatBoost',
  'XGBoost_BAG_L1': 'StackerEnsembleModel_XGBoost',
  'LightGBMCustom_BAG_L1': 'StackerEnsembleModel_LGB',
  'WeightedEnsemble_L2': 'WeightedEnsembleModel'},
 'model_performance': {'RandomFo

In [18]:
perf

0.13033138813696107

## other optimization frameworks

### scikit-optimize

In [None]:
from skopt.space import Real, Categorical, Integer
from skopt import BayesSearchCV
regressor = BayesSearchCV(
    GradientBoostingRegressor(),
      {
         'learning_rate': Real(0.1,0.3),
         'loss': Categorical(['lad','ls','huber','quantile']),
         'max_depth': Integer(3,6),
      },
    n_iter=32,
    random_state=0,
    verbose=1,
    cv=5, n_jobs=-1,
  )
regressor.fit(X_train,y_train)

### TPOT

In [None]:
from tpot import TPOTClassifier 
from sklearn.datasets import load_digits 
from sklearn.model_selection import train_test_split 

digits = load_digits() 
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25, random_state=42) 
tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42) 
tpot.fit(X_train, y_train) 
print(tpot.score(X_test, y_test)) 
tpot.export('tpot_digits_pipeline.py')