# Feature Engineering in AutoML

### Zadanie 1

Na podstawie dopasowanych modeli w autosklearnie (w wersji 1.0 i 2.0) zrób zestawienie technik feature preprocessing wykorzystywanych w tym pakiecie. Aby zrobić to zestawienie w sposób systematyczny i powtarzalny można zapisywać otrzymane wyniki do listy lub słownika, który potem możemy analizować. Przetestujcie Autosklearn dla 5 różnych zbiorów danych.

Jeśli nie znasz jakiejś metody preprocessingu - to jest świetna okazja żeby zajrzeć do dokumentacji 🔍, oprócz AutoML poszerzymy horyzonty 🌄


Następnie odpowiedz na pytania:
1. Jakie techniki były najczęściej używane?
2. Czy są różnice pomiędzy Autosklearn 1.0 i 2.0? Jaki może być powód potencjalnych różnic?
3. Czy w zależności od innych danych inne techniki preprocessingu były wybierane?


### Instalacja Autosklearn (pamiętaj o restarcie po instalacji)

In [None]:
# # 1. uninstall all affected packages
# !pip uninstall -y Cython scipy pyparsing scikit_learn imbalanced-learn mlxtend yellowbrick

# # 2. install packages to be downgraded
# !pip install Cython==0.29.36 scipy==1.9 pyparsing==2.4

# # 3. install older scikit-learn disregarding its dependencies
# !pip install scikit-learn==0.24.2 --no-build-isolation

# # 4. finally install auto-sklearn
# !pip install auto-sklearn

Found existing installation: Cython 3.0.5
Uninstalling Cython-3.0.5:
  Successfully uninstalled Cython-3.0.5
Found existing installation: scipy 1.11.3
Uninstalling scipy-1.11.3:
  Successfully uninstalled scipy-1.11.3
Found existing installation: pyparsing 3.1.1
Uninstalling pyparsing-3.1.1:
  Successfully uninstalled pyparsing-3.1.1
Found existing installation: scikit-learn 1.2.2
Uninstalling scikit-learn-1.2.2:
  Successfully uninstalled scikit-learn-1.2.2
Found existing installation: imbalanced-learn 0.10.1
Uninstalling imbalanced-learn-0.10.1:
  Successfully uninstalled imbalanced-learn-0.10.1
Found existing installation: mlxtend 0.22.0
Uninstalling mlxtend-0.22.0:
  Successfully uninstalled mlxtend-0.22.0
Found existing installation: yellowbrick 1.5
Uninstalling yellowbrick-1.5:
  Successfully uninstalled yellowbrick-1.5
Collecting Cython==0.29.36
  Downloading Cython-0.29.36-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (1.9 MB)
[2K     [90m━━

Collecting scikit-learn==0.24.2
  Downloading scikit-learn-0.24.2.tar.gz (7.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: scikit-learn
  Building wheel for scikit-learn (pyproject.toml) ... [?25l[?25hdone
  Created wheel for scikit-learn: filename=scikit_learn-0.24.2-cp310-cp310-linux_x86_64.whl size=22231991 sha256=05e2f14c2076d989115c195f77783973bbf49feb3bfbd1f75f26df74dbbcf450
  Stored in directory: /root/.cache/pip/wheels/13/a4/68/4e78865652fa14db4a162b491e5138565f97646f9e1f2ab8cc
Successfully built scikit-learn
Installing collected packages: scikit-learn
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.13.0 requires scikit-learn>=1.2.2, but you have scikit-l

### Setup

In [None]:
from autosklearn.classification import AutoSklearnClassifier
from autosklearn.experimental.askl2 import AutoSklearn2Classifier
from autosklearn.metrics import roc_auc

import sklearn.datasets
import sklearn.metrics
import sklearn.model_selection

# From OpenML: https://www.openml.org/d/31
dataset_name = "credit-g"

def get_data_and_scoring_function(dataset_name):
    X, y = sklearn.datasets.fetch_openml(dataset_name, as_frame=True, return_X_y=True)
    X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(
        X, y, random_state=42, stratify=y,
    )

    def scoring_function(estimator):
        predictions = estimator.predict_proba(X_test)[:, 1]
        return sklearn.metrics.roc_auc_score(y_test, predictions)

    def train_scoring_function(estimator):
        predictions = estimator.predict_proba(X_train)[:, 1]
        return sklearn.metrics.roc_auc_score(y_train, predictions)

    def get_test_data():
        return X_test, y_test

    return X_train, y_train, get_test_data, scoring_function, train_scoring_function

X_train, y_train, get_test_data, scoring_function, train_scoring_function = get_data_and_scoring_function(dataset_name)

print(f"Done downloading {dataset_name}")

  warn("Multiple active versions of the dataset matching the name"


Done downloading credit-g


### Trening


In [None]:
settings = {
  "time_left_for_this_task": 120,  # seconds
  "seed": 42,
  "metric": roc_auc,
  "n_jobs": 4,
}

# This will only be used by autosklearn 1 while autosklearn 2 will automatically
# select a strategy
resampling_strategy = "holdout"

#-------------------------


In [None]:
askl1 = AutoSklearnClassifier(
    **settings,
    resampling_strategy=resampling_strategy
)
askl1.fit(X_train, y_train, dataset_name="credit-g")



AutoSklearnClassifier(ensemble_class=<class 'autosklearn.ensembles.ensemble_selection.EnsembleSelection'>,
                      metric=roc_auc, n_jobs=4, per_run_time_limit=48, seed=42,
                      time_left_for_this_task=120)

In [None]:
from pprint import pprint

print(f"Auto-sklearn 1.0 | train = {train_scoring_function(askl1)} | test = {scoring_function(askl1)}")
print(f"Selected `resampling-strategy` = {askl1.resampling_strategy}")
print(f"Selected `resampling-strategy-arguments` = {askl1.resampling_strategy_arguments}")

# Some quick summary statistics
print(askl1.sprint_statistics())

# The leaderboard shows all the models during the optimization process,
# see this link for arguments if you want to see more!
# https://automl.github.io/auto-sklearn/master/api.html#autosklearn.classification.AutoSklearnClassifier.leaderboard
leaderboard = askl1.leaderboard(sort_by="model_id", ensemble_only=True)
print(leaderboard)

# Show all the models in the final produced ensemble
# pprint(askl1.show_models())

# For compatibility with scikit-learn we implement `cv_results_`, but the output is pretty lengthy, so we leave this commented
# print(askl1.cv_results_)

Auto-sklearn 1.0 | train = 0.9549375661375661 | test = 0.7929904761904762
Selected `resampling-strategy` = holdout
Selected `resampling-strategy-arguments` = None
auto-sklearn results:
  Dataset name: credit-g
  Metric: roc_auc
  Best validation score: 0.810500
  Number of target algorithm runs: 28
  Number of successful target algorithm runs: 27
  Number of crashed target algorithm runs: 0
  Number of target algorithms that exceeded the time limit: 1
  Number of target algorithms that exceeded the memory limit: 0

          rank  ensemble_weight                type      cost   duration
model_id                                                                
3           12             0.04       liblinear_svc  0.374961   5.282920
5            4             0.02   gradient_boosting  0.216294   5.357056
7            1             0.14                 sgd  0.189500   8.218706
8            9             0.02                 mlp  0.270348  26.110157
10           8             0.06       ran

In [None]:
askl2 = AutoSklearn2Classifier(
    **settings
)
askl2.fit(X_train, y_train, dataset_name="credit-g")

  for col, series in prediction.iteritems():


AutoSklearn2Classifier(metric=roc_auc, n_jobs=4, per_run_time_limit=48, seed=42,
                       time_left_for_this_task=120)

### Jak wyciągać poszczególne elementy?

In [None]:
for i, (weight, pipeline) in enumerate(askl1.get_models_with_weights()):
    for stage_name, component in pipeline.named_steps.items():
        if "feature_preprocessor" in stage_name:
          print(i)
          print(component.choice.preprocessor)
        if "classifier" in stage_name:
          print(component.choice)





0
passthrough
autosklearn.pipeline Quadratic Discriminant Analysis
1
KernelPCA(coef0=0.0, gamma=0.011140362342581723, kernel='rbf',
          n_components=1598, random_state=42, remove_zero_eig=True)
autosklearn.pipeline Linear Discriminant Analysis
2
FastICA(fun='exp', random_state=42, whiten=False)
autosklearn.pipeline Stochastic Gradient Descent Classifier
3
Nystroem(coef0=1.0, degree=3, gamma=1.0, kernel='cosine', n_components=1358,
         random_state=42)
autosklearn.pipeline Passive Aggressive Classifier
4
FeatureAgglomeration(n_clusters=22,
                     pooling_func=<function amax at 0x7f196fae79a0>)
autosklearn.pipeline Random Forest Classifier
5
SelectFromModel(estimator=ExtraTreesClassifier(class_weight='balanced',
                                               criterion='entropy',
                                               max_features=16,
                                               min_samples_leaf=16, n_jobs=1,
                                             

## Zadanie 2

Bazując na wnioskach z poprzedniego zadania, sprawdź czy zastosowanie metod preprocessingu wybieranych w autosklearnie poprawi jakość modeli zbudowanych w Autogluonie.

- W wersji najprostszej można zrobić preprocessing na całych danych a następnie wykorzystać Autogluon.

- W wersji średnio trudnej można wykorzystać  moduł `sklearn.pipeline`
- W wersji pro można wykorzystać przykład i zdefiniować odpowiednią klasę w Autogluonie

https://auto.gluon.ai/stable/tutorials/tabular/tabular-feature-engineering.html
  https://github.com/autogluon/autogluon/blob/master/examples/tabular/example_custom_feature_generator.py