## EvalML: AutoML

EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

**Key Functionality**

- Automation - Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
- Data Checks - Catches and warns of problems with your data and problem setup before modeling.
- End-to-end - Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature selection, and a variety of modeling techniques.
- Model Understanding - Provides tools to understand and introspect on models, to learn how they'll behave in your problem domain.
- Domain-specific - Includes repository of domain-specific objective functions and an interface to define your own.

In [1]:
import orchest
## EVALML
from evalml.automl import AutoMLSearch
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import warnings

warnings.filterwarnings("ignore")




Matplotlib is building the font cache; this may take a moment.


In [2]:
data = orchest.get_inputs()  # data = [(df_data, df_target)]
bcell, covid, sars, bcell_sars = data["data"]

In [3]:
X = bcell_sars.drop(
    ["target", "parent_protein_id", "protein_seq", "peptide_seq"], axis=1
)
y = bcell_sars["target"]

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [5]:
%%time
automl = AutoMLSearch(
    X_train=X_train, 
    y_train=y_train, 
    problem_type='binary',
    # random_seed=2021,
    max_time=300,
)

Generating pipelines to search over...


8 pipelines ready for search.


CPU times: user 2.79 s, sys: 2.21 s, total: 5.01 s
Wall time: 9.69 s


In [6]:
automl.search()


*****************************


* Beginning pipeline search *


*****************************





Optimizing for Log Loss Binary. 


Lower score is better.



Using SequentialEngine to train and score pipelines.


Will stop searching for new pipelines after 300 seconds.



Allowed model families: linear_model, catboost, lightgbm, decision_tree, random_forest, xgboost, extra_trees



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline


Mode Baseline Binary Classification Pipeline:


	Starting cross validation


	Finished cross validation - mean Log Loss Binary: 9.187



*****************************


* Evaluating Batch Number 1 *


*****************************





Elastic Net Classifier w/ Imputer + Standard Scaler:


	Starting cross validation


	Finished cross validation - mean Log Loss Binary: 0.552


Decision Tree Classifier w/ Imputer:


	Starting cross validation


	Finished cross validation - mean Log Loss Binary: 0.581


Random Forest Classifier w/ Imputer:


	Starting cross validation


	Finished cross validation - mean Log Loss Binary: 0.458


LightGBM Classifier w/ Imputer:


	Starting cross validation


	Finished cross validation - mean Log Loss Binary: 0.390



Search finished after 11:17            


Best pipeline: LightGBM Classifier w/ Imputer


Best pipeline Log Loss Binary: 0.390168


### Using best pipeline

In [7]:
%%time
pipeline = automl.best_pipeline
pipeline.fit(X_train, y_train)

CPU times: user 3min 53s, sys: 50.7 s, total: 4min 44s
Wall time: 3min 3s


pipeline = BinaryClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'LightGBM Classifier': ['LightGBM Classifier', 'Imputer.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'LightGBM Classifier':{'boosting_type': 'gbdt', 'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 0, 'num_leaves': 31, 'min_child_samples': 20, 'n_jobs': -1, 'bagging_freq': 0, 'bagging_fraction': 0.9}}, random_seed=0)

In [8]:
preds = pipeline.predict(X_test)

In [9]:
print("AUC score:",roc_auc_score(y_test,preds))
orchest.output(automl,name='automl')

AUC score: 0.8207598615464995
