## EvalML: AutoML

EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

**Key Functionality**

- Automation - Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
- Data Checks - Catches and warns of problems with your data and problem setup before modeling.
- End-to-end - Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature selection, and a variety of modeling techniques.
- Model Understanding - Provides tools to understand and introspect on models, to learn how they'll behave in your problem domain.
- Domain-specific - Includes repository of domain-specific objective functions and an interface to define your own.

In [7]:
import orchest
## EVALML
from evalml.automl import AutoMLSearch
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import warnings

warnings.filterwarnings("ignore")


In [8]:
data = orchest.get_inputs()  # data = [(df_data, df_target)]
bcell, covid, sars, bcell_sars = data["data"]

In [9]:
X = bcell_sars.drop(
    ["target", "parent_protein_id", "protein_seq", "peptide_seq"], axis=1
)
y = bcell_sars["target"]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

In [11]:
%%time
automl = AutoMLSearch(
    X_train=X_train, 
    y_train=y_train, 
    problem_type='binary',
    # random_seed=2021,
    max_time=300,
)

Generating pipelines to search over...


[I 2021-08-19 10:14:50,884.884 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Generating pipelines to search over...
[D 2021-08-19 10:14:51,307.307 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses


8 pipelines ready for search.
CPU times: user 673 ms, sys: 194 ms, total: 867 ms
Wall time: 969 ms


[D 2021-08-19 10:14:51,618.618 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses
[D 2021-08-19 10:14:51,621.621 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses
[D 2021-08-19 10:14:51,622.622 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] allowed_estimators set to ['Decision Tree Classifier', 'LightGBM Classifier', 'Extra Trees Classifier', 'Elastic Net Classifier', 'CatBoost Classifier', 'XGBoost Classifier', 'Random Forest Classifier', 'Logistic Regression Classifier']
[D 2021-08-19 10:14:51,631.631 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses
[D 2021-08-19 10:14:51,634.634 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclass

In [12]:
automl.search()


*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary. 
Lower score is better.

Using SequentialEngine to train and score pipelines.
Will stop searching for new pipelines after 300 seconds.

Allowed model families: decision_tree, lightgbm, extra_trees, linear_model, xgboost, random_forest, catboost



[I 2021-08-19 10:14:51,825.825 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 
*****************************
[I 2021-08-19 10:14:51,830.830 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] * Beginning pipeline search *
[I 2021-08-19 10:14:51,833.833 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] *****************************
[I 2021-08-19 10:14:51,836.836 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 
[I 2021-08-19 10:14:51,842.842 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Optimizing for Log Loss Binary. 
[I 2021-08-19 10:14:51,846.846 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Lower score is better.

[I 2021-08-19 10:14:51,851.851 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Using SequentialEngine to train and score pipelines.
[I 2021-08-19 10:14:51,854.854 /opt/conda/lib/python3.7/site-packages/evalml/automl/

FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline


[D 2021-08-19 10:14:52,411.411 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses
[D 2021-08-19 10:14:52,479.479 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses
[I 2021-08-19 10:14:52,481.481 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline


Mode Baseline Binary Classification Pipeline:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 9.303

*****************************
* Evaluating Batch Number 1 *
*****************************



[I 2021-08-19 10:14:54,654.654 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Mode Baseline Binary Classification Pipeline:
[I 2021-08-19 10:14:54,656.656 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 	Starting cross validation
[D 2021-08-19 10:14:54,658.658 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 		Training and scoring on fold 0
[D 2021-08-19 10:14:54,658.658 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: starting training
[D 2021-08-19 10:14:54,659.659 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: finished training
[D 2021-08-19 10:14:54,659.659 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Optimal threshold found (0.000)
[D 2021-08-19 10:14:54,659.659 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Scoring trained pipeline
[D 2021-08-19 10:14:54,659.659 /opt/conda/l

Elastic Net Classifier w/ Imputer + Standard Scaler:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.556


[I 2021-08-19 10:14:57,553.553 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Elastic Net Classifier w/ Imputer + Standard Scaler:
[I 2021-08-19 10:14:57,556.556 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 	Starting cross validation
[D 2021-08-19 10:14:57,558.558 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 		Training and scoring on fold 0
[D 2021-08-19 10:14:57,559.559 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: starting training
[D 2021-08-19 10:14:57,559.559 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: finished training
[D 2021-08-19 10:14:57,559.559 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Optimal threshold found (0.222)
[D 2021-08-19 10:14:57,560.560 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Scoring trained pipeline
[D 2021-08-19 10:14:57,560.560 /opt/

Decision Tree Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.594


[I 2021-08-19 10:14:59,017.017 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Decision Tree Classifier w/ Imputer:
[I 2021-08-19 10:14:59,019.019 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 	Starting cross validation
[D 2021-08-19 10:14:59,022.022 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 		Training and scoring on fold 0
[D 2021-08-19 10:14:59,022.022 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: starting training
[D 2021-08-19 10:14:59,022.022 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: finished training
[D 2021-08-19 10:14:59,022.022 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Optimal threshold found (0.333)
[D 2021-08-19 10:14:59,022.022 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Scoring trained pipeline
[D 2021-08-19 10:14:59,023.023 /opt/conda/lib/python

Random Forest Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.462


[I 2021-08-19 10:15:03,225.225 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Random Forest Classifier w/ Imputer:
[I 2021-08-19 10:15:03,227.227 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 	Starting cross validation
[D 2021-08-19 10:15:03,230.230 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 		Training and scoring on fold 0
[D 2021-08-19 10:15:03,230.230 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: starting training
[D 2021-08-19 10:15:03,231.231 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: finished training
[D 2021-08-19 10:15:03,231.231 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Optimal threshold found (0.306)
[D 2021-08-19 10:15:03,231.231 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Scoring trained pipeline
[D 2021-08-19 10:15:03,231.231 /opt/conda/lib/python

LightGBM Classifier w/ Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.386


[I 2021-08-19 10:15:05,255.255 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] LightGBM Classifier w/ Imputer:
[I 2021-08-19 10:15:05,258.258 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 	Starting cross validation
[D 2021-08-19 10:15:05,261.261 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 		Training and scoring on fold 0
[D 2021-08-19 10:15:05,261.261 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: starting training
[D 2021-08-19 10:15:05,261.261 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: finished training
[D 2021-08-19 10:15:05,261.261 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Optimal threshold found (0.333)
[D 2021-08-19 10:15:05,262.262 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 			Fold 0: Scoring trained pipeline
[D 2021-08-19 10:15:05,262.262 /opt/conda/lib/python3.7/s


Do you really want to exit search (y/n)?  y


Exiting AutoMLSearch.

Search finished after 00:24            


[I 2021-08-19 10:15:17,232.232 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Exiting AutoMLSearch.
[I 2021-08-19 10:15:17,245.245 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] 
Search finished after 00:24            


Best pipeline: LightGBM Classifier w/ Imputer
Best pipeline Log Loss Binary: 0.385990


[I 2021-08-19 10:15:18,135.135 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Best pipeline: LightGBM Classifier w/ Imputer
[I 2021-08-19 10:15:18,137.137 /opt/conda/lib/python3.7/site-packages/evalml/automl/automl_search.py] Best pipeline Log Loss Binary: 0.385990


### Using best pipeline

In [13]:
%%time
pipeline = automl.best_pipeline
pipeline.fit(X_train, y_train)

CPU times: user 1.62 s, sys: 38.8 ms, total: 1.66 s
Wall time: 598 ms


[D 2021-08-19 10:15:18,760.760 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses
[D 2021-08-19 10:15:18,764.764 /opt/conda/lib/python3.7/site-packages/evalml/utils/gen_utils.py] Could not import class ProphetRegressor in get_importable_subclasses


pipeline = BinaryClassificationPipeline(component_graph={'Imputer': ['Imputer', 'X', 'y'], 'LightGBM Classifier': ['LightGBM Classifier', 'Imputer.x', 'y']}, parameters={'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'LightGBM Classifier':{'boosting_type': 'gbdt', 'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 0, 'num_leaves': 31, 'min_child_samples': 20, 'n_jobs': -1, 'bagging_freq': 0, 'bagging_fraction': 0.9}}, random_seed=0)

In [14]:
preds = pipeline.predict(X_test)

In [15]:
print("AUC score:",roc_auc_score(y_test,preds))
orchest.output(automl,name='automl')

NameError: name 'roc_auc_score' is not defined