## EvalML: AutoML

EvalML is an AutoML library which builds, optimizes, and evaluates machine learning pipelines using domain-specific objective functions.

**Key Functionality**

- Automation - Makes machine learning easier. Avoid training and tuning models by hand. Includes data quality checks, cross-validation and more.
- Data Checks - Catches and warns of problems with your data and problem setup before modeling.
- End-to-end - Constructs and optimizes pipelines that include state-of-the-art preprocessing, feature engineering, feature selection, and a variety of modeling techniques.
- Model Understanding - Provides tools to understand and introspect on models, to learn how they'll behave in your problem domain.
- Domain-specific - Includes repository of domain-specific objective functions and an interface to define your own.

In [1]:
import orchest
## EVALML
from evalml.automl import AutoMLSearch
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import warnings

warnings.filterwarnings("ignore")



In [2]:
data = orchest.get_inputs()
train,test = data["data"]

In [3]:
X = train.drop(['target'], axis=1)
y = train['target']

In [4]:
import woodwork as ww
# X = ww.DataTable(X)

# Note: We could have also manually set the text column to 
# natural language if Woodwork had not automatically detected
from evalml.utils import infer_feature_types
X = infer_feature_types(X, {'text': 'NaturalLanguage'})
# y = ww.DataColumn(y)

In [5]:
from evalml.preprocessing import split_data

X_train, X_holdout, y_train, y_holdout = split_data(X, y, problem_type='binary', test_size=0.2)

In [6]:
automl = AutoMLSearch(X_train=X_train, y_train=y_train,additional_objectives=['f1'], problem_type='binary',max_time=300)

In [7]:
automl.search()

In [8]:
%%time
pipeline = automl.best_pipeline
pipeline.fit(X_train, y_train)

CPU times: user 7.44 s, sys: 437 ms, total: 7.88 s
Wall time: 7.28 s


pipeline = BinaryClassificationPipeline(component_graph={'Drop Columns Transformer': ['Drop Columns Transformer', 'X', 'y'], 'Text Featurization Component': ['Text Featurization Component', 'Drop Columns Transformer.x', 'y'], 'Imputer': ['Imputer', 'Text Featurization Component.x', 'y'], 'One Hot Encoder': ['One Hot Encoder', 'Imputer.x', 'y'], 'XGBoost Classifier': ['XGBoost Classifier', 'One Hot Encoder.x', 'y']}, parameters={'Drop Columns Transformer':{'columns': ['location']}, 'Imputer':{'categorical_impute_strategy': 'most_frequent', 'numeric_impute_strategy': 'mean', 'categorical_fill_value': None, 'numeric_fill_value': None}, 'One Hot Encoder':{'top_n': 10, 'features_to_encode': None, 'categories': None, 'drop': 'if_binary', 'handle_unknown': 'ignore', 'handle_missing': 'error'}, 'XGBoost Classifier':{'eta': 0.1, 'max_depth': 6, 'min_child_weight': 1, 'n_estimators': 100, 'n_jobs': -1, 'eval_metric': 'logloss'}}, random_seed=0)

In [9]:
preds = pipeline.predict(X_holdout)

In [10]:
print("F1 score:",f1_score(y_holdout,preds))
orchest.output(automl,name='automl')

F1 score: 0.6962457337883959
