https://evalml.alteryx.com/en/stable/demos/text_input.html

In [22]:
import evalml
from evalml import AutoMLSearch
from evalml.model_understanding.metrics import graph_confusion_matrix

### Dataset

In [3]:
from urllib.request import urlopen
import pandas as pd

input_data = urlopen(
    "https://featurelabs-static.s3.amazonaws.com/spam_text_messages_modified.csv"
)
data = pd.read_csv(input_data)[:750]

X = data.drop(["Category"], axis=1)
y = data["Category"]

print(X.shape)
display(X.head())

(750, 1)


Unnamed: 0,Message
0,Free entry in 2 a wkly comp to win FA Cup fina...
1,FreeMsg Hey there darling it's been 3 week's n...
2,WINNER!! As a valued network customer you have...
3,Had your mobile 11 months or more? U R entitle...
4,"SIX chances to win CASH! From 100 to 20,000 po..."


In [4]:
y.value_counts(normalize=True)

spam    0.593333
ham     0.406667
Name: Category, dtype: float64

In [5]:
# In order to properly utilize Woodwork’s ‘Natural Language’ typing, we need to pass this argument in during initialization. 
# Otherwise, this will be treated as an ‘Unknown’ type and dropped in the search.
X.ww.init(logical_types={"Message": "NaturalLanguage"})

In [7]:
X_train.ww

Unnamed: 0_level_0,Physical Type,Logical Type,Semantic Tag(s)
Column,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Message,string,NaturalLanguage,[]


In [8]:
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(
    X, y, problem_type="binary", test_size=0.2, random_seed=0
)

### AutoML training

In [12]:
automl = AutoMLSearch(
    X_train=X_train,
    y_train=y_train,
    problem_type="binary",
    max_batches=3,
    verbose=True,
    optimize_thresholds=True,
)

automl.search(interactive_plot=False)

AutoMLSearch will use mean CV score to rank pipelines.

*****************************
* Beginning pipeline search *
*****************************

Optimizing for Log Loss Binary. 
Lower score is better.

Using SequentialEngine to train and score pipelines.
Searching up to 5 batches for a total of None pipelines. 
Allowed model families: 

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 14.658

*****************************
* Evaluating Batch Number 1 *
*****************************

Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.212

*****************************
* Evaluating Batch Number 2 *
*****************************

Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Sele

	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.235
Extra Trees Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.249
XGBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.299
Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.237
Extra Trees Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.226
XGBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished

Extra Trees Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.220
XGBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.279
Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.283
Extra Trees Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.272
XGBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.275
Random Forest Classifier w/ Label En

	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.226
XGBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.287
Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.235
Extra Trees Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.234
XGBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.308
Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finish

	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.576
	High coefficient of variation (cv >= 0.5) within cross validation scores.
	Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer may not perform as estimated on unseen data.
Extra Trees Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.216
XGBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.268
Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer:
	Starting cross validation
	Finished cross validation - mean Log Loss Binary: 0.576
	High coefficient of variation (cv >= 0.5) within cross validation scores.
	Random Forest Classifier w/ 

{1: {'Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer': 5.072690010070801,
  'Total time of batch': 5.206547260284424},
 2: {'Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + RF Classifier Select From Model': 5.553283452987671,
  'Total time of batch': 5.689086437225342},
 3: {'Decision Tree Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer': 4.57648491859436,
  'LightGBM Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer': 5.0739476680755615,
  'Extra Trees Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transformer': 5.574789524078369,
  'Elastic Net Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Standard Scaler + Select Columns Transformer': 4.728957891464233,
  'CatBoost Classifier w/ Label Encoder + Natural Language Featurizer + Imputer + Select Columns Transform

In [13]:
automl.rankings

Unnamed: 0,id,pipeline_name,search_order,ranking_score,mean_cv_score,standard_deviation_cv_score,percent_better_than_baseline,high_variance_cv,parameters
0,1,Random Forest Classifier w/ Label Encoder + Na...,1,0.21247,0.21247,0.043953,98.550459,False,"{'Label Encoder': {'positive_label': None}, 'I..."
1,167,Extra Trees Classifier w/ Label Encoder + Natu...,167,0.215811,0.215811,0.04034,98.527667,False,"{'Label Encoder': {'positive_label': None}, 'I..."
29,145,Random Forest Classifier w/ Label Encoder + Na...,145,0.229708,0.229708,0.063129,98.432854,False,"{'Label Encoder': {'positive_label': None}, 'I..."
74,2,Random Forest Classifier w/ Label Encoder + Na...,2,0.248763,0.248763,0.056686,98.302858,False,"{'Label Encoder': {'positive_label': None}, 'I..."
88,168,XGBoost Classifier w/ Label Encoder + Natural ...,168,0.26792,0.26792,0.034921,98.172164,False,"{'Label Encoder': {'positive_label': None}, 'I..."
152,4,LightGBM Classifier w/ Label Encoder + Natural...,4,0.322042,0.322042,0.153554,97.802923,False,"{'Label Encoder': {'positive_label': None}, 'I..."
169,6,Elastic Net Classifier w/ Label Encoder + Natu...,6,0.369001,0.369001,0.094933,97.482555,False,"{'Label Encoder': {'positive_label': None}, 'I..."
170,9,Logistic Regression Classifier w/ Label Encode...,9,0.369105,0.369105,0.095089,97.481842,False,"{'Label Encoder': {'positive_label': None}, 'I..."
187,7,CatBoost Classifier w/ Label Encoder + Natural...,7,0.577829,0.577829,0.004883,96.057859,False,"{'Label Encoder': {'positive_label': None}, 'I..."
188,3,Decision Tree Classifier w/ Label Encoder + Na...,3,3.187738,3.187738,1.18646,78.252202,True,"{'Label Encoder': {'positive_label': None}, 'I..."


In [17]:
best_pipeline = automl.best_pipeline

In [18]:
automl.describe_pipeline(automl.rankings.iloc[0]["id"])


*************************************************************************************
* Random Forest Classifier w/ Label Encoder + Natural Language Featurizer + Imputer *
*************************************************************************************

Problem Type: binary
Model Family: Random Forest

Pipeline Steps
1. Label Encoder
	 * positive_label : None
2. Natural Language Featurizer
3. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : mean
	 * boolean_impute_strategy : most_frequent
	 * categorical_fill_value : None
	 * numeric_fill_value : None
	 * boolean_fill_value : None
4. Random Forest Classifier
	 * n_estimators : 100
	 * max_depth : 6
	 * n_jobs : -1

Training
Training for binary problems.
Total training time (including CV): 5.1 seconds

Cross Validation
----------------
             Log Loss Binary  MCC Binary  Gini   AUC  Precision    F1  Balanced Accuracy Binary  Accuracy Binary # Training # Validation
0                      0

#### view pipeline graph

In [19]:
scores = best_pipeline.score(
    X_holdout, y_holdout, objectives=evalml.objectives.get_ranking_objectives("binary")
)
print(f'Accuracy Binary: {scores["Accuracy Binary"]}')
print(f'Accuracy Binary: {scores}')

Accuracy Binary: 0.9333333333333333
Accuracy Binary: OrderedDict([('MCC Binary', 0.861853011604347), ('Log Loss Binary', 0.18123361648635386), ('Gini', 0.9734757782280345), ('AUC', 0.9867378891140173), ('Recall', 0.9438202247191011), ('Precision', 0.9438202247191011), ('F1', 0.9438202247191011), ('Balanced Accuracy Binary', 0.9309265058021735), ('Accuracy Binary', 0.9333333333333333)])


In [23]:
y_pred = best_pipeline.predict(X_holdout)
graph_confusion_matrix(y_holdout, y_pred)

In [16]:
best_pipeline.graph()

RuntimeError: To graph pipelines, a graphviz backend is required.
Install the backend using one of the following commands:
  Mac OS: brew install graphviz
  Linux (Ubuntu): sudo apt-get install graphviz
  Windows: conda install python-graphviz


In [None]:
best_pipeline.input_feature_names