## Hauptseminar AutoML

#### Used Framework: [EvalML](https://github.com/alteryx/evalml)

**Type:** Open Source  
**Contributers:** https://github.com/alteryx/evalml/graphs/contributors  
**Publications:**
  * No direct scientific publications
  * Documentation: [EvalML](https://evalml.alteryx.com/en/stable/index.html)
  * Blog Post: [Automate NLP Tasks using EvalML Library](https://www.analyticsvidhya.com/blog/2021/04/automate-nlp-tasks-using-evalml-library/)
  * Blog Post: [Why Alteryx’s EvalML is one of the best AutoML I’ve ever used](https://www.linkedin.com/pulse/why-alteryxs-evalml-one-best-automl-ive-ever-used-abhijit-singh)
  * Blog Post: [Automate your ML Pipelines with EvalML](https://machinelearning.piyasaa.com/automate-your-ml-pipelines-with-evalml/)
  * Blog Post: [Using Text Data in EvalML with Woodwork](https://www.vebuso.com/2021/05/using-text-data-in-evalml-with-woodwork/)  

**Software type:** Software library  
**Usage expertice:** Data/computer scientist  
**ML library:** Unknown  
**ML tasks:** (see [docs](https://evalml.alteryx.com/en/latest/autoapi/evalml/problem_types/index.html#evalml.problem_types.ProblemTypes))  
  * Binary classification
  * Multiclass classification
  * Regression
  * Time series binary classification
  * Time series multiclass classification
  * Time series regression

**ML approaches:** (see [docs](https://evalml.alteryx.com/en/stable/autoapi/evalml/pipelines/components/estimators/))  
  * CatBoost (Classifier / Regression)
  * Decision Tree (Classifier / Regression)
  * Elastic Net (Classifier / Regression)
  * Extra Trees (Classifier / Regression)
  * K-Nearest Neighbors Classifier
  * LightGBM (Classifier / Regression)
  * Logistic Regression Classifier
  * Random Forest (Classifier / Regression)
  * Support Vector Machine (Classifier / Regression)
  * Vowpal Wabbit Binary (Classifier / Regression)
  * Vowpal Wabbit Multiclass (Classifier / Regression)
  * XGBoost (Classifier / Regression)
  * Linear Regression
  * Prophet Regression
  * Autoregressive Integrated Moving Average Model Regression

### Imports & Data Download

In [None]:
!pip install evalml

In [2]:
# Download test & train datasets from GitHub  (obsolete if already provided)
# !wget https://raw.githubusercontent.com/hochschule-darmstadt/AutoML_Hauptseminar/main/Hauptseminar-ATM/Scheppat/phishing_train.csv
# !wget https://raw.githubusercontent.com/hochschule-darmstadt/AutoML_Hauptseminar/main/Hauptseminar-ATM/Scheppat/phishing_test.csv

# !wget https://raw.githubusercontent.com/hochschule-darmstadt/AutoML_Hauptseminar/main/Hauptseminar-ATM/Scheppat/college_train.csv
# !wget https://raw.githubusercontent.com/hochschule-darmstadt/AutoML_Hauptseminar/main/Hauptseminar-ATM/Scheppat/college_test.csv

In [3]:
import pandas as pd
import numpy as np

### Classification (Phishing Dataset)

In [4]:
train = pd.read_csv('phishing_train.csv')
test = pd.read_csv('phishing_test.csv')

train.head()

Unnamed: 0,having_IP_Address,URL_Length,Shortining_Service,having_At_Symbol,double_slash_redirecting,Prefix_Suffix,having_Sub_Domain,SSLfinal_State,Domain_registeration_length,Favicon,port,HTTPS_token,Request_URL,URL_of_Anchor,Links_in_tags,SFH,Submitting_to_email,Abnormal_URL,Redirect,on_mouseover,RightClick,popUpWidnow,Iframe,age_of_domain,DNSRecord,web_traffic,Page_Rank,Google_Index,Links_pointing_to_page,Statistical_report,Result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,1,-1,1,-1,1,-1,-1,-1,0,1,1,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,1,-1,1,0,-1,-1,1,1,0,1,1,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,1,-1,1,0,-1,-1,-1,-1,0,1,1,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,1,-1,-1,0,0,-1,1,1,0,1,1,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,1,1,1,0,0,-1,1,1,0,-1,1,-1,1,-1,-1,0,-1,1,1,1,1


In [5]:
target_name = "Result"
X_train = train[train.columns.difference([target_name])]
y_train = train[target_name]
X_test = test[test.columns.difference([target_name])]
y_test = test[target_name]

In [6]:
import evalml
from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='binary', objective='F1', max_time=600, verbose=True)

Generating pipelines to search over...
8 pipelines ready for search.


In [7]:
automl.search()


*****************************
* Beginning pipeline search *
*****************************

Optimizing for F1. 
Greater score is better.

Using SequentialEngine to train and score pipelines.
Will stop searching for new pipelines after 600 seconds.

Allowed model families: linear_model, linear_model, xgboost, lightgbm, catboost, random_forest, decision_tree, extra_trees



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Evaluating Baseline Pipeline: Mode Baseline Binary Classification Pipeline
Mode Baseline Binary Classification Pipeline:
	Starting cross validation
	Finished cross validation - mean F1: 0.000

*****************************
* Evaluating Batch Number 1 *
*****************************

Elastic Net Classifier w/ Label Encoder + Imputer + Standard Scaler:
	Starting cross validation
	Finished cross validation - mean F1: 0.916
Logistic Regression Classifier w/ Label Encoder + Imputer + Standard Scaler:
	Starting cross validation
	Finished cross validation - mean F1: 0.916
XGBoost Classifier w/ Label Encoder + Imputer:
	Starting cross validation
	Finished cross validation - mean F1: 0.957
LightGBM Classifier w/ Label Encoder + Imputer:
	Starting cross validation
	Finished cross validation - mean F1: 0.961
CatBoost Classifier w/ Label Encoder + Imputer:
	Starting cross validation
	Finished cross validation - mean F1: 0.919
Random Forest Classifier w/ Label Encoder + Imputer:
	Starting cross val

In [8]:
# View rankings of all searched pipelines
automl.rankings

Unnamed: 0,id,pipeline_name,search_order,mean_cv_score,standard_deviation_cv_score,validation_score,percent_better_than_baseline,high_variance_cv,parameters
0,33,CatBoost Classifier w/ Label Encoder + Imputer,33,0.965125,0.003571,0.965157,96.512544,False,{'Imputer': {'categorical_impute_strategy': 'm...
1,51,LightGBM Classifier w/ Label Encoder + Imputer,51,0.963895,0.004117,0.96081,96.389546,False,{'Imputer': {'categorical_impute_strategy': 'm...
2,16,XGBoost Classifier w/ Label Encoder + Imputer,16,0.963677,0.004322,0.962256,96.367736,False,{'Imputer': {'categorical_impute_strategy': 'm...
22,61,Random Forest Classifier w/ Label Encoder + Im...,61,0.951402,0.003253,0.952715,95.140162,False,{'Imputer': {'categorical_impute_strategy': 'm...
29,64,Extra Trees Classifier w/ Label Encoder + Imputer,64,0.946808,0.006813,0.952967,94.680761,False,{'Imputer': {'categorical_impute_strategy': 'm...
64,1,Elastic Net Classifier w/ Label Encoder + Impu...,1,0.916161,0.00881,0.926169,91.616054,False,{'Imputer': {'categorical_impute_strategy': 'm...
66,2,Logistic Regression Classifier w/ Label Encode...,2,0.915655,0.009011,0.925634,91.565496,False,{'Imputer': {'categorical_impute_strategy': 'm...
71,88,Decision Tree Classifier w/ Label Encoder + Im...,88,0.913953,0.003902,0.914959,91.395284,False,{'Imputer': {'categorical_impute_strategy': 'm...
107,0,Mode Baseline Binary Classification Pipeline,0,0.0,0.0,0.0,0.0,False,{'Baseline Classifier': {'strategy': 'mode'}}


In [9]:
# Detailed description of the best pipeline
automl.best_pipeline.describe()


**************************************************
* CatBoost Classifier w/ Label Encoder + Imputer *
**************************************************

Problem Type: binary
Model Family: CatBoost
Number of features: 30

Pipeline Steps
1. Label Encoder
2. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : most_frequent
	 * categorical_fill_value : None
	 * numeric_fill_value : None
3. CatBoost Classifier
	 * n_estimators : 51
	 * eta : 0.8700873882711782
	 * max_depth : 7
	 * bootstrap_type : None
	 * silent : True
	 * allow_writing_files : False
	 * n_jobs : -1


In [10]:
# Output raw results
automl.results

{'pipeline_results': {0: {'cv_data': [{'all_objective_scores': OrderedDict([('F1',
                   0.0),
                  ('MCC Binary', 0.0),
                  ('Log Loss Binary', 15.334198903517928),
                  ('Gini', 0.0),
                  ('AUC', 0.5),
                  ('Precision', 0.0),
                  ('Balanced Accuracy Binary', 0.5),
                  ('Accuracy Binary', 0.5560294687863513),
                  ('# Training', 5158),
                  ('# Validation', 2579)]),
     'binary_classification_threshold': 9.16384630183206e-53,
     'mean_cv_score': 0.0},
    {'all_objective_scores': OrderedDict([('F1', 0.0),
                  ('MCC Binary', 0.0),
                  ('Log Loss Binary', 15.32080659006507),
                  ('Gini', 0.0),
                  ('AUC', 0.5),
                  ('Precision', 0.0),
                  ('Balanced Accuracy Binary', 0.5),
                  ('Accuracy Binary', 0.5564172159751842),
                  ('# Training', 5158)

In [None]:
# Test best pipeline predictions
pipeline = automl.best_pipeline
pipeline.fit(X_train, y_train) # not necessary for the best pipeline, because it is already fitted ("train_best_pipeline")
pred = pipeline.predict(X_test)
pred

In [12]:
# Output scores of the best pipeline
from evalml.objectives import F1
objective = F1()
scores = automl.best_pipeline.score(X_test, y_test, objectives=[objective]) 
print(f'F1 score of best pipeline ({automl.best_pipeline.name}): {scores["F1"]}')
print(scores)

F1 score of best pipeline (CatBoost Classifier w/ Label Encoder + Imputer): 0.90625
OrderedDict([('F1', 0.90625)])


In [13]:
# Save pickled version of best pipeline to a file
automl.best_pipeline.save("best-classification-model.pkl")

### Regression (Phishing Dataset)

In [14]:
train = pd.read_csv('college_train.csv')
test = pd.read_csv('college_test.csv')

train.head()

Unnamed: 0,UNITID,school_name,city,state,zip,school_webpage,latitude,longitude,admission_rate,sat_verbal_midrange,sat_math_midrange,sat_writing_midrange,act_combined_midrange,act_english_midrange,act_math_midrange,act_writing_midrange,sat_total_average,undergrad_size,percent_white,percent_black,percent_hispanic,percent_asian,percent_part_time,average_cost_academic_year,average_cost_program_year,tuition_(instate),tuition_(out_of_state),spend_per_student,faculty_salary,percent_part_time_faculty,percent_pell_grant,completion_rate,predominant_degree,highest_degree,ownership,region,gender,carnegie_basic_classification,carnegie_undergraduate,carnegie_size,religious_affiliation,percent_female,agege24,faminc,mean_earnings_6_years,median_earnings_6_years,mean_earnings_10_years,median_earnings_10_years
0,100654,Alabama A & M University,Normal,AL,35762,www.aamu.edu/,34.7834,-86.5685,0.8989,410.0,400.0,?,17.0,17.0,17.0,?,823.0,4051.0,0.0279,0.9501,0.0089,0.0022,0.0622,18888.0,?,7182.0,12774.0,7459.0,7079.0,0.8856,0.7115,0.2914,Bachelors,Graduate,Public,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",COED,Master\s Colleges and Universities (larger pro...,"Full-time four-year, inclusive","Medium 4-year, highly residential (3,000 to 9,...",?,0.52999997138977,0.07999999821186,40211.22,26100.0,22800.0,35300.0,31400.0
1,100663,University of Alabama at Birmingham,Birmingham,AL,35294-0110,www.uab.edu,33.5022,-86.8092,0.8673,580.0,585.0,?,25.0,26.0,23.0,?,1146.0,11200.0,0.5987,0.259,0.0258,0.0518,0.2579,19990.0,?,7206.0,16398.0,17208.0,10170.0,0.9106,0.3505,0.5377,Bachelors,Graduate,Public,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",COED,Research Universities (very high research acti...,"Medium full-time four-year, selective, higher ...","Large 4-year, primarily nonresidential (over 9...",?,0.64999997615814,0.25999999046325,49894.65,37400.0,33200.0,46300.0,40300.0
2,100690,Amridge University,Montgomery,AL,36117-3553,www.amridgeuniversity.edu,32.3626,-86.17399999999999,?,?,?,?,?,?,?,?,?,322.0,0.2919,0.4224,0.0093,0.0031,0.3727,12300.0,?,6870.0,6870.0,5123.0,3849.0,0.6721,0.6839,0.6667,Bachelors,Graduate,Private nonprofit,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",COED,Baccalaureate Colleges--Arts & Sciences,"Medium full-time four-year, inclusivestudents ...","Very small 4-year, primarily nonresidential (l...",Churches of Christ,0.50999999046325,0.82999998331069,38712.18,38500.0,32800.0,42100.0,38100.0
3,100706,University of Alabama in Huntsville,Huntsville,AL,35899,www.uah.edu,34.7228,-86.6384,0.8062,575.0,580.0,?,26.0,26.0,25.0,?,1180.0,5525.0,0.7012,0.131,0.0338,0.0364,0.2395,20306.0,?,9192.0,21506.0,9352.0,9341.0,0.6555,0.3281,0.4835,Bachelors,Graduate,Public,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",COED,Research Universities (very high research acti...,"Medium full-time four-year, selective, higher ...","Medium 4-year, primarily nonresidential (3,000...",?,0.55000001192092,0.28999999165534,54155.4,39300.0,36700.0,52700.0,46600.0
4,100724,Alabama State University,Montgomery,AL,36104-0271,www.alasu.edu/email/index.aspx,32.3643,-86.2957,0.5125,430.0,425.0,?,17.0,17.0,17.0,?,830.0,5354.0,0.0161,0.9285,0.0114,0.0015,0.0902,17400.0,?,8720.0,15656.0,7393.0,6557.0,0.6641,0.8265,0.2517,Bachelors,Graduate,Public,"Southeast (AL, AR, FL, GA, KY, LA, MS, NC, SC,...",COED,Master\s Colleges and Universities (larger pro...,"Full-time four-year, inclusive","Medium 4-year, primarily residential (3,000 to...",?,0.56999999284744,0.10999999940395,31846.99,21200.0,19300.0,30700.0,27800.0


In [15]:
target_name = "percent_pell_grant"
X_train = train[train.columns.difference([target_name])]
y_train = train[target_name]
X_test = test[test.columns.difference([target_name])]
y_test = test[target_name]

In [16]:
import evalml
from evalml.automl import AutoMLSearch
automl = AutoMLSearch(X_train=X_train, y_train=y_train, problem_type='regression', objective='Root Mean Squared Error', max_time=600, verbose=True)

Removing columns ['admission_rate', 'average_cost_academic_year', 'average_cost_program_year', 'city', 'completion_rate', 'faculty_salary', 'faminc', 'latitude', 'longitude', 'percent_asian', 'percent_black', 'percent_hispanic', 'percent_part_time', 'percent_part_time_faculty', 'percent_white', 'school_name', 'school_webpage', 'spend_per_student', 'tuition_(instate)', 'tuition_(out_of_state)', 'undergrad_size', 'zip'] because they are of 'Unknown' type
Generating pipelines to search over...
7 pipelines ready for search.


In [17]:
automl.search()


*****************************
* Beginning pipeline search *
*****************************

Optimizing for Root Mean Squared Error. 
Lower score is better.

Using SequentialEngine to train and score pipelines.
Will stop searching for new pipelines after 600 seconds.

Allowed model families: linear_model, xgboost, lightgbm, catboost, random_forest, decision_tree, extra_trees



FigureWidget({
    'data': [{'mode': 'lines+markers',
              'name': 'Best Score',
              'type'…

Evaluating Baseline Pipeline: Mean Baseline Regression Pipeline
Mean Baseline Regression Pipeline:
	Starting cross validation
	Finished cross validation - mean Root Mean Squared Error: 0.215

*****************************
* Evaluating Batch Number 1 *
*****************************




Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.8716078111937264, tolerance: 0.015302576540210867


Objective did not converge. You might want to increase the number of iterations. Duality gap: 4.18688710763999, tolerance: 0.015322747089503642


Objective did not converge. You might want to increase the number of iterations. Duality gap: 0.6809749824494133, tolerance: 0.014959480837557946



Elastic Net Regressor w/ Drop Columns Transformer + Imputer + One Hot Encoder + Standard Scaler:
	Starting cross validation
	Finished cross validation - mean Root Mean Squared Error: 0.158
XGBoost Regressor w/ Drop Columns Transformer + Imputer + One Hot Encoder:
	Starting cross validation
	Finished cross validation - mean Root Mean Squared Error: 0.151
LightGBM Regressor w/ Drop Columns Transformer + Imputer + One Hot Encoder:
	Starting cross validation
	Finished cross validation - mean Root Mean Squared Error: 0.155
CatBoost Regressor w/ Drop Columns Transformer + Imputer:
	Starting cross validation
	Finished cross validation - mean Root Mean Squared Error: 0.195
Random Forest Regressor w/ Drop Columns Transformer + Imputer + One Hot Encoder:
	Starting cross validation
	Finished cross validation - mean Root Mean Squared Error: 0.161
Decision Tree Regressor w/ Drop Columns Transformer + Imputer + One Hot Encoder:
	Starting cross validation
	Finished cross validation - mean Root Mean S

In [18]:
# View rankings of all searched pipelines
automl.rankings

Unnamed: 0,id,pipeline_name,search_order,mean_cv_score,standard_deviation_cv_score,validation_score,percent_better_than_baseline,high_variance_cv,parameters
0,2,XGBoost Regressor w/ Drop Columns Transformer ...,2,0.151084,0.000956,0.151869,29.638823,False,{'Drop Columns Transformer': {'columns': ['adm...
1,14,LightGBM Regressor w/ Drop Columns Transformer...,14,0.153322,0.001673,0.153324,28.596544,False,{'Drop Columns Transformer': {'columns': ['adm...
2,25,Random Forest Regressor w/ Drop Columns Transf...,25,0.154288,0.001239,0.152984,28.146741,False,{'Drop Columns Transformer': {'columns': ['adm...
4,1,Elastic Net Regressor w/ Drop Columns Transfor...,1,0.158474,0.001542,0.157008,26.197171,False,{'Drop Columns Transformer': {'columns': ['adm...
9,7,Extra Trees Regressor w/ Drop Columns Transfor...,7,0.165679,0.003291,0.165427,22.841712,False,{'Drop Columns Transformer': {'columns': ['adm...
10,6,Decision Tree Regressor w/ Drop Columns Transf...,6,0.165919,0.003599,0.167004,22.729825,False,{'Drop Columns Transformer': {'columns': ['adm...
20,4,CatBoost Regressor w/ Drop Columns Transformer...,4,0.195204,0.003289,0.193895,9.091765,False,{'Drop Columns Transformer': {'columns': ['adm...
25,0,Mean Baseline Regression Pipeline,0,0.214726,0.002863,0.213214,0.0,False,{'Baseline Regressor': {'strategy': 'mean'}}


In [19]:
# Detailed description of the best pipeline
automl.best_pipeline.describe()


*****************************************************************************
* XGBoost Regressor w/ Drop Columns Transformer + Imputer + One Hot Encoder *
*****************************************************************************

Problem Type: regression
Model Family: XGBoost
Number of features: 215

Pipeline Steps
1. Drop Columns Transformer
	 * columns : ['admission_rate', 'average_cost_academic_year', 'average_cost_program_year', 'city', 'completion_rate', 'faculty_salary', 'faminc', 'latitude', 'longitude', 'percent_asian', 'percent_black', 'percent_hispanic', 'percent_part_time', 'percent_part_time_faculty', 'percent_white', 'school_name', 'school_webpage', 'spend_per_student', 'tuition_(instate)', 'tuition_(out_of_state)', 'undergrad_size', 'zip']
2. Imputer
	 * categorical_impute_strategy : most_frequent
	 * numeric_impute_strategy : mean
	 * categorical_fill_value : None
	 * numeric_fill_value : None
3. One Hot Encoder
	 * top_n : 10
	 * features_to_encode : None
	 * cate

In [20]:
# Output raw results
automl.results

{'pipeline_results': {0: {'cv_data': [{'all_objective_scores': OrderedDict([('Root Mean Squared Error',
                   0.2132136220094272),
                  ('ExpVariance', 0.0),
                  ('MaxError', 0.5048702973300971),
                  ('MedianAE', 0.1555797026699029),
                  ('MSE', 0.0454600486103789),
                  ('MAE', 0.1756006692430955),
                  ('R2', -0.00040905320955508806),
                  ('# Training', 3296),
                  ('# Validation', 1648)]),
     'binary_classification_threshold': None,
     'mean_cv_score': 0.2132136220094272},
    {'all_objective_scores': OrderedDict([('Root Mean Squared Error',
                   0.21293724011328263),
                  ('ExpVariance', 0.0),
                  ('MaxError', 0.5082056432038845),
                  ('MedianAE', 0.1604056432038845),
                  ('MSE', 0.045342268227061784),
                  ('MAE', 0.17623045637842863),
                  ('R2', -0.00071571992730

In [None]:
# Test best pipeline predictions
pipeline = automl.best_pipeline
pipeline.fit(X_train, y_train) # not necessary for the best pipeline, because it is already fitted ("train_best_pipeline")
pred = pipeline.predict(X_test)
pred

In [22]:
# Output scores of the best pipeline
from evalml.objectives import RootMeanSquaredError
objective = RootMeanSquaredError()
scores = automl.best_pipeline.score(X_test, y_test, objectives=[objective]) 
print(f'RMSE score of best pipeline ({automl.best_pipeline.name}): {scores["Root Mean Squared Error"]}')
print(scores)

RMSE score of best pipeline (XGBoost Regressor w/ Drop Columns Transformer + Imputer + One Hot Encoder): 0.21902859610220876
OrderedDict([('Root Mean Squared Error', 0.21902859610220876)])


In [23]:
# Save pickled version of best pipeline to a file
automl.best_pipeline.save("best-regression-model.pkl")