Licensed under the MIT License.

Copyright (c) 2021-2031. All rights reserved.

# Model Selection with MLJAR

* MLJAR's automl params: https://supervised.mljar.com/api/
* Checklist of modes: https://supervised.mljar.com/features/modes/
* Steps of mljar's automl: https://supervised.mljar.com/features/automl/

In [1]:
from supervised.automl import AutoML
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, balanced_accuracy_score

import warnings
warnings.filterwarnings('ignore')

### Regression

In [2]:
df = pd.read_pickle('../luigi_pipeline/output/preprocessed_data.pkl')
print(df.shape)

# drop categorical features, only keep numerical features
cat_cols = [col for col in df.select_dtypes(include='category').columns if col != 'Year']
df.drop(cat_cols, axis=1, inplace=True)

train_df = df.loc[df['Year'].astype(str) < '2015']
test_df = df.loc[df['Year'].astype(str) == '2015']

y_train, y_test = train_df['Sales'], test_df['Sales']
X_train, X_test = train_df.drop(['Sales', 'Date', 'Year'], axis=1), test_df.drop(['Sales', 'Date', 'Year'], axis=1)

X_train.reset_index(inplace=True, drop=True)
X_test.reset_index(inplace=True, drop=True)
y_train.reset_index(inplace=True, drop=True)
y_test.reset_index(inplace=True, drop=True)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
X_train.head()

(693861, 22)
(532529, 3) (161332, 3) (532529,) (161332,)


Unnamed: 0,Customers_larger_than_3000,CompetitionDistance,Customers
0,0.0,1270.0,327
1,0.0,1270.0,703
2,0.0,1270.0,700
3,0.0,1270.0,0
4,0.0,1270.0,684


In [4]:
automl = AutoML(mode="Compete", eval_metric='r2', explain_level=2, random_state=10,
                results_path='mljar_regression',
               validation_strategy={
                   "validation_type": "kfold",
                    "k_folds": 10,
                    "shuffle": True,
                    "stratify": True,
                    "random_seed": 10
               })
automl.fit(X_train, y_train)

Linear algorithm was disabled.
AutoML directory: mljar_regression
The task is regression with evaluation metric r2
AutoML will use algorithms: ['Decision Tree', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step simple_algorithms will try to check up to 3 models
1_DecisionTree r2 0.847607 trained in 46.42 seconds
2_DecisionTree r2 0.747035 trained in 23.12 seconds
3_DecisionTree r2 0.824 trained in 30.68 seconds
* Step default_algorithms will try to check up to 6 models
4_Default_LightGBM r2 0.942309 trained in 2730.99 seconds
Skip not_so_random because of the time limit.
Skip golden_features because of the ti

AutoML(eval_metric='r2', explain_level=2, mode='Compete', random_state=10,
       results_path='mljar_regression',
       validation_strategy={'k_folds': 10, 'random_seed': 10, 'shuffle': True,
                            'stratify': True, 'validation_type': 'kfold'})

In [5]:
y_pred = automl.predict(X_test)
print("Test R2:", r2_score(y_test, y_pred))

Test R2: 0.9377363242318557


In [6]:
# load saved model
loaded_automl = AutoML(results_path='mljar_regression')
y_pred = loaded_automl.predict(X_test)
print("Test R2:", r2_score(y_test, y_pred))

Test R2: 0.9377363242318557


### Classification
#### 1 hour default time limit

In [18]:
df30 = pd.read_csv('../../crystal_ball/data_collector/structured_data/leaf.csv')

y30 = df30['species']
X30 = df30.drop('species', axis=1)

X_train30, X_test30, y_train30, y_test30 = train_test_split(X30, y30, test_size=0.2,
                                               random_state=10, shuffle=True, stratify=y30)

X_train30.reset_index(inplace=True, drop=True)
X_test30.reset_index(inplace=True, drop=True)
y_train30.reset_index(inplace=True, drop=True)
y_test30.reset_index(inplace=True, drop=True)

print(X_train30.shape, X_test30.shape, y_train30.shape, y_test30.shape)
print(y_train30.nunique(), y_test30.nunique())
X_train30.head()

(272, 15) (68, 15) (272,) (68,)
30 30


Unnamed: 0,specimen_number,eccentricity,aspect_ratio,elongation,solidity,stochastic_convexity,isoperimetric_factor,maximal_indentation_depth,lobedness,average_intensity,average_contrast,smoothness,third_moment,uniformity,entropy
0,6,0.55977,1.3442,0.34301,0.9298,0.97544,0.57879,0.053564,0.52218,0.14905,0.25543,0.061249,0.02381,0.000597,2.413
1,2,0.87024,2.1094,0.52863,0.9836,0.99298,0.60784,0.003174,0.001833,0.026902,0.091391,0.008283,0.002439,0.000161,0.73904
2,6,0.63965,1.2323,0.60663,0.77037,0.62105,0.24135,0.12438,2.8155,0.025438,0.096215,0.009172,0.003421,5.2e-05,0.75194
3,10,0.39606,1.1647,0.29415,0.94064,0.99298,0.5486,0.025244,0.11598,0.051625,0.12014,0.014228,0.003721,0.000382,1.4943
4,6,0.37522,1.1417,0.81725,0.68511,0.58772,0.12523,0.09186,1.5358,0.11488,0.20861,0.041703,0.013344,0.00082,2.0281


In [26]:
automl_classification = AutoML(mode="Compete", eval_metric='logloss', ml_task='multiclass_classification',
                               explain_level=2, random_state=10,
                               results_path='mljar_classification',
                               validation_strategy={
                                   "validation_type": "kfold",
                                    "k_folds": 5,
                                    "shuffle": True,
                                    "stratify": True,
                                    "random_seed": 10
                               })
automl_classification.fit(X_train30, y_train30)

AutoML directory: mljar_classification
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree logloss 2.420592 trained in 186.64 seconds
2_DecisionTree logloss 2.869901 trained in 191.42 seconds
3_DecisionTree logloss 1.901518 trained in 233.82 seconds
4_Linear logloss 1.110228 trained in 280.24 seconds
Skip default_algorithms because of the time limit.
* Step not_so_random will try to check up to 63 models
14_LightGBM logloss

AutoML(eval_metric='logloss', explain_level=2,
       ml_task='multiclass_classification', mode='Compete', random_state=10,
       results_path='mljar_classification',
       validation_strategy={'k_folds': 5, 'random_seed': 10, 'shuffle': True,
                            'stratify': True, 'validation_type': 'kfold'})

In [28]:
y_pred30 = automl_classification.predict(X_test30)
print("Test Balabced Accuracy:", balanced_accuracy_score(y_test30, y_pred30))

Test Balabced Accuracy: 0.8444444444444444


#### 6 minute time limit

In [30]:
automl_classification_6m = AutoML(mode="Compete", eval_metric='logloss', ml_task='multiclass_classification',
                               total_time_limit=360,
                               explain_level=2, random_state=10,
                               results_path='mljar_classification_6m',
                               validation_strategy={
                                   "validation_type": "kfold",
                                    "k_folds": 5,
                                    "shuffle": True,
                                    "stratify": True,
                                    "random_seed": 10
                               })
automl_classification_6m.fit(X_train30, y_train30)

AutoML directory: mljar_classification_6m
The task is multiclass_classification with evaluation metric logloss
AutoML will use algorithms: ['Decision Tree', 'Linear', 'Random Forest', 'Extra Trees', 'LightGBM', 'Xgboost', 'CatBoost', 'Neural Network', 'Nearest Neighbors']
AutoML will stack models
AutoML will ensemble available models
AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'kmeans_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'boost_on_errors', 'ensemble', 'stack', 'ensemble_stacked']
* Step simple_algorithms will try to check up to 4 models
1_DecisionTree logloss 2.420592 trained in 253.09 seconds
4_Linear logloss 1.110228 trained in 239.96 seconds
Skip default_algorithms because of the time limit.
Skip not_so_random because of the time limit.
Skip golden_features because no parameters were generated.
'score' Traceback (most recent call last):
  File "C:\Users\wuhan\anaconda3\lib\si

AutoML(eval_metric='logloss', explain_level=2,
       ml_task='multiclass_classification', mode='Compete', random_state=10,
       results_path='mljar_classification_6m', total_time_limit=360,
       validation_strategy={'k_folds': 5, 'random_seed': 10, 'shuffle': True,
                            'stratify': True, 'validation_type': 'kfold'})

In [31]:
y_pred30_6m = automl_classification_6m.predict(X_test30)
print("Test Balabced Accuracy:", balanced_accuracy_score(y_test30, y_pred30_6m))

Test Balabced Accuracy: 0.6555555555555554
