## AutoML
## Empirical Tests - Analysis of Results

This project aims to explore some of the main **AutoML tools** available, which involves the following tasks:
1. Reading of technical articles concerning the automated machine learning field.
2. Discussion about machine learning pipelines and the automation of some of their components.
3. Identification of the most interesting Python libraries for automatic ML pipeline construction.
4. Quick implementation of the selected tools with simulated data.
5. Careful exploration of the APIs of the selected tools.
6. Comparison among selected tools concerning: model performance, computation time, and usability.

All of these activities derive from the **objectives** of this project, which are: i) reflection about ML pipeline components; ii) discussion and analysis of AutoML tools; iii) identification of key-points of AutoML frameworks; iv) definition of: the advantages and disadvantages of main AutoML tools, and, first of all, the relavance and adequacy of implementing AutoML.

---------------------

In this series of notebooks, we test out different AutoML Python libraries and compare them according to the following criteria: performance metrics of developed pipelines evaluated on test data; computation time (i.e., the performance relative to the available time budget of the search process); and usability of the tool.

* **Performance:** for each tool, after providing them with a training data (that will receive the appropriate validation approach by each tool), and after the search for the best ML pipeline, the selected one will be evaluated on a hold-out dataset (25% of the complete dataset). The model assessment will be based on the following metrics, since the supervised learning task is a binary classification here: [ROC-AUC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html), [average precision score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html), [Brier score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.brier_score_loss.html), [accuracy](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html), and [MCC](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.matthews_corrcoef.html).

* **Computation time:** all tested AutoML tools have some sort of time budget for the search process. Therefore, instead of minimizing the computation time across all tested tools, we will explore three different time budgets: 20 minutes, 1 hour, and 6 hours. Consequently, one of the main aspects of the comparison among tools will be the performance achieved by each one of them given different time budgets, besides of the average performance throughout all time budgets.

* **Usability:** this aspect of the comparison refers to how easy it is to set up the search for each one of the tested tools. Also important are the outputs of the search process, mainly in terms of the visualization and assessment of constructed and selected pipelines. Besides, the diversity of produced information about the search and how clear it is to access and interpret these data are also an aspect to have in mind. Finally, the more straightforward it is to use a selected pipeline the better is the tool.

The empirical tests follow the reading and discussing of the APIs of all selected tools. So, since the main initialization arguments, methods and attributes have been defined, they will be used accordingly in these notebooks.

The data used for the empirical tests was found in Kaggle repository of datasets. It consists of a dataset for binary classification whose objective is to construct a classification algorithm for the [identification of malware apps](https://www.kaggle.com/saurabhshahane/android-permission-dataset). It has 27310 unique instances (mobile phone applications) and 184 variables, among which one is the binary outcome variable and another is the name of the app. Since the main objective of this project is to explore AutoML tools, only some basic feature engineering operations were implemented, besides of a short description and exploration of the data.

Now, we begin to discuss each selected AutoML tool in terms of the above-mentioned criteria. Data for this analysis follows from the [table with main outcomes](#model_assess)<a href='#model_assess'></a> from the empirical tests. Then, each tool will be briefy assessed and the most interesting one will be pointed, at least considering the empirical tests conducted here.

1. **Performance:** MLJAR and Auto-sklearn gather the best performance among all tested tools. Curiously enough, the two best performances in terms of test ROC-AUC derive from those two tools with a budget time of only 20 minutes. This seems to point to some sort of overfitting when large search spaces are explored. *Note, however, that all of these conclusions rely on the single dataset explored here!*

2. **Computation time:** Auto-sklearn, TPOT and MLJAR all respected the time budget for the search. PyCaret, however, has shown an akward behavior, since a constant amount of time was allocated to the search, even that more time has been spent during a fine tuning of parameters and the estimation of the final model. All tested tools have shown themselves as efficient, since good test set performances were found given a small search time.

3. **Usability:** when it comes to this criterium, [main table](#model_assess)<a href='#model_assess'></a> shows that TPOT has a great advantage. It brings a good performance with a number of crucial initialization parameters close to the half of the other tools. MLJAR is the tool with the second lowest number of parameters. Given some subjective considerations about usability, it seemed very cumbersome to use PyCaret in comparison with Auto-sklearn and TPOT. Although flexibility is an important point for new AutoML tools, PyCaret does not seem to have the same balance between transparency and easiness of use as MLJAR, for instance.

If asked about the best tool among all the tested ones, MLJAR has been a great surprise, specially because of its different modes made specially for distinct types of users. Even so, Auto-sklearn and TPOT (specially the last one) deserve attention given their robust search space that eventually would perform better than MLJAR in a wider range of datasets.

Finally, the use of AutoML tools should be inspired by two main objectives: i) immediate processing of data for understanding of some empirical application; and ii) discovering of guidelines for a further development of machine learning pipelines. Consequently, AutoML does not appear to be a complete replacement of human development when it comes to the construction of ML applications. However, this field of research is subject to permanent improvements and even better tools are being created on a regular basis.

------------

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Model assessment](#model_assess)<a href='#model_assess'></a>.

<a id='libraries'></a>

## Libraries

In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
cd "/content/gdrive/MyDrive/Studies/autoML/Codes"

/content/gdrive/MyDrive/Studies/autoML/Codes


In [3]:
# pip install -r requirements.txt

In [4]:
import pandas as pd
import numpy as np
import os
import json

<a id='functions_classes'></a>

## Functions and classes

In [5]:
from utils import get_outcomes

<a id='settings'></a>

## Settings

### Data management

In [6]:
# Declare whether to export results:
export = False

<a id='model_assess'></a>

## Model assessment

In [7]:
model_assess = []

# Loop over AutoML tools:
for t in ['auto_sklearn', 'tpot', 'mljar', 'pycaret']:
  # Loop over estimatios:
  for f in [f for f in os.listdir(f'../Datasets/Outcomes/{t}') if 'model_assess' in f]:
    with open(f'../Datasets/Outcomes/{t}/{f}') as json_file:
      model_assess.append(json.load(json_file))

In [8]:
# Table with main outcomes of empirical tests:
outcomes = get_outcomes(model_assess)

if export:
  outcomes.to_csv('../Datasets/outcomes.csv', index=False)

outcomes.sort_values('test_roc_auc', ascending=False)

Unnamed: 0,autoML,estimation_id,time_budget,num_params,test_roc_auc,test_avg_prec,test_brier,test_acc,test_mcc,running_time
6,mljar,1631025897,0.333333,10,0.908455,0.959053,0.115545,0.821005,0.611579,0.343471
0,auto_sklearn,1630780422,0.333333,11,0.907942,0.957101,0.117798,0.821005,0.631783,0.33199
8,mljar,1631202843,6.0,10,0.906509,0.956235,0.117476,0.818515,0.614105,4.132753
7,mljar,1631030237,1.0,10,0.905702,0.956158,0.117606,0.821884,0.619878,1.029569
2,auto_sklearn,1630873231,6.0,11,0.905065,0.957475,0.155006,0.817929,0.585574,6.000003
1,auto_sklearn,1630784120,1.0,11,0.902595,0.956994,0.119417,0.815878,0.617101,0.999048
3,tpot,1630965285,0.333333,6,0.902045,0.954475,0.120289,0.816318,0.598208,0.350083
5,tpot,1631019728,6.0,6,0.898509,0.95541,0.120731,0.81163,0.583053,6.178011
4,tpot,1630970249,1.0,6,0.896073,0.952389,0.123157,0.811484,0.584642,1.035469
9,pycaret,1631124876,0.333333,12,0.8912,0.92,,0.8058,0.6055,0.272821
