## End-to-end machine learning application
## Data modeling - Final training

This project aims to integrate different aspects of a machine learning system, thus developing an end-to-end ML project. The final product is an app (hypothetically called *AppSafe*) composed of a model that calculates the risk of a mobile app being a malware and an API that could integrate with an app store and with the user by sending him/her a warning message when the mobile app that is about to be downloaded is too risky.

The project follows the traditional [CRISP-DM](https://pt.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) methodology, so these are the main stages that make the core of the project: data engineering, data preparation, data modeling, and deployment.

-----------

This notebook tries to sequentially improve the currently best pipeline by **fine tuning** both model hyper-parameters and the configuration of data preparation. The first model that is improved consists of the best complete pipeline (data preparation + ensemble of models) found after all experimentations have been tested. Then, different components of this complete pipeline are changed seeking to further optimize its model performance. The sequential fine tuning updates the best complete pipeline after each iteration until no additional improvement occurs.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing the data](#imports)<a href='#imports'></a>.
  * [Features and labels](#features_labels)<a href='#features_labels'></a>.
  * [Data understanding](#data_und)<a href='#data_und'></a>.
  * [Model assessment](#model_assess)<a href='#model_assess'></a>.

5. [Data preparation](#data_prep)<a href='#data_prep'></a>.
  * [Features classification and early selection](#classif_feat)<a href='#classif_feat'></a>.
  * [Pipeline of data transformations](#pipeline)<a href='#pipeline'></a>.
  * [Features selection](#features_selection)<a href='#features_selection'></a>.

6. [Data modeling](#data_modeling)<a href='#data_modeling'></a>.
  * [Train and test data](#train_test_data)<a href='#train_test_data'></a>.
  * [Grids of hyper-parameters](#hyper_parameters)<a href='#hyper_parameters'></a>.
  * [Model training and evaluation](#model_training_eval)<a href='#model_training_eval'></a>.
  * [Ensemble definition](#ensemble_definition)<a href='#ensemble_definition'></a>.
  * [Exporting artifacts](#artifacts)<a href='#artifacts'></a>.

<a id='libraries'></a>

## Libraries





In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
cd "/content/gdrive/MyDrive/Studies/end_to_end_ml/notebooks/"

/content/gdrive/MyDrive/Studies/end_to_end_ml/model_dev


In [None]:
# !pip install -r ../requirements.txt

In [None]:
import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import time
from copy import deepcopy
import pickle

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import matthews_corrcoef, confusion_matrix
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss
from scipy.stats import uniform, randint

In [None]:
import sys

sys.path.append(
    os.path.abspath(
        os.path.join(
            os.path.dirname(__doc__), '../src'
        )
    )
)

<a id='functions_classes'></a>

## Functions and classes

In [None]:
from utils import classify_variables, assessing_missings, missings_detection, data_consistency, running_time, frequency_list
from transformations import LogTransformation, ScaleNumericalVars, TreatMissings, OneHotEncoding, OutliersTreat, Pipeline
from kfolds import Kfolds_fit
from features_selection import FeaturesSelection
from data_vis import plot_histogram, plot_boxplot, plot_bar
from production import Ensemble

<a id='settings'></a>

## Settings

<a id='data_management_settings'></a>

### Data management

In [None]:
# Declare whether outcomes should be exported:
EXPORT = True

# Identification of experiment:
experiment_id = str(int(time.time()))

# Declares if it is a baseline estimation (following the best pipeline found during experimentation):
BASELINE = False

# Experiment description:
COMMENT = 'Fine tuning of the best complete pipeline: further exploration of main hyper-parameters of selected models together with \
slight changes in data transformations.'

<a id='data_prep_settings'></a>

### Data preparation

#### Parameters for further exploration (data transformations)

In [None]:
# Declare which type of scaling should be applied over numerical variables ('standard_scale', 'min_max_scale', None):
WHICH_SCALE = None

# Declare which type of missing values treatment should be used ('create_binary', 'impute_stat'):
WHICH_MISSINGS_TREAT = 'create_binary'

# Declare which statistic should be used for missing values treatment ('mean', 'median'):
MISSINGS_TREAT_STAT = 'mean' if WHICH_MISSINGS_TREAT=='impute_stat' else None

<a id='features_sel'></a>

### Set of features

In [None]:
# List of features to be removed:
REMOVE_FEATS = []
# REMOVE_FEATS = ['num_related_apps', 'num_words_desc', 'num_known_apps',
#                 'share_known', 'num_known_malwares', 'share_known_malwares']

<a id='fixed_settings'></a>

### Fixed settings

In [None]:
# Directory for storing model outcomes:
if 'artifacts' not in os.listdir('..'):
    os.mkdir('../artifacts')

<a id='imports'></a>

## Importing the data

<a id='features_labels'></a>

### Features and labels

#### Training data

In [None]:
df_train = pd.read_csv('../data/training_data.csv', dtype={'app_id': int})

print(f'Shape of df_train: {df_train.shape}.')
print(f'Number of unique instances: {df_train.app_id.nunique()}.')

# Auxiliary variables:
drop_vars = ['app', 'package', 'class', 'app_id', 'related_apps', 'description']

df_train.head(3)

Shape of df_train: (18298, 191).
Number of unique instances: 18298.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_write_contact_data,your_personal_information_write_to_user_defined_dictionary,class,app_id,num_related_apps,num_words_desc,num_known_apps,share_known,num_known_malwares,share_known_malwares
0,Ambient Soothing Sounds: Beach,com.zeddev.chillbeach1,Health & Fitness,The soothing sounds on a long and seamless loo...,3.6,122,0.0,"com.zeddev.chillmeadow1, com.droiddevz.ambient...",1.0,1,...,0,0,0,6565,4.0,42.0,0.0,0.0,0.0,
1,Aurora,jiang.joyworks.aurora,Brain & Puzzle,This is one great &quot;Escape Game&quot; <p>Y...,3.8,24,1.41,com.firemaplegames.games.the_secretofgrislyman...,1.0,0,...,0,0,1,4772,4.0,251.0,0.0,0.0,0.0,
2,Tank Ace 1944,com.resetgame.tankace1944,Arcade & Action,In Tank Ace 1944 you command a World War II ta...,3.7,20,4.99,"ru.sibteam.classictankfull, nl.ejsoft.mortalsk...",0.0,0,...,0,0,1,20856,4.0,341.0,0.0,0.0,0.0,


Missing data

In [None]:
missings_train = pd.DataFrame(data={
    'feature': df_train.isnull().sum().index,
    'num_missings': df_train.isnull().sum().values,
    'share_missings': [v/len(df_train) for v in df_train.isnull().sum().values]
}).sort_values('num_missings', ascending=False)

missings_train.head(10)

Unnamed: 0,feature,num_missings,share_missings
190,share_known_malwares,10047,0.549076
185,num_related_apps,484,0.026451
189,num_known_malwares,484,0.026451
188,share_known,484,0.026451
187,num_known_apps,484,0.026451
7,related_apps,484,0.026451
8,dangerous_permissions_count,129,0.00705
3,description,3,0.000164
186,num_words_desc,3,0.000164
0,app,1,5.5e-05


#### Test data

In [None]:
df_test = pd.read_csv('../data/test_data.csv', dtype={'app_id': int})

print(f'Shape of df_test: {df_test.shape}.')
print(f'Number of unique instances: {df_test.app_id.nunique()}.')

df_test.head(3)

Shape of df_test: (9012, 191).
Number of unique instances: 9012.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_write_contact_data,your_personal_information_write_to_user_defined_dictionary,class,app_id,num_related_apps,num_words_desc,num_known_apps,share_known,num_known_malwares,share_known_malwares
0,Dirty Jokes,com.appspot.swisscodemonkeys.dirty,Entertainment,The best Dirty Jokes app for Android!<p>#1 Fre...,4.0,2470,0.0,"com.gonzotech.dirty_jokes, com.comic.lastlaugh...",1.0,1,...,0,0,0,5804,4.0,82,1.0,0.25,1.0,1.0
1,Animal Sounds with Photos,com.teachersparadise.animalsoundsphotos,Education,Let kids explore the animal kingdom by learnin...,3.8,168,0.0,"com.papainteractive, com.teachersparadise.days...",2.0,0,...,0,0,0,13224,4.0,37,2.0,0.5,0.0,0.0
2,Mini Catch,com.airylabs.games.minicatch,Brain & Puzzle,"From Airy Labs, acclaimed developer of the bes...",3.0,1,0.0,"com.oscarmikegames.Bloxus, com.concretesoftwar...",2.0,1,...,0,0,1,14752,4.0,244,0.0,0.0,0.0,


Missing data

In [None]:
missings_test = pd.DataFrame(data={
    'feature': df_test.isnull().sum().index,
    'num_missings': df_test.isnull().sum().values,
    'share_missings': [v/len(df_test) for v in df_test.isnull().sum().values]
}).sort_values('num_missings', ascending=False)

missings_test.head(10)

Unnamed: 0,feature,num_missings,share_missings
190,share_known_malwares,5072,0.562805
185,num_related_apps,236,0.026187
189,num_known_malwares,236,0.026187
188,share_known,236,0.026187
187,num_known_apps,236,0.026187
7,related_apps,236,0.026187
8,dangerous_permissions_count,72,0.007989
122,system_tools_retrieve_running_applications,0,0.0
131,system_tools_write_sync_settings,0,0.0
123,system_tools_send_package_removed_broadcast,0,0.0


<a id='data_und'></a>

### Data understanding

In [None]:
data_und = pd.read_csv('../data/features.csv')

print(f'Shape of data_und: {data_und.shape}.')
print(f'Number of unique instances: {data_und.feature.nunique()}.')

data_und.head(3)

Shape of data_und: (191, 8).
Number of unique instances: 191.


Unnamed: 0,feature,type,n_unique,sample_values,num_missings,share_missings,var_class,category
0,app,object,22823,['Alabama Crimson Tide News' 'Blood Demon Movi...,1,3.7e-05,categorical,app_attributes
1,package,object,23485,['com.estrongs.android.pop.app.shortcut' 'com....,0,0.0,categorical,app_attributes
2,category,object,30,['Shopping' 'Racing' 'Productivity' 'Sports Ga...,0,0.0,categorical,app_attributes


<a id='model_assess'></a>

### Model assessment

In [None]:
with open('../experiments/model_assess.json', 'r') as json_file:
    model_assess = json.load(json_file)

if 'fine_tuning.json' in os.listdir('../experiments'):
    with open('../experiments/fine_tuning.json', 'r') as json_file:
        fine_tuning = json.load(json_file)

else:
    fine_tuning = {}

fine_tuning[experiment_id] = {
    'comment': COMMENT,
    'solution': {
        'which_scale': WHICH_SCALE,
        'which_missings_treat': WHICH_MISSINGS_TREAT,
        'missings_treat_stat': MISSINGS_TREAT_STAT,
        'remove_feats': REMOVE_FEATS
    }
}

#### Complete pipeline selection

In [None]:
# Best models for each data pipeline:
pipeline_selection = pd.DataFrame(
    data={
        'experiment_id': [e for e in model_assess],
        'solution': [model_assess[e]['solution'] for e in model_assess],
        'best_model': [model_assess[e]['best_models'][0] for e in model_assess],
        'test_roc_auc': [model_assess[e]['models'][model_assess[e]['best_models'][0]]['performance_metrics']['test_roc_auc'] for e
                         in model_assess],
        'test_acc': [model_assess[e]['models'][model_assess[e]['best_models'][0]]['performance_metrics']['test_acc'] for e
                     in model_assess],
        'test_mcc': [model_assess[e]['models'][model_assess[e]['best_models'][0]]['performance_metrics']['test_mcc'] for e
                     in model_assess]
          },
    index=[e for e in model_assess]
).sort_values(['test_roc_auc', 'test_acc', 'test_mcc'], ascending=[False, False, False])
pipeline_selection

Unnamed: 0,experiment_id,solution,best_model,test_roc_auc,test_acc,test_mcc
1650581120,1650581120,"{'scale_all': False, 'treat_outliers': True, '...",ensemble_0,0.916012,0.833999,0.63397
1650554986,1650554986,"{'scale_all': True, 'treat_outliers': False, '...",ensemble_0,0.915581,0.833111,0.632162
1650488565,1650488565,"{'scale_all': False, 'treat_outliers': False, ...",light_gbm,0.915155,0.834998,0.637094
1650727310,1650727310,"{'scale_all': False, 'treat_outliers': True, '...",ensemble_0,0.914654,0.833333,0.632812
1650732761,1650732761,"{'scale_all': False, 'treat_outliers': False, ...",light_gbm,0.913785,0.834443,0.636553
1650737228,1650737228,"{'scale_all': False, 'treat_outliers': False, ...",light_gbm,0.913171,0.834887,0.634772
1650721375,1650721375,"{'scale_all': False, 'treat_outliers': True, '...",ensemble_0,0.908034,0.825788,0.609164
1650739831,1650739831,"{'scale_all': False, 'treat_outliers': False, ...",light_gbm,0.90799,0.824345,0.613697
1650567459,1650567459,"{'scale_all': False, 'treat_outliers': True, '...",ensemble_1,0.898973,0.828229,0.614573


In [None]:
# Collection of best models based on how many times it maximizes the selected metrics:
best_pipelines = []
for metric in ['test_roc_auc', 'test_acc', 'test_mcc']:
    best_pipelines.extend([m for m, v in zip(pipeline_selection.index, pipeline_selection[metric]) if v==max(pipeline_selection[metric])])
best_pipeline = [
                 (k, pipeline_selection.loc[k]['solution'], pipeline_selection.loc[k]['best_model']) for k, v in
                 frequency_list(best_pipelines).items() if v==max(frequency_list(best_pipelines).values())
][0]
display(best_pipeline)

# Define the best pipeline:
best_pipeline_id = best_pipeline[0]

('1650488565',
 {'first_treat_outliers': None,
  'method': None,
  'outliers_method': None,
  'scale_all': False,
  'treat_outliers': False},
 'light_gbm')

Parameters defined during experimentation

In [None]:
# Scaling all variables:
scale_all = model_assess[best_pipeline_id]['pipeline']['data_transformation']['scale_all']

# Treating outliers:
treat_outliers = model_assess[best_pipeline_id]['pipeline']['data_transformation']['treat_outliers']

# Outliers treatment:
outliers_method = model_assess[best_pipeline_id]['pipeline']['data_transformation']['outliers_method']

# First or last treatment of outliers:
first_treat_outliers = model_assess[best_pipeline_id]['pipeline']['data_transformation']['first_treat_outliers']

# Method of features selection:
method = model_assess[best_pipeline_id]['pipeline']['features_selection']['method']

# Selected learning algorithms:
if 'ensemble' in model_assess[best_pipeline_id]['best_models'][0]:
    sel_models = model_assess[best_pipeline_id]['models'][model_assess[best_pipeline_id]['best_models'][0]].get('models')
else:
    sel_models = [model_assess[best_pipeline_id]['best_models'][0]]

Fixed parameters

In [None]:
# Early selection of features:
drop_excessive_miss = model_assess[best_pipeline_id]['pipeline']['early_selection']['drop_excessive_miss']
excessive_miss = model_assess[best_pipeline_id]['pipeline']['early_selection']['excessive_miss']
drop_no_var = model_assess[best_pipeline_id]['pipeline']['early_selection']['drop_no_var']
minimum_var = model_assess[best_pipeline_id]['pipeline']['early_selection']['minimum_var']
drop_bin_no_var = model_assess[best_pipeline_id]['pipeline']['early_selection']['drop_bin_no_var']
bin_minimum_var = model_assess[best_pipeline_id]['pipeline']['early_selection']['bin_minimum_var']

# Data transformations:
log_transform = model_assess[best_pipeline_id]['pipeline']['data_transformation']['log_transform']
cat_transf_var = model_assess[best_pipeline_id]['pipeline']['data_transformation']['cat_transf_var']
quantile = model_assess[best_pipeline_id]['pipeline']['data_transformation']['quantile']
k = model_assess[best_pipeline_id]['pipeline']['data_transformation']['k']

# Features selection:
threshold = model_assess[best_pipeline_id]['pipeline']['features_selection']['threshold']
num_folds = model_assess[best_pipeline_id]['pipeline']['features_selection']['num_folds']
metric = model_assess[best_pipeline_id]['pipeline']['features_selection']['metric']
min_num_feats = model_assess[best_pipeline_id]['pipeline']['features_selection']['min_num_feats']
max_num_feats = model_assess[best_pipeline_id]['pipeline']['features_selection']['max_num_feats']
step = model_assess[best_pipeline_id]['pipeline']['features_selection']['step']
direction = model_assess[best_pipeline_id]['pipeline']['features_selection']['direction']
regul_param = model_assess[best_pipeline_id]['pipeline']['features_selection']['regul_param']

<a id='data_prep'></a>

## Data preparation

<a id='classif_feat'></a>

### Features classification and early selection

In [None]:
class_variables = classify_variables(dataframe=df_train, vars_to_drop=drop_vars, test_data=df_test,
                                     drop_excessive_miss=drop_excessive_miss, excessive_miss=excessive_miss,
                                     drop_no_var=drop_no_var, minimum_var=minimum_var)

# Lists of variables:
cat_vars = class_variables['cat_vars']
binary_vars = class_variables['binary_vars']
cont_vars = class_variables['cont_vars']

Initial number of features: 185.
0 features were dropped for excessive number of missings!
29 features were dropped for having no variance!
156 remaining features.




#### Selecting binary variables based on their variances

In [None]:
if drop_bin_no_var:
  # Dropping features with no variance in the training data:
  bin_no_variance = [c for c in binary_vars  if np.nanvar(df_train[c])<=bin_minimum_var]
  print(f'{len(bin_no_variance)} binary variables were dropped for having variance inferior to {bin_minimum_var}.\n')

  print(f'Shape of df_train (before dropping binary variables): {df_train.shape}.')
  df_train = df_train.drop(bin_no_variance, axis=1)
  print(f'Shape of df_train (after dropping binary variables): {df_train.shape}.\n')

  print(f'Shape of df_test (before dropping binary variables): {df_test.shape}.')
  df_test = df_test.drop(bin_no_variance, axis=1)
  print(f'Shape of df_test (after dropping binary variables): {df_test.shape}.')

101 binary variables were dropped for having variance inferior to 0.01.

Shape of df_train (before dropping binary variables): (18298, 162).
Shape of df_train (after dropping binary variables): (18298, 61).

Shape of df_test (before dropping binary variables): (9012, 162).
Shape of df_test (after dropping binary variables): (9012, 61).


<a id='pipeline'></a>

### Pipeline of data transformations

In [None]:
if scale_all==False:
    if (treat_outliers==True) & (first_treat_outliers==True):
        outliers_treat = OutliersTreat(vars_to_treat=[c for c in cont_vars], method=outliers_method, quantile=quantile, k=k)
        outliers_treat.fit(training_data=df_train)
        df_train = outliers_treat.transform(data=df_train)

    to_log = [c for c in df_train.columns if c in cont_vars]
    # to_scale = [f'L#{c}' for c in df_train.columns if c in cont_vars]
    vars_to_treat = None

    pipeline = Pipeline(
        operations = [
                      LogTransformation(to_log=to_log),
                      # ScaleNumericalVars(to_scale=to_scale, which_scale=WHICH_SCALE),
                      TreatMissings(vars_to_treat=vars_to_treat, method=WHICH_MISSINGS_TREAT, drop_vars=drop_vars, cat_vars=cat_vars,
                                    statistic=MISSINGS_TREAT_STAT),
                      OneHotEncoding(categorical_features=cat_vars, variance_param=cat_transf_var)
        ]
    )

    df_train_scaled, df_test_scaled = pipeline.transform(data_list=[df_test], training_data=df_train)
    df_test_scaled = df_test_scaled[0]

    if (treat_outliers==True) & (first_treat_outliers==False):
        outliers_treat = OutliersTreat(vars_to_treat=[f'L#{c}' for c in cont_vars], method=outliers_method, quantile=quantile, k=k)
        outliers_treat.fit(training_data=df_train_scaled)
        df_train_scaled = outliers_treat.transform(data=df_train_scaled)

#### Datasets consistency

In [None]:
if scale_all==False:
    # Assessing missing values (training data):
    missings_detection(df_train_scaled.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_train_scaled')

    # Assessing missing values (test data):
    missings_detection(df_test_scaled.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_test_scaled')

    # Checking datasets structure:
    df_test_scaled = data_consistency(dataframe=df_train_scaled,
                                      test_data=df_test_scaled)['test_data']

Training and test data are consistent with each other.


#### Scaling all variables

In [None]:
if scale_all:
    if (treat_outliers==True) & (first_treat_outliers==True):
        outliers_treat = OutliersTreat(vars_to_treat=[c for c in cont_vars], method=outliers_method, quantile=quantile, k=k)
        outliers_treat.fit(training_data=df_train)
        df_train = outliers_treat.transform(data=df_train)

    # Logarithmic transformation:
    to_log = [c for c in df_train.columns if c in cont_vars]
    log_transf = LogTransformation(to_log=to_log)
    df_train = log_transf.fit_transform(data=df_train)
    df_test = log_transf.fit_transform(data=df_test)

    # One-hot encoding:
    categorical_transf = OneHotEncoding(categorical_features=cat_vars, variance_param=cat_transf_var)
    categorical_transf.fit(training_data=df_train)
    df_train_scaled = categorical_transf.transform(data=df_train)
    df_test_scaled = categorical_transf.transform(data=df_test)

    to_scale = [c for c in df_train_scaled.columns if (c not in drop_vars)]

    # Object for scaling numerical data:
    scale_transf = ScaleNumericalVars(to_scale=to_scale, which_scale=WHICH_SCALE)
    scale_transf.fit(training_data=df_train_scaled)

    # Training data:
    df_train_scaled = scale_transf.transform(data=df_train_scaled)
    new_vars_scale = list(df_train_scaled.drop(drop_vars, axis=1).columns)

    # Test data:
    df_test_scaled = scale_transf.transform(data=df_test_scaled)

    # Object for missing values treatment:
    vars_to_treat = [c for c in list(df_train_scaled.columns) if (c not in drop_vars) & (c not in cat_vars) &
                     (df_train_scaled[c].isnull().sum() > 0)]
    missings_treat = TreatMissings(vars_to_treat=vars_to_treat, method=WHICH_MISSINGS_TREAT, drop_vars=drop_vars, cat_vars=[],
                                   statistic=MISSINGS_TREAT_STAT, treat_remaining=True)
    
    # Training data:
    df_train_scaled = missings_treat.fit_transform(data=df_train_scaled, training_data=df_train_scaled)

    # Test data:
    df_test_scaled = missings_treat.fit_transform(data=df_test_scaled, training_data=df_train_scaled)

    # Checking datasets structure:
    missings_detection(df_train_scaled.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_train_scaled')
    missings_detection(df_test_scaled.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_test_scaled')
    df_test_scaled = data_consistency(dataframe=df_train_scaled,
                                      test_data=df_test_scaled)['test_data']

    # Object for scaling numerical data:
    new_vars_scale = [c for c in list(df_train_scaled.drop(drop_vars, axis=1).columns) if c not in new_vars_scale]
    scale_transf = ScaleNumericalVars(to_scale=new_vars_scale, which_scale=WHICH_SCALE)
    scale_transf.fit(training_data=df_train_scaled)

    # Training data:
    df_train_scaled = scale_transf.transform(data=df_train_scaled)

    # Test data:
    df_test_scaled = scale_transf.transform(data=df_test_scaled)

    if (treat_outliers==True) & (first_treat_outliers==False):
        outliers_treat = OutliersTreat(vars_to_treat=[f'L#{c}' for c in cont_vars], method=outliers_method, quantile=quantile, k=k)
        outliers_treat.fit(training_data=df_train_scaled)
        df_train_scaled = outliers_treat.transform(data=df_train_scaled)

<a id='features_selection'></a>

### Features selection

In [None]:
if method is not None:
    # Dataframe with only continuous variables:
    cont_train_df = df_train_scaled[[f'L#{c}' for c in cont_vars]]

    # Features selection:
    selection = FeaturesSelection(method=method, 
                                  threshold=threshold,
                                  num_folds=num_folds, metric=metric, min_num_feats=min_num_feats, max_num_feats=max_num_feats, step=step,
                                  direction=direction)
    selection.select_features(inputs=cont_train_df if method in ['variance', 'correlation'] else df_train_scaled.drop(drop_vars, axis=1),
                              output=df_train_scaled['class'],
                              estimator=LogisticRegression(penalty='l1', solver='liblinear', C=regul_param))
    selected_features = selection.selected_features

    if method in ['variance', 'correlation']:
        selected_variables = [c for c in list(df_train_scaled.columns) if (c not in drop_vars) & (c.replace('L#', '') not in cont_vars)]
        selected_variables.extend([f'L#{c}' for c in cont_vars if f'L#{c}' in selected_features])
    
    else:
        selected_variables = [v for v in selected_features]

    print(f'\033[1m{len(selected_variables)} variables were chosen based on {method}!\033[0m')

else:
    selected_variables = list(df_train_scaled.drop(drop_vars, axis=1).columns)
    print(f'\033[1m All {len(selected_variables)} variables were chosen!\033[0m')

[1m All 87 variables were chosen![0m


#### Ad-hoc features selection

In [None]:
if len(REMOVE_FEATS) > 0:
    remove_vars = list(set(
        [item for sublist in [[v for v in selected_variables if r in v] for r in REMOVE_FEATS] for item in sublist]
    ))

    selected_variables = [v for v in selected_variables if v not in remove_vars]
    print(f'\033[1mOnly {len(selected_variables)} variables were kept from the collection of original variables.\033[0m')

<a id='data_modeling'></a>

## Data modeling

<a id='train_test_data'></a>

### Train and test data

In [None]:
# Dados de treinamento e de teste:
X_train, y_train = (df_train_scaled[selected_variables], df_train_scaled['class'])
X_test, y_test = (df_test_scaled[selected_variables], df_test_scaled['class'])

print(f'Shape do dataset de treino: {X_train.shape}.')
print(f'Shape do dataset de teste: {X_test.shape}.\n')

# Dictionary of model assessment:
fine_tuning[experiment_id]['info'] = {
    'experiment_id': experiment_id,
    'n_obs_train': len(X_train),
    'n_obs_test': len(X_test),
    'avg_y_train': np.nanmean(y_train),
    'avg_y_test': np.nanmean(y_test),
    'n_vars_train': X_train.shape[1],
    'n_vars_test': X_test.shape[1],
    'pipeline_id': best_pipeline_id
}
fine_tuning[experiment_id]['pipeline'] = {
    'experiment_id': experiment_id,
    'n_obs_train': len(X_train),
    'n_obs_test': len(X_test),
    'avg_y_train': np.nanmean(y_train),
    'avg_y_test': np.nanmean(y_test),
    'n_vars_train': X_train.shape[1],
    'n_vars_test': X_test.shape[1]
}
fine_tuning[experiment_id]['pipeline'] = {
    'early_selection': {
        'drop_excessive_miss': drop_excessive_miss,
        'excessive_miss': excessive_miss,
        'drop_no_var': drop_no_var,
        'minimum_var': minimum_var,
        'drop_bin_no_var': drop_bin_no_var,
        'bin_minimum_var': bin_minimum_var
    },
    'data_transformation': {
        'log_transform': log_transform,
        'which_scale': WHICH_SCALE,
        'which_missings_treat': WHICH_MISSINGS_TREAT,
        'missings_treat_stat': MISSINGS_TREAT_STAT,
        'cat_transf_var': cat_transf_var,
        'scale_all': scale_all,
        'treat_outliers': treat_outliers,
        'quantile': quantile,
        'outliers_method': outliers_method,
        'k': k,
        'first_treat_outliers': first_treat_outliers
    },
    'features_selection': {
        'method': method,
        'threshold': threshold,
        'num_folds': num_folds,
        'metric': metric,
        'min_num_feats': min_num_feats,
        'max_num_feats': max_num_feats,
        'step': step,
        'direction': direction,
        'regul_param': regul_param
    }
}
fine_tuning[experiment_id]['models'] = {}

Shape do dataset de treino: (18298, 87).
Shape do dataset de teste: (9012, 87).



<a id='hyper_parameters'></a>

### Grids of hyper-parameters

#### Parameters for optimization

In [None]:
# Grids of values of parameters under optimization:
grid_params = {
    "logistic_regression": {"C": [0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03, 0.1, 0.25, 0.3, 0.5, 0.75, 1, 3, 10]},
    "random_forest": {
        'n_estimators': [i for i in range(25, 520, 25)],
        'max_features': [int(np.sqrt(X_train.shape[1]))],
        'min_samples_split': [2]
    },
    "light_gbm": {
        'bagging_fraction': uniform(0.5, 0.5),
        'learning_rate': uniform(0.0001, 0.1),
        'max_depth': randint(1, 10),
        'num_iterations': [i for i in range(100, 1050, 50)]
    },
    "xgboost": {
        'subsample': uniform(0.5, 0.5),
        'eta': uniform(0.0001, 0.1),
        'max_depth': randint(1, 10),
        'num_boost_round': [i for i in range(100, 1050, 50)]
    },
    "svm": {
        'C': [1],
        'kernel': ['poly'],
        'degree': [i for i in range(1, 11, 1)],
        'gamma': ['scale']
    }
}

Hyper-parameters of best models (baseline estimation)

In [None]:
baseline_params = {}

# Loop over selected models:
for m in sel_models:
    baseline_params[m] = {}

    # Loop over hyper-parameters:
    for p in model_assess[best_pipeline_id]['models'][m]['best_param']:
        baseline_params[m][p] = [model_assess[best_pipeline_id]['models'][m]['best_param'][p]]

#### Default and fixed parameters

In [None]:
# Default values of parameters under optimization (used when grid/random search fails):
default_params = {
    "logistic_regression": {'C': 1.0},
    "random_forest": {'n_estimators': 100, 'max_features': int(np.sqrt(X_train.shape[1])), 'min_samples_split': 2},
    "light_gbm": {'bagging_fraction': 0.75, 'learning_rate': 0.01, 'max_depth': 10, 'num_iterations': 500},
    "xgboost": {'subsample': 0.75, 'eta': 0.01, 'max_depth': 10, 'num_boost_round': 100},
    "svm": {'C': 1.0, 'kernel': 'poly', 'degree': 1, 'gamma': 'scale'}
}

# Customized values of parameters:
fixed_params = {
    "logistic_regression": {'penalty':'l1', 'solver':'liblinear', 'warm_start':True},
    "random_forest": {'bootstrap': True, 'criterion': 'gini'},
    "light_gbm": None,
    "xgboost": None,
    "svm": {'probability': True}
}

<a id='model_training_eval'></a>

### Model training and evaluation

In [None]:
models, predictions = {}, {}

# Loop over learning methods:
for m in sel_models:
    print(f'Training the {m.replace("_", " ")} model...')
    start_time = datetime.now()

    # Creating the object for K-folds CV estimation:
    model = Kfolds_fit(task='binary' if m=='light_gbm' else ('binary:logistic' if m=='xgboost' else 'classification'), method=m,
                       metric='roc_auc', num_folds=5, shuffle=False, pre_selecting=False,
                       random_search=False if BASELINE else (True if m in ['light_gbm', 'xgboost'] else False), n_samples=1000,
                       grid_param=baseline_params[m] if BASELINE else grid_params[m],
                       default_param=default_params[m],
                       fixed_params=fixed_params[m])
    
    # Training the model:
    model.fit(train_inputs=X_train, train_output=y_train, test_inputs=X_test, test_output=y_test)
    end_time = datetime.now()
    elapsed_time = running_time(start_time=start_time, end_time=end_time)

    # Predicted scores for test data:
    test_scores = model.test_scores

    # Predicted labels for test data:
    test_scores['y_pred'] = test_scores.test_score.apply(lambda x: 1 if x > 0.5 else 0)
    test_scores['fn'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==1) & (x['y_pred']==0) else 0,
                                                                axis=1)
    test_scores['fp'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==0) & (x['y_pred']==1) else 0,
                                                                axis=1)
    test_scores['tn'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==0) & (x['y_pred']==0) else 0,
                                                                axis=1)
    test_scores['tp'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==1) & (x['y_pred']==1) else 0,
                                                                axis=1)

    # Model assessment:
    fine_tuning[experiment_id]['models'][m] = {'performance_metrics': deepcopy(model.performance_metrics)}
    fine_tuning[experiment_id]['models'][m]['best_param'] = model.best_param
    fine_tuning[experiment_id]['models'][m]['running_time'] = elapsed_time
    fine_tuning[experiment_id]['models'][m]['performance_metrics'].update({
        'test_mcc': matthews_corrcoef(test_scores['y_true'], test_scores['y_pred']),
        'test_acc': np.nanmean(test_scores['y_true']==test_scores['y_pred']),
        'test_prec': np.nansum(test_scores['tp'])/(np.nansum(test_scores['fp']) + np.nansum(test_scores['tp'])),
        'test_rec': np.nansum(test_scores['tp'])/(np.nansum(test_scores['fn']) + np.nansum(test_scores['tp'])),
        'fn_rate': np.nansum(test_scores['fn'])/(np.nansum(test_scores['fn']) + np.nansum(test_scores['tp'])),
        'fp_rate': np.nansum(test_scores['fp'])/(np.nansum(test_scores['fp']) + np.nansum(test_scores['tn'])),
        'conf_matrix': [[int(i) for i in a] for a in confusion_matrix(test_scores['y_true'], test_scores['y_pred'])]
    })

    # Saving the model object and predictions:
    models[m] = model
    predictions[m] = test_scores.copy()

Training the light gbm model...


[1;30;43mA saída de streaming foi truncada nas últimas 5000 linhas.[0m

Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use it instead of argument


Found `num_iterations` in params. Will use

---------------------------------------------------------------------
[1mTrain-test estimation outcomes:[0m


Outcomes from K-folds CV estimation:
   Number of data folds: 5.
   Number of samples for random search: 1000.
   Estimation method: light gbm.
   Metric for choosing best hyper-parameter: roc_auc.
   Best hyper-parameters: {'bagging_fraction': 0.5465824739383021, 'learning_rate': 0.028899981152448805, 'max_depth': 3, 'num_iterations': 800}.
   CV performance metric associated with best hyper-parameters: 0.9189.


Performance metrics evaluated at test data:
   test_roc_auc = 0.9165
   test_prec_avg = 0.9623
   test_brier = 0.1101
---------------------------------------------------------------------


------------------------------------
[1mRunning time:[0m 103.68 minutes.
Start time: 2022-05-01, 15:16:38
End time: 2022-05-01, 17:00:19
------------------------------------
------------------------------------
[1mRunning time:[0m 103.69 minutes.
Start time: 2022-05-01, 15:16

#### Feature importances

In [None]:
if 'logistic_regression' in models:
    # Feature importances of logistic regression:
    feat_importances_lr = pd.DataFrame(data={
        'feature': list(X_train.columns),
        'feat_imp': [c for c in models['logistic_regression'].model.coef_[0]],
        'abs_feat_imp': [abs(c) for c in models['logistic_regression'].model.coef_[0]]
    }).sort_values('abs_feat_imp', ascending=False)
    feat_importances_lr.index.name = 'logistic_regression'
    display(feat_importances_lr.head(10))

if 'random_forest' in models:
    # Feature importances of random forest:
    feat_importances_rf = pd.DataFrame(data={
        'feature': list(X_train.columns),
        'feat_imp': [c for c in models['random_forest'].model.feature_importances_]
    }).sort_values('feat_imp', ascending=False)
    feat_importances_rf.index.name = 'random_forest'
    display(feat_importances_rf.head(10))

if 'light_gbm' in models:
    # Feature importances of LightGBM:
    feat_importances_lgb = pd.DataFrame(data={
        'feature': list(X_train.columns),
        'feat_imp': [c for c in models['light_gbm'].model.feature_importance()]
    }).sort_values('feat_imp', ascending=False)
    feat_importances_lgb.index.name = 'light_gbm'
    display(feat_importances_lgb.head(10))

Unnamed: 0_level_0,feature,feat_imp
light_gbm,Unnamed: 1_level_1,Unnamed: 2_level_1
44,L#number_of_ratings,1185
49,L#num_words_desc,596
43,L#rating,287
53,L#share_known_malwares,234
67,C#category#COMICS,205
46,L#dangerous_permissions_count,165
45,L#price,164
86,C#category#TRAVEL__LOCAL,138
73,C#category#LIBRARIES__DEMO,120
52,L#num_known_malwares,109


#### Analysis of predictions

Distribution of predictions

In [None]:
# Loop over trained models:
for m in models:
    print('---------------------------------------------------------------------')
    print(f'\033[1m{m.replace("_", " ").capitalize()} model:\033[0m\n')
    display(pd.DataFrame(model_assess[experiment_id]['models'][m]['performance_metrics']['conf_matrix'],
                         index=['y=0', 'y=1'], columns=['y_hat=0', 'y_hat=1']))
    print('\n')
    display(predictions[m]['test_score'].describe())
    print('\n')
    display(predictions[m]['y_true'].describe())

    plot_histogram(data=predictions[m], x=['test_score'], pos=[(1,1)], by_var=None,
                   x_title=['y_hat'], y_title=['Frequency'],
                   titles=['Histogram of predictions'], width=600, height=450)
    print('---------------------------------------------------------------------\n')

Relationships of predictions with true labels

In [None]:
# Loop over trained models:
for m in models:
    print('-----------------------------------------------------------------------------------')
    print(f'\033[1m{m.replace("_", " ").capitalize()} model:\033[0m\n')
    display(predictions[m].groupby('y_true')[['test_score']].describe())
    print('-----------------------------------------------------------------------------------\n')

# Loop over trained models:
for m in models:
    plot_histogram(data=predictions[m], x=['test_score'], pos=[(1,1)], by_var=['y_true'],
                   barmode='overlay', opacity=0.75,
                   x_title=['y_hat'], y_title=['frequency'],
                   titles=[f'Distribution of predictions by true label - {m}'], width=600, height=450)

# Loop over trained models:
for m in models:
    plot_boxplot(data=predictions[m], x=['y_true'], y=['test_score'], pos=[(1,1)],
                titles=[f'Distribution of predictions by true label - {m}'], width=600, height=450)
    
# Loop over trained models:
for m in models:
    # Rate of y = 1 by decile of scores:
    predictions[m]['decile'] = pd.qcut(predictions[m]['test_score'], q=10)
    y_avg_dec = predictions[m].groupby('decile').mean()[['y_true']].reset_index()
    y_avg_dec['score'] = [str(d) for d in y_avg_dec['decile']]
    plot_bar(data=y_avg_dec, x=['score'], y=['y_true'], pos=[(1,1)],
             titles=[f'Rate of y = 1 by decile of scores - {m}'], width=600, height=450)

Major errors

In [None]:
# Loop over trained models:
for m in models:
    print('-----------------------------------------------------------------------------------')
    print(f'\033[1m{m.replace("_", " ").capitalize()} model:\033[0m\n')
    display(predictions[m][predictions[m].y_true==0].sort_values('test_score', ascending=False).head(10))
    print('-----------------------------------------------------------------------------------\n')

In [None]:
# Loop over trained models:
for m in models:
    print('-----------------------------------------------------------------------------------')
    print(f'\033[1m{m.replace("_", " ").capitalize()} model:\033[0m\n')
    display(predictions[m][predictions[m].y_true==1].sort_values('test_score', ascending=True).head(10))
    print('-----------------------------------------------------------------------------------\n')

Distribution of features given predictions

In [None]:
# Loop over trained models:
for m in models:
    # Distribution of predictions given features:
    score_dist_feat = pd.concat([df_test[['category']], predictions[m]], axis=1)
    score_dist_feat = score_dist_feat.groupby('category').mean()[['test_score']].reset_index()

    plot_bar(data=score_dist_feat, x=['category'], y=['test_score'], pos=[(1,1)],
            titles=[f'Average of score by app category'], width=600, height=450)

    # Loop over variables:
    for v in ['L#price', 'L#rating', 'L#share_known', 'L#share_known_malwares']:
        # Distribution of predictions given features:
        score_dist_feat = pd.concat([df_test_scaled[[v]], predictions[m]], axis=1)
        score_dist_feat['decile'] = pd.qcut(score_dist_feat[v], q=10, duplicates='drop')
        score_dist_feat = score_dist_feat.groupby('decile').mean()[['test_score']].reset_index()
        score_dist_feat[v] = [str(d) for d in score_dist_feat['decile']]


        plot_bar(data=score_dist_feat, x=[v], y=['test_score'], pos=[(1,1)],
                titles=[f'Average of score by decile of {v}'], width=600, height=450)

<a id='ensemble_definition'></a>

### Ensemble definition

In [None]:
# Performance of individual models:
indiv_models = pd.DataFrame(data={
    'model': [m for m in models],
    'test_roc_auc': [fine_tuning[experiment_id]['models'][m]['performance_metrics']['test_roc_auc'] for m in models]
}).sort_values('test_roc_auc', ascending=False)

# Subsets of models:
models_subsets = [list(indiv_models['model'].iloc[0:i]) for i in range(2, 6)]

# Loop over selected models:
for sel_models in models_subsets:
    # List of selected models:
    models_ = [models[m].model for m in sel_models]
    weights = [1/(len(models_)) for i in range(len(models_))]

    # Predictions from an ensemble of models:
    ensemble = Ensemble(models=models_, statistic='weighted_mean', weights=weights, task='binary_class')
    test_scores = ensemble.predict(inputs=X_test, predict_class=False)
    test_scores = pd.DataFrame(data={
        'test_score': test_scores, 'y_true': y_test,
        'y_pred': [1 if s > 0.5 else 0 for s in test_scores]
    })

    # Evaluation of predicted classes:
    test_scores['fn'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==1) & (x['y_pred']==0) else 0,
                                                                axis=1)
    test_scores['fp'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==0) & (x['y_pred']==1) else 0,
                                                                axis=1)
    test_scores['tn'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==0) & (x['y_pred']==0) else 0,
                                                                axis=1)
    test_scores['tp'] = test_scores[['y_true', 'y_pred']].apply(lambda x: 1 if (x['y_true']==1) & (x['y_pred']==1) else 0,
                                                                axis=1)

    # Model assessment:
    fine_tuning[experiment_id]['models'][f'ensemble_{models_subsets.index(sel_models)}'] = {
        'models': [m for m in sel_models],
        'weights': weights,
        'performance_metrics': {
            'test_roc_auc': roc_auc_score(test_scores['y_true'], test_scores['test_score']),
            'test_prec_avg': average_precision_score(test_scores['y_true'], test_scores['test_score']),
            'test_brier': brier_score_loss(test_scores['y_true'], test_scores['test_score']),
            'test_mcc': matthews_corrcoef(test_scores['y_true'], test_scores['y_pred']),
            'test_acc': np.nanmean(test_scores['y_true']==test_scores['y_pred']),
            'test_prec': np.nansum(test_scores['tp'])/(np.nansum(test_scores['fp']) + np.nansum(test_scores['tp'])),
            'test_rec': np.nansum(test_scores['tp'])/(np.nansum(test_scores['fn']) + np.nansum(test_scores['tp'])),
            'fn_rate': np.nansum(test_scores['fn'])/(np.nansum(test_scores['fn']) + np.nansum(test_scores['tp'])),
            'fp_rate': np.nansum(test_scores['fp'])/(np.nansum(test_scores['fp']) + np.nansum(test_scores['tn'])),
            'conf_matrix': [[int(i) for i in a] for a in confusion_matrix(test_scores['y_true'], test_scores['y_pred'])]
        }
    }

    # Saving model predictions:
    predictions[f'ensemble_{models_subsets.index(sel_models)}'] = test_scores.copy()

#### Exporting model outcomes

In [None]:
# Corrigindo data types:
for m in [m for m in fine_tuning[experiment_id]['models'].keys() if m in ['light_gbm', 'xgboost']]:
    fine_tuning[experiment_id]['models'][m]['best_param']['max_depth'] = int(
        fine_tuning[experiment_id]['models'][m]['best_param']['max_depth']
    )

    if m=='light_gbm':
        fine_tuning[experiment_id]['models']['light_gbm']['best_param']['num_iterations'] = int(
            fine_tuning[experiment_id]['models']['light_gbm']['best_param']['num_iterations']
        )
    if m=='xgboost':
        fine_tuning[experiment_id]['models']['xgboost']['best_param']['num_boost_round'] = int(
            fine_tuning[experiment_id]['models']['xgboost']['best_param']['num_boost_round']
        )

In [None]:
if EXPORT:
    # Model evaluation:
    with open('../experiments/fine_tuning.json', 'w') as json_file:
        json.dump(fine_tuning, json_file, indent=2)

    # Feature importances:
    for feat_imps in [feat_imps for feat_imps in ['feat_importances_lr', 'feat_importances_rf', 'feat_importances_lgb'] if feat_imps in vars()]:
        eval(feat_imps).to_csv(f'../experiments/feature_importances/{feat_imps}_{experiment_id}.csv', index=False)

    # Predicted scores:
    for m in fine_tuning[experiment_id]['models'].keys():
        predictions[m].to_csv(f'../experiments/predictions/predictions_{m}_{experiment_id}.csv', index=False)

<a id='artifacts'></a>

### Exporting artifacts

#### Assessing the fine tuned pipeline

In [None]:
# Main metrics for each data pipeline:
pipeline_selection = pd.DataFrame(
    data={
        'experiment_id': [e for e in fine_tuning],
        'solution': [fine_tuning[e]['solution'] for e in fine_tuning],
        'test_roc_auc': [fine_tuning[e]['models']['ensemble_0']['performance_metrics']['test_roc_auc'] for e
                         in fine_tuning],
        'test_acc': [fine_tuning[e]['models']['ensemble_0']['performance_metrics']['test_acc'] for e
                     in fine_tuning],
        'test_mcc': [fine_tuning[e]['models']['ensemble_0']['performance_metrics']['test_mcc'] for e
                     in fine_tuning]
          },
    index=[e for e in fine_tuning]
).sort_values(['test_roc_auc', 'test_acc', 'test_mcc'], ascending=[False, False, False])
display(pipeline_selection)

# Collection of best models based on how many times it maximizes the selected metrics:
best_pipelines = []
for metric in ['test_roc_auc', 'test_acc', 'test_mcc']:
    best_pipelines.extend([m for m, v in zip(pipeline_selection.index, pipeline_selection[metric]) if v==max(pipeline_selection[metric])])
best_pipeline = [
                 (k, pipeline_selection.loc[k]['solution']) for k, v in
                 frequency_list(best_pipelines).items() if v==max(frequency_list(best_pipelines).values())
][0]

# Define the best pipeline:
best_pipeline_id = best_pipeline[0]

if best_pipeline_id==experiment_id:
    print(f'\nThere is a new best complete pipeline: {best_pipeline_id}!')

Unnamed: 0,experiment_id,solution,test_roc_auc,test_acc,test_mcc
1651412222,1651412222,"{'which_scale': None, 'which_missings_treat': ...",0.916764,0.835775,0.637669
1651411278,1651411278,"{'which_scale': 'min_max_scale', 'which_missin...",0.916567,0.831891,0.629339
1651418141,1651418141,"{'which_scale': None, 'which_missings_treat': ...",0.916489,0.835331,0.636607
1651413630,1651413630,"{'which_scale': None, 'which_missings_treat': ...",0.915729,0.834443,0.635907
1651343023,1651343023,"{'which_scale': 'standard_scale', 'which_missi...",0.915575,0.832557,0.631212
1651344238,1651344238,"{'which_scale': 'standard_scale', 'which_missi...",0.915254,0.832668,0.631178
1651340574,1651340574,"{'which_scale': 'standard_scale', 'which_missi...",0.915155,0.834998,0.637094
1651352012,1651352012,"{'which_scale': 'standard_scale', 'which_missi...",0.896371,0.81225,0.593647


#### Exports

In [None]:
if EXPORT & ((len(os.listdir('../artifacts')) == 0) | (best_pipeline_id==experiment_id)):
    # Training data for pipeline implementation:
    df_train.to_csv('../artifacts/df_train.csv', index=False)

    # Object of fitted pipeline:
    pickle.dump(pipeline, open('../artifacts/pipeline.pickle', 'wb'))

    # Object of ensemble of trained models:
    pickle.dump(ensemble, open('../artifacts/ensemble.pickle', 'wb'))

    # Variables expected by the model:
    variables = list(df_train_scaled.drop(drop_vars, axis=1).columns)
    with open('../artifacts/variables.json', 'w') as json_file:
        json.dump(variables, json_file, indent=2)
    
    # Model registry:
    fine_tuning[experiment_id]['comment'] = 'Best model (data pipeline and ensemble of trained models) after \
experiments and fine tuning.'
    model_registry = fine_tuning[experiment_id]
    with open('../artifacts/model_registry.json', 'w') as json_file:
        json.dump(model_registry, json_file, indent=2)

    print('Artifacts were updated!')