## End-to-end machine learning application
## Data preparation

This project aims to integrate different aspects of a machine learning system, thus developing an end-to-end ML project. The final product is an app (hypothetically called *AppSafe*) composed of a model that calculates the risk of a mobile app being a malware and an API that could integrate with an app store and with the user by sending him/her a warning message when the mobile app that is about to be downloaded is too risky.


The project follows the traditional [CRISP-DM](https://pt.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) methodology, so these are the main stages that make the core of the project: data engineering, data preparation, data modeling, and deployment.

-----------

The construction of this notebook has two objectives: first, to test functions and classes of data preparation that were developed and now integrate the Python module named "*transformations.py*" available in the *src* folder; second, to create codes that make use of those functions and classes, so ultimately they can be inserted into notebooks of data modeling.

Consequently, this notebook presents codes that implement the following tasks of data preparation, after which data is ready to be used for training and testing machine learning models:
* Early selection of variables (based on their variance).
* Logarithmic transformation of numerical variables.
* Scaling of numerical data.
* Missing values treament (also numerical variables).
* Transformation of categorical features (missing values treatment and one-hot encoding).
* Outliers treatment.
* Features selection.
* Creation of pipelines that sequentially run all of the above operations.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing the data](#imports)<a href='#imports'></a>.
  * [Features and labels](#features_labels)<a href='#features_labels'></a>.
  * [Data understanding](#data_und)<a href='#data_und'></a>.

5. [Data preparation](#data_prep)<a href='#data_prep'></a>.
  * [Features classification and early selection](#classif_feat)<a href='#classif_feat'></a>.
  * [Logarithmic transformation](#log_transf)<a href='#log_transf'></a>.
  * [Scaling the data](#data_scaling)<a href='#data_scaling'></a>.
  * [Missing values treatment](#missings_treat)<a href='#missings_treat'></a>.
  * [Transforming categorical variables](#categorical_transf)<a href='#categorical_transf'></a>.
  * [Datasets consistency](#datasets_consistency)<a href='#datasets_consistency'></a>.
  * [Outliers treatment](#outliers_treat)<a href='#outliers_treat'></a>.
  * [Features selection](#features_selection)<a href='#features_selection'></a>.
  * [Full pipeline](#pipeline)<a href='#pipeline'></a>.

<a id='libraries'></a>

## Libraries





In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
cd "/content/gdrive/MyDrive/Studies/end_to_end_ml/notebooks/"

/content/gdrive/MyDrive/Studies/end_to_end_ml/model_dev


In [None]:
# !pip install -r ../requirements.txt

In [None]:
import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import time

from sklearn.linear_model import LogisticRegression

In [None]:
import sys

sys.path.append(
    os.path.abspath(
        os.path.join(
            os.path.dirname(__doc__), '../src'
        )
    )
)

<a id='functions_classes'></a>

## Functions and classes

In [None]:
from utils import classify_variables, assessing_missings, missings_detection, data_consistency
from transformations import LogTransformation, ScaleNumericalVars, TreatMissings, OneHotEncoding, OutliersTreat, Pipeline
from features_selection import FeaturesSelection

<a id='settings'></a>

## Settings

<a id='data_management_settings'></a>

### Data management

In [None]:
# Declare whether outcomes should be exported:
EXPORT = False

<a id='data_prep_settings'></a>

### Data preparation

#### Early selection of variables

In [1]:
DROP_EXCESSIVE_MISS = True # Declare whether variables should be dropped based on the share of missings on training data.
EXCESSIVE_MISS = 0.95 # Share of missings above which a variable is excluded from the dataframe.
DROP_NO_VAR = True # Declare whether variables with no sufficient variation should be dropped.
MINIMUM_VAR = 0 # Value of variance below which a variable is excluded from the dataframe.
DROP_BIN_NO_VAR = True # Declare whether binary variables with not enough variability should be dropped from the dataframes.
BIN_MINIMUM_VAR = 0.01 # Minimum variance below which binary variables should be deleted.

#### Data transformations

In [None]:
LOG_TRANSFORM = True # Declare whether to log-transform numerical variables.
WHICH_SCALE = 'standard_scale' # Declare which type of scaling should be applied over numerical variables ('standard_scale', 'min_max_scale', 'no_scale').
SCALE_ALL = False # Declare whether all variables (not only the continuous) are subject to scaling.
WHICH_MISSINGS_TREAT = 'create_binary' # Declares which method of missing values treatment should be implemented ('create_binary', 'impute_stat').
MISSINGS_TREAT_STAT = 'mean' # Declares which statistic should be used for missing values treatment ('mean', 'median').
CAT_TRANSF_VAR = 0.01 # Variance of dummy variables below which the respective categories are dropped out during categorical data transformation.
TREAT_OUTLIERS = True # Indicates whether outliers should be treated.
OUTLIERS_METHOD = 'iqr' # Method for treating outliers.
QUANTILE = 0.025 # Quantile parameter for outliers treatment.
K = 3 # Parameter for IQR outliers treatment.

#### Features selection

In [None]:
NUM_FOLDS = 5 # Parameter of exaustive methods (RFE, RFECV, sequential selection, random selection).
METRIC = 'roc_auc' # Parameter of exaustive methods (RFE, RFECV, sequential selection, random selection).
MIN_NUM_FEATS = 10 # Parameter of exaustive methods (RFECV).
MAX_NUM_FEATS = 80 # Parameter of exaustive methods (RFE, sequential selection, random selection).
STEP = 5 # Parameter of exaustive methods (RFE, RFECV, random selection).
DIRECTION = 'forward' # Parameter of exaustive methods (sequential selection).
REGUL_PARAM = 1.0 # Parameter of exaustive methods (RFE, RFECV, sequential selection, random selection).

In [None]:
# Grid of hyper-parameters for features selection
grid_fs = {
  'correlation': {
    'threshold': 0.8, 'num_folds': NUM_FOLDS, 'metric': METRIC, 'min_num_feats': MIN_NUM_FEATS, 'max_num_feats': MAX_NUM_FEATS, 'step': STEP,
    'direction': DIRECTION
  },
  'supervised': {
    'threshold': 0, 'num_folds': NUM_FOLDS, 'metric': METRIC, 'min_num_feats': MIN_NUM_FEATS, 'max_num_feats': MAX_NUM_FEATS, 'step': STEP,
    'direction': DIRECTION
  },
  'rfe': {
    'threshold': 0.8, 'num_folds': NUM_FOLDS, 'metric': METRIC, 'min_num_feats': MIN_NUM_FEATS, 'max_num_feats': MAX_NUM_FEATS, 'step': STEP,
    'direction': DIRECTION
  },
  'rfecv': {
    'threshold': 0.8, 'num_folds': NUM_FOLDS, 'metric': METRIC, 'min_num_feats': MIN_NUM_FEATS, 'max_num_feats': MAX_NUM_FEATS, 'step': STEP,
    'direction': DIRECTION
  },
  'sequential': {
    'threshold': 0.8, 'num_folds': NUM_FOLDS, 'metric': METRIC, 'min_num_feats': MIN_NUM_FEATS, 'max_num_feats': MAX_NUM_FEATS, 'step': STEP,
    'direction': DIRECTION
  },
  'random_selection': {
    'threshold': 0.8, 'num_folds': NUM_FOLDS, 'metric': METRIC, 'min_num_feats': MIN_NUM_FEATS, 'max_num_feats': MAX_NUM_FEATS, 'step': STEP,
    'direction': DIRECTION
  }
}

<a id='imports'></a>

## Importing the data

<a id='features_labels'></a>

### Features and labels

#### Training data

In [None]:
df_train = pd.read_csv('../data/training_data.csv', dtype={'app_id': int})

print(f'Shape of df_train: {df_train.shape}.')
print(f'Number of unique instances: {df_train.app_id.nunique()}.')

# Auxiliary variables:
drop_vars = ['app', 'package', 'class', 'app_id', 'related_apps', 'description']

df_train.head(3)

Shape of df_train: (18298, 191).
Number of unique instances: 18298.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_write_contact_data,your_personal_information_write_to_user_defined_dictionary,class,app_id,num_related_apps,num_words_desc,num_known_apps,share_known,num_known_malwares,share_known_malwares
0,Ambient Soothing Sounds: Beach,com.zeddev.chillbeach1,Health & Fitness,The soothing sounds on a long and seamless loo...,3.6,122,0.0,"com.zeddev.chillmeadow1, com.droiddevz.ambient...",1.0,1,...,0,0,0,6565,4.0,42.0,0.0,0.0,0.0,
1,Aurora,jiang.joyworks.aurora,Brain & Puzzle,This is one great &quot;Escape Game&quot; <p>Y...,3.8,24,1.41,com.firemaplegames.games.the_secretofgrislyman...,1.0,0,...,0,0,1,4772,4.0,251.0,0.0,0.0,0.0,
2,Tank Ace 1944,com.resetgame.tankace1944,Arcade & Action,In Tank Ace 1944 you command a World War II ta...,3.7,20,4.99,"ru.sibteam.classictankfull, nl.ejsoft.mortalsk...",0.0,0,...,0,0,1,20856,4.0,341.0,0.0,0.0,0.0,


Missing data

In [None]:
missings_train = pd.DataFrame(data={
    'feature': df_train.isnull().sum().index,
    'num_missings': df_train.isnull().sum().values,
    'share_missings': [v/len(df_train) for v in df_train.isnull().sum().values]
}).sort_values('num_missings', ascending=False)

missings_train.head(10)

Unnamed: 0,feature,num_missings,share_missings
190,share_known_malwares,10047,0.549076
185,num_related_apps,484,0.026451
189,num_known_malwares,484,0.026451
188,share_known,484,0.026451
187,num_known_apps,484,0.026451
7,related_apps,484,0.026451
8,dangerous_permissions_count,129,0.00705
3,description,3,0.000164
186,num_words_desc,3,0.000164
0,app,1,5.5e-05


#### Test data

In [None]:
df_test = pd.read_csv('../data/test_data.csv', dtype={'app_id': int})

print(f'Shape of df_test: {df_test.shape}.')
print(f'Number of unique instances: {df_test.app_id.nunique()}.')

df_test.head(3)

Shape of df_test: (9012, 191).
Number of unique instances: 9012.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_write_contact_data,your_personal_information_write_to_user_defined_dictionary,class,app_id,num_related_apps,num_words_desc,num_known_apps,share_known,num_known_malwares,share_known_malwares
0,Dirty Jokes,com.appspot.swisscodemonkeys.dirty,Entertainment,The best Dirty Jokes app for Android!<p>#1 Fre...,4.0,2470,0.0,"com.gonzotech.dirty_jokes, com.comic.lastlaugh...",1.0,1,...,0,0,0,5804,4.0,82,1.0,0.25,1.0,1.0
1,Animal Sounds with Photos,com.teachersparadise.animalsoundsphotos,Education,Let kids explore the animal kingdom by learnin...,3.8,168,0.0,"com.papainteractive, com.teachersparadise.days...",2.0,0,...,0,0,0,13224,4.0,37,2.0,0.5,0.0,0.0
2,Mini Catch,com.airylabs.games.minicatch,Brain & Puzzle,"From Airy Labs, acclaimed developer of the bes...",3.0,1,0.0,"com.oscarmikegames.Bloxus, com.concretesoftwar...",2.0,1,...,0,0,1,14752,4.0,244,0.0,0.0,0.0,


Missing data

In [None]:
missings_test = pd.DataFrame(data={
    'feature': df_test.isnull().sum().index,
    'num_missings': df_test.isnull().sum().values,
    'share_missings': [v/len(df_test) for v in df_test.isnull().sum().values]
}).sort_values('num_missings', ascending=False)

missings_test.head(10)

Unnamed: 0,feature,num_missings,share_missings
190,share_known_malwares,5072,0.562805
185,num_related_apps,236,0.026187
189,num_known_malwares,236,0.026187
188,share_known,236,0.026187
187,num_known_apps,236,0.026187
7,related_apps,236,0.026187
8,dangerous_permissions_count,72,0.007989
122,system_tools_retrieve_running_applications,0,0.0
131,system_tools_write_sync_settings,0,0.0
123,system_tools_send_package_removed_broadcast,0,0.0


<a id='data_und'></a>

### Data understanding

In [None]:
data_und = pd.read_csv('../data/features.csv')

print(f'Shape of data_und: {data_und.shape}.')
print(f'Number of unique instances: {data_und.feature.nunique()}.')

data_und.head(3)

Shape of data_und: (191, 8).
Number of unique instances: 191.


Unnamed: 0,feature,type,n_unique,sample_values,num_missings,share_missings,var_class,category
0,app,object,22823,['Alabama Crimson Tide News' 'Blood Demon Movi...,1,3.7e-05,categorical,app_attributes
1,package,object,23485,['com.estrongs.android.pop.app.shortcut' 'com....,0,0.0,categorical,app_attributes
2,category,object,30,['Shopping' 'Racing' 'Productivity' 'Sports Ga...,0,0.0,categorical,app_attributes


<a id='data_prep'></a>

## Data preparation

<a id='classif_feat'></a>

### Features classification and early selection

In [None]:
class_variables = classify_variables(dataframe=df_train, vars_to_drop=drop_vars, test_data=df_test,
                                     drop_excessive_miss=DROP_EXCESSIVE_MISS, excessive_miss=EXCESSIVE_MISS,
                                     drop_no_var=DROP_NO_VAR, minimum_var=MINIMUM_VAR)

# Lists of variables:
cat_vars = class_variables['cat_vars']
binary_vars = class_variables['binary_vars']
cont_vars = class_variables['cont_vars']

Initial number of features: 185.
0 features were dropped for excessive number of missings!
29 features were dropped for having no variance!
156 remaining features.




#### Selecting binary variables based on their variances

In [None]:
if DROP_BIN_NO_VAR:
  # Dropping features with no variance in the training data:
  bin_no_variance = [c for c in binary_vars  if np.nanvar(df_train[c])<=BIN_MINIMUM_VAR]
  print(f'{len(bin_no_variance)} binary variables were dropped for having variance inferior to {BIN_MINIMUM_VAR}.\n')

  print(f'Shape of df_train (before dropping binary variables): {df_train.shape}.')
  df_train = df_train.drop(bin_no_variance, axis=1)
  print(f'Shape of df_train (after dropping binary variables): {df_train.shape}.\n')

  print(f'Shape of df_test (before dropping binary variables): {df_test.shape}.')
  df_test = df_test.drop(bin_no_variance, axis=1)
  print(f'Shape of df_test (after dropping binary variables): {df_test.shape}.')

101 binary variables were dropped for having variance inferior to 0.01.

Shape of df_train (before dropping binary variables): (18298, 162).
Shape of df_train (after dropping binary variables): (18298, 61).

Shape of df_test (before dropping binary variables): (9012, 162).
Shape of df_test (after dropping binary variables): (9012, 61).


<a id='log_transf'></a>

### Logarithmic transformation

In [None]:
if LOG_TRANSFORM:
    # Variables that should be log-transformed:
    to_log = [c for c in df_train.columns if c in cont_vars]

    # Object for log-transforming:
    log_transf = LogTransformation(to_log=to_log)

    # Training data:
    df_train = log_transf.fit_transform(data=df_train)

    # Test data:
    df_test = log_transf.fit_transform(data=df_test)

else:
    print('\033[1mNo transformation performed!\033[0m')

<a id='data_scaling'></a>

### Scaling the data

In [None]:
if (WHICH_SCALE in ['standard_scale', 'min_max_scale']) & (SCALE_ALL==False):
    # Variables that should be scaled:
    to_scale = [c for c in df_train.columns if ('L#' in c)]

    # Object for scaling numerical data:
    scale_transf = ScaleNumericalVars(to_scale=to_scale, which_scale=WHICH_SCALE)
    scale_transf.fit(training_data=df_train)

    # Training data:
    df_train_scaled = scale_transf.transform(data=df_train)

    # Test data:
    df_test_scaled = scale_transf.transform(data=df_test)

else:
    df_train_scaled = df_train.copy()
    df_test_scaled = df_test.copy()

    print('\033[1mNo transformation performed!\033[0m')

<a id='missings_treat'></a>

### Missing values treatment

#### Variables with missings

In [None]:
data_und[data_und.num_missings>0].sort_values('num_missings', ascending=False)

Unnamed: 0,feature,type,n_unique,sample_values,num_missings,share_missings,var_class,category
190,share_known_malwares,float64,7,"[nan, 1.0, 0.0, 0.5, 0.3333333333333333, 0.666...",10047,0.549076,numerical,app_attributes
7,related_apps,object,23868,['{com.warting.blogg.wis_trevortransdgtl_feed_...,720,0.026364,categorical,app_attributes
185,num_related_apps,float64,4,"[4.0, 1.0, nan, 3.0, 2.0]",484,0.026451,numerical,app_attributes
187,num_known_apps,float64,5,"[0.0, 1.0, 2.0, 3.0, nan, 4.0]",484,0.026451,numerical,app_attributes
188,share_known,float64,7,"[0.0, 0.25, 0.5, 0.75, nan, 0.3333333333333333...",484,0.026451,numerical,app_attributes
189,num_known_malwares,float64,5,"[0.0, 1.0, 3.0, 2.0, nan, 4.0]",484,0.026451,numerical,app_attributes
8,dangerous_permissions_count,float64,28,[ 4. 13. 12. 0. 22. 15. 11. 10. 7. 19.],201,0.00736,numerical,actions_others
3,description,object,23552,"['Enjoy Navionics??? Anytime, Anywhere.<p>The ...",3,0.00011,categorical,app_attributes
186,num_words_desc,float64,700,[ 23. 490. 234. nan 183. 593. 660. 587. 626. ...,3,0.000164,numerical,app_attributes
0,app,object,22823,['Alabama Crimson Tide News' 'Blood Demon Movi...,1,3.7e-05,categorical,app_attributes


Missings by label

In [None]:
# Observations with y = 0:
missings_y0_df = pd.DataFrame(data={
    'feature': df_train_scaled[df_train_scaled['class']==0].isnull().sum().index,
    'num_missings_y0': df_train_scaled[df_train_scaled['class']==0].isnull().sum().values,
    'share_missings_y0': [v/len(df_train_scaled[df_train_scaled['class']==0]) for v in df_train_scaled[df_train_scaled['class']==0].isnull().sum().values]
}).sort_values('num_missings_y0', ascending=False)

# Observations with y = 1:
missings_y1_df = pd.DataFrame(data={
    'feature': df_train_scaled[df_train_scaled['class']==1].isnull().sum().index,
    'num_missings_y1': df_train_scaled[df_train_scaled['class']==1].isnull().sum().values,
    'share_missings_y1': [v/len(df_train_scaled[df_train_scaled['class']==1]) for v in df_train_scaled[df_train_scaled['class']==1].isnull().sum().values]
}).sort_values('num_missings_y1', ascending=False)

missings_by_label_df = missings_y0_df.merge(missings_y1_df, on='feature', how='left').sort_values('num_missings_y1', ascending=False)
missings_by_label_df.head(10)

Unnamed: 0,feature,num_missings_y0,share_missings_y0,num_missings_y1,share_missings_y1
0,L#share_known_malwares,2698,0.442949,7349,0.602032
4,L#num_known_apps,51,0.008373,433,0.035471
5,L#num_related_apps,51,0.008373,433,0.035471
1,L#num_known_malwares,51,0.008373,433,0.035471
3,L#share_known,51,0.008373,433,0.035471
2,related_apps,51,0.008373,433,0.035471
8,L#dangerous_permissions_count,1,0.000164,128,0.010486
31,app,0,0.0,1,8.2e-05
36,hardware_controls_control_flashlight,0,0.0,0,0.0
46,services_that_cost_you_money_directly_call_pho...,0,0.0,0,0.0


Missings by observation

In [None]:
missings_rows_df = pd.DataFrame(data={
    'idx_obs': df_train.T.isnull().sum().index,
    'num_missings': df_train.T.isnull().sum().values,
    'share_missings': [v/len(df_train) for v in df_train.T.isnull().sum().values]
}).sort_values('num_missings', ascending=False)
missings_rows_df.head(10)

Unnamed: 0,idx_obs,num_missings,share_missings
8766,8766,7,0.000383
5241,5241,7,0.000383
16559,16559,7,0.000383
4378,4378,7,0.000383
4853,4853,7,0.000383
7699,7699,7,0.000383
10178,10178,7,0.000383
12813,12813,7,0.000383
14052,14052,7,0.000383
13245,13245,7,0.000383


In [None]:
display(missings_rows_df.num_missings.describe())
print('\n')
display(missings_rows_df.num_missings.value_counts())

count    18298.000000
mean         0.688764
std          1.045568
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          7.000000
Name: num_missings, dtype: float64





1    9561
0    8249
6     355
7     129
2       3
3       1
Name: num_missings, dtype: int64

#### Treating missing values

In [None]:
if SCALE_ALL==False:
    # Object for missing values treatment:
    vars_to_treat = [c for c in list(df_train_scaled.columns) if (c not in drop_vars) & (c not in cat_vars) &
                    (df_train_scaled[c].isnull().sum() > 0)]
    missings_treat = TreatMissings(vars_to_treat=vars_to_treat, method=WHICH_MISSINGS_TREAT, drop_vars=drop_vars, cat_vars=cat_vars,
                                   statistic=MISSINGS_TREAT_STAT, treat_remaining=True)

    # Training data:
    df_train_scaled = missings_treat.fit_transform(data=df_train_scaled, training_data=df_train_scaled)
    
    # Test data:
    df_test_scaled = missings_treat.fit_transform(data=df_test_scaled, training_data=df_train_scaled)

else:
    print('\033[1mNo transformation performed!\033[0m')

<a id='categorical_transf'></a>

### Transforming categorical variables

In [None]:
# Object for applying one-hot encoding:
categorical_transf = OneHotEncoding(categorical_features=cat_vars, variance_param=CAT_TRANSF_VAR)
categorical_transf.fit(training_data=df_train_scaled)

# Training data:
df_train_scaled = categorical_transf.transform(data=df_train_scaled)

# Test data:
df_test_scaled = categorical_transf.transform(data=df_test_scaled)

print(f'\033[1mShape of df_train_scaled:\033[0m {df_train_scaled.shape}.')
print(f'\033[1mShape of df_test_scaled:\033[0m {df_test_scaled.shape}.')

[1mShape of df_train_scaled:[0m (18298, 93).
[1mShape of df_test_scaled:[0m (9012, 93).


<a id='datasets_consistency'></a>

### Datasets consistency

In [None]:
if SCALE_ALL==False:
    # Assessing missing values (training data):
    missings_detection(df_train_scaled.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_train_scaled')

    # Assessing missing values (test data):
    missings_detection(df_test_scaled.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_test_scaled')

    # Checking datasets structure:
    df_test_scaled = data_consistency(dataframe=df_train_scaled,
                                      test_data=df_test_scaled)['test_data']

Training and test data are consistent with each other.


#### Scaling all variables

In [None]:
if SCALE_ALL:
    to_scale = [c for c in df_train_scaled.columns if (c not in drop_vars)]

    # Object for scaling numerical data:
    scale_transf = ScaleNumericalVars(to_scale=to_scale, which_scale=WHICH_SCALE)
    scale_transf.fit(training_data=df_train_scaled)

    # Training data:
    df_train_scaled = scale_transf.transform(data=df_train_scaled)
    new_vars_scale = list(df_train_scaled.drop(drop_vars, axis=1).columns)

    # Test data:
    df_test_scaled = scale_transf.transform(data=df_test_scaled)

    # Object for missing values treatment:
    vars_to_treat = [c for c in list(df_train_scaled.columns) if (c not in drop_vars) & (c not in cat_vars) &
                    (df_train_scaled[c].isnull().sum() > 0)]
    missings_treat = TreatMissings(vars_to_treat=vars_to_treat, method=WHICH_MISSINGS_TREAT, drop_vars=drop_vars, cat_vars=[],
                                   statistic=MISSINGS_TREAT_STAT, treat_remaining=True)
    
    # Training data:
    df_train_scaled = missings_treat.fit_transform(data=df_train_scaled, training_data=df_train_scaled)

    # Test data:
    df_test_scaled = missings_treat.fit_transform(data=df_test_scaled, training_data=df_train_scaled)

    # Checking datasets structure:
    df_test_scaled = data_consistency(dataframe=df_train_scaled,
                                      test_data=df_test_scaled)['test_data']

    # Object for scaling numerical data:
    new_vars_scale = [c for c in list(df_train_scaled.drop(drop_vars, axis=1).columns) if c not in new_vars_scale]
    scale_transf = ScaleNumericalVars(to_scale=new_vars_scale, which_scale=WHICH_SCALE)
    scale_transf.fit(training_data=df_train_scaled)

    # Training data:
    df_train_scaled = scale_transf.transform(data=df_train_scaled)

    # Test data:
    df_test_scaled = scale_transf.transform(data=df_test_scaled)

<a id='outliers_treat'></a>

### Outliers treatment

In [None]:
if TREAT_OUTLIERS:
    outliers_treat = OutliersTreat(vars_to_treat=[f'L#{c}' for c in cont_vars], method=OUTLIERS_METHOD, quantile=QUANTILE, k=K)
    outliers_treat.fit(training_data=df_train_scaled)
    df_train_scaled = outliers_treat.transform(data=df_train_scaled)

<a id='features_selection'></a>

### Features selection

In [None]:
# Dataframe with only continuous variables:
cont_train_df = df_train_scaled[[f'L#{c}' for c in cont_vars]]

for m in grid_fs:
  try:
    # Features selection:
    selection = FeaturesSelection(method=m, 
                                  threshold=grid_fs[m]['threshold'],
                                  num_folds=grid_fs[m]['num_folds'], metric=grid_fs[m]['metric'], min_num_feats=grid_fs[m]['min_num_feats'],
                                  max_num_feats=grid_fs[m]['max_num_feats'], step=grid_fs[m]['step'],
                                  direction=grid_fs[m]['direction'])
    selection.select_features(inputs=cont_train_df if m in ['variance', 'correlation'] else df_train_scaled.drop(drop_vars, axis=1),
                              output=df_train_scaled['class'],
                              estimator=LogisticRegression(penalty='l1', solver='liblinear', C=REGUL_PARAM))
    selected_features = selection.selected_features
    print(f'\033[1m{len(selected_features)} variables were chosen based on {m}!\033[0m')
  
  except Exception as Error:
    print(Error)
    print(f'\033[1mError during features selection based on {m}!\033[0m')

From 11 features, 9 were selected!
[1m9 variáveis foram selecionadas com base no método correlation![0m
From 87 features, 76 were selected!
[1m76 variáveis foram selecionadas com base no método supervised![0m
From 87 features, 1 were selected!
From 87 features, 2 were selected!
From 87 features, 3 were selected!
From 87 features, 4 were selected!
From 87 features, 5 were selected!
From 87 features, 6 were selected!
From 87 features, 7 were selected!
From 87 features, 8 were selected!
From 87 features, 9 were selected!
From 87 features, 10 were selected!
From 87 features, 11 were selected!
From 87 features, 12 were selected!
From 87 features, 13 were selected!
From 87 features, 14 were selected!
From 87 features, 15 were selected!
From 87 features, 16 were selected!
From 87 features, 17 were selected!
From 87 features, 18 were selected!
From 87 features, 19 were selected!
From 87 features, 20 were selected!
From 87 features, 21 were selected!
From 87 features, 22 were selected!
From


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 1 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 2 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 3 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 4 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 5 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 6 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 7 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 8 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 9 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 10 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 11 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 12 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 13 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 14 were selected!



Liblinear failed to converge, increase the number of iterations.



From 87 features, 15 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 16 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 17 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 18 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 19 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 20 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 21 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 22 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 23 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 24 were selected!
From 87 features, 25 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 26 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 27 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 28 were selected!



Liblinear failed to converge, increase the number of iterations.



From 87 features, 29 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 30 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 31 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 32 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 33 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 34 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



From 87 features, 35 were selected!



Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.


Liblinear failed to converge, increase the number of iterations.



KeyboardInterrupt: ignored

<a id='pipeline'></a>

### Full pipeline

#### Importing the data

In [None]:
# Importing training data:
df_train2 = pd.read_csv('../data/training_data.csv', dtype={'app_id': int})
print(f'Shape of df_train2: {df_train2.shape}.')
print(f'Number of unique instances: {df_train2.app_id.nunique()}.')

# Importing test data:
df_test2 = pd.read_csv('../data/test_data.csv', dtype={'app_id': int})
print(f'Shape of df_test2: {df_test2.shape}.')
print(f'Number of unique instances: {df_test2.app_id.nunique()}.')

# Classifying variables:
class_variables = classify_variables(dataframe=df_train2, vars_to_drop=drop_vars, test_data=df_test2,
                                     drop_excessive_miss=DROP_EXCESSIVE_MISS, excessive_miss=EXCESSIVE_MISS,
                                     drop_no_var=DROP_NO_VAR, minimum_var=MINIMUM_VAR)

# Lists of variables:
cat_vars2, binary_vars2, cont_vars2 = class_variables['cat_vars'], class_variables['binary_vars'], class_variables['cont_vars']

if DROP_BIN_NO_VAR:
  # Dropping features with no variance in the training data:
  bin_no_variance = [c for c in binary_vars2  if np.nanvar(df_train2[c])<=BIN_MINIMUM_VAR]
  print(f'{len(bin_no_variance)} binary variables were dropped for having variance inferior to {BIN_MINIMUM_VAR}.\n')

  print(f'Shape of df_train2 (before dropping binary variables): {df_train2.shape}.')
  df_train2 = df_train2.drop(bin_no_variance, axis=1)
  print(f'Shape of df_train2 (after dropping binary variables): {df_train2.shape}.\n')

  print(f'Shape of df_test2 (before dropping binary variables): {df_test2.shape}.')
  df_test2 = df_test2.drop(bin_no_variance, axis=1)
  print(f'Shape of df_test2 (after dropping binary variables): {df_test2.shape}.')

Shape of df_train2: (18298, 191).
Number of unique instances: 18298.
Shape of df_test2: (9012, 191).
Number of unique instances: 9012.
Initial number of features: 185.
0 features were dropped for excessive number of missings!
29 features were dropped for having no variance!
156 remaining features.


101 binary variables were dropped for having variance inferior to 0.01.

Shape of df_train2 (before dropping binary variables): (18298, 162).
Shape of df_train2 (after dropping binary variables): (18298, 61).

Shape of df_test2 (before dropping binary variables): (9012, 162).
Shape of df_test2 (after dropping binary variables): (9012, 61).


#### Preparing the data

In [None]:
to_log = [c for c in df_train2.columns if c in cont_vars2]
to_scale = [f'L#{c}' for c in df_train2.columns if c in cont_vars2]
vars_to_treat = None
# vars_to_treat = [f'L#{c}' if c in cont_vars else c for c in list(df_train2.columns) if (c not in drop_vars) & (c not in cat_vars) &
#                  (df_train2[c].isnull().sum() > 0)]

pipeline = Pipeline(
    operations = [
                  LogTransformation(to_log=to_log),
                  ScaleNumericalVars(to_scale=to_scale, which_scale=WHICH_SCALE),
                  TreatMissings(vars_to_treat=vars_to_treat, method=WHICH_MISSINGS_TREAT, drop_vars=drop_vars, cat_vars=cat_vars2,
                                statistic=MISSINGS_TREAT_STAT),
                  OneHotEncoding(categorical_features=cat_vars2, variance_param=CAT_TRANSF_VAR)
    ]
)

df_train_scaled2, df_test_scaled2 = pipeline.transform(data_list=[df_test2], training_data=df_train2)

if TREAT_OUTLIERS:
    outliers_treat = OutliersTreat(vars_to_treat=[f'L#{c}' for c in cont_vars2], method=OUTLIERS_METHOD, quantile=QUANTILE, k=K)
    outliers_treat.fit(training_data=df_train_scaled2)
    df_train_scaled2 = outliers_treat.transform(data=df_train_scaled2)

df_test_scaled2 = df_test_scaled2[0]
df_test_scaled2.head(3)

Unnamed: 0,app,package,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,hardware_controls_change_your_audio_settings,...,C#category#NEWS__MAGAZINES,C#category#PERSONALIZATION,C#category#PRODUCTIVITY,C#category#SHOPPING,C#category#SOCIAL,C#category#SPORTS,C#category#SPORTS_GAMES,C#category#TOOLS,C#category#TRANSPORTATION,C#category#TRAVEL__LOCAL
0,Dirty Jokes,com.appspot.swisscodemonkeys.dirty,The best Dirty Jokes app for Android!<p>#1 Fre...,0.343966,-0.067487,-0.194993,"com.gonzotech.dirty_jokes, com.comic.lastlaugh...",-0.68037,-0.20759,0,...,0,0,0,0,0,0,0,0,0,0
1,Animal Sounds with Photos,com.teachersparadise.animalsoundsphotos,Let kids explore the animal kingdom by learnin...,0.206206,-0.117087,-0.194993,"com.papainteractive, com.teachersparadise.days...",-0.343797,-0.882636,0,...,0,0,0,0,0,0,0,0,0,0
2,Mini Catch,com.airylabs.games.minicatch,"From Airy Labs, acclaimed developer of the bes...",-0.344832,-0.120685,-0.194993,"com.oscarmikegames.Bloxus, com.concretesoftwar...",-0.343797,-0.20759,0,...,0,0,0,0,0,0,0,0,0,0


#### Sanity check

In [None]:
# Assessing missing values (training data):
missings_detection(df_train_scaled2.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_train_scaled2')

# Assessing missing values (test data):
missings_detection(df_test_scaled2.drop([v for v in drop_vars if v!='class'], axis=1), name=f'df_test_scaled2')

# Checking datasets structure:
df_test_scaled2 = data_consistency(dataframe=df_train_scaled2,
                                   test_data=df_test_scaled2)['test_data']

check = df_train_scaled.drop(drop_vars, axis=1)==df_train_scaled2.drop(drop_vars, axis=1)
if (check.sum().sum()==np.prod(check.shape)) != True:
  print('Inconsistent results for training data!')

check = df_test_scaled.drop(drop_vars, axis=1)==df_test_scaled2.drop(drop_vars, axis=1)
if (check.sum().sum()==np.prod(check.shape)) != True:
  print('Inconsistent results for test data!')

In [None]:
del df_train2, df_test2, df_train_scaled2, df_test_scaled2