## End-to-end machine learning application
## Data modeling - Model analysis

This project aims to integrate different aspects of a machine learning system, thus developing an end-to-end ML project. The final product is an app (hypothetically called *AppSafe*) composed of a model that calculates the risk of a mobile app being a malware and an API that could integrate with an app store and with the user by sending him/her a warning message when the mobile app that is about to be downloaded is too risky.

The project follows the traditional [CRISP-DM](https://pt.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) methodology, so these are the main stages that make the core of the project: data engineering, data preparation, data modeling, and deployment.

-----------

This notebook gathers different dimensions of model analysis and implements them for the best model constructed during experimentation and fine tuning. This analyis starts by displaying performance metrics, then calculates features importances and finally inspects predicted scores and labels, which is the main contribution of this notebook because it evaluates the robustness and consistency of the model to be used in production.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing the data](#imports)<a href='#imports'></a>.
5. [Model performance](#model_performance)<a href='#model_performance'></a>.
6. [Features importances](#feat_imp)<a href='#feat_imp'></a>.
7. [Predicted scores and classes](#predictions)<a href='#predictions'></a>.

<a id='libraries'></a>

## Libraries





In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [2]:
cd "/content/gdrive/MyDrive/Studies/end_to_end_ml/notebooks/"

/content/gdrive/MyDrive/Studies/end_to_end_ml/notebooks


In [3]:
# !pip install -r ../requirements.txt

In [4]:
import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import time
from copy import deepcopy
import pickle

from scipy.stats import ks_2samp

In [5]:
import sys

sys.path.append(
    os.path.abspath(
        os.path.join(
            os.path.dirname(__doc__), '../src'
        )
    )
)

<a id='functions_classes'></a>

## Functions and classes

In [6]:
from data_vis import HistoPlot, BoxPlot, BarPlot

<a id='settings'></a>

## Settings

In [7]:
# Declare whether outcomes should be exported:
EXPORT = False

<a id='imports'></a>

## Importing the data

<a id='features_labels'></a>

### Features and labels

#### Training data

In [8]:
df_train = pd.read_csv('../data/training_data.csv', dtype={'app_id': int})

print(f'Shape of df_train: {df_train.shape}.')
print(f'Number of unique instances: {df_train.app_id.nunique()}.')

# Auxiliary variables:
drop_vars = ['app', 'package', 'class', 'app_id', 'related_apps', 'description']

df_train.head(3)

Shape of df_train: (18298, 191).
Number of unique instances: 18298.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_write_contact_data,your_personal_information_write_to_user_defined_dictionary,class,app_id,num_related_apps,num_words_desc,num_known_apps,share_known,num_known_malwares,share_known_malwares
0,Ambient Soothing Sounds: Beach,com.zeddev.chillbeach1,Health & Fitness,The soothing sounds on a long and seamless loo...,3.6,122,0.0,"com.zeddev.chillmeadow1, com.droiddevz.ambient...",1.0,1,...,0,0,0,6565,4.0,42.0,0.0,0.0,0.0,
1,Aurora,jiang.joyworks.aurora,Brain & Puzzle,This is one great &quot;Escape Game&quot; <p>Y...,3.8,24,1.41,com.firemaplegames.games.the_secretofgrislyman...,1.0,0,...,0,0,1,4772,4.0,251.0,0.0,0.0,0.0,
2,Tank Ace 1944,com.resetgame.tankace1944,Arcade & Action,In Tank Ace 1944 you command a World War II ta...,3.7,20,4.99,"ru.sibteam.classictankfull, nl.ejsoft.mortalsk...",0.0,0,...,0,0,1,20856,4.0,341.0,0.0,0.0,0.0,


Missing data

In [9]:
missings_train = pd.DataFrame(data={
    'feature': df_train.isnull().sum().index,
    'num_missings': df_train.isnull().sum().values,
    'share_missings': [v/len(df_train) for v in df_train.isnull().sum().values]
}).sort_values('num_missings', ascending=False)

missings_train.head(10)

Unnamed: 0,feature,num_missings,share_missings
190,share_known_malwares,10047,0.549076
185,num_related_apps,484,0.026451
189,num_known_malwares,484,0.026451
188,share_known,484,0.026451
187,num_known_apps,484,0.026451
7,related_apps,484,0.026451
8,dangerous_permissions_count,129,0.00705
3,description,3,0.000164
186,num_words_desc,3,0.000164
0,app,1,5.5e-05


#### Test data

In [10]:
df_test = pd.read_csv('../data/test_data.csv', dtype={'app_id': int})

print(f'Shape of df_test: {df_test.shape}.')
print(f'Number of unique instances: {df_test.app_id.nunique()}.')

df_test.head(3)

Shape of df_test: (9012, 191).
Number of unique instances: 9012.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_write_contact_data,your_personal_information_write_to_user_defined_dictionary,class,app_id,num_related_apps,num_words_desc,num_known_apps,share_known,num_known_malwares,share_known_malwares
0,Dirty Jokes,com.appspot.swisscodemonkeys.dirty,Entertainment,The best Dirty Jokes app for Android!<p>#1 Fre...,4.0,2470,0.0,"com.gonzotech.dirty_jokes, com.comic.lastlaugh...",1.0,1,...,0,0,0,5804,4.0,82,1.0,0.25,1.0,1.0
1,Animal Sounds with Photos,com.teachersparadise.animalsoundsphotos,Education,Let kids explore the animal kingdom by learnin...,3.8,168,0.0,"com.papainteractive, com.teachersparadise.days...",2.0,0,...,0,0,0,13224,4.0,37,2.0,0.5,0.0,0.0
2,Mini Catch,com.airylabs.games.minicatch,Brain & Puzzle,"From Airy Labs, acclaimed developer of the bes...",3.0,1,0.0,"com.oscarmikegames.Bloxus, com.concretesoftwar...",2.0,1,...,0,0,1,14752,4.0,244,0.0,0.0,0.0,


Missing data

In [11]:
missings_test = pd.DataFrame(data={
    'feature': df_test.isnull().sum().index,
    'num_missings': df_test.isnull().sum().values,
    'share_missings': [v/len(df_test) for v in df_test.isnull().sum().values]
}).sort_values('num_missings', ascending=False)

missings_test.head(10)

Unnamed: 0,feature,num_missings,share_missings
190,share_known_malwares,5072,0.562805
185,num_related_apps,236,0.026187
189,num_known_malwares,236,0.026187
188,share_known,236,0.026187
187,num_known_apps,236,0.026187
7,related_apps,236,0.026187
8,dangerous_permissions_count,72,0.007989
122,system_tools_retrieve_running_applications,0,0.0
131,system_tools_write_sync_settings,0,0.0
123,system_tools_send_package_removed_broadcast,0,0.0


<a id='data_und'></a>

### Data understanding

In [12]:
data_und = pd.read_csv('../data/features.csv')

print(f'Shape of data_und: {data_und.shape}.')
print(f'Number of unique instances: {data_und.feature.nunique()}.')

data_und.head(3)

Shape of data_und: (191, 8).
Number of unique instances: 191.


Unnamed: 0,feature,type,n_unique,sample_values,num_missings,share_missings,var_class,category
0,app,object,22823,['Alabama Crimson Tide News' 'Blood Demon Movi...,1,3.7e-05,categorical,app_attributes
1,package,object,23485,['com.estrongs.android.pop.app.shortcut' 'com....,0,0.0,categorical,app_attributes
2,category,object,30,['Shopping' 'Racing' 'Productivity' 'Sports Ga...,0,0.0,categorical,app_attributes


<a id='model_artifacts'></a>

### Model artifacts

In [13]:
# Model assessment:
with open('../experiments/model_assess.json', 'r') as json_file:
    model_assess = json.load(json_file)

with open('../experiments/fine_tuning.json', 'r') as json_file:
    fine_tuning = json.load(json_file)

with open('../artifacts/model_registry.json', 'r') as json_file:
    model_registry = json.load(json_file)

# Identification of experiment:
experiment_id = model_registry['info']['experiment_id']

# Object of fitted pipeline:
model = pickle.load(open('../artifacts/model.pickle', 'rb'))

# Variables expected by the model:
with open('../artifacts/variables.json', 'r') as json_file:
    variables = json.load(json_file)

<a id='model_performance'></a>

## Model performance

<a id='pipeline_desc'></a>

### Pipeline description

#### Dataset information

In [14]:
pd.DataFrame(model_registry['info'], index=['']).T

Unnamed: 0,Unnamed: 1
experiment_id,1651412222.0
n_obs_train,18298.0
n_obs_test,9012.0
avg_y_train,0.667122
avg_y_test,0.668553
n_vars_train,87.0
n_vars_test,87.0
pipeline_id,1650488565.0


#### Pipeline description

Early selection of input variables

In [15]:
pd.DataFrame(model_registry['pipeline']['early_selection'], index=['']).T

Unnamed: 0,Unnamed: 1
drop_excessive_miss,True
excessive_miss,0.95
drop_no_var,True
minimum_var,0
drop_bin_no_var,True
bin_minimum_var,0.01


Data transformation

In [16]:
pd.DataFrame(model_registry['pipeline']['data_transformation'], index=['']).T

Unnamed: 0,Unnamed: 1
log_transform,True
which_scale,
which_missings_treat,create_binary
missings_treat_stat,
cat_transf_var,0.01
scale_all,False
treat_outliers,False
quantile,
outliers_method,
k,


Features selection

In [17]:
pd.DataFrame(model_registry['pipeline']['features_selection'], index=['']).T

Unnamed: 0,Unnamed: 1
method,
threshold,
num_folds,
metric,
min_num_feats,
max_num_feats,
step,
direction,
regul_param,


#### Ensemble definition

In [18]:
# Ensemble definition:
ensemble = [model_registry['models'][m] for m in model_registry['models'] if 'ensemble' in m][0]

Models and weights

In [19]:
pd.DataFrame(data={
    'models': ensemble['models'],
    'weights': ensemble['weights']
})

Unnamed: 0,models,weights
0,light_gbm,1.0


Models hyper-parameters

In [20]:
# Loop over ensemble models:
for m in ensemble['models']:
    print(f'\033[1m{m}\033[0m:')
    display(pd.DataFrame(model_registry['models']['light_gbm']['best_param'], index=['']).T)

[1mlight_gbm[0m:


Unnamed: 0,Unnamed: 1
bagging_fraction,0.649756
learning_rate,0.078279
max_depth,3.0
num_iterations,350.0


#### Model performance (test data)

In [21]:
perf_metrics = deepcopy(ensemble['performance_metrics'])
conf_matrix = perf_metrics.pop('conf_matrix')

In [22]:
pd.DataFrame(perf_metrics, index=['']).T

Unnamed: 0,Unnamed: 1
test_roc_auc,0.916764
test_prec_avg,0.962434
test_brier,0.109957
test_mcc,0.637669
test_acc,0.835775
test_prec,0.892419
test_rec,0.857759
fn_rate,0.142241
fp_rate,0.20857


In [23]:
pd.DataFrame(conf_matrix, columns=['predicted_0', 'predicted_1'], index=['true_0', 'true_1'])

Unnamed: 0,predicted_0,predicted_1
true_0,2364,623
true_1,857,5168


<a id='feat_imp'></a>

## Features importances

In [24]:
# Loop over ensemble models:
for m in model.ensemble.models:
    if 'Logistic' in str(m):
        # Feature importances of logistic regression:
        feat_importances_lr = pd.DataFrame(data={
            'feature': variables,
            'feat_imp': [c for c in m.coef_[0]],
            'abs_feat_imp': [abs(c) for c in m.coef_[0]]
        }).sort_values('abs_feat_imp', ascending=False)
        feat_importances_lr.index.name = 'logistic_regression'
        display(feat_importances_lr.head(10))

    if 'Forest' in str(m):
        # Feature importances of random forest:
        feat_importances_rf = pd.DataFrame(data={
            'feature': variables,
            'feat_imp': [c for c in m.feature_importances_]
        }).sort_values('feat_imp', ascending=False)
        feat_importances_rf.index.name = 'random_forest'
        display(feat_importances_rf.head(10))

    if 'lightgbm' in str(m):
        # Feature importances of LightGBM:
        feat_importances_lgb = pd.DataFrame(data={
            'feature': variables,
            'feat_imp': [c for c in m.feature_importance()]
        }).sort_values('feat_imp', ascending=False)
        feat_importances_lgb.index.name = 'light_gbm'
        display(feat_importances_lgb.head(10))

Unnamed: 0_level_0,feature,feat_imp
light_gbm,Unnamed: 1_level_1,Unnamed: 2_level_1
44,L#number_of_ratings,526
49,L#num_words_desc,268
43,L#rating,128
53,L#share_known_malwares,108
67,C#category#COMICS,88
46,L#dangerous_permissions_count,80
45,L#price,66
73,C#category#LIBRARIES__DEMO,49
52,L#num_known_malwares,48
86,C#category#TRAVEL__LOCAL,46


<a id='predictions'></a>

## Predicted scores and classes

In [25]:
# Predictions for test data points:
preds_file = [f for f in os.listdir('../experiments/predictions/') if (experiment_id in f) & ('ensemble' in f)][0]
predictions = pd.read_csv(f'../experiments/predictions/{preds_file}')

print(f'Shape of predictions: {predictions.shape}.')
predictions.head(3)

Shape of predictions: (9012, 7).


Unnamed: 0,test_score,y_true,y_pred,fn,fp,tn,tp
0,0.456581,0,0,0,0,1,0
1,0.087759,0,0,0,0,1,0
2,0.969398,1,1,0,0,0,1


<a id='preds_dist'></a>

### Distribution of predictions

#### Unconditional distribution

In [26]:
display(predictions['test_score'].describe())
print('\n')
display(predictions['y_true'].describe())

# Declaring the grid of plots:
histplot = HistoPlot(grid=(1,1), width=700, height=400, titles=['Distribution of predicted scores'])

# Creating the plots:
histplot.add_plot(
    data=predictions, x='test_score', position=(1,1),
    x_axis_name='Predicted scores', y_axis_name='Distribution',
    opacity=0.5
)

# Plotting the grid:
histplot.render()

count    9012.000000
mean        0.665401
std         0.332404
min         0.011884
25%         0.351353
50%         0.774919
75%         0.991477
max         0.999970
Name: test_score, dtype: float64





count    9012.000000
mean        0.668553
std         0.470759
min         0.000000
25%         0.000000
50%         1.000000
75%         1.000000
max         1.000000
Name: y_true, dtype: float64

#### Distribution conditional on true labels

In [27]:
display(predictions.groupby('y_true')[['test_score']].describe())

Unnamed: 0_level_0,test_score,test_score,test_score,test_score,test_score,test_score,test_score,test_score
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
y_true,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,2987.0,0.330321,0.214199,0.011884,0.157939,0.295896,0.460723,0.977448
1,6025.0,0.831523,0.24344,0.025233,0.731845,0.968126,0.998211,0.99997


In [28]:
# Statistic of Kolmogorov-Smirnov (KS) test:
ks_stat = ks_2samp(
    predictions[predictions['y_true']==0]['test_score'],
    predictions[predictions['y_true']==1]['test_score']
).statistic

# Declaring the grid of plots:
histplot = HistoPlot(grid=(1,1), width=700, height=400,
                     titles=[f'Distribution of predicted scores by true label (KS stat = {ks_stat:.4f})'])

# Creating the plots:
histplot.add_plot(
    data=predictions[predictions['y_true']==0], x='test_score', position=(1,1),
    x_name='y_true=0', x_axis_name='Predicted scores', y_axis_name='Distribution',
    opacity=0.5
)
histplot.add_plot(
    data=predictions[predictions['y_true']==1], x='test_score', position=(1,1),
    x_name='y_true=1', x_axis_name='Predicted scores', y_axis_name='Distribution',
    opacity=0.5
)

# Plotting the grid:
histplot.render()

In [29]:
# Declaring the grid of plots:
boxplot = BoxPlot(grid=(1,2), width=900, height=400, titles=['Distribution of predicted scores by true label'])

# Creating the plots:
boxplot.add_plot(
    data=predictions, x='y_true', y='test_score', position=(1,1),
    x_axis_name='y_true', y_axis_name='Predicted score'
)

# Plotting the grid:
boxplot.render()

In [30]:
# Decile of scores:
predictions['decile'] = pd.qcut(predictions['test_score'], q=10)

# Rate of y = 1 by decile of scores:
y_avg_dec = predictions.groupby('decile').mean()[['y_true']].reset_index()
y_avg_dec['score'] = [str(d) for d in y_avg_dec['decile']]

# Declaring the grid of plots:
barplot = BarPlot(grid=(1,1), width=700, height=400, titles=['Rate of y = 1 by decile of scores'])

# Creating the plots:
barplot.add_plot(
    data=y_avg_dec, x='score', y='y_true', position=(1,1),
    x_axis_name='', y_axis_name='y_true'
)

# Plotting the grid:
barplot.render()

<a id='major_errors'></a>

### Major prediction errors

In [31]:
df_test_sel = pd.concat([df_test[['category', 'price', 'rating', 'share_known', 'share_known_malwares']], predictions], axis=1)

#### False positives

In [32]:
display(df_test_sel[df_test_sel.y_true==0].sort_values('test_score', ascending=False).head(10))

Unnamed: 0,category,price,rating,share_known,share_known_malwares,test_score,y_true,y_pred,fn,fp,tn,tp,decile
1587,Books & Reference,0.0,4.5,1.0,1.0,0.977448,0,1,0,1,0,0,"(0.9281, 0.9801]"
8773,Travel & Local,0.0,4.5,0.75,1.0,0.974098,0,1,0,1,0,0,"(0.9281, 0.9801]"
1689,Sports,0.0,3.6,0.0,,0.974006,0,1,0,1,0,0,"(0.9281, 0.9801]"
7933,Lifestyle,0.0,0.0,0.0,,0.973845,0,1,0,1,0,0,"(0.9281, 0.9801]"
3153,Books & Reference,0.0,3.0,0.0,,0.966804,0,1,0,1,0,0,"(0.9281, 0.9801]"
7562,Travel & Local,0.0,4.2,0.0,,0.96156,0,1,0,1,0,0,"(0.9281, 0.9801]"
2933,Travel & Local,0.0,4.1,0.5,1.0,0.96021,0,1,0,1,0,0,"(0.9281, 0.9801]"
3073,Finance,0.0,3.0,0.25,1.0,0.95982,0,1,0,1,0,0,"(0.9281, 0.9801]"
1102,Casual,0.0,5.0,0.0,,0.956073,0,1,0,1,0,0,"(0.9281, 0.9801]"
6562,Arcade & Action,0.0,1.8,0.0,,0.950157,0,1,0,1,0,0,"(0.9281, 0.9801]"


Features distribution

In [33]:
print('\033[1mDistribution of "category" for the largest false positives:\033[0m')
display(df_test_sel[df_test_sel.y_true==0].sort_values('test_score', ascending=False).head(10)['category'].value_counts(normalize=True))
print('\n\033[1mRelationship between "category" and true label for training data:\033[0m')
display(df_train.groupby('category').mean()[['class']].sort_values('class', ascending=False))

[1mDistribution of "category" for the largest false positives:[0m


Travel & Local       0.3
Books & Reference    0.2
Finance              0.1
Lifestyle            0.1
Casual               0.1
Arcade & Action      0.1
Sports               0.1
Name: category, dtype: float64


[1mRelationship between "category" and true label for training data:[0m


Unnamed: 0_level_0,class
category,Unnamed: 1_level_1
Transportation,0.98524
Medical,0.983051
Travel & Local,0.981859
Sports,0.956186
News & Magazines,0.928082
Shopping,0.926606
Photography,0.89071
Tools,0.884379
Music & Audio,0.803797
Productivity,0.797531


In [34]:
print('\033[1mDistribution of numerical relevant features for the largest false positives:\033[0m')
display(
    df_test_sel[df_test_sel.y_true==0].sort_values('test_score', ascending=False).head(10)[['price', 'rating', 'share_known',
                                                                                            'share_known_malwares']].describe()
)
print('\n\033[1mRelationship between numerical relevant features and true label for training data:\033[0m')
display(df_train.groupby('class').mean()[['price', 'rating', 'share_known',	'share_known_malwares']])

[1mDistribution of numerical relevant features for the largest false positives:[0m


Unnamed: 0,price,rating,share_known,share_known_malwares
count,10.0,10.0,10.0,4.0
mean,0.0,3.37,0.25,1.0
std,0.0,1.51221,0.372678,0.0
min,0.0,0.0,0.0,1.0
25%,0.0,3.0,0.0,1.0
50%,0.0,3.85,0.0,1.0
75%,0.0,4.425,0.4375,1.0
max,0.0,5.0,1.0,1.0



[1mRelationship between numerical relevant features and true label for training data:[0m


Unnamed: 0_level_0,price,rating,share_known,share_known_malwares
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.0,3.965441,0.237252,0.296616
1,0.995695,3.268698,0.161967,0.838359


#### False negatives

In [35]:
display(df_test_sel[df_test_sel.y_true==1].sort_values('test_score', ascending=True).head(10))

Unnamed: 0,category,price,rating,share_known,share_known_malwares,test_score,y_true,y_pred,fn,fp,tn,tp,decile
7511,Business,0.0,4.5,0.5,0.0,0.025233,1,0,1,0,0,0,"(0.01178, 0.168]"
2929,Libraries & Demo,0.0,4.2,0.25,0.0,0.032495,1,0,1,0,0,0,"(0.01178, 0.168]"
7202,Comics,0.0,4.2,0.25,0.0,0.033128,1,0,1,0,0,0,"(0.01178, 0.168]"
8711,Libraries & Demo,0.0,4.6,0.25,0.0,0.035994,1,0,1,0,0,0,"(0.01178, 0.168]"
7460,Libraries & Demo,0.0,4.6,0.25,0.0,0.035994,1,0,1,0,0,0,"(0.01178, 0.168]"
2568,Comics,0.0,3.9,1.0,0.0,0.036747,1,0,1,0,0,0,"(0.01178, 0.168]"
2149,Libraries & Demo,0.0,3.8,0.75,0.333333,0.038901,1,0,1,0,0,0,"(0.01178, 0.168]"
3492,Comics,0.0,3.7,0.25,0.0,0.039931,1,0,1,0,0,0,"(0.01178, 0.168]"
8880,Libraries & Demo,0.0,4.7,0.25,0.0,0.040316,1,0,1,0,0,0,"(0.01178, 0.168]"
6496,Libraries & Demo,0.0,4.7,0.25,0.0,0.040316,1,0,1,0,0,0,"(0.01178, 0.168]"


Features distribution

In [36]:
print('\033[1mDistribution of "category" for the largest false negatives:\033[0m')
display(df_test_sel[df_test_sel.y_true==1].sort_values('test_score', ascending=True).head(10)['category'].value_counts(normalize=True))
print('\n\033[1mRelationship between "category" and true label for training data:\033[0m')
display(df_train.groupby('category').mean()[['class']].sort_values('class', ascending=False))

[1mDistribution of "category" for the largest false negatives:[0m


Libraries & Demo    0.6
Comics              0.3
Business            0.1
Name: category, dtype: float64


[1mRelationship between "category" and true label for training data:[0m


Unnamed: 0_level_0,class
category,Unnamed: 1_level_1
Transportation,0.98524
Medical,0.983051
Travel & Local,0.981859
Sports,0.956186
News & Magazines,0.928082
Shopping,0.926606
Photography,0.89071
Tools,0.884379
Music & Audio,0.803797
Productivity,0.797531


In [37]:
print('\033[1mDistribution of numerical relevant features for the largest false negatives:\033[0m')
display(
    df_test_sel[df_test_sel.y_true==1].sort_values('test_score', ascending=True).head(10)[['price', 'rating', 'share_known',
                                                                                           'share_known_malwares']].describe()
)
print('\n\033[1mRelationship between numerical relevant features and true label for training data:\033[0m')
display(df_train.groupby('class').mean()[['price', 'rating', 'share_known',	'share_known_malwares']])

[1mDistribution of numerical relevant features for the largest false negatives:[0m


Unnamed: 0,price,rating,share_known,share_known_malwares
count,10.0,10.0,10.0,10.0
mean,0.0,4.29,0.4,0.033333
std,0.0,0.384274,0.268742,0.105409
min,0.0,3.7,0.25,0.0
25%,0.0,3.975,0.25,0.0
50%,0.0,4.35,0.25,0.0
75%,0.0,4.6,0.4375,0.0
max,0.0,4.7,1.0,0.333333



[1mRelationship between numerical relevant features and true label for training data:[0m


Unnamed: 0_level_0,price,rating,share_known,share_known_malwares
class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,0.0,3.965441,0.237252,0.296616
1,0.995695,3.268698,0.161967,0.838359


<a id='feats_dist'></a>

### Distribution of predictions given features

#### Category of apps

In [38]:
# Distribution of predictions given features:
score_dist_feat = pd.concat([df_test[['category']], predictions], axis=1)
score_dist_feat = score_dist_feat.groupby('category').mean()[['test_score']].reset_index()

In [39]:
# Declaring the grid of plots:
barplot = BarPlot(grid=(1,1), width=700, height=400, titles=['Average of score by app category'])

# Creating the plots:
barplot.add_plot(
    data=score_dist_feat, x='category', y='test_score', position=(1,1),
    x_axis_name='', y_axis_name='y_true'
)

# Plotting the grid:
barplot.render()

#### Numerical features

In [40]:
score_dist_feat = {}

# Loop over variables:
for v in ['price', 'rating', 'share_known', 'share_known_malwares']:
    # Distribution of predictions given features:
    df_test_sel[[f'L#{v}']] = df_test_sel[[v]].apply(lambda x: np.log(x + 0.0001))
    df_test_sel[f'decile_{v}'] = pd.qcut(df_test_sel[f'L#{v}'], q=10, duplicates='drop')
    tmp_score_dist = df_test_sel.groupby(f'decile_{v}').mean()[['test_score']].reset_index()
    tmp_score_dist[f'L#{v}'] = [str(d) for d in tmp_score_dist[f'decile_{v}']]
    score_dist_feat[f'L#{v}'] = tmp_score_dist

In [41]:
# Declaring the grid of plots:
barplot = BarPlot(grid=(4,1), width=700, height=1000, main_title='Average of score by decile of features',
                  titles=[], legend=False)

# Creating the plots:
barplot.add_plot(
    data=score_dist_feat['L#price'], x='L#price', y='test_score', position=(1,1),
    x_axis_name='L#price', y_axis_name='y_true'
)
barplot.add_plot(
    data=score_dist_feat['L#rating'], x='L#rating', y='test_score', position=(2,1),
    x_axis_name='L#rating', y_axis_name='y_true'
)
barplot.add_plot(
    data=score_dist_feat['L#share_known'], x='L#share_known', y='test_score', position=(3,1),
    x_axis_name='L#share_known', y_axis_name='y_true'
)
barplot.add_plot(
    data=score_dist_feat['L#share_known_malwares'], x='L#share_known_malwares', y='test_score', position=(4,1),
    x_axis_name='L#share_known_malwares', y_axis_name='y_true'
)

# Plotting the grid:
barplot.render()