## End-to-end machine learning application
## Deployment - Production model

This project aims to integrate different aspects of a machine learning system, thus developing an end-to-end ML project. The final product is an app (hypothetically called *AppSafe*) composed of a model that calculates the risk of a mobile app being a malware and an API that could integrate with an app store and with the user by sending him/her a warning message when the mobile app that is about to be downloaded is too risky.

The project follows the traditional [CRISP-DM](https://pt.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) methodology, so these are the main stages that make the core of the project: data engineering, data preparation, data modeling, and deployment.

-----------

This notebook imports all artifacts generated during final training of data modeling stage (notebook "Data Modeling - Final Training"). Next, it creates an object of *Model* class, whose *predict* method is able to produce a prediction for an input data point by applying all necessary operations of data preparation, using an object of *Pipeline* class, and then, making use of an object of *Ensemble* class, the model object finally returns a prediction.

After creating an object of *Model* class, this notebook saves it in a pickle file so it can be used within an application through an API hosted in a local/remote server. This notebook also applies tests over the model object, which helps with the estimation of latency time and with the development of errors handling.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing artifacts](#imports)<a href='#imports'></a>.
  * [Training data](#training_data)<a href='#training_data'></a>.
  * [Data understanding](#data_und)<a href='#data_und'></a>.
  * [Model registry](#model_registry)<a href='#model_registry'></a>.
  * [Artifacts of the model](#artifacts)<a href='#artifacts'></a>.

5. [Model in production](#production_model)<a href='#production_model'></a>.
  * [Sample data](#sample_data)<a href='#sample_data'></a>.
  * [Model object](#model)<a href='#model'></a>.
  * [Predictions from the trained model](#predictions)<a href='#predictions'></a>.

<a id='libraries'></a>

## Libraries





In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
cd "/content/gdrive/MyDrive/Studies/end_to_end_ml/notebooks/"

/content/gdrive/MyDrive/Studies/end_to_end_ml/model_dev


In [None]:
# !pip install -r ../requirements.txt

In [None]:
import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import time
import pickle
from copy import deepcopy

In [None]:
import sys

sys.path.append(
    os.path.abspath(
        os.path.join(
            os.path.dirname(__doc__), '../src'
        )
    )
)

<a id='functions_classes'></a>

## Functions and classes

In [None]:
from production import Model
from data_vis import plot_histogram

<a id='settings'></a>

## Settings

<a id='data_management_settings'></a>

### Data management

In [None]:
# Declare whether outcomes should be exported:
EXPORT = False

# Experiment description:
COMMENT = ''

<a id='imports'></a>

## Importing artifacts

<a id='training_data'></a>

### Training data

In [None]:
df_train = pd.read_csv('../artifacts/df_train.csv', dtype={'app_id': int})

print(f'Shape of df_train: {df_train.shape}.')
print(f'Number of unique instances: {df_train.app_id.nunique()}.')

df_train.head(3)

Shape of df_train: (18298, 61).
Number of unique instances: 18298.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_write_browsers_history_and_bookmarks,your_personal_information_write_contact_data,class,app_id,num_related_apps,num_words_desc,num_known_apps,share_known,num_known_malwares,share_known_malwares
0,Ambient Soothing Sounds: Beach,com.zeddev.chillbeach1,Health & Fitness,The soothing sounds on a long and seamless loo...,3.6,122,0.0,"com.zeddev.chillmeadow1, com.droiddevz.ambient...",1.0,1,...,0,0,0,6565,4.0,42.0,0.0,0.0,0.0,
1,Aurora,jiang.joyworks.aurora,Brain & Puzzle,This is one great &quot;Escape Game&quot; <p>Y...,3.8,24,1.41,com.firemaplegames.games.the_secretofgrislyman...,1.0,0,...,0,0,1,4772,4.0,251.0,0.0,0.0,0.0,
2,Tank Ace 1944,com.resetgame.tankace1944,Arcade & Action,In Tank Ace 1944 you command a World War II ta...,3.7,20,4.99,"ru.sibteam.classictankfull, nl.ejsoft.mortalsk...",0.0,0,...,0,0,1,20856,4.0,341.0,0.0,0.0,0.0,


#### List of input variables

In [None]:
with open('../artifacts/variables.json', 'r') as json_file:
    variables = json.load(json_file)

#### Data schema

In [None]:
with open('../artifacts/schema.json', 'r') as json_file:
    schema = json.load(json_file)

<a id='data_und'></a>

### Data understanding

In [None]:
data_und = pd.read_csv('../data/features.csv')

print(f'Shape of data_und: {data_und.shape}.')
print(f'Number of unique instances: {data_und.feature.nunique()}.')

data_und.head(3)

Shape of data_und: (191, 8).
Number of unique instances: 191.


Unnamed: 0,feature,type,n_unique,sample_values,num_missings,share_missings,var_class,category
0,app,object,22823,['Alabama Crimson Tide News' 'Blood Demon Movi...,1,3.7e-05,categorical,app_attributes
1,package,object,23485,['com.estrongs.android.pop.app.shortcut' 'com....,0,0.0,categorical,app_attributes
2,category,object,30,['Shopping' 'Racing' 'Productivity' 'Sports Ga...,0,0.0,categorical,app_attributes


<a id='model_registry'></a>

### Model registry

In [None]:
with open('../artifacts/model_registry.json', 'r') as json_file:
    model_registry = json.load(json_file)

<a id='artifacts'></a>

### Artifacts of the model

#### Pipeline object

In [None]:
pipeline = pickle.load(open('../artifacts/pipeline.pickle', 'rb'))

#### Ensemble object

In [None]:
ensemble = pickle.load(open('../artifacts/ensemble.pickle', 'rb'))

<a id='production_model'></a>

## Model in production

<a id='sample_data'></a>

### Sample data

In [None]:
with open('../artifacts/sample_inputs.json', 'r') as json_file:
    sample_inputs = json.load(json_file)

<a id='model'></a>

### Model object

In [None]:
# Model object from pipeline and ensemble objects:
model = Model(schema=schema, pipeline=pipeline, ensemble=ensemble, variables=variables)

In [None]:
if EXPORT:
    # Object of fitted pipeline:
    pickle.dump(model, open('../artifacts/model.pickle', 'wb'))

<a id='predictions'></a>

### Predictions from the trained model

In [None]:
# Object of fitted pipeline:
model = pickle.load(open('../artifacts/model.pickle', 'rb'))

In [None]:
# Prediction for a given data point:
start = time.time()
prediction = model.predict(input_data=sample_inputs[0],
                           training_data=df_train)
end = time.time()
print(f'Predicted probability that the application is a malware: {prediction[0]:.4f}.')
print(f'Elapsed time: {round(end-start, 2)} seconds.')


Mean of empty slice.



Predicted probability that the application is a malware: 0.9973.
Elapsed time: 1.08 seconds.


#### Evaluating time for prediction

In [None]:
time_elapsed = []

# Loop over data inputs:
for i in range(len(sample_inputs)):
    # Prediction for a given data point:
    start = time.time()
    prediction = model.predict(input_data=sample_inputs[i],
                              training_data=df_train)
    end = time.time()
    time_elapsed.append(end-start)
  
time_elapsed = pd.DataFrame(data={
    'time_elapsed': time_elapsed
})


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of empty slice.


Mean of em

In [None]:
display(time_elapsed.time_elapsed.describe())
plot_histogram(
    data=time_elapsed, x=['time_elapsed'], pos=[(1,1)],
    titles=['Distribution of total elapsed time over 1000 samples'], width=700, height=450
)

count    1000.000000
mean        0.956420
std         0.051233
min         0.908056
25%         0.941041
50%         0.948143
75%         0.956907
max         1.904061
Name: time_elapsed, dtype: float64

#### Handling errors

Absence of attributes

In [None]:
missing_attr = deepcopy(sample_inputs[5])
missing_attr.pop('price')
prediction = model.predict(input_data=missing_attr, training_data=df_train)

ValueError: ignored

Irrelevant attributes

In [None]:
irrel_attr = deepcopy(sample_inputs[273])
irrel_attr['price2'] = irrel_attr['price']
prediction = model.predict(input_data=irrel_attr, training_data=df_train)


Mean of empty slice.



In [None]:
irrel_attr = deepcopy(sample_inputs[482])
irrel_attr['price2'] = irrel_attr['price']
irrel_attr.pop('price')
irrel_attr.pop('rating')
prediction = model.predict(input_data=irrel_attr, training_data=df_train)

ValueError: ignored

Wrong data type

In [None]:
wrong_type = deepcopy(sample_inputs[61])
wrong_type['rating'] = str(wrong_type['rating'])
prediction = model.predict(input_data=wrong_type, training_data=df_train)

TypeError: ignored

In [None]:
wrong_type = deepcopy(sample_inputs[119])
wrong_type['category'] = 10
prediction = model.predict(input_data=wrong_type, training_data=df_train)

TypeError: ignored