## End-to-end machine learning application
## Deployment - Production model (support)

This project aims to integrate different aspects of a machine learning system, thus developing an end-to-end ML project. The final product is an app (hypothetically called *AppSafe*) composed of a model that calculates the risk of a mobile app being a malware and an API that could integrate with an app store and with the user by sending him/her a warning message when the mobile app that is about to be downloaded is too risky.

The project follows the traditional [CRISP-DM](https://pt.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining) methodology, so these are the main stages that make the core of the project: data engineering, data preparation, data modeling, and deployment.

-----------

This notebook produces additional artifacts for the consumption of the trained model in a production environment. Namely, it imports the raw available data and saves two files: one with the schema of raw variables that should be sent to the model and other with a sample of raw inputs that can be used when performing tests over the model object.

**Summary:**
1. [Libraries](#libraries)<a href='#libraries'></a>.
2. [Functions and classes](#functions_classes)<a href='#functions_classes'></a>.
3. [Settings](#settings)<a href='#settings'></a>.
4. [Importing data](#imports)<a href='#imports'></a>.
  * [Complete available dataset](#data)<a href='#data'></a>.

5. [Saving additional artifacts](#artifacts)<a href='#artifacts'></a>.

<a id='libraries'></a>

## Libraries





In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
cd "/content/gdrive/MyDrive/Studies/end_to_end_ml/notebooks/"

/content/gdrive/MyDrive/Studies/end_to_end_ml/model_dev


In [None]:
# !pip install -r ../requirements.txt

In [None]:
# !pip uninstall scikit-learn -y

In [None]:
# !pip install scikit-learn==0.24.2

In [None]:
import pandas as pd
import numpy as np
import os
import json
from datetime import datetime
import time

<a id='functions_classes'></a>

## Functions and classes

In [None]:
from utils import correct_col_name

<a id='settings'></a>

## Settings

<a id='data_management_settings'></a>

### Data management

In [None]:
# Declare whether outcomes should be exported:
EXPORT = True

<a id='imports'></a>

## Importing data

<a id='data'></a>

### Complete available dataset

In [None]:
input_data = pd.read_csv('../data/Android_Permission.csv')

# Columns names:
input_data.columns = [correct_col_name(c) for c in input_data.columns]

print(f'Shape of input_data: {input_data.shape}.')

# Removing duplicates:
input_data.drop_duplicates(inplace=True)
print(f'Number of instances after removing duplicates: {len(input_data)}.')

input_data.head(3)

Shape of input_data: (29999, 184).
Number of instances after removing duplicates: 27310.


Unnamed: 0,app,package,category,description,rating,number_of_ratings,price,related_apps,dangerous_permissions_count,safe_permissions_count,...,your_personal_information_read_calendar_events,your_personal_information_read_contact_data,your_personal_information_read_sensitive_log_data,your_personal_information_read_user_defined_dictionary,your_personal_information_retrieve_system_internal_state,your_personal_information_set_alarm_in_alarm_clock,your_personal_information_write_browsers_history_and_bookmarks,your_personal_information_write_contact_data,your_personal_information_write_to_user_defined_dictionary,class
0,Canada Post Corporation,com.canadapost.android,Business,Canada Post Mobile App gives you access to som...,3.1,77,0.0,"{com.adaffix.pub.ca.android, com.kevinquan.gas...",7.0,1,...,0,1,0,0,0,0,0,1,0,0
1,Word Farm,com.realcasualgames.words,Brain & Puzzle,Speed and strategy combine in this exciting wo...,4.3,199,0.0,"{air.com.zubawing.FastWordLite, com.joybits.do...",3.0,2,...,0,0,0,0,0,0,0,0,0,0
2,Fortunes of War FREE,fortunesofwar.free,Cards & Casino,"Fortunes of War is a fast-paced, easy to learn...",4.1,243,0.0,"{com.kevinquan.condado, hu.monsta.pazaak, net....",1.0,1,...,0,0,0,0,0,0,0,0,0,0


<a id='artifacts'></a>

## Saving additional artifacts

In [None]:
# Data schema:
schema = dict(
    zip(
        [c for c in input_data.drop(['class'], axis=1).columns],
        ['str' if type(input_data[input_data[c].isnull()==False].sample(1)[c].iloc[0])==str else 'numeric' for
         c in input_data.drop(['class'], axis=1).columns]
    )
)

# Sample of inputs:
sample = np.random.choice(range(len(input_data)), size=1000, replace=False)
sample_inputs = [
 dict(zip(list(input_data.drop(['class'], axis=1).iloc[i].index),
          [int(v) if isinstance(v, np.int64) else v for v in input_data.drop(['class'], axis=1).iloc[i].values])) for i in sample
]

In [None]:
if EXPORT:
    with open('../artifacts/schema.json', 'w') as json_file:
        json.dump(schema, json_file, indent=2)

    with open('../artifacts/sample_inputs.json', 'w') as json_file:
        json.dump(sample_inputs, json_file, indent=2)