# DQ0 SDK / CLI Demo
## Prerequistes
* Installed DQ0 SDK. Install with `pip install dq0sdk`
* Installed DQ0 CLI.
* Proxy running and registered from the DQ0 CLI with `dq0-cli proxy add ...`
* Valid session of DQ0. Log in with `dq0 auth login`
* Running instance of DQ0 CLI server: `dq0 server start`

## Concept
The two main structures to work with DQ0 quarantine via the DQ0 SDK are
* Project - the current model environment, a workspace and directory the user can define models in. Project also provides access to trained models.
* Experiment - the DQ0 runtime to execute training runs in the remote quarantine.

Start by importing the core classes

In [None]:
# import dq0sdk cli
from dq0sdk.cli import Project, Experiment

## Create a project
Projects act as the working environment for model development.
Each project has a model directory with a .meta file containing the model uuid, attached data sources etc.
Creating a project with `Project.create(name='model_1')` is equivalent to calling the DQ0 Cli command `dq0-cli model create --name model_1`

In [None]:
# create a project with name 'model_1'. Automatically creates the 'model_1' directory and changes to this directory.
project = Project(name='model_1')

## Load a project
Alternatively, you can load an existing project by first cd'ing into this directory and then call Project.load()
This will read in the .meta file of this directory

In [None]:
%cd model_1

In [None]:
# Alternative: load a project from the current model directory
project = Project.load()

## Create Experiment
To execute DQ0 training commands inside the quarantine you define experiments for your projects.
You can create as many experiments as you like for one project.

In [None]:
# Create experiment for project
experiment = Experiment(project=project, name='experiment_1')

## Get and attach data source
For new projects you need to attach a data source. Existing (loaded) projects usually already have data sources attached.

In [None]:
# first get some info about available data sources
sources = project.get_available_data_sources()

# print info abouth the first source
info = project.get_data_info(sources[0]['uuid'])
info

Get the dataset description:

In [None]:
# print data description
info['description']

Also, inspect the data column types including allowed values for feature generation:

In [None]:
# print information about column types and values
info['types']

And some sample data if available:

In [None]:
# get sample data
project.get_sample_data(sources[0]['uuid'])

Now, attach the dataset to our project

In [None]:
# attach the first dataset
project.attach_data_source(sources[0]['uuid'])

## Define the model
Working with DQ0 is basically about defining two functions:
* setup_data() - called right before model training to prepare attached data sources
* setup_model() - actual model definition code
The easiest way to define those functions is to write them in the notebook (inline) and pass them to the project before calling deploy. Alternatively, the user can write the complete user_model.py to the project's directory.

### Define fuctions inline
First variant with functions passed to the project instance. Note that you need to define imports inline inside the functions as only those code blocks are replaced in the source files.

In [None]:
# define functions

def setup_data():
    # load input data
    if self.data_source is None:
        logger.error('No data source found')
        return

    data = self.data_source.read()

    # read and preprocess the data
    dataset_df = self.preprocess()

    from sklearn.model_selection import train_test_split
    X_train_df, X_test_df, y_train_ts, y_test_ts =\
        train_test_split(dataset_df.iloc[:, :-1],
                         dataset_df.iloc[:, -1],
                         test_size=0.33,
                         random_state=42)
    self.input_dim = X_train_df.shape[1]

    # set data member variables
    self.X_train = X_train_df
    self.X_test = X_test_df
    self.y_train = y_train_ts
    self.y_test = y_test_ts
    
def setup_model():
    import tensorflow.compat.v1 as tf
    self.learning_rate = 0.1
    self.epochs = 10
    # self.optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
    self.optimizer = 'Adam'
    self.model = tf.keras.Sequential([
        tf.keras.layers.Input(self.input_dim),
        tf.keras.layers.Dense(10, activation='tanh'),
        tf.keras.layers.Dense(10, activation='tanh'),
        tf.keras.layers.Dense(2, activation='softmax')])
    
def preprocess():
    # columns
    column_names_list = [
        'lastname',
        'firstname',
        'age',
        'workclass',
        'fnlwgt',
        'education',
        'education-num',
        'marital-status',
        'occupation',
        'relationship',
        'race',
        'sex',
        'capital-gain',
        'capital-loss',
        'hours-per-week',
        'native-country',
        'income'
    ]

    # columns types list drawn from data source types information above.
    columns_types_list = [
        {
            'name': 'age',
            'type': 'int'
        },
        {
            'name': 'workclass',
            'type': 'string',
            'values': [
                'Private',
                'Self-emp-not-inc',
                'Self-emp-inc',
                'Federal-gov',
                'Local-gov',
                'State-gov',
                'Without-pay',
                'Never-worked',
                'Unknown'
            ]
        },
        {
            'name': 'fnlwgt',
            'type': 'int'
        },
        {
            'name': 'education',
            'type': 'string',
            'values': [
                'Bachelors',
                'Some-college',
                '11th',
                'HS-grad',
                'Prof-school',
                'Assoc-acdm',
                'Assoc-voc',
                '9th',
                '7th-8th',
                '12th',
                'Masters',
                '1st-4th',
                '10th',
                'Doctorate',
                '5th-6th',
                'Preschool'
            ]
        },
        {
            'name': 'education-num',
            'type': 'int'
        },
        {
            'name': 'marital-status',
            'type': 'string',
            'values': [
                'Married-civ-spouse',
                'Divorced',
                'Never-married',
                'Separated',
                'Widowed',
                'Married-spouse-absent',
                'Married-AF-spouse'
            ]
        },
        {
            'name': 'occupation',
            'type': 'string',
            'values': [
                'Tech-support',
                'Craft-repair',
                'Other-service',
                'Sales',
                'Exec-managerial',
                'Prof-specialty',
                'Handlers-cleaners',
                'Machine-op-inspct',
                'Adm-clerical',
                'Farming-fishing',
                'Transport-moving',
                'Priv-house-serv',
                'Protective-serv',
                'Armed-Forces',
                'Unknown'
            ]
        },
        {
            'name': 'relationship',
            'type': 'string',
            'values': [
                'Wife',
                'Own-child',
                'Husband',
                'Not-in-family',
                'Other-relative',
                'Unmarried'
            ]
        },
        {
            'name': 'race',
            'type': 'string',
            'values': [
                'White',
                'Asian-Pac-Islander',
                'Amer-Indian-Eskimo',
                'Other',
                'Black'
            ]
        },
        {
            'name': 'sex',
            'type': 'string',
            'values': [
                'Female',
                'Male'
            ]
        },
        {
            'name': 'capital-gain',
            'type': 'int'
        },
        {
            'name': 'capital-loss',
            'type': 'int'
        },
        {
            'name': 'hours-per-week',
            'type': 'int'
        },
        {
            'name': 'native-country',
            'type': 'string',
            'values': [
                'United-States',
                'Cambodia',
                'England',
                'Puerto-Rico',
                'Canada',
                'Germany',
                'Outlying-US(Guam-USVI-etc)',
                'India',
                'Japan',
                'Greece',
                'South',
                'China',
                'Cuba',
                'Iran',
                'Honduras',
                'Philippines',
                'Italy',
                'Poland',
                'Jamaica',
                'Vietnam',
                'Mexico',
                'Portugal',
                'Ireland',
                'France',
                'Dominican-Republic',
                'Laos',
                'Ecuador',
                'Taiwan',
                'Haiti',
                'Columbia',
                'Hungary',
                'Guatemala',
                'Nicaragua',
                'Scotland',
                'Thailand',
                'Yugoslavia',
                'El-Salvador',
                'Trinadad&Tobago',
                'Peru',
                'Hong',
                'Holand-Netherlands',
                'Unknown'
            ]
        }
    ]
    
    from dq0sdk.data.preprocessing import preprocessing
    import sklearn.preprocessing
    import pandas as pd

    if 'dataset' in globals():
        # local testing mode
        dataset = globals()['dataset']
    else:
        # get the input dataset
        if self.data_source is None:
            logger.error('No data source found')
            return

        # read the data via the attached input data source
        dataset = self.data_source.read(
            names=column_names_list,
            sep=',',
            skiprows=1,
            index_col=None,
            skipinitialspace=True,
            na_values={
                'capital-gain': 99999,
                'capital-loss': 99999,
                'hours-per-week': 99,
                'workclass': '?',
                'native-country': '?',
                'occupation': '?'}
        )

    # drop unused columns
    dataset.drop(['lastname', 'firstname'], axis=1, inplace=True)
    column_names_list.remove('lastname')
    column_names_list.remove('firstname')

    # define target feature
    target_feature = 'income'

    # get categorical features
    categorical_features_list = [
        col['name'] for col in columns_types_list
        if col['type'] == 'string']

    # get categorical features
    quantitative_features_list = [
        col['name'] for col in columns_types_list
        if col['type'] == 'int' or col['type'] == 'float']

    # get arguments
    approach_for_missing_feature = 'imputation'
    imputation_method_for_cat_feats = 'unknown'
    imputation_method_for_quant_feats = 'median'
    features_to_drop_list = None

    # handle missing data
    dataset = preprocessing.handle_missing_data(
        dataset,
        mode=approach_for_missing_feature,
        imputation_method_for_cat_feats=imputation_method_for_cat_feats,
        imputation_method_for_quant_feats=imputation_method_for_quant_feats,  # noqa: E501
        categorical_features_list=categorical_features_list,
        quantitative_features_list=quantitative_features_list)

    if features_to_drop_list is not None:
        dataset.drop(features_to_drop_list, axis=1, inplace=True)

    # get dummy columns
    dataset = pd.get_dummies(dataset, columns=categorical_features_list, dummy_na=False)    

    # unzip categorical features with dummies
    categorical_features_list_with_dummies = []
    for col in columns_types_list:
        if col['type'] == 'string':
            for value in col['values']:
                categorical_features_list_with_dummies.append('{}_{}'.format(col['name'], value))

    # add missing columns
    missing_columns = set(categorical_features_list_with_dummies) - set(dataset.columns)
    for col in missing_columns:
        dataset[col] = 0
        
    # and sort the columns
    dataset = dataset.reindex(sorted(dataset.columns), axis=1)

    # Scale values to the range from 0 to 1 to be precessed by the neural network
    dataset[quantitative_features_list] = sklearn.preprocessing.minmax_scale(dataset[quantitative_features_list])

    # label target
    y_ts = dataset[target_feature]
    le = sklearn.preprocessing.LabelEncoder()
    y_bin_nb = le.fit_transform(y_ts)
    y_bin = pd.Series(index=y_ts.index, data=y_bin_nb)
    dataset.drop([target_feature], axis=1, inplace=True)
    dataset[target_feature] = y_bin

    return dataset
    
# set model code in project
project.set_model_code(setup_data=setup_data, setup_model=setup_model, preprocess=preprocess, parent_class_name='NeuralNetworkClassification')

### Define functions as source code
Second variant, writing the complete model. Template can be retrieved by `!cat models/user_model.py` which is created by Project create.

In [None]:
%%writefile models/user_model.py

import logging

from dq0sdk.models.tf.neural_network_classification import NeuralNetworkClassification

logger = logging.getLogger()


class UserModel(NeuralNetworkClassification):
    """Derived from dq0sdk.models.tf.NeuralNetwork class

    Model classes provide a setup method for data and model
    definitions.

    Args:
        model_path (:obj:`str`): Path to the model save destination.
    """
    def __init__(self, model_path):
        super().__init__(model_path)

    def setup_data(self):
        """Setup data function. See code above..."""
        pass

    def preprocess(self):
        """Preprocess the data. See code above..."""
        pass

    def setup_model(self):
        """Setup model function See code above..."""
        pass


## Train the model
After testing the model locally directly in this notebook, it's time to train it inside the DQ0 quarantine. This is done by calling experiment.train() which in turn calls the Cli commands `dq0-cli model deploy` and `dq0-cli model train`

In [None]:
run = experiment.train()

train is executed asynchronously. You can wait for the run to complete or get the state with get_state:
(TBD: in the future there could by a jupyter extension that shows the run progress in a widget.)

In [None]:
# wait for completion
run.wait_for_completion(verbose=True)

When the run has completed you can retrieve the results:

In [None]:
# get training results
print(run.get_results())

After train dq0 will run the model checker to evaluate if the trained model is safe and allowed for prediction. Get the state of the checker run together with the other state information with the get_state() function:

In [None]:
# get the state whenever you like
print(run.get_state())

## Predict
Finally, it's time to use the trained model to predict something

In [None]:
import numpy as np
import pandas as pd

# get the latest model
model = project.get_latest_model()

# check DQ0 privacy clearing
if model.predict_allowed:

    # create predict set
    records = [
        {
            'lastname': 'some-lastname',
            'firstname': 'some-firstname',
            'age': 45,
            'workclass':'Private',
            'fnlwgt': 544091,
            'education': 'HS-grad',
            'education-num': 9,
            'marital-status': 'Married-AF-spouse',
            'occupation': 'Exec-managerial',
            'relationship': 'Wife',
            'race': 'White',
            'sex': 'Female',
            'capital-gain': 0,
            'capital-loss': 0,
            'hours-per-week': 25,
            'native-country': 'United-States',
            'income': '<=50K'
        },
        {
            'lastname': 'some-lastname',
            'firstname': 'some-firstname',
            'age': 29,
            'workclass': 'Federal-gov',
            'fnlwgt': 162298,
            'education': 'Masters',
            'education-num': 14,
            'marital-status': 'Married-civ-spouse',
            'occupation': 'Exec-managerial',
            'relationship': 'Husband',
            'race': 'White',
            'sex': 'Male',
            'capital-gain': 34084,
            'capital-loss': 0,
            'hours-per-week': 70,
            'native-country': 'United-States',
            'income': '<=50K'
        }
    ]
    dataset = pd.DataFrame.from_records(records)
    
    # preprocess data
    dataset = preprocess()
    
    # drop target (included above only because of compatability with preprocess function)
    dataset.drop(['income'], axis=1, inplace=True)

    # load or get numpy predict data
    # predict_data = np.load(‘X_demo_predict.npy’)
    predict_data = dataset.to_numpy()

    # call predict
    run = model.predict(predict_data)

    # wait for completion
    run.wait_for_completion(verbose=True)

In [None]:
# get predict results
print(run.get_results()['predict'])