
Material do minicurso "Introdução ao Desenvolvimento de Modelos de Aprendizado de Máquina Produtivos"

#### Moacir A. Ponti - 2024
---
# Demo notebook

1. Compactar a pasta src + arquivo `model_params.json` em um arquivo `src.zip`
2. Subir o arquivo no colab



In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
!rm -r "/content/src"
!rm "/content/model_params.json"
!unzip "/content/src.zip"
!rm "/content/src.zip"

rm: cannot remove '/content/src': No such file or directory
rm: cannot remove '/content/model_params.json': No such file or directory
Archive:  /content/src.zip
  inflating: model_params.json       
   creating: src/
   creating: src/data/
  inflating: src/data/etl.py         
  inflating: src/data/run_etl.py     
  inflating: src/headers_.py         
  inflating: src/inference.py        
   creating: src/model/
  inflating: src/model/model.py      
  inflating: src/model/run_training.py  


In [3]:
from src.data.etl import ETL
from src.model.model import Model
import os
import sys
import pickle
import json
import pandas as pd

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



# Loading and defining functions for demo at Colab

- those functions are in the files `run_etl.py` and `run_training.py`
- in real productive scenarios we would like to run those scripts, and never a notebook
- however for this demo I am copying some of the functions to show and explain them

In [13]:
def train_and_evaluate(dataset_artifact_file, model_version, model_params_file):
    """ Trains and Evaluate model
    Args:
        dataset_artifact_file: str - the path to the dataset artifact file
        model_version: str - the model version
        model_params_file: str - the path to the model parameters file
    Returns:
        tuple: model, dictionary with evaluation
    """
    # create an instance of the Model class
    model = Model(model_version)

    # loads dataset artifact from the pickle file
    dataset_artifact = ETL.deserialize_dataset(dataset_artifact_file)
    if not dataset_artifact:
        print('Could not load dataset artifact')
        return

    # validate dataset
    if not ETL.validate_dataset(dataset_artifact):
        print('Dataset is not valid')
        return

    # loads JSON into a dictionary
    with open(model_params_file, 'r') as file:
        model_params = json.load(file)

    ignore_columns = [col for col in dataset_artifact['dataset'].columns if col not in dataset_artifact['features']]

    X, X_test, y, y_test = model.create_train_test_split(dataset_artifact, test_size=0.2, split_by='random',
                                                         time_column=None, target_column='target',
                                                         ignore_columns=ignore_columns)

    model.train(X, y, model_params)
    dict_eval = model.evaluate_model(X_test, y_test, dataset_version=dataset_artifact['version'], verbose=True)

    return model, dict_eval

In [7]:
# content of inference.py
import numpy as np

def load_model(model_file):
    """
    Loads the model from a pickle file
    Args:
        model_file: str - the path to the pickle file
    Returns:
        model: the model
    """
    model = Model.load(model_file)
    return model


def predict(model, data):
    """
    Predicts the target
    Args:
        model: the model
        data: np.array - the data to predict
    Returns:
        np.array: the predictions
    """

    # check if json contain all the features as defined in model.features, and if not imput with NULL
    # predict
    scores, missing_features = model.predict(data)

    response = {'score': round(float(scores[1]), 4),
                'no_missing_features': missing_features}

    return response

# Example: using a public dataset

California housing:
- As target we will predict the median house values above 90K USD

## 1) ETL

In [9]:
csv_file = '/content/sample_data/california_housing_train.csv'

### this content is also at the run_etl.py file

# create an instance of the ETL class
etl = ETL()
# load the dataset from the csv file
dataset = etl.load_dataset_from_csv(csv_file)

# select which columns will be used as features of the model
etl.select_columns(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'])

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

In [10]:
# as part of the ETL process we define also a "target" column

# in this case we will transform the median_house_value values above 90K as target 1, while the remaining will be target 0
dataset.rename({'median_house_value':'target'}, axis=1, inplace=True)
dataset['target'] = (dataset['target'] > 90000).astype(int)

In [11]:
# build dictionary containing dataset information
dataset_artifact = {'name': 'california_housing',
                    'version': '1.0.0',
                    'features': etl.selected_features,
                    'dataset': dataset}

# serialize it (creates an artifact that can be stored and versioned, later retrieved)
etl.serialize_dataset(dataset_artifact)

In [12]:
dataset_artifact['features']

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income'],
      dtype='object')

## Model Training

- requires a dataset artifact
- requires a `json` file with the model parameters

In [14]:
dataset_artifact_file = 'dataset_california_housing_1.0.0.pkl'
model_params_file = 'model_params.json'

if not dataset_artifact_file.endswith('.pkl') or not model_params_file.endswith('.json'):
    print('Please provide a valid pickle file and json file')

# check params json file exists
if not os.path.exists(model_params_file):
    print(f'Model Params file: {model_params_file} not found')

In [15]:
model_version = '1.0.0'

# run training and evaluation
model, dict_eval = train_and_evaluate(dataset_artifact_file, model_version, model_params_file)

Model (v=1.0.0) evaluation:
Test set version = 1.0.0, size = 3400 rows x 8 feats
	PRAUC: 0.9524
	ROCAUC: 0.8426
	F1: 0.9642


In [16]:
# serializes model in a file
model.serialize()

In [17]:
# creates an instance example to be used later in the inference test
model.instance_example_to_json()

In [18]:
# defines the model file and the data file to be predicted
model_file = 'model_1.0.0.pkl'
data_file = 'instance.json'

# the code below is in inference.py script

# loads model from pickle
model = load_model(model_file)
# loads data from json
with open(data_file, 'r') as file:
    data = json.load(file)

print(f'Inference with model {model.model_name}, version {model.version}...')
# predict
response = predict(model, data)

print(response)

Inference with model lightgbm, version 1.0.0...
{'score': 1.0, 'no_missing_features': 0}


In [19]:
!python /content/src/inference.py model_1.0.0.pkl instance.json

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.

