# Developing our LightGBM model locally

Before creating our LightGBM container for SageMaker, let's create a simple model and test it locally.

Let's install LightGBM to the SageMaker Notebook:

In [None]:
!pip install lightgbm==2.3.1

The dataset used in this experiment is a toy dataset called Iris (http://archive.ics.uci.edu/ml/datasets/iris). The dataset contains 3 classes of 50 instances each, where each class refers to a type of iris plant and the goal is to classify the correct species based on features like sepal and petal width and length. 

<img src='./media/iris.jpg' alt='iris'  class="center">

The clallenge itself is very basic, so you can focus on the mechanics and the features of this automated environment later.

As a requirement, suppose our **F1 score must be greater than 90%** in order for our model to go to production.

Feel free to develop your own **LightGBM model** (take a look in the [docs for the parameters](https://lightgbm.readthedocs.io/en/latest/Parameters.html) and [the Python APIs](https://lightgbm.readthedocs.io/en/latest/Python-API.html)):

In [None]:
import os

import lightgbm as lgb

from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
import pandas as pd
import numpy as np
import joblib

In [None]:
iris = datasets.load_iris()

X=iris.data
y=iris.target

dataset = np.insert(iris.data, 0, iris.target,axis=1)

df = pd.DataFrame(data=dataset, columns=['iris_id'] + iris.feature_names)
## We'll also save the dataset, with header, give we'll need to create a baseline for the monitoring
df['species'] = df['iris_id'].map(lambda x: 'setosa' if x == 0 else 'versicolor' if x == 1 else 'virginica')

df.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

For simplicity, let's use the [LightGBM Scikit Learn's API](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier):

In [None]:
# Set random number seed to make it reproducible
gbm = lgb.LGBMClassifier(objective='multiclass',
                        num_class=len(np.unique(y)),
                        )

In [None]:
gbm.set_params(num_leaves=40,
              max_depth=10,
              learning_rate=0.11,
              random_state=42)

In [None]:
gbm.fit(X_train, y_train,
        eval_set=[(X_test, y_test)],
        eval_names='[validation_softmax]',
        eval_metric='softmax',
        early_stopping_rounds=5,
        verbose=5)

In [None]:
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

In [None]:
y_pred

Let's check our F1 score:

In [None]:
score = f1_score(y_test,y_pred,labels=[0.0,1.0,2.0],average='micro')

In [None]:
score

Look's like our F1 score is good enough and our code is working!

We create a directory called `models` to save the trained models:

In [None]:
os.makedirs('./models', exist_ok=True)

In [None]:
joblib.dump(gbm, os.path.join('models', 'nb_model.joblib'))

In [None]:
# Check the loaded model
loaded_gbm = joblib.load('models/nb_model.joblib')

y_loaded_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

f1_score(y_test,y_loaded_pred,labels=[0.0,1.0,2.0],average='micro')

Now, let's turn the code cells above into a Python script. This way we'll be able to automate the training and also use our custom training script in SageMaker.

The idea is that we can **pass the hyperparameters and also the data location into our script**. After training, the script **will save our model in the specified directory**. In addition, the script can load a specific model from a directory, with a function called `model_fn`. We could run the script multiple times with different configurations if wanted.

First, we save the train and test datasets to a local folder called `data`:

In [None]:
# Create directory and write csv
os.makedirs('./data', exist_ok=True)
os.makedirs('./data/raw', exist_ok=True)
os.makedirs('./data/train', exist_ok=True)
os.makedirs('./data/test', exist_ok=True)
os.makedirs('./data/test_no_label', exist_ok=True)

In [None]:
np_data_raw = np.concatenate((X, np.expand_dims(y, axis=1)), axis=1)
np.savetxt('./data/raw/iris.csv', np_data_raw, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f, %1.0f')

np_data_train = np.concatenate((X_train, np.expand_dims(y_train, axis=1)), axis=1)
np.savetxt('./data/train/iris_train.csv', np_data_train, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f, %1.0f')

np_data_test = np.concatenate((X_test, np.expand_dims(y_test, axis=1)), axis=1)
np.savetxt('./data/test/iris_test.csv', np_data_test, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f, %1.0f')
np.savetxt('./data/test_no_label/iris_test_no_label.csv', X_test, delimiter=',', fmt='%1.1f, %1.1f, %1.1f, %1.1f')

Then, we create the training script `train.py` and save it in a local directory called `source_dir`. We save the trained model in the `models` directory:

In [None]:
os.makedirs('./source_dir', exist_ok=True)

In [None]:
%%writefile source_dir/train.py
import argparse
import os
import pandas as pd
import numpy as np
import logging
import sys

import joblib

import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split

logger = logging.getLogger(__name__)
logger.setLevel(logging.DEBUG)
logger.addHandler(logging.StreamHandler(sys.stdout))

def _get_data(path):
    input_files = [ os.path.join(path, file) for file in os.listdir(path) ]
    if len(input_files) == 0:
        raise ValueError(('There are no files in {}.\n' +
                          'This usually indicates that the channel ({}) was incorrectly specified,\n' +
                          'the data specification in S3 was incorrectly specified or the role specified\n' +
                          'does not have permission to access the data.').format(args.train, "train"))

    raw_data = [ pd.read_csv(file, header=None, engine="python") for file in input_files ]
    data = pd.concat(raw_data)
    X=data.iloc[:,:4]
    y=data.iloc[:,4]
    
    return X, y
    

def train(args):
    '''
    Main function for initializing SageMaker training in the hosted infrastructure.
    
    Parameters
    ----------
    args: the parsed input arguments of the script. The objects assigned as attributes of the namespace. It's the populated namespace.
    
    See: https://docs.python.org/3/library/argparse.html#argparse.ArgumentParser.parse_args
    '''

    # Take the set of files and read them all into a single pandas dataframe
    logger.info('Loading the data...')
    
    X_train, y_train = _get_data(args.train)
    X_test, y_test = _get_data(args.test)
    
    logger.info(f'Train data with shape: X={X_train.shape} y={y_train.shape}')
    logger.info(f'Validation data with shape: X={X_test.shape} y={y_test.shape}')

    logger.info('Starting training...')
    gbm = lgb.LGBMClassifier(objective='multiclass',
                            num_class=len(np.unique(y_train)))
    
    hyperparams = {a_key: a_value for a_key, a_value in vars(args).items() if (a_value!=None and a_key not in ['model_dir', 'train', 'test'])}
    print('hyperparameters:', hyperparams)
    
    gbm.set_params(**hyperparams)
    
    logger.info(f'Using configuration:\n{gbm}')
    gbm.fit(X_train, y_train,
            eval_set=[(X_test, y_test)],
            eval_names='[validation_softmax]',
            eval_metric='softmax',
            early_stopping_rounds=5,
            verbose=5)

    y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

    score = f1_score(y_test,y_pred,labels=[0.0,1.0,2.0],average='micro')

    # generate evaluation metrics
    logger.info(f'[F1 score] {score}')
                                                              
    save_model(gbm, args.model_dir)
                                                              
def save_model(model, model_dir):
    '''
    Function for saving the model in the expected directory for SageMaker.
    
    Parameters
    ----------
    model: a Scikit-Learn estimator
    model_dir: A string that represents the path where the training job writes the model artifacts to. After training, artifacts in this directory are uploaded to S3 for model hosting. (this should be the default SageMaker environment variables)
    '''
    logger.info(f"Saving the model in directory '{model_dir}'")
                                                              
    # Print the coefficients of the trained classifier, and save the coefficients
    joblib.dump(model, os.path.join(model_dir, "model.joblib"))


def model_fn(model_dir):
    """Deserialized and return fitted model
    
    Note that this should have the same name as the serialized model in the main method
    """
    estimator = joblib.load(os.path.join(model_dir, "model.joblib"))
    return estimator


# Main script entry for SageMaker to run when initializing training
                                                              
if __name__ == '__main__':
    parser = argparse.ArgumentParser()

    # Hyperparameters (if not specified, default to LightGBM')
    parser.add_argument('--num_leaves', type=int, default=None)
    parser.add_argument('--max_depth', type=int, default=None)
    parser.add_argument('--learning_rate', type=float, default=None)
    parser.add_argument('--random_state', type=int, default=None) 
    
    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_VALIDATION'])

    args = parser.parse_args()
#     print(args)
                                                              
    train(args)

Now let's test the script:

In [None]:
# In the SageMaker enviroment, the default directories will be:
%env SM_MODEL_DIR="/opt/ml/model"
%env SM_CHANNEL_TRAIN="/opt/ml/input/data/train"
%env SM_CHANNEL_VALIDATION="/opt/ml/input/data/test"

In [None]:
# To run our script locally we will overwrite the defaults (passing other local directories to load data and save models)
# We will run with LGBM's defaults (no hyperparams defined):
!python source_dir/train.py --model-dir models --train data/train --test data/test

In [None]:
# Now, we will run will run with LGBM's other hyperparameters:
# (We expect final validation loss of 0.138846 and F1 score of 0.94)
!python source_dir/train.py --num_leaves 40 --max_depth 10 --learning_rate 0.11 --random_state 42 --model-dir models --train data/train --test data/test

Look's like our script is working as expected!

We check if the models was saved correctly in the `models` directory:

In [None]:
gbm_loaded = joblib.load('models/model.joblib')
gbm_loaded

## The end of the local development! 

## Now, we will develop our custom LightGBM container

## &rarr; [CLICK HERE TO MOVE ON](./1_training-container.ipynb)