d
Maggy enables you to train with Tensorflow distributed optimizers.
Using Maggy, you have to make minimal changes in train your model in a distributed fashion.

### 0. Spark Session

Make sure you have a running Spark Session/Context available.

On Hopsworks, just run your notebook to start the spark application.

### 1. Model definition

Let's define the model we want to train. The layers of the model have to be defined in the \_\_init__ function.

Do not instantiate the class, otherwise you won't be able to use Maggy.

In [0]:
from pyspark.sql import SparkSession
import os

try:
  import tensorflow as tf
except ModuleNotFoundError:
  %pip install tensorflow-cpu==2.4.1


In [0]:
import tensorflow as tf
if tf.__version__ != '2.4.1':
  %pip install tensorflow-cpu==2.4.1

In [0]:
from tensorflow import keras 
from tensorflow.keras.layers import Dense
from tensorflow.keras import Sequential
from tensorflow.keras.optimizers import Adam

# you can use keras.Sequential(), you just need to override it 
# on a custom class and define the layers in __init__()
class NeuralNetwork(Sequential):
        
    def __init__(self, nl=4):
        
        super().__init__()
        self.add(Dense(10,input_shape=(None,4),activation='tanh'))
        if nl >= 4:
          for i in range(0, nl-2):
            self.add(Dense(8,activation='tanh'))
        self.add(Dense(3,activation='softmax'))

model = NeuralNetwork

### 2. Dataset creation

You can create the dataset here and pass it to the TfDistributedConfig, or creating it in the training function.

You need to change the dataset path is correct.

In [0]:
display(dbutils.fs.ls("/FileStore/tables/iris_train-2.csv"))

path,name,size
dbfs:/FileStore/tables/iris_train-2.csv,iris_train-2.csv,3745


In [0]:
train_set_path = "dbfs:/FileStore/tables/iris_train-2.csv"
test_set_path = "dbfs:/FileStore/tables/iris_test-1.csv"

In [0]:
train_set = spark.read.format("csv").option("header","true")\
  .option("inferSchema", "true").load(train_set_path).drop('_c0')

test_set = spark.read.format("csv").option("header","true")\
  .option("inferSchema", "true").load(test_set_path).drop('_c0')

raw_train_set = train_set.toPandas().values
raw_test_set = test_set.toPandas().values

In [0]:
def process_data(train_set, test_set):
    
    import pandas as pd
    import numpy as np
    from sklearn.preprocessing import LabelEncoder
    from sklearn.model_selection import train_test_split
    
    X_train = train_set[:,0:4]
    y_train = train_set[:,4:]
    X_test = test_set[:,0:4]
    y_test = test_set[:,4:]

    return (X_train, y_train), (X_test, y_test)
  
train_set, test_set = process_data(raw_train_set, raw_test_set)

### 3. Defining the training function

The programming model is that you wrap the code containing the model training inside a wrapper function. Inside that wrapper function provide all imports and parts that make up your experiment.

The function should return the metric that you want to optimize for. This should coincide with the metric being reported in the Keras callback (see next point).
You can return the metric list, in this case only the loss element will be printed.

In [0]:
def hpo_function(number_layers, reporter):
  
  model = NeuralNetwork(nl=number_layers)
  model.build()
  
  #fitting the model and predicting
  model.compile(Adam(lr=0.04),'categorical_crossentropy',metrics=['accuracy'])
  train_input, test_input = process_data(raw_train_set, raw_test_set)

  train_batch_size = 75
  test_batch_size = 15
  epochs = 10
  
  model.fit(x=train_input[0], y=train_input[1],
            batch_size=train_batch_size,
            epochs=epochs,
            verbose=1)

  score = model.evaluate(x=test_input[0], y=test_input[1], batch_size=test_batch_size, verbose=1)
                         
  print(f'Test loss: {score[0]}')
  print(f'Test accuracy: {score[1]}')

  return score[1]

In [0]:
def training_function(model, train_set, test_set, hparams):
    
    model = model(nl=hparams['number_layers'])
    model.build()
    #fitting the model and predicting

    model.compile(Adam(lr=hparams['learning_rate']),'categorical_crossentropy',metrics=['accuracy'])
    
    #raise ValueError(list(train_set.as_numpy_iterator()))

    model.fit(train_set,epochs=hparams['epochs'])

    accuracy = model.evaluate(test_set)

    return accuracy

### 4. Configuring the experiment

In order to use maggy distributed training, we have to configure the training model, we can pass it to TfDistributedConfig.
the model class has to be an implementation of __tf.keras.Model__.
We can also define __train_set__, __test_set__ and eventually the __model_parameters__. __model_parameters__ is a dictionary
containing the parameters to be used in the \_\_init__ function of your model.

In [0]:
from maggy.experiment_config import OptimizationConfig
from maggy import Searchspace

# The searchspace can be instantiated with parameters
sp = Searchspace(number_layers=('INTEGER', [2, 8]))

hpo_config = OptimizationConfig(num_trials=4, optimizer="randomsearch", searchspace=sp, direction="max", es_interval=1, es_min=5, name="hp_tuning_test")

### 5. Run distributed training

Finally, we are ready to launch the maggy experiment. You just need to pass 2 parameters: the training function and the configuration variable we defined in the previous steps.

In [0]:
from maggy import experiment

result = experiment.lagom(train_fn=hpo_function, config=hpo_config)

print(result)

In [0]:
from maggy.experiment_config.tf_distributed import TfDistributedConfig

#define the constructor parameters of your model
model_params = {
    #train dataset entries / num_workers
    'train_batch_size': 75,
    #test dataset entries / num_workers
    'test_batch_size': 15,
    'learning_rate': 0.04,
    'epochs': 20,
    'number_layers': result['best_config']['number_layers'],
}

training_config = TfDistributedConfig(name="tf_test", model=model, train_set=train_set, test_set=test_set, process_data=process_data, hparams = model_params)

In [0]:
experiment.lagom(training_function, training_config)