---
# Technology Selection

What algorithms to use and what technologies are available.

## Nature of the problem

* Multi-label Binary Classification 

It is a binary classification task where multiple althorithms have been developed and applied in the real life e.g. SPAM fileter.

* Naive Bayes - [Naive Bayes and Text Classification](https://arxiv.org/abs/1410.5329)
* CNN - [Convolutional Neural Networks for Sentence Classification](https://arxiv.org/pdf/1408.5882.pdf)
* DNN Language Model - [Attention Is All You Need](https://arxiv.org/abs/1706.03762)

## Asssessment

Skipped due to the time constraint.

## Decision

**Transformer Deep Neural Network Architecture** transfer-learning (fine-tuning) on the pre-trained language model.

1. State of the art algorithms being actively researched.
2. Pre-trained models for text classification e.g text sentiment analysis are available. 
3. Other well-explored althorithms have been well tested as published in Kaggle. 










---
# Implementation



## ML Model for Fine Tuning

### Framework
* Google TensorFlow 2.x 
* Keras for training the model
* Huggingface Transformer library

### Data allocation
Utilize [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) to :
1. shuffle the train data 
2. allocate the ratio R of the data for validation. R=0.2
3 apply the model training on (1-R) ratio of the data for training

Apply the trained model on testing data for evaluation.

### Hyper parameter search
* Learning rate (5e-5, 5e-4, 5e-3) as the start value

### Epoch
Number of times to go through the entire training data set N. N=10 due to the time constraint.

### Early stopping
Utilize Keras [EarlyStopping](https://keras.io/api/callbacks/early_stopping/) to stop the training when no improvement is achieved N times. N=5.

### Reduce learning rate at no improvement
Utilize Keras [ReduceLROnPlateau](https://keras.io/api/callbacks/reduce_lr_on_plateau/) to reduce the learning rate when no improvement is achieved N times. N=3.


### Keras Callbacks

Utilize [Keras Callbacks API](https://keras.io/api/callbacks/) to apply Eary Stopping, Reduce Learning Rate, and TensorBoard during the model training.



In [43]:
class SavePretrainedCallback(tf.keras.callbacks.Callback):
    """
    This is only for directly working on the Huggingface models.
    
    Hugging Face models have a save_pretrained() method that saves both 
    the weights and the necessary metadata to allow them to be loaded as 
    a pretrained model in future. This is a simple Keras callback that 
    saves the model with this method after each epoch.
    
    """
    def __init__(self, output_dir, **kwargs):
        super().__init__()
        self.output_dir = output_dir
        self.lowest_val_loss=np.inf
        self.best_epoch = -1
        self.verbose = kwargs['verbose'] if 'verbose' in kwargs else False

    def on_epoch_end(self, epoch, logs=None):
        """
        Save only the best model
        - https://stackoverflow.com/a/68042600/4281353
        - https://www.tensorflow.org/guide/keras/custom_callback
        
        TODO: 
        save_pretrained() method is in the HuggingFace model only.
        Need to implement an logic to update for Keras model saving.
        """
        val_loss=logs.get('val_loss')
        if (self.best_epoch < 0) or (val_loss < self.lowest_val_loss):
            if self.verbose:
                print(f"Model val_loss improved: [{val_loss} < {self.lowest_val_loss}]")
                print(f"Saving to {self.output_dir}")
            self.lowest_val_loss = val_loss
            self.best_epoch = epoch
            self.model.save_pretrained(self.output_dir)


class TensorBoardCallback(tf.keras.callbacks.TensorBoard):
    """TensorBoard visualization of the model training
    See https://keras.io/api/callbacks/tensorboard/
    """
    def __init__(self, output_directory):
        super().__init__(
            log_dir=output_directory,
            write_graph=True,
            write_images=True,
            histogram_freq=1,     # log histogram visualizations every 1 epoch
            embeddings_freq=1,    # log embedding visualizations every 1 epoch
            update_freq="epoch",  # every epoch
        )


class EarlyStoppingCallback(tf.keras.callbacks.EarlyStopping):
    """Stop training when no progress on the metric to monitor
    https://keras.io/api/callbacks/early_stopping/
    https://machinelearningmastery.com/how-to-stop-training-deep-neural-networks-at-the-right-time-using-early-stopping/

    Using val_loss to monitor. 
    https://datascience.stackexchange.com/a/49594/68313
    Prefer the loss to the accuracy. Why? The loss quantify how certain 
    the model is about a prediction. The accuracy merely account for 
    the number of correct predictions. Similarly, any metrics using hard 
    predictions rather than probabilities have the same problem.
    """
    def __init__(self, patience=3):
        assert patience > 0
        super().__init__(
            monitor='val_loss', 
            mode='min', 
            verbose=1, 
            patience=patience,
            restore_best_weights=True
        )


class ModelCheckpointCallback(tf.keras.callbacks.ModelCheckpoint):
    """Check point to save the model
    See https://keras.io/api/callbacks/model_checkpoint/

    NOTE: Did not work with HuggingFace with the error.
        NotImplementedError: Saving the model to HDF5 format requires the model 
        to be a Functional model or a Sequential model. 
        It does not work for subclassed models, because such models are defined 
        via the body of a Python method, which isn't safely serializable. 
    """
    def __init__(self, path_to_file):
        """
        Args:
            path_to_file: path to the model file to save at check points
        """
        super().__init__(
            filepath=path_to_file, 
            monitor='val_loss', 
            mode='min', 
            save_best_only=True,
            save_freq="epoch",
            verbose=1
        )


class ReduceLRCallback(tf.keras.callbacks.ReduceLROnPlateau):
    """Reduce learning rate when a metric has stopped improving.
    See https://keras.io/api/callbacks/reduce_lr_on_plateau/
    """
    def __init__(self, patience=3):
        assert patience > 0
        super().__init__(
            monitor="val_loss",
            factor=0.2,
            patience=patience,
            verbose=1
        )


### Fine Tuning Runner

The Runner class implements the fine-tuning based on the [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased) pretrained model. Each classification category e.g. ```toxic``` will have a dedicated Runner class instance. The reason for using the ***Distilled*** BERT model is to run the training on the limited resources


In [44]:
from tensorflow.keras.models import (
    Sequential
)
from tensorflow.keras.layers import (
    Dense
)


tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')


class Runner:
    """Fine tuning implementation class
    See:
    - https://www.tensorflow.org/guide/keras/train_and_evaluate
    - https://stackoverflow.com/questions/68172891/
    - https://stackoverflow.com/a/68172992/4281353

    The TF/Keras model has the base model, e.g distilbert for DistiBERT which is
    from the base model TFDistilBertModel.
    https://huggingface.co/transformers/model_doc/distilbert.html#tfdistilbertmodel

    TFDistilBertForSequenceClassification has classification layers added on top
    of TFDistilBertModel, hence not required to add fine-tuning layers by users.
    _________________________________________________________________
    Layer (type)                 Output Shape              Param #   
    =================================================================
    distilbert (TFDistilBertMain multiple                  66362880  
    _________________________________________________________________
    pre_classifier (Dense)       multiple                  590592    
    _________________________________________________________________
    classifier (Dense)           multiple                  1538      
    _________________________________________________________________
    dropout_59 (Dropout)         multiple                  0         
    =================================================================
    """
    # ================================================================================
    # Class
    # ================================================================================
    USE_HF_TRAINER = False
    TOKENIZER_LOWER_CASE = True
    # _model_name = 'distilbert-base-cased'
    _model_name = 'distilbert-base-uncased'
    _model_base_name = 'distilbert'
    _tokenizer = DistilBertTokenizerFast.from_pretrained(
        _model_name, 
        do_lower_case=TOKENIZER_LOWER_CASE
    )

    # ================================================================================
    # Instance
    # ================================================================================
    # --------------------------------------------------------------------------------
    # Instance properties
    # --------------------------------------------------------------------------------
    @property
    def category(self):
        """Category of the text comment classification, e.g. toxic"""
        return self._category

    @property
    def num_labels(self):
        """Number of labels to classify"""
        assert self._num_labels > 0
        return self._num_labels

    @property
    def tokenizer(self):
        """BERT tokenizer. The Tokenzer must match the pretrained model"""
        return self._tokenizer

    @property
    def max_sequence_length(self):
        """Maximum token length for the BERT tokenizer can accept. Max 512
        """
        assert 128 <= self._max_sequence_length <= 512
        return self._max_sequence_length

    @property
    def X(self):
        """Training TensorFlow DataSet"""
        return self._X

    @property
    def V(self):
        """Validation TensorFlow DataSet"""
        return self._V

    @property
    def model_name(self):
        """HuggingFace pretrained model name"""
        return self._model_name

    @property
    def model_base_name(self):
        """HuggingFace pretrained base model name"""
        return self._model_base_name

    @property
    def model(self):
        """TensorFlow/Keras Model instance"""
        return self._model

    @property
    def freeze_pretrained_base_model(self):
        """Boolean to freeze the base model"""
        return self._freeze_pretrained_base_model

    @property
    def batch_size(self):
        """Mini batch size during the training"""
        assert self._batch_size > 0
        return self._batch_size

    @property
    def learning_rate(self):
        """Training learning rate"""
        return self._learning_rate

    @property
    def reduce_lr_patience(self):
        """Training patience for reducing learinig rate"""
        return self._reduce_lr_patience

    @property
    def early_stop_patience(self):
        """Training patience for early stopping"""
        return self._early_stop_patience

    @property
    def num_epochs(self):
        """Number of maximum epochs to run for the training"""
        return self._num_epochs

    @property
    def output_directory(self):
        """Parent directory to manage training artefacts"""
        return self._output_directory

    @property
    def model_directory(self):
        """Directory to save the trained models"""
        return self._model_directory

    @property
    def log_directory(self):
        """Directory to save logs, e.g. TensorBoard logs"""
        return self._log_directory

    @property
    def model_metric_names(self):
        """Model mtrics
        The attribute model.metrics_names gives labels for the scalar metrics
        to be returned from model.evaluate().
        """
        return self.model.metrics_names

    @property
    def history(self):
        """The history object returned from model.fit(). 
        The object holds a record of the loss and metric during training
        """
        assert self._history is not None
        return self._history

    @property
    def trainer(self):
        """HuggingFace trainer instance
        HuggingFace offers an optimized Trainer because PyTorch does not have
        the training loop as Keras/Model has. It is available for TensorFlow
        as well, hence to be able to hold the instance in case using it.
        """
        return self._trainer

    # --------------------------------------------------------------------------------
    # Instance initialization
    # --------------------------------------------------------------------------------
    def __init__(
            self,
            category,
            training_data,
            training_label,
            validation_data,
            validation_label,
            num_labels=2,
            max_sequence_length=256,
            freeze_pretrained_base_model=False,
            batch_size=16,
            learning_rate=5e-5,
            early_stop_patience=5,
            reduce_lr_patience=2,
            num_epochs=3,
            output_directory="./output"
    ):
        """
        Args:
            category: 
            traininig_data: 
            training_label:
            validation_data:
            validation_label:
            num_labels: Number of labels
            max_sequence_length=256: maximum tokens for tokenizer
            freeze_pretrained_base_model: flag to freeze pretrained model base layer
            batch_size:
            learning_rate:
            early_stop_patience:
            reduce_lr_patience:
            num_epochs:
            output_directory: Directory to save the outputs
        """
        self._category = category
        self._trainer = None

        # --------------------------------------------------------------------------------
        # Model training configurations
        # --------------------------------------------------------------------------------
        assert 128 <= max_sequence_length <= 512, "Current max sequenth length is 512"
        self._max_sequence_length = max_sequence_length

        assert num_labels > 0
        self._num_labels = num_labels

        assert isinstance(freeze_pretrained_base_model, bool)
        self._freeze_pretrained_base_model = freeze_pretrained_base_model

        assert learning_rate > 0.0
        self._learning_rate = learning_rate
        self._model = None

        assert num_epochs > 0
        self._num_epochs = num_epochs

        assert batch_size > 0
        self._batch_size = batch_size

        assert early_stop_patience > 0
        self._early_stop_patience = early_stop_patience
        self._reduce_lr_patience = reduce_lr_patience

        # model.fit() result holder
        self._history = None  

        # --------------------------------------------------------------------------------
        # Output directories
        # --------------------------------------------------------------------------------
        # Parent directory
        self._output_directory = output_directory
        Path(self.output_directory).mkdir(parents=True, exist_ok=True)
        
        # Model directory
        self._model_directory = "{parent}/model_C{category}_B{size}_L{length}".format(
            parent=self.output_directory,
            category=self.category,
            size=self.batch_size,
            length=self.max_sequence_length
        )
        Path(self.model_directory).mkdir(parents=True, exist_ok=True)

        # Log directory
        self._log_directory = "{parent}/log_C{category}_B{size}_L{length}".format(
            parent=self.output_directory,
            category=self.category,
            size=self.batch_size,
            length=self.max_sequence_length
        )
        Path(self.log_directory).mkdir(parents=True, exist_ok=True)

        # --------------------------------------------------------------------------------
        # TensorFlow DataSet
        # --------------------------------------------------------------------------------
        assert np.all(np.isin(training_label, np.arange(self.num_labels)))
        assert np.all(np.isin(validation_label, np.arange(self.num_labels)))
        self._X = tf.data.Dataset.from_tensor_slices((
            dict(self.tokenize(training_data)),
            training_label
        ))
        self._V = tf.data.Dataset.from_tensor_slices((
            dict(self.tokenize(validation_data)),
            validation_label
        ))
        del training_data, validation_data
        
        # --------------------------------------------------------------------------------
        # Model
        # --------------------------------------------------------------------------------
        config_file = self.model_directory + os.path.sep + "config.json"
        if os.path.isfile(config_file) and os.access(config_file, os.R_OK):
            # Load the saved model
            print(f"loading the saved model from {self.model_directory}...")
            self._pretrained_model = TFDistilBertForSequenceClassification.from_pretrained(
                self.model_directory,
                num_labels=num_labels
            )
        else:
            # Download the model from Huggingface
            self._pretrained_model = TFDistilBertForSequenceClassification.from_pretrained(
                self.model_name,
                num_labels=num_labels,            
            )

        # Freeze base model if required
        if self.freeze_pretrained_base_model:
            for _layer in self._pretrained_model.layers:
                if _layer.name == self.model_base_name:
                    _layer.trainable = False

        self._model = self._pretrained_model

        # The number of classes in the output must match the num_labels
        _output = self._pretrained_model(self.tokenize(["i say hello"]))
        assert _output['logits'].shape[-1] == self.num_labels, "Number of labels mismatch"

        # --------------------------------------------------------------------------------
        # Build the model
        #     from_logits in SparseCategoricalCrossentropy(from_logits=[True|False])
        #     True  when the input is logits not  normalized by softmax.
        #     False when the input is probability normalized by softmax
        # --------------------------------------------------------------------------------
        optimizer = tf.keras.optimizers.Adam(learning_rate=self.learning_rate)
        self.model.compile(
            optimizer=optimizer, 
            # loss=self.model.compute_loss,
            loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
            # loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
            # ["accuracy", "AUC"] causes an error:
            # ValueError: Shapes (None, 1) and (None, 2) are incompatible
            metrics = ["accuracy"]  
        )
        self.model.summary()

    # --------------------------------------------------------------------------------
    # Instance methods
    # --------------------------------------------------------------------------------
    def tokenize(self, sentences, truncation=True, padding='longest'):
        """Tokenize using the Huggingface tokenizer
        Args: 
            sentences: String or list of string to tokenize
            padding: Padding method ['do_not_pad'|'longest'|'max_length']
        """
        return self.tokenizer(
            sentences,
            truncation=truncation,
            padding=padding,
            max_length=self.max_sequence_length,
            return_tensors="tf"
        )

    def decode(self, tokens):
        return tokenizer.decode(tokens)

    def _hf_train(self):
        """Train the model using HuggingFace Trainer"""
        self._training_args = TFTrainingArguments(
            output_dir='./results',             # output directory
            num_train_epochs=3,                 # total number of training epochs
            per_device_train_batch_size=self.batch_size,     # batch size per device during training
            per_device_eval_batch_size=self.batch_size,      # batch size for evaluation
            warmup_steps=500,                   # number of warmup steps for learning rate scheduler
            weight_decay=0.01,                  # strength of weight decay
            logging_dir='./logs',               # directory for storing logs
            logging_steps=10,
        )

        # with self._training_args.strategy.scope():
        #     self._model = TFDistilBertForSequenceClassification.from_pretrained(self.model_name)

        self._trainer = TFTrainer(
            model=self.model,
            args=self._training_args,   # training arguments
            train_dataset=self.X,       # training dataset
            eval_dataset=self.V         # evaluation dataset
        )
        self.trainer.train()

    def _keras_train(self):
        """Train the model using Keras
        """
        # --------------------------------------------------------------------------------
        # Train the model
        # --------------------------------------------------------------------------------
        self._history = self.model.fit(
            self.X.shuffle(1000).batch(self.batch_size).prefetch(1),
            epochs=self.num_epochs,
            batch_size=self.batch_size,
            validation_data=self.V.shuffle(1000).batch(self.batch_size).prefetch(1),
            callbacks=[
                EarlyStoppingCallback(patience=self.early_stop_patience),
                ReduceLRCallback(patience=self.reduce_lr_patience),
                TensorBoardCallback(self.log_directory),
                SavePretrainedCallback(output_dir=self.model_directory, verbose=True),
            ]
        )
        # del self._X, self._V

    def train(self):
        """Run the model trainig"""
        if self.USE_HF_TRAINER:
            self._hf_train()
        else:
            self._keras_train()

    def evaluate(self, data, label):
        """Evaluate the model on the given data and label.
        https://www.tensorflow.org/api_docs/python/tf/keras/Model#evaluate
        The attribute model.metrics_names gives labels for the scalar metrics
        to be returned from model.evaluate().

        Args:
            data: data to run the prediction
            label: label for the data
        Returns: 
            scalar loss if the model has a single output and no metrics, OR 
            list of scalars (if the model has multiple outputs and/or metrics). 
        """
        assert np.all(np.isin(label, np.arange(self.num_labels)))
        test_dataset = tf.data.Dataset.from_tensor_slices((
            dict(self.tokenize(data)),
            label
        ))
        evaluation = self.model.evaluate(
            test_dataset.shuffle(1000).batch(self.batch_size).prefetch(1)
        )
        return evaluation

    def predict(self, data):
        """Calcuate the prediction for the data
        Args:
            data: text data to classify
        Returns: Probabilities for label value 0 and 1
        """
        tokens = dict(self.tokenizer(
            data,
            truncation=True,
            padding=True,
            max_length=self.max_sequence_length,
            return_tensors="tf"
        ))
        logits = self.model.predict(tokens)["logits"]
        return tf.nn.softmax(logits)
        # return logits

    def save(self, path_to_dir=None):
        """Save the model from the HuggingFace. 
        - config.json 
        - tf_model.h5  

        Args:
            path_to_dir: directory path to save the HuggingFace model artefacts
        """

        if path_to_dir is None or len(path_to_dir) == 0:
            path_to_dir = self.model_directory
        Path(path_to_dir).mkdir(parents=True, exist_ok=True)
        if self.USE_HF_TRAINER:
            self.trainer.save_model(path_to_dir)  
        else:
            # TODO: 
            #   save_pretrained() method is in the HuggingFace model only.
            #   Need to update for custom model saving.
            self.model.save_pretrained(path_to_dir)

    def load(self, path_to_dir):
        """Load the model as the HuggingFace format.
        Args:
            path_to_dir: Directory path from where to load config.json and .h5.
        """
        if os.path.isdir(path_to_dir) and os.access(path_to_dir, os.R_OK):
            self._model = TFDistilBertForSequenceClassification.from_pretrained(path_to_dir)
        else:
            raise RuntimeError(f"{path_to_dir} does not exit")

In [45]:
def balance(
    df, 
    data_col_name,
    label_col_name,
    retain_columns,
    max_replication_ratio=sys.maxsize
):
    """Balance the data volumes of positives and negatives
    The negatives (label==0) has more volume than the positives has, hence
    causing skewed data representation. To avoid the model from adapting to the
    majority (negative), naively balance the volumes so that they have same size.

    For the ratio = (negatives / positives), replicate positives 'ratio' times 
    to match the volume of negatives if ratio < max_replication_ratio.
    When ratio > max_replication_ratio, replicate max_replication_ratio times
    to the size = (positive_size * max_replication_ratio). Then take 'size'
    volume randomly from negatives.

    A portion of the negatives will not be used because of this balancing.

    Args:
        df: Pandas dataframe 
        data_col_name: Column name for the data
        label_col_name: Column name for the label
        retain_columns: Columns to retain in the dataframe to return
    Returns: 
        Pandas dataframe with the ratin_columns.
    """
    positive_indices = df.index[df[label_col_name]==1].tolist()
    negative_indices = df.index[df[label_col_name]==0].tolist()
    assert not bool(set(positive_indices) & set(negative_indices))

    positive_size = len(positive_indices)
    negative_size = len(negative_indices)
    ratio = np.minimum(negative_size // positive_size, max_replication_ratio)

    if ratio >= 2:
        # Generate equal size of indices for positives and negatives. 
        target_positive_indices = ratio * positive_indices
        target_negative_indices = np.random.choice(
            a=negative_indices, 
            size=ratio * positive_size,
            replace=False
        ).tolist()
        indices = target_positive_indices + target_negative_indices

        # Extract [data, label] with equal size of positives and negatives
        data = df.iloc[indices][
            df.columns[df.columns.isin(retain_columns)]
        ]

    else: 
        data = df[
            df.columns[df.columns.isin(retain_columns)]
        ]
    return data


def generate_runner(
    train,
    category,
    max_sequence_length,
    freeze_pretrained_base_model,
    num_labels,
    batch_size,
    num_epochs,
    learning_rate,
    early_stop_patience,
    reduce_lr_patience,
    output_directory,
    max_replication_ratio = sys.maxsize
):
    """Wrapper to create the Runnler instances for the respective category
    Args:
        train: Pandas dataframe containing entire training data
        category: unhealthy comment category, e.g. 'toxic'
        max_sequence_length:
        batch_size:
        num_epochs:
        learning_rate:
        early_stop_patience:
        reduce_lr_patience:
        output_directory:
        max_replication_ratio: ratio up to which to replicate the skewed volume
    """
    print("\n--------------------------------------------------------------------------------")
    print(f"Build runner for [{category}]")
    print("--------------------------------------------------------------------------------")

    balanced = balance(
        df=train, 
        data_col_name='comment_text', 
        label_col_name=category,
        retain_columns=['id', 'comment_text', category]
    )
    data = balanced['comment_text'].tolist()
    label = balanced[category].tolist()
    del balanced

    # --------------------------------------------------------------------------------
    # Split data into training and validation
    # --------------------------------------------------------------------------------
    train_data, validation_data, train_label, validation_label = train_test_split(
        data,
        label,
        test_size=.2,
        shuffle=True
    )
    del data, label

    # --------------------------------------------------------------------------------
    # Instantiate the model trainer
    # --------------------------------------------------------------------------------
    runner = Runner(
        category=category,
        training_data=train_data,
        training_label=train_label,
        validation_data=validation_data,
        validation_label=validation_label,
        max_sequence_length=max_sequence_length,
        freeze_pretrained_base_model=freeze_pretrained_base_model,
        num_labels=num_labels,
        batch_size=batch_size,
        num_epochs=num_epochs,
        learning_rate=learning_rate,
        early_stop_patience=early_stop_patience,
        reduce_lr_patience=reduce_lr_patience,
        output_directory=output_directory
    )
    return runner
    
def generate_category_runner(category):
    def f(train):
        return generate_runner(
            category=category,
            train=train,
            max_replication_ratio = sys.maxsize,
            freeze_pretrained_base_model=FREEZE_BASE_MODEL,
            num_labels=NUM_LABELS,
            batch_size=BATCH_SIZE,
            max_sequence_length=MAX_SEQUENCE_LENGTH,
            num_epochs=NUM_EPOCHS,
            learning_rate=LEARNING_RATE,
            early_stop_patience=EARLY_STOP_PATIENCE,
            reduce_lr_patience=REDUCE_LR_PATIENCE,
            output_directory=RESULT_DIRECTORY
        )
    return f

def evaluate(runner, test):
    """
    Evaluate the model of the runner
    Args:
        runner: Runner instance
        test: Pandas dataframe holding entire data
    """
    print("\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
    print(f"Model evaluation on [{runner.category}]")
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")
    test_data = test['comment_text'].tolist()
    test_label = test[category].tolist()
    evaluation = runner.evaluate(test_data, test_label)

    print(f"Evaluation: {runner.model_metric_names}:{evaluation}")
    del test_data, test_label

## Execution

In [None]:
if CLEANING_FOR_ANALYSIS and (not CLEANING_FOR_TRAINING):
    # Data has been clearned but training needs non cleaned data
    train, test = load_raw_data(TEST_MODE)
    print(f"Data records for training [{train['id'].count()}]")

# Drop the rows with -1. ['toxic'] >= 0 is sufficient
test = test[test['toxic'] >= 0]
gc.collect()

train.head(3)

In [1]:
# HuggingFace
MAX_SEQUENCE_LENGTH = 256   # Max token length to accept. 512 taks 1 hour/epoch on Google Colab

# Model training
NUM_LABELS = 2
FREEZE_BASE_MODEL = False
NUM_EPOCHS = 10
BATCH_SIZE = 32
LEARNING_RATE = 5e-5  # Must be small to avoid catastrophic forget
REDUCE_LR_PATIENCE = 3
EARLY_STOP_PATIENCE = 5

print("""
MAX_SEQUENCE_LENGTH = {}
FREEZE_BASE_MODEL = {}
NUM_EPOCHS = {}
BATCH_SIZE = {}
LEARNING_RATE = {}
REDUCE_LR_PATIENCE = {}
EARLY_STOP_PATIENCE = {}
""".format(
    MAX_SEQUENCE_LENGTH,
    FREEZE_BASE_MODEL,
    NUM_EPOCHS,
    BATCH_SIZE,
    LEARNING_RATE,
    REDUCE_LR_PATIENCE,
    EARLY_STOP_PATIENCE
))


MAX_SEQUENCE_LENGTH = 256
FREEZE_BASE_MODEL = False
NUM_EPOCHS = 10
BATCH_SIZE = 32
LEARNING_RATE = 5e-05
REDUCE_LR_PATIENCE = 3
EARLY_STOP_PATIENCE = 5



In [39]:
# runners = {}      # To save the Runner instance for each category.
# evaluations = {}  # Evaluation results for each category
#     for category in CATEGORIES:
#         runners[category], evaluations[category] = run(
#             category=category, 
#             train=train,
#             test=test,
#             batch_size=BATCH_SIZE,
#             max_sequence_length=MAX_SEQUENCE_LENGTH,
#             num_epochs=NUM_EPOCHS,
#             learning_rate=LEARNING_RATE,
#             early_stop_patience=EARLY_STOP_PATIENCE,
#             reduce_lr_patience=REDUCE_LR_PATIENCE,
#             output_directory=RESULT_DIRECTORY
#         )