# Training Transformers for Next-Activity Prediction

In this repository, there exist three scripts. Fill out this notebook to explain the code within them.

(concept taken from https://arxiv.org/abs/2104.00721)

## Transformer Architecture

(explanation for the layers, type of positional encoding... suggestions of improvements maybe...)

- Optional task: the inclusion of the findings in paper about positional encoding layer when dealing with time-series data: https://link.springer.com/article/10.1007/s10618-023-00948-2

### Import the required libraries

In [1]:
import json
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn import utils
from tensorflow.keras import layers
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
os.environ["CUDA_DEVICE_ORDER"] = "PCI_BUS_ID"
import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics 




In [7]:
# Define a custom layer for a single Transformer block
class TransformerBlock(layers.Layer):
    def __init__(self, embed_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        
        # Multi-head attention layer
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embed_dim)
        
        # Feed-forward neural network with regularization
        self.ffn = tf.keras.Sequential([
            layers.Dense(ff_dim, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(1e-4)),
            layers.Dense(embed_dim, kernel_regularizer=tf.keras.regularizers.l2(1e-4))
        ])
        
        # Layer normalization layers
        self.layernorm_a = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm_b = layers.LayerNormalization(epsilon=1e-6)
        
        # Dropout layers to prevent overfitting
        self.dropout_a = layers.Dropout(rate)
        self.dropout_b = layers.Dropout(rate)

    def call(self, inputs, training):
        # Apply layer normalization to the inputs before attention
        attn_input = self.layernorm_a(inputs)
        
        # Apply multi-head attention to the normalized inputs
        attn_output = self.att(attn_input, attn_input)
        
        # Apply dropout to the attention output during training
        attn_output = self.dropout_a(attn_output, training=training)
        
        # Add the original inputs to the attention output (residual connection)
        out_a = inputs + attn_output
        
        # Apply layer normalization to the output of the attention layer
        ffn_input = self.layernorm_b(out_a)
        
        # Apply the feed-forward network to the normalized output
        ffn_output = self.ffn(ffn_input)
        
        # Apply dropout to the feed-forward output during training
        ffn_output = self.dropout_b(ffn_output, training=training)
        
        # Add the residual connection and return the final output
        return out_a + ffn_output

# Define a custom layer for token and position embedding
class TokenAndPositionEmbedding(layers.Layer):
    def __init__(self, maxlen, vocab_size, embed_dim):
        super(TokenAndPositionEmbedding, self).__init__()
        
        # Embedding layer for tokens
        self.token_emb = layers.Embedding(input_dim=vocab_size, output_dim=embed_dim)
        
        # Embedding layer for positions
        self.pos_emb = layers.Embedding(input_dim=maxlen, output_dim=embed_dim)

    def call(self, x):
        # Get the sequence length of the input
        maxlen = tf.shape(x)[-1]
        
        # Create a tensor of positions (0, 1, 2, ..., maxlen-1)
        positions = tf.range(start=0, limit=maxlen, delta=1)
        
        # Embed the positions
        positions = self.pos_emb(positions)
        
        # Embed the tokens
        x = self.token_emb(x)
        
        # Return the sum of token embeddings and position embeddings
        return x + positions

# Build a Transformer-based text classification model
def get_model(max_case_length, vocab_size, output_dim, embed_dim=128, num_heads=8, ff_dim=256):
    """
    Builds a Transformer-based text classification model.

    Args:
        max_case_length: Maximum length of the input sequence.
        vocab_size: Number of words in the vocabulary.
        output_dim: Number of output classes for classification.
        embed_dim: Dimensionality of word embeddings (default: 128).
        num_heads: Number of heads in the multi-head attention layer (default: 8).
        ff_dim: Dimensionality of the feed-forward network (default: 256).

    Returns:
        A tf.keras.Model for text classification.
    """
    inputs = layers.Input(shape=(max_case_length,))
    x = TokenAndPositionEmbedding(max_case_length, vocab_size, embed_dim)(inputs)
    
    # Add multiple transformer blocks
    for _ in range(2):  # Increased number of transformer blocks
        x = TransformerBlock(embed_dim, num_heads, ff_dim)(x, training=True)
        
    x = layers.GlobalAveragePooling1D()(x)
    x = layers.Dropout(0.3)(x)  # Slightly increased dropout rate
    x = layers.Dense(256, activation="relu", kernel_regularizer=tf.keras.regularizers.l2(1e-4))(x)  # Increased dense layer size
    x = layers.Dropout(0.3)(x)  # Slightly increased dropout rate
    outputs = layers.Dense(output_dim, activation="linear")(x)
    
    transformer = tf.keras.Model(inputs=inputs, outputs=outputs, name="improved_transformer")
    return transformer


**TransformerBlock Class**

- Initialization (__init__ method):

    - Multi-Head Attention Layer (self.att): This layer performs attention over multiple heads to capture different aspects of the input sequence. It uses num_heads and embed_dim to configure the attention mechanism.
    - Feed-Forward Network (self.ffn): This is a small neural network consisting of two dense layers with a ReLU activation in between. It also includes L2 regularization to prevent overfitting.
    - Layer Normalization (self.layernorm_a, self.layernorm_b): These layers normalize the input to have zero mean and unit variance, helping in stabilizing and speeding up the training process.
    - Dropout Layers (self.dropout_a, self.dropout_b): These layers randomly drop a fraction of the input units during training to prevent overfitting.

- Forward Pass (call method):

    - Layer Normalization (attn_input): Normalizes the inputs before applying attention.
    - Multi-Head Attention (attn_output): Computes the attention scores and outputs the attended values.
- Dropout (self.dropout_a): Applies dropout to the attention output during training.
- Residual Connection (out_a): Adds the original input to the attention output to form a residual connection.
- Layer Normalization (ffn_input): Normalizes the output of the residual connection.
- Feed-Forward Network (ffn_output): Applies the feed-forward neural network to the normalized output.
- Dropout (self.dropout_b): Applies dropout to the feed-forward network output during training.
- Residual Connection: Adds the normalized output of the residual connection to the feed-forward network output and returns it.

**TokenAndPositionEmbedding Class**

- Initialization (__init__ method):

    - Token Embedding (self.token_emb): Embeds each token (word or character) in the input sequence into a dense vector of fixed size (embed_dim).
    - Position Embedding (self.pos_emb): Embeds each position in the input sequence into a dense vector of the same size (embed_dim).

- Forward Pass (call method):

    - Sequence Length (maxlen): Determines the length of the input sequence dynamically.
    - Position Tensor (positions): Creates a tensor representing the positions in the input sequence.
    - Position Embedding (positions): Converts the position tensor into position embeddings.
    - Token Embedding (x): Converts the input tokens into token embeddings.
    - Sum of Embeddings: Adds the position embeddings to the token embeddings to form the final input representation.


- **TokenAndPositionEmbedding:**
    - Word Embeddings: Capture the meaning of a word (learned from text data). Similar words have similar embeddings.
    - Positional Embeddings: Encode word order (vectors change based on position in the sequence).
 
#### Function:
- **get_model** :
- 1. Input Layer (layers.Input(shape=(max_case_length,))):
    - Defines the starting point for the model, accepting sequences of integers representing word indices in the text data.
    - max_case_length controls the maximum number of words allowed in a sentence.
- 2. Token and Position Embedding (x = TokenAndPositionEmbedding(...)):
    - Transforms integer sequences into a richer format by combining:
    - Word Embeddings: Capture the meaning of each word (learned from text data).
    - Positional Embeddings: Encode the order of words in the sentence.
- 3. Transformer Blocks (for _ in range(4): ...):
    - Stacks four TransformerBlock instances, the core building blocks:
    - Multi-head Attention: Analyzes relationships between words in the sentence.
    - Feed-forward Network (FFN): Captures complex patterns beyond word relationships.
    - Normalization and Dropout: Stabilizes training and prevents overfitting.
- 4. Sequence Pooling (x = layers.GlobalAveragePooling1D()(x)):
  - Summarizes the entire sentence by averaging the outputs from all elements in the sequence.
- 5. Classification Head (layers.Dense(...)...):

    - Predicts the class label for the text:
    - Dense layer (256 units): Projects features to a higher dimension.
    - Dropout (0.3): Prevents overfitting by randomly dropping information during training.
    - Dense layer (output_dim units): Generates final output scores for classification.
- 6. Model Creation (transformer = tf.keras.Model(...)):

Combines all the steps into a single tf.keras.Model instance named "improved_transformer".

## Dataset

(What is the dataset, what is the vocabulary size, why is it processed the way it is)

In [8]:
class LogsDataLoader:
    def __init__(self, name, dir_path = "./datasets"):
        """Provides support for reading and 
            pre-processing examples from processed logs.
        Args:
            name: str: name of the dataset as used during processing raw logs
            dir_path: str: Path to dataset directory
        """
        self._dir_path = f"{dir_path}/{name}/processed"

    def prepare_data_next_activity(self, df, 
        x_word_dict, y_word_dict, 
        max_case_length, shuffle=True):
        
        x = df["prefix"].values
        y = df["next_act"].values
        if shuffle:
            x, y = utils.shuffle(x, y)

        token_x = list()
        for _x in x:
            token_x.append([x_word_dict[s] for s in _x.split()])
        # token_x = np.array(token_x, dtype = np.float32)

        token_y = list()
        for _y in y:
            token_y.append(y_word_dict[_y])
        # token_y = np.array(token_y, dtype = np.float32)

        token_x = tf.keras.preprocessing.sequence.pad_sequences(
            token_x, maxlen=max_case_length)

        token_x = np.array(token_x, dtype=np.float32)
        token_y = np.array(token_y, dtype=np.float32)

        return token_x, token_y

    def get_max_case_length(self, train_x):
        train_token_x = list()
        for _x in train_x:
            train_token_x.append(len(_x.split()))
        return max(train_token_x)

    def load_data(self, task):
        train_df = pd.read_csv(f"{self._dir_path}/{task}_train.csv")
        test_df = pd.read_csv(f"{self._dir_path}/{task}_test.csv")

        with open(f"{self._dir_path}/metadata.json", "r") as json_file:
            metadata = json.load(json_file)

        x_word_dict = metadata["x_word_dict"]
        y_word_dict = metadata["y_word_dict"]
        max_case_length = self.get_max_case_length(train_df["prefix"].values)
        vocab_size = len(x_word_dict) 
        total_classes = len(y_word_dict)

        return (train_df, test_df, 
            x_word_dict, y_word_dict, 
            max_case_length, vocab_size, 
            total_classes)


- LogsDataLoader for loading and pre-processing sequential data from processed log files.

    - It takes a dataset name and directory path during initialization.
    - The prepare_data_next_activity function prepares training data by converting text sequences to numerical arrays and padding them to a fixed length.
    - It can also find the maximum sequence length in the data using get_max_case_length.
    - The load_data function loads training and testing dataframes, reads metadata (word dictionaries), and calculates additional parameters liken vocabulary size and number of classes.

## Training and Using The Model

(model compilation, loss function, accuracy metrics, data loading...)

### Expermenting with 4 transformers bloc

In [4]:
dataset="helpdesk"
model_dir="./models"
result_dir="./results"
task = "next_activity"

epochs=10
batch_size=12
learning_rate=0.001
gpu=0
os.environ["CUDA_VISIBLE_DEVICES"] = str(gpu)

if __name__ == "__main__":
    # Create directories to save the results and models
    model_path = f"{model_dir}/{dataset}"
    if not os.path.exists(model_path):
        os.makedirs(model_path)
    model_path = f"{model_path}/next_activity_ckpt.weights.h5"

    result_path = f"{result_dir}/{dataset}"
    if not os.path.exists(result_path):
        os.makedirs(result_path)
    result_path = f"{result_path}/results"

    # Load data
    data_loader = LogsDataLoader(name=dataset)

    (train_df, test_df, x_word_dict, y_word_dict, max_case_length, 
        vocab_size, num_output) = data_loader.load_data(task)
    
    # Prepare training examples for next activity prediction task
    train_token_x, train_token_y = data_loader.prepare_data_next_activity(train_df, 
        x_word_dict, y_word_dict, max_case_length)
    
    # Create and train a transformer model
    transformer_model = get_model(
        max_case_length=max_case_length, 
        vocab_size=vocab_size,
        output_dim=num_output)

    transformer_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
        metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

    model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=model_path,
        save_weights_only=True,
        monitor="sparse_categorical_accuracy",
        mode="max", save_best_only=True)


    transformer_model.fit(train_token_x, train_token_y, 
        epochs=epochs, batch_size=batch_size, 
        shuffle=True, verbose=2, callbacks=[model_checkpoint_callback])

    # Evaluate over all the prefixes (k) and save the results
    k, accuracies,fscores, precisions, recalls = [],[],[],[],[]
    for i in range(max_case_length):
        test_data_subset = test_df[test_df["k"]==i]
        if len(test_data_subset) > 0:
            test_token_x, test_token_y = data_loader.prepare_data_next_activity(test_data_subset, 
                x_word_dict, y_word_dict, max_case_length)   
            y_pred = np.argmax(transformer_model.predict(test_token_x), axis=1)
            accuracy = metrics.accuracy_score(test_token_y, y_pred)
            precision, recall, fscore, _ = metrics.precision_recall_fscore_support(
                test_token_y, y_pred, average="weighted")
            k.append(i)
            accuracies.append(accuracy)
            fscores.append(fscore)
            precisions.append(precision)
            recalls.append(recall)

    k.append(i + 1)
    accuracies.append(np.mean(accuracy))
    fscores.append(np.mean(fscores))
    precisions.append(np.mean(precisions))
    recalls.append(np.mean(recalls))
    print('Average accuracy across all prefixes:', np.mean(accuracies))
    print('Average f-score across all prefixes:', np.mean(fscores))
    print('Average precision across all prefixes:', np.mean(precisions))
    print('Average recall across all prefixes:', np.mean(recalls))    
    results_df = pd.DataFrame({"k":k, "accuracy":accuracies, "fscore": fscores, 
        "precision":precisions, "recall":recalls})
    results_df.to_csv(result_path+"_next_activity.csv", index=False)


Epoch 1/10

1107/1107 - 55s - loss: 0.9429 - sparse_categorical_accuracy: 0.7749 - 55s/epoch - 50ms/step
Epoch 2/10
1107/1107 - 51s - loss: 0.7630 - sparse_categorical_accuracy: 0.8131 - 51s/epoch - 46ms/step
Epoch 3/10
1107/1107 - 56s - loss: 0.7141 - sparse_categorical_accuracy: 0.8175 - 56s/epoch - 51ms/step
Epoch 4/10
1107/1107 - 53s - loss: 0.6962 - sparse_categorical_accuracy: 0.8140 - 53s/epoch - 48ms/step
Epoch 5/10
1107/1107 - 56s - loss: 0.6621 - sparse_categorical_accuracy: 0.8159 - 56s/epoch - 51ms/step
Epoch 6/10
1107/1107 - 55s - loss: 0.6460 - sparse_categorical_accuracy: 0.8162 - 55s/epoch - 50ms/step
Epoch 7/10
1107/1107 - 52s - loss: 0.6294 - sparse_categorical_accuracy: 0.8166 - 52s/epoch - 47ms/step
Epoch 8/10
1107/1107 - 53s - loss: 0.6357 - sparse_categorical_accuracy: 0.8152 - 53s/epoch - 48ms/step
Epoch 9/10
1107/1107 - 52s - loss: 0.6171 - sparse_categorical_accuracy: 0.8191 - 52s/epoch - 47ms/step
Epoch 10/10
1107/1107 - 53s - loss: 0.6243 - sparse_categorica

### Expermenting with 10 epochs and 2 transformers block

In [9]:
epochs = 10
# Create and train a transformer model
transformer_model = get_model(
    max_case_length=max_case_length, 
    vocab_size=vocab_size,
    output_dim=num_output)

transformer_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=model_path,
    save_weights_only=True,
    monitor="sparse_categorical_accuracy",
    mode="max", save_best_only=True)


transformer_model.fit(train_token_x, train_token_y, 
    epochs=epochs, batch_size=batch_size, 
    shuffle=True, verbose=2, callbacks=[model_checkpoint_callback])

# Evaluate over all the prefixes (k) and save the results
k, accuracies,fscores, precisions, recalls = [],[],[],[],[]
for i in range(max_case_length):
    test_data_subset = test_df[test_df["k"]==i]
    if len(test_data_subset) > 0:
        test_token_x, test_token_y = data_loader.prepare_data_next_activity(test_data_subset, 
            x_word_dict, y_word_dict, max_case_length)   
        y_pred = np.argmax(transformer_model.predict(test_token_x), axis=1)
        accuracy = metrics.accuracy_score(test_token_y, y_pred)
        precision, recall, fscore, _ = metrics.precision_recall_fscore_support(
            test_token_y, y_pred, average="weighted")
        k.append(i)
        accuracies.append(accuracy)
        fscores.append(fscore)
        precisions.append(precision)
        recalls.append(recall)
k.append(i + 1)
accuracies.append(np.mean(accuracy))
fscores.append(np.mean(fscores))
precisions.append(np.mean(precisions))
recalls.append(np.mean(recalls))
print('Average accuracy across all prefixes:', np.mean(accuracies))
print('Average f-score across all prefixes:', np.mean(fscores))
print('Average precision across all prefixes:', np.mean(precisions))
print('Average recall across all prefixes:', np.mean(recalls))  

Epoch 1/10
1107/1107 - 35s - loss: 0.8040 - sparse_categorical_accuracy: 0.7896 - 35s/epoch - 32ms/step
Epoch 2/10
1107/1107 - 31s - loss: 0.6947 - sparse_categorical_accuracy: 0.8147 - 31s/epoch - 28ms/step
Epoch 3/10
1107/1107 - 30s - loss: 0.6775 - sparse_categorical_accuracy: 0.8157 - 30s/epoch - 27ms/step
Epoch 4/10
1107/1107 - 32s - loss: 0.6525 - sparse_categorical_accuracy: 0.8141 - 32s/epoch - 29ms/step
Epoch 5/10
1107/1107 - 32s - loss: 0.6369 - sparse_categorical_accuracy: 0.8167 - 32s/epoch - 29ms/step
Epoch 6/10
1107/1107 - 29s - loss: 0.6219 - sparse_categorical_accuracy: 0.8172 - 29s/epoch - 26ms/step
Epoch 7/10
1107/1107 - 29s - loss: 0.6146 - sparse_categorical_accuracy: 0.8166 - 29s/epoch - 26ms/step
Epoch 8/10
1107/1107 - 29s - loss: 0.6122 - sparse_categorical_accuracy: 0.8145 - 29s/epoch - 26ms/step
Epoch 9/10
1107/1107 - 29s - loss: 0.6087 - sparse_categorical_accuracy: 0.8175 - 29s/epoch - 26ms/step
Epoch 10/10
1107/1107 - 30s - loss: 0.5995 - sparse_categorical_

### Expermenting with 20 epochs and 2 transformers block

In [10]:
epochs = 20
# Create and train a transformer model
transformer_model = get_model(
    max_case_length=max_case_length, 
    vocab_size=vocab_size,
    output_dim=num_output)

transformer_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate),
    loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=model_path,
    save_weights_only=True,
    monitor="sparse_categorical_accuracy",
    mode="max", save_best_only=True)


transformer_model.fit(train_token_x, train_token_y, 
    epochs=epochs, batch_size=batch_size, 
    shuffle=True, verbose=2, callbacks=[model_checkpoint_callback])

# Evaluate over all the prefixes (k) and save the results
k, accuracies,fscores, precisions, recalls = [],[],[],[],[]
for i in range(max_case_length):
    test_data_subset = test_df[test_df["k"]==i]
    if len(test_data_subset) > 0:
        test_token_x, test_token_y = data_loader.prepare_data_next_activity(test_data_subset, 
            x_word_dict, y_word_dict, max_case_length)   
        y_pred = np.argmax(transformer_model.predict(test_token_x), axis=1)
        accuracy = metrics.accuracy_score(test_token_y, y_pred)
        precision, recall, fscore, _ = metrics.precision_recall_fscore_support(
            test_token_y, y_pred, average="weighted")
        k.append(i)
        accuracies.append(accuracy)
        fscores.append(fscore)
        precisions.append(precision)
        recalls.append(recall)
k.append(i + 1)
accuracies.append(np.mean(accuracy))
fscores.append(np.mean(fscores))
precisions.append(np.mean(precisions))
recalls.append(np.mean(recalls))
print('Average accuracy across all prefixes:', np.mean(accuracies))
print('Average f-score across all prefixes:', np.mean(fscores))
print('Average precision across all prefixes:', np.mean(precisions))
print('Average recall across all prefixes:', np.mean(recalls))  

Epoch 1/20
1107/1107 - 30s - loss: 0.8050 - sparse_categorical_accuracy: 0.7916 - 30s/epoch - 27ms/step
Epoch 2/20
1107/1107 - 28s - loss: 0.6961 - sparse_categorical_accuracy: 0.8169 - 28s/epoch - 26ms/step
Epoch 3/20
1107/1107 - 26s - loss: 0.6686 - sparse_categorical_accuracy: 0.8164 - 26s/epoch - 23ms/step
Epoch 4/20
1107/1107 - 26s - loss: 0.6549 - sparse_categorical_accuracy: 0.8141 - 26s/epoch - 23ms/step
Epoch 5/20
1107/1107 - 29s - loss: 0.6325 - sparse_categorical_accuracy: 0.8170 - 29s/epoch - 27ms/step
Epoch 6/20
1107/1107 - 29s - loss: 0.6237 - sparse_categorical_accuracy: 0.8168 - 29s/epoch - 26ms/step
Epoch 7/20
1107/1107 - 27s - loss: 0.6101 - sparse_categorical_accuracy: 0.8177 - 27s/epoch - 24ms/step
Epoch 8/20
1107/1107 - 29s - loss: 0.6082 - sparse_categorical_accuracy: 0.8171 - 29s/epoch - 26ms/step
Epoch 9/20
1107/1107 - 29s - loss: 0.6011 - sparse_categorical_accuracy: 0.8180 - 29s/epoch - 26ms/step
Epoch 10/20
1107/1107 - 29s - loss: 0.6020 - sparse_categorical_

**Suggestions for Improvements**

**Positional Encoding:**

- Currently, the model uses learnable position embeddings. An alternative is to use sinusoidal position embeddings, as proposed in the original Transformer paper. Sinusoidal embeddings have the advantage of being fixed and potentially providing better inductive biases for the model.

**Pre-trained Embeddings:**
- Use pre-trained embeddings (e.g., GloVe, Word2Vec, or BERT embeddings) for the token embeddings. This can help the model leverage external knowledge and improve performance, especially when training data is limited.

**Transformer Block Hyperparameters:**
- Experiment with different numbers of heads (num_heads) and different feed-forward network dimensions (ff_dim) to find the best configuration for your specific dataset.

**Layer Scaling:**
- Apply layer scaling (e.g., LayerScale) to improve training stability and performance, especially for deeper models.

**Learning Rate Schedules:**
- Use advanced learning rate schedules such as the learning rate warm-up followed by a cosine decay to improve training efficiency and convergence.

**Regularization:**
- Incorporate other regularization techniques such as weight decay and layer-wise adaptive rate scaling (LARS) to further prevent overfitting and improve generalization.