# Introduction

This Jupyter Notebook document contains a series of experiments for machine learning and deep learning-based sequence analysis and sentiment analysis. The experiments include the following:

1. Machine-learning based Sequence Analysis using a SVM (Support Vector Machine) model
2. Deep-learning-based Sequence Analysis using a LSTM (Long Short-Term Memory) model
3. Deep-learning-based Sequence Analysis using a DistilBERT (Transformer) model
4. Deep-learning-based Sentiment Analysis using a TCN (Temporal Convolutional Network) model

Each experiment focuses on analyzing sequences of events or text data using different models. The goal is to explore the performance and effectiveness of these models in various sequence analysis tasks.

The code and explanations for each experiment are provided in the subsequent cells of this Jupyter Notebook document.


# Setup


1. Environment
2. Imports
3. Globals
4. Utilities


## Environment and imports

This section sets up the environment for the experiment and includes the necessary imports. It consists of the following subsections:

Colab specific setup: This subsection contains the necessary installations and code checkout specific to Colab.
Global variables and Settings: This subsection defines the global variables and settings used throughout the experiment.
Utility Functions: This subsection includes utility functions that are used in the experiment.

### Colab Setup

To ensure the smooth execution of this Jupyter Notebook document, it is important to perform the necessary Colab specific setup. This setup includes installing required packages, updating the base environment, and cloning the necessary codebase. By following these steps, you can ensure that the notebook runs seamlessly and all dependencies are properly configured.

#### Colab Specific installations and Code Checkout

Here we check if the notebook is running on Google Colab. 
If it is, it we install the condacolab package, update conda and install important libraries using conda. 
We also clone the experiment's GitHub repository and copu the codebase and environment_setup directories if they don't already exist.

In [None]:
import os
import tensorflow as tf


is_colab = 'google.colab' in str(get_ipython())

if is_colab:
    !pip install condacolab
    import condacolab
    condacolab.install()
    !conda update -n base -c defaults conda
    !conda install -y python=3.11 cudatoolkit tensorflow cudnn
    !conda clean -ya

    if not os.path.exists('codebase'):
        !git clone https://github.com/jrgrant-uliv/capstone-project-csck700.git  
        !cp -r /content/capstone-project-csck700/codebase ./
        !cp -r /content/capstone-project-csck700/environment_setup ./

#### Resrouce Downloads

This block is responsible for setting up the necessary resources and dependencies for the CSCK_700 experiment. 

It creates directories for embeddings and application log datasets. 
If the CSCK_700_Resources folder does not exist, it downloads it from a Google Drive link, unzips the supprting resources into their local directories, and then installs the python requrements.

In [None]:
import os

%mkdir -p artefacts/embeddings
%mkdir -p application_log_datasets

# if CSCK_700_Resources does not exist, download it
if not os.path.exists('./CSCK_700_Resources'):
    !pip install --upgrade gdown
    !gdown https://drive.google.com/drive/folders/1Nsiyt_DseGU1tMTdb08AD65Y6puLIO9B -O ./ --folder
    !unzip -oq ./CSCK_700_Resources/HDFS.zip -d ./application_log_datasets/HDFS
    !unzip -oq ./CSCK_700_Resources/glove.840B.300d.zip -d ./artefacts/embeddings

%cd environment_setup

!sh install_dependencies.sh
%cd ..

### Imports

In [None]:
# All the stock stuff
from sklearn.model_selection import train_test_split

from sklearn.model_selection import cross_val_predict, cross_val_score
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    roc_curve,
)

from codebase.pipeline.preprocessors.preprocessor import (
    BertEventTokenizer,
    SequenceVectorizer,
    SGTVectorizer,
)
from codebase.anomaly_detection.models import (
    SVMClassifier,
    LSTMAttentionClassifier,
    TransformerClassifier,
    TCNSentimentclassifier,
)
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import random
import warnings
import nltk
import pickle
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords


nltk.download("stopwords")
nltk.download("wordnet")
warnings.filterwarnings("ignore", category=UserWarning)

## Global varaibles and settings


In [None]:
# output directories
output_root = "./output"
benchmark_results_dir = f"{output_root}/benchmark_results"
benchmark_results_file = f"{benchmark_results_dir}/benchmark_results.csv"
plot_dir = f"{output_root}/plots"
plot_file = f"{plot_dir}/benchmark_results.png"
model_dir = f"{output_root}/models"

benchmark_results = []
lstm_data = {
    "HDFS": {"train": {"X": [], "y": []}, "test": {"X": [], "y": []}, "loaded": False},
    "Thunderbird": {
        "train": {"X": [], "y": []},
        "test": {"X": [], "y": []},
        "loaded": False,
    },
}
svm_data = {
    "HDFS": {"train": {"X": [], "y": []}, "test": {"X": [], "y": []}, "loaded": False},
    "Thunderbird": {
        "train": {"X": [], "y": []},
        "test": {"X": [], "y": []},
        "loaded": False,
    },
}

data_sets = {
    "HDFS": {
        # The benchmark dataset
        "struct_log": "./application_log_datasets/HDFS/HDFS.event_traces.csv",
        # The event template file
        "template_file": "./application_log_datasets/HDFS/HDFS.log_templates.csv",
    },
    "Thunderbird": {
        # The benchmark dataset
        "struct_log": "application_log_datasets/Thunderbird/Thunderbird_20M.log_structured.csv",
        # The event template file
        "template_file": "application_log_datasets/Thunderbird/Thunderbird_20M.log_templates.csv",
    },
}
print("Data sets loaded")
# print(data_sets)

### Utility functions


#### Data Loading


In [None]:
# ensure output directories exist
if not os.path.exists(output_root):
    os.mkdir(output_root)
if not os.path.exists(benchmark_results_dir):
    os.mkdir(benchmark_results_dir)
if not os.path.exists(plot_dir):
    os.mkdir(plot_dir)
if not os.path.exists(model_dir):
    os.mkdir(model_dir)


def load_event_templates_hdfs(event_templates):
    """
    Load event templates from a CSV file and create a dictionary mapping event IDs to event texts.

    Parameters:
    event_templates (str): The path to the CSV file containing event templates.

    Returns:
    dict: A dictionary mapping event IDs to event texts.
    """
    df_event_templates = pd.read_csv(event_templates)
    # create an event_id to event_text dictionary
    event_id_to_event_text = {}
    for index, row in df_event_templates.iterrows():
        event_id_to_event_text[row["EventId"]] = row["EventTemplate"]
    return event_id_to_event_text


def load_data_hdfs(event_traces, event_templates):
    """
    Load data from HDFS and perform feature and label extraction.

    Parameters:
    event_traces (str): Path to the CSV file containing event traces.
    event_templates (str): Path to the CSV file containing event templates.

    Returns:
    tuple: A tuple containing the feature array (x) and the label array (y).
    """

    df_event_traces = pd.read_csv(event_traces)

    # Label: Success = 0, rest = 1
    df_event_traces["Label"] = df_event_traces["Label"].apply(
        lambda x: 0 if x == "Success" else 1
    )

    # feature and label extraction
    x = df_event_traces["Features"].values
    y = df_event_traces["Label"].values

    return x, y


def load_data(data_set, model, validation_data=False):
    """
    Load the benchmark dataset.

    Args:
        data_set (str): The name of the benchmark dataset.
        window_size (int): The size of the sliding window.
        train_ratio (float): The ratio of the training set to the entire dataset.
        split_type (str): The type of the splitting method. It can be 'uniform' or 'sequential'.

    Returns:
        tuple: A tuple containing the training set and test set.
    """
    log_file = data_sets[data_set]["struct_log"]
    if "label_file" in data_sets[data_set]:
        label_file = data_sets[data_set]["label_file"]
    # load templates if in data_sets[data_set]
    template_file = None
    if "template_file" in data_sets[data_set]:
        template_file = data_sets[data_set]["template_file"]

        x, y = load_data_hdfs(log_file, template_file)
        return x, y

#### Data Preparaion


In [None]:
def prepare_hdfs_data(x, y):
    """
    Prepares the HDFS data for training by generating augmented data and combining it with the original data.

    Args:
        x (numpy.ndarray): The feature data.
        y (numpy.ndarray): The label data.

    Returns:
        pandas.DataFrame: The prepared data with augmented samples.

    """
    label_counts = np.bincount(y)
    pos_count = label_counts[1]
    augmenation_cap = int(pos_count * 0.75)
    print("Augmentation cap: ", augmenation_cap)
    print("Label counts: ", label_counts)

    new_abnormal = generate_augmented_data(x, augmenation_cap)

    original_data_df = pd.DataFrame({"feature": x, "label": y})
    new_abnormal_df = pd.DataFrame({"feature": new_abnormal, "label": 1})

    data_df = pd.concat([original_data_df, new_abnormal_df], ignore_index=True)
    # shuffle data_df
    data_df = data_df.sample(frac=1, random_state=42).reset_index(drop=True)
    return data_df


def generate_augmented_data(normal_data, augmentation_sample_size):
    """
    Generate augmented data by applying various anomaly generation techniques to the given normal data.

    Parameters:
    normal_data (list): A list of strings representing the normal data.

    Returns:
    list: A list of strings representing the augmented data.

    """

    random_state = 42
    random.seed(random_state)

    unique_sequence_ids = []
    augment_data = []
    for x in normal_data:
        x = x.replace("[", "").replace("]", "").replace(" ", "").split(",")
        # of the sequence is longer than 5 and has more than 3 unique sequence ids, add it to the augmented data
        if len(x) > 5 and len(set(x)) > 3:
            augment_data.append(x)
            if len(augment_data) >= augmentation_sample_size:
                break
        for y in x:
            if y not in unique_sequence_ids:
                unique_sequence_ids.append(y)
    print("Number of unique sequence ids: ", len(unique_sequence_ids))
    print("Size of sample for augmentation: ", len(augment_data))
    print("Sampled data: ", augment_data[0])
    sampled_data = random.sample(augment_data, int(len(augment_data) * 0.6))
    reversed_sequences = []
    for x in sampled_data:
        x.reverse()
        reversed_sequences.append(x)
    print("Sampled reversed sequences: ", reversed_sequences[:5])
    # randomly select another 20% of the normal dataset and generate shuffled sequences as anomalies
    sampled_data = random.sample(augment_data, int(len(augment_data) * 0.3))
    shuffled_sequences = []
    for x in sampled_data:
        random.shuffle(x)
        shuffled_sequences.append(x)
    print("Sampled shuffled sequences: ", shuffled_sequences[:5])

    # randomly select another 20% of the normal dataset and randomly insert sequence ids from unique_sequence_ids as anomalies
    sampled_data = random.sample(augment_data, int(len(augment_data) * 0.2))
    inserted_sequences = []
    for x in sampled_data:
        # insert a random sequence id from unique_sequence_ids at a random point
        random_index = random.randint(0, len(x) - 1)
        # insert up to 10 sequence ids
        insert_count = random.randint(1, 10)
        for i in range(insert_count):
            random_sequence_id = random.choice(unique_sequence_ids)
            x.insert(random_index, random_sequence_id)
        inserted_sequences.append(x)
    print("Sampled inserted sequences: ", inserted_sequences[:5])
    # combine all the augmented data
    augmented_data = reversed_sequences + shuffled_sequences + inserted_sequences
    print("Number of augmented sequences: ", len(augmented_data))
    print("Sampled augmented data: ", augmented_data[:5])
    # reassemble as a string representation of a list
    new_abnormal = []
    for lst in augment_data:
        lst = ",".join(lst)
        lst = f"[{lst}]"
        new_abnormal.append(lst)
    return new_abnormal


def process_text_corpus(text_corpus_df, word_split=None):
    """
    Process the text corpus dataframe by performing various transformations.

    Args:
        text_corpus_df (pandas.DataFrame): The input text corpus dataframe.
        word_split (dict, optional): A dictionary containing words to split and their replacements.

    Returns:
        pandas.DataFrame: The processed text corpus dataframe.
    """
    split_words = word_split is not None
    lm = WordNetLemmatizer()
    english_stops = set(stopwords.words("english"))

    # Train Corpus
    print("Process Training Corpus")
    text_corpus_df["EventTemplate"] = text_corpus_df["EventTemplate"].apply(
        lambda x: x.lower()
    )
    text_corpus_df["EventTemplate"] = text_corpus_df["EventTemplate"].replace(
        {"<.*?>": " "}, regex=True
    )
    text_corpus_df["EventTemplate"] = text_corpus_df["EventTemplate"].replace(
        {"[^a-zA-Z]": " "}, regex=True
    )
    text_corpus_df["EventTemplate"] = text_corpus_df["EventTemplate"].replace(
        {"\s+": " "}, regex=True
    )
    if split_words:
        for word in word_split:
            replace = " ".join(word_split[word]).lower()
            print(replace)
            text_corpus_df["EventTemplate"] = text_corpus_df[
                "EventTemplate"
            ].str.replace(word, replace)

    text_corpus_df["EventTemplate"] = text_corpus_df["EventTemplate"].apply(
        lambda x: [
            lm.lemmatize(word)
            for word in x.split(" ")
            if not word in english_stops and word != ""
        ]
    )
    # remove '' from list in column X
    text_corpus_df["EventTemplate"] = text_corpus_df["EventTemplate"].apply(
        lambda x: [word for word in x if word != ""]
    )
    # convert list in column X to string
    text_corpus_df["EventTemplate"] = text_corpus_df["EventTemplate"].apply(
        lambda x: " ".join(x)
    )

    return text_corpus_df

#### Encoding


In [None]:
def positional_encoding(max_len, d_model):
    """
    Generate positional encoding for transformer models.

    Parameters:
    - max_len (int): Maximum sequence length.
    - d_model (int): Dimensionality of the model.

    Returns:
    - pos_enc (np.ndarray): Positional encoding of shape (1, max_len, d_model).
    """
    position = np.arange(0, max_len)[:, np.newaxis]
    div_term = np.exp(np.arange(0, d_model, 2) * -(np.log(10000.0) / d_model))
    pos_enc = np.zeros((max_len, d_model))
    pos_enc[:, 0::2] = np.sin(position * div_term)
    pos_enc[:, 1::2] = np.cos(position * div_term)
    pos_enc = pos_enc[np.newaxis, ...]
    return pos_enc.astype(np.float32)

#### Save Model


In [None]:
def save_model_deployment(classifier, vectorizer):
    """
    Saves the trained classifier model and associated artifacts for deployment.

    Args:
        classifier (Classifier): The trained classifier object.
        vectorizer (Vectorizer): The vectorizer object used for feature extraction.

    Returns:
        None
    """
    classifier.save_model_file()
    model_artefact_dir = classifier.model_artefact_dir
    if vectorizer is not None:
        tokenizer_file = os.path.join(model_artefact_dir, "tokenizer.pkl")
        vectorizer.save_tokenizer(tokenizer_file)

#### Model Evaluation


In [None]:
def evaluate_model(
    clf,
    _model,
    _data_set,
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
):
    """
    Evaluate the performance of a machine learning model.

    Args:
        clf: The classifier model.
        _model: The name of the model.
        _data_set: The name of the dataset.
        accuracies: List of accuracy scores.
        precisions: List of precision scores.
        recalls: List of recall scores.
        fscores: List of F1 scores.
        aucs: List of AUC scores.
        conf_matrices: List of confusion matrices.
        roc_curves: List of ROC curves.

    Returns:
        None
    """
    conf_matrix = np.mean(conf_matrices, axis=0).astype(np.int32)
    tn = conf_matrix[0][0]
    fp = conf_matrix[0][1]
    tp = conf_matrix[1][1]
    fn = conf_matrix[1][0]
    pos = tp + fn
    neg = fp + tn
    accuracy = np.mean(accuracies)
    precision = np.mean(precisions)
    recall = np.mean(recalls)
    f1 = np.mean(fscores)
    auc = np.mean(aucs)
    TPR = tp / pos
    FPR = fp / neg
    TNR = tn / neg
    FNR = fn / pos
    CSR = (tp + tn) / (pos + neg)
    CFR = (fp + fn) / (pos + neg)
    MTTD_Impact = (FPR * 2) + FNR

    save_conf_matrix(clf, conf_matrix)
    plot_confusion_matrix(clf, conf_matrix)
    plot_roc(clf, roc_curves, aucs)

    save_benchmark_results(
        _model,
        _data_set,
        accuracy,
        precision,
        recall,
        f1,
        auc,
        conf_matrix,
        tp,
        fp,
        tn,
        fn,
        pos,
        neg,
        TPR,
        FPR,
        TNR,
        FNR,
        CSR,
        CFR,
        MTTD_Impact
    )

    plot_metrics()


def evaluate_svm(clf, _model, _data_set, y_pred, y_true, accuracy):
    """
    Evaluate the performance of a Support Vector Machine (SVM) classifier.

    Parameters:
    - clf: The trained SVM classifier.
    - _model: The name or identifier of the model being evaluated.
    - _data_set: The name or identifier of the dataset being evaluated.
    - y_pred: The predicted labels.
    - y_true: The true labels.
    - accuracy: The accuracy scores for each prediction.

    Returns:
    None
    """

    accuracy = np.mean(accuracy)

    # Calculate precision, recall, and F-score
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    # Calculate AUC
    auc = roc_auc_score(y_true, y_pred)

    # Calculate confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred)

    tp = conf_matrix[0][0]
    fp = conf_matrix[0][1]
    tn = conf_matrix[1][1]
    fn = conf_matrix[1][0]
    pos = tp + fn
    neg = fp + tn
    TPR = tp / pos
    FPR = fp / neg
    TNR = tn / neg
    FNR = fn / pos
    CSR = (tp + tn) / (pos + neg)
    CFR = (fp + fn) / (pos + neg)
    MTTD_Impact = (FPR * 2) + FNR


    save_conf_matrix(clf, conf_matrix)
    plot_confusion_matrix(clf, conf_matrix)
    roc = roc_curve(y_true, y_pred)
    plot_roc(clf, [roc], [auc])

    save_benchmark_results(
        _model,
        _data_set,
        accuracy,
        precision,
        recall,
        f1,
        auc,
        conf_matrix,
        tp,
        fp,
        tn,
        fn,
        pos,
        neg,
        TPR,
        FPR,
        TNR,
        FNR,
        CSR,
        CFR,
        MTTD_Impact,
    )

    plot_metrics()


def save_conf_matrix(clf, conf_matrix):
    """
    Save the confusion matrix to a CSV file.

    Args:
        clf (object): The classifier object.
        conf_matrix (array-like): The confusion matrix.

    Returns:
        None
    """
    model_conf_matrix_file = os.path.join(
        clf.model_artefact_dir, "confusion_matrix.csv"
    )
    conf_matrix_df = pd.DataFrame(
        conf_matrix,
        index=["Actual Negative", "Actual Positive"],
        columns=["Predicted Negative", "Predicted Positive"],
    )
    conf_matrix_df.to_csv(model_conf_matrix_file)


def save_benchmark_results(
    model,
    dataset,
    accuracy,
    precision,
    recall,
    f1,
    auc,
    conf_matrix,
    tp,
    fp,
    tn,
    fn,
    pos,
    neg,
    TPR,
    FPR,
    TNR,
    FNR,
    CSR,
    CFR,
    MTTD_Impact
):
    """
    Save benchmark results to a CSV file.

    Args:
        model (str): The name of the model.
        dataset (str): The name of the dataset.
        accuracy (float): The accuracy score.
        precision (float): The precision score.
        recall (float): The recall score.
        f1 (float): The F1 score.
        auc (float): The AUC score.
        conf_matrix (array-like): The confusion matrix.
        tp (int): The number of true positives.
        fp (int): The number of false positives.
        tn (int): The number of true negatives.
        fn (int): The number of false negatives.
        pos (int): The number of positive instances.
        neg (int): The number of negative instances.
        mttd_impact (float): The impact score 1.
    """
    print("Accuracy: ", accuracy)
    print("Precision: ", precision)
    print("Recall: ", recall)
    print("F1: ", f1)
    print("AUC: ", auc)
    print("Confusion Matrix: ", conf_matrix)

    # add metrics to extended_benchmark_results
    benchmark_results.append(
        [
            model,
            dataset,
            accuracy,
            precision,
            recall,
            f1,
            auc,
            pos,
            neg,
            tp,
            fp,
            fn,
            tn,
            TPR,
            FPR,
            TNR,
            FNR,
            CSR,
            CFR,
            MTTD_Impact,
        ]
    )

    if not os.path.exists(benchmark_results_dir):
        os.makedirs(benchmark_results_dir)
    df = pd.DataFrame(
        benchmark_results,
        columns=[
            "model",
            "dataset",
            "accuracy",
            "precision",
            "recall",
            "f1",
            "auc",
            "pos",
            "neg",
            "tp",
            "fp",
            "fn",
            "tn",
            "tpr",
            "fpr",
            "tnr",
            "fnr",
            "csr",
            "cfr",
            "mttd_impact",
        ],
    )
    df.to_csv(benchmark_results_file, index=False)

#### Download and process Glove Embeddings


In [None]:
"""
Downloads and processes the GloVe embeddings if necessary.

Args:
    None

Returns:
    None
"""

path = './artefacts/embeddings/'
nb_file_path=path+'glove.840B.300d.txt'
pkl_file = f'{path}/glove.840B.300d.pkl'

force_process_glove = False
if not os.path.exists(path):
    os.makedirs(path)
    force_process_glove = True

if not os.path.exists(pkl_file):
    force_process_glove = True

if force_process_glove:
    if not os.path.exists(nb_file_path):
        !wget http://nlp.stanford.edu/data/glove.840B.300d.zip -P {path}
        !unzip {path}/glove.840B.300d.zip -d {path}
        !rm {path}/glove.840B.300d.zip

    df = pd.read_csv(nb_file_path, sep=" ", quoting=3, header=None, index_col=0)
    embeddings_index = {key: val.values for key, val in df.T.items()}
    #
    print('Found %s word vectors.' % len(embeddings_index))
    import pickle
    pkl_file = f'{path}/glove.840B.300d.pkl'
    with open(pkl_file, 'wb') as fp:
        pickle.dump(embeddings_index, fp)
else:
    reload_embedding_index = True

#### Plot Results


In [None]:
def plot_metrics():
    """
    Plots the metrics (accuracy, precision, recall, and F1 score) for different models and datasets.

    Returns:
        None
    """
    df = pd.DataFrame(
        benchmark_results,
        columns=[
            "model",
            "dataset",
            "accuracy",
            "precision",
            "recall",
            "f1",
            "auc",
            "pos",
            "neg",
            "tp",
            "fp",
            "fn",
            "tn",
            "tpr",
            "fpr",
            "tnr",
            "fnr",
            "csr",
            "cfr",
            "mttd_impact",
        ],
    )
    # Define a color palette for models
    model_palette = sns.color_palette("Set1", n_colors=len(df['model'].unique()))

    # Plotting
    plt.figure(figsize=(12, 8))

    # Accuracy Plot
    plt.subplot(2, 2, 1)
    sns.barplot(x="model", y="accuracy", hue="dataset", data=df, palette=model_palette)
    plt.title("Accuracy")

    # Precision Plot
    plt.subplot(2, 2, 2)
    sns.barplot(x="model", y="precision", hue="dataset", data=df, palette=model_palette)
    plt.title("Precision")

    # Recall Plot
    plt.subplot(2, 2, 3)
    sns.barplot(x="model", y="recall", hue="dataset", data=df, palette=model_palette)
    plt.title("Recall")

    # F1 Score Plot
    plt.subplot(2, 2, 4)
    sns.barplot(x="model", y="f1", hue="dataset", data=df, palette=model_palette)
    plt.title("F1 Score")

    plt.tight_layout()

    plt.savefig(plot_file)
    plt.show()

def plot_roc(clf, roc_curves, auc_scores):
    """
    Plots the ROC curves for a classifier.

    Args:
        clf (Classifier): The classifier object.
        roc_curves (list): List of ROC curves.
        auc_score (float): The AUC score.

    Returns:
        None
    """
    model_name = clf.model_name

    plt.figure(figsize=(10, 10))
    split = 1
    for roc in roc_curves:
        fpr = roc[0]
        tpr = roc[1]
        auc = auc_scores[split - 1]
        plt.plot(
            fpr, tpr, label=f"Cross-validation split {split} - AUC = {auc:.2f}"
        )
        split += 1

    plt.title(f"ROC Curves {model_name}")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.legend()
    plt.savefig(os.path.join(clf.model_artefact_dir, "roc_curves.png"))
    plt.show()


def plot_confusion_matrix(clf, conf_matrix):
    """
    Plots the confusion matrix using a heatmap.

    Parameters:
    - clf: The classifier object.
    - conf_matrix: The confusion matrix to be plotted.

    Returns:
    None
    """
    plt.figure(figsize=(10, 10))
    sns.heatmap(
        conf_matrix,
        annot=True,
        fmt="g",
        cmap="Blues",
        xticklabels=["True", "False"],
        yticklabels=["True", "False"],
    )
    plt.title("Confusion Matrix - " + clf.model_name)
    plt.ylabel("Actual")
    plt.xlabel("Predicted")
    plt.savefig(os.path.join(clf.model_artefact_dir, "confusion_matrix.png"))
    plt.show()

# The Experiment


## Bootstrap


In [None]:
"""
This code loads HDFS event sequences, prepares the data, and optionally reduces the data.
"""

reduce_data = False
print("Loading HDFS Event Sequences")
x, y = load_data("HDFS", "SVM")
hdfs_data_df = prepare_hdfs_data(x, y)

if reduce_data:
    print("Reducing HDFS Event Sequences")
    hdfs_data_df = hdfs_data_df.drop(
        hdfs_data_df[hdfs_data_df.label == 0].sample(
            frac=0.5, random_state=42).index
    )
    hdfs_data_df = hdfs_data_df.drop(
        hdfs_data_df[hdfs_data_df.label == 1].sample(
            frac=0.5, random_state=42).index
    )

In [None]:
import tensorflow as tf
from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
print("Num GPUs Available: ", len(
    tf.config.experimental.list_physical_devices('GPU')))
print("Running in Colab: ", is_colab)


In [None]:
print("Run all the cells above this one")

## Model 1 - SVM

Statistical Analysis using an SVM (Support Vector Machine) model


#### Preparation


In [None]:
"""
This code performs an experiment using the HDFS dataset and SVM model.
It loads the data, prepares it, and splits it into train and test sets.
It then applies TF-IDF vectorization to the data and calculates class weights.
Finally, it prints the shapes of the train and test data, the normal to anomaly ratio,
the counts of normal and anomaly classes, and the class weights.
"""

_data_set = "HDFS"
_model = "SVM"
print(_data_set, _model)

if hdfs_data_df is None:
    x, y = load_data(_data_set, _model)
    hdfs_data_df = prepare_hdfs_data(x, y)

x = hdfs_data_df["feature"].tolist()
y = hdfs_data_df["label"].tolist()
y = np.array(y)
train_data, test_data, train_labels, test_labels = train_test_split(
    x, y, test_size=0.3, random_state=42
)

vectorizer = SequenceVectorizer(mode="tfidf")
vectorizer.fit(x)
train_data, test_data, train_labels, test_labels = train_test_split(
    x, y, test_size=0.3, random_state=42
)
train_data = vectorizer.transform(train_data)
test_data = vectorizer.transform(test_data)

class_counts = np.unique(train_labels, return_counts=True)[1]
ratio = class_counts[0] / class_counts[1]
class_weights = {0: 1, 1: ratio}

print("Train data shape: ", train_data.shape)
print("Test data shape: ", test_data.shape)
print("Normal to anomaly ratio: ", ratio)
print("Normal: ", class_counts[0])
print("Anomaly: ", class_counts[1])
print("Class weights:", class_weights)

#### Execution


In [None]:
"""
This code performs cross-validation using an SVM classifier model.
It sets the batch size, LSTM units, number of epochs, number of splits, prediction threshold, and model name.
Then it initializes an SVMClassifier object with the specified model name and class weights.
Next, it builds the SVM model using the SVMClassifier object.
After that, it performs cross-validation using the built model and obtains predictions.
Finally, it calculates the accuracy of the model using cross-validation scores and evaluates the SVM classifier.
"""
batch_size = 128
lstm_units = 8
num_epochs = 10
num_splits = 5
prediction_threshold = 0.9
model_name = "SVMClassifier_HDFS"

classifier = SVMClassifier(model_name=model_name, class_weight=class_weights)
svc_model = classifier.build_model()

# Perform cross-validation and get predictions
y_pred = cross_val_predict(svc_model, train_data, train_labels, cv=5)
# Calculate accuracy
accuracy = cross_val_score(
    svc_model, train_data, train_labels, cv=5, scoring="accuracy"
)

evaluate_svm(classifier, _model, _data_set, y_pred, train_labels, accuracy)

## Model 2 - LSTM


#### Version 1. Sequence Classification


##### Preparation


In [None]:
"""
This code prepares the data for training a model on the HDFS dataset using LSTM.
It loads the data, prepares it for training, and calculates class weights for imbalanced data.
"""

_data_set = "HDFS"
_model = "LSTM"
print(_data_set, _model)

sequence_length = 128
feature_dim = 1

if hdfs_data_df is None:
    x, y = load_data(_data_set, _model)
    hdfs_data_df = prepare_hdfs_data(x, y)


x = hdfs_data_df["feature"].tolist()
y = hdfs_data_df["label"].tolist()
y = np.array(y)

vectorizer = SequenceVectorizer(num_words=sequence_length)
vectorizer.fit(x)
train_data = vectorizer.transform(x)
train_labels = y

vocab_size = len(vectorizer.tokenizer.word_index) + 1
train_data = train_data.reshape(
    train_data.shape[0], sequence_length, feature_dim)

class_counts = np.unique(train_labels, return_counts=True)[1]
ratio = class_counts[0] / class_counts[1]
class_weights = {0: 1, 1: ratio}

print("Train data shape: ", train_data.shape)
print("Normal to anomaly ratio: ", ratio)
print("Normal: ", class_counts[0])
print("Anomaly: ", class_counts[1])
print("Class weights:", class_weights)

##### Execution


In [None]:
"""
This code performs cross-validation and evaluation of an LSTMAttentionClassifier model.
It sets the batch size, LSTM units, number of epochs, number of splits, prediction threshold, and model name.
Then it initializes an LSTMAttentionClassifier object with the specified parameters.
Next, it performs cross-validation using the cross_validate method of the classifier object.
Finally, it evaluates the model using the evaluate_model function, passing in the classifier object and various evaluation metrics.
"""
batch_size = 128
lstm_units = 8
num_epochs = 10
num_splits = 5
prediction_threshold = 0.9
model_name = "LSTMAttentionClassifier_HDFS_SequenceMatrix"

classifier = LSTMAttentionClassifier(
    model_name,
    feature_dim,
    sequence_length,
    vocab_size,
    lstm_units=lstm_units,
)

(
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
) = classifier.cross_validate(
    train_data, train_labels, num_splits, num_epochs, prediction_threshold
)

evaluate_model(
    classifier,
    _model,
    _data_set,
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
)

#### Version 2. Sequence Embeddings usint Sequence Graph Transformation


##### Preparation


In [None]:

"""
This code prepares and processes the HDFS dataset for training a model. It performs the following steps:

1. Loads and prepares the HDFS data if it is not already loaded.
2. Converts the feature and label columns into lists.
3. Prepares the train data by creating a DataFrame with sequence and label columns.
4. Applies sequence-to-graph transformation using SGTVectorizer.
5. Converts the train data to arrays.
6. Calculates class weights based on the ratio of normal to anomaly labels.
7. Prints information about the data, including train data shape, normal to anomaly ratio, class counts, and class weights.
"""

_data_set = "HDFS"
_model = "LSTM"
print(_data_set, _model)

sequence_length = 32
feature_dim = 1

# Load and prepare the HDFS data
if hdfs_data_df is None:
    x, y = load_data(_data_set, _model)
    hdfs_data_df = prepare_hdfs_data(x, y)

x = hdfs_data_df["feature"].tolist()
for i in range(len(x)):
    x[i] = x[i].replace("[", "").replace("]", "").replace(",", " ").split(" ")

y = hdfs_data_df["label"].tolist()
y = np.array(y)

# Prepare train data
train_data_df = pd.DataFrame({"sequence": x, "label": y})
train_data_df = train_data_df.reset_index()
train_data_df = train_data_df.rename(columns={"index": "id"})
train_labels = train_data_df["label"].tolist()
train_data_df.drop(columns=["label"], inplace=True)

# Apply sequence-to-graph transformation using SGTVectorizer
vectorizer = SGTVectorizer(num_dims=sequence_length)
train_data = vectorizer.fit_transform(train_data_df)

# Convert train and test data to arrays
train_data_arr = []
for i in train_data:
    i = i.tolist()
    train_data_arr.append(i)
train_data.shape

train_labels = np.array(train_labels)

# Calculate class weights
class_counts = np.unique(train_labels, return_counts=True)[1]
ratio = class_counts[0] / class_counts[1]
class_weights = {0: 1, 1: ratio}

# Print information about the data
print("Train data shape: ", train_data.shape)
print("Normal to anomaly ratio: ", ratio)
print("Normal: ", class_counts[0])
print("Anomaly: ", class_counts[1])
print("Class weights:", class_weights)

##### Execution


In [None]:
"""
This code performs cross-validation and evaluation of an LSTMAttentionClassifier model.
It sets the values for various parameters such as vocab_size, batch_size, lstm_units, num_epochs, num_splits, and prediction_threshold.
The model is initialized with the specified parameters and then cross-validated using the cross_validate() method.
The resulting accuracies, precisions, recalls, fscores, aucs, conf_matrices, and roc_curves are then used to evaluate the model using the evaluate_model() function.
"""

vocab_size = 500
batch_size = 128
lstm_units = 8
num_epochs = 10
num_splits = 5
prediction_threshold = 0.9
model_name = "LSTMAttentionClassifier_HDFS_SGT"

classifier = LSTMAttentionClassifier(
    model_name,
    feature_dim,
    sequence_length,
    vocab_size,
    lstm_units=lstm_units,
)

(
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
) = classifier.cross_validate(
    train_data, train_labels, num_splits, num_epochs, prediction_threshold
)

evaluate_model(
    classifier,
    _model,
    _data_set,
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
)

## Model 3 - Transformer


#### Preaparation


In [None]:


"""
This code performs data preprocessing and sets up the necessary variables for training a Transformer model on the HDFS dataset.
"""

_data_set = "HDFS"
_model = "Transformer"
print(_data_set, _model)
batch_size = 32

if hdfs_data_df is None:
    x, y = load_data(_data_set, _model)
    hdfs_data_df = prepare_hdfs_data(x, y)

reduce_data_df = hdfs_data_df.drop(
    hdfs_data_df[hdfs_data_df.label == 0].sample(
        frac=0.75, random_state=42).index
)

x = reduce_data_df["feature"].tolist()
y = reduce_data_df["label"].tolist()
y = np.array(y)

max_len = max([len(i) for i in x])

model_type = "distilbert-base-uncased"
preprocessor = BertEventTokenizer(
    model_type=model_type, batch_size=batch_size, max_length=128
)
train_tokens, train_masks = preprocessor.transform(x)
vocab_size = preprocessor.vocab_size
seq_len = preprocessor.max_length
train_data = [train_tokens, train_masks]
train_labels = y

class_counts = np.unique(train_labels, return_counts=True)[1]
ratio = class_counts[0] / class_counts[1]
class_weights = {0: 1, 1: ratio}

print("Train data shape: ", train_data[0].shape)
print("Normal to anomaly ratio: ", ratio)
print("Normal: ", class_counts[0])
print("Anomaly: ", class_counts[1])
print("Class weights:", class_weights)

#### Execution


In [None]:
"""
This code performs a cross-validation experiment using a TransformerClassifier model.
It sets the sequence length, batch size, number of epochs, number of splits, prediction threshold, and model name.
Then it initializes a TransformerClassifier object with the specified parameters.
Next, it calls the `cross_validate` method of the classifier object to perform cross-validation on the training data.
The results of the cross-validation are stored in the variables `accuracies`, `precisions`, `recalls`, `fscores`, `aucs`, `conf_matrices`, and `roc_curves`.
Finally, it calls the `evaluate_model` function to evaluate the model using the obtained results.
"""

seq_len = 640
batch_size = 16
if is_colab:
    batch_size = 64
num_epochs = 10
num_splits = 5
prediction_threshold = 0.9
model_name = "Transformer_classifier_HDFS"

classifier = TransformerClassifier(
    "Transformer_classifier_HDFS", 1, seq_len, vocab_size, model_type=model_type
)

(
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
) = classifier.cross_validate(
    train_data, train_labels, num_splits, num_epochs, prediction_threshold
)

evaluate_model(
    classifier,
    _model,
    _data_set,
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
)

## Model 4 - TCN


#### Preparation


In [None]:

"""
This code performs data preprocessing and prepares the data for training a model on the HDFS dataset.
It loads event templates, processes text corpus, applies feature string mapping, and prepares embedding matrix.
It also calculates class weights for imbalanced data and prints relevant information about the data.
"""

_data_set = "HDFS"
_model = "TCN"
print(_data_set, _model)
batch_size = 32

hdfs_event_templates = data_sets["HDFS"]["template_file"]
m = load_event_templates_hdfs(hdfs_event_templates)

# load dictionary m into a dataframe, key as column 'EventId', value as column 'EventTemplate'
event_templates_df = pd.DataFrame(
    m.items(), columns=["EventId", "EventTemplate"])

word_split = {
    "namesystem": ["name", "system"],
    "allocateblock": ["allocate", "block"],
    "packetresponder": ["packet", "responder"],
    "addstoredblock": ["add", "stored", "block"],
    "invalidset": ["invalid", "set"],
    "ioexception": ["input", "exception"],
    "writeblock": ["write", "block"],
    "blockinfo": ["block", "info"],
    "volumemap": ["volume", "map"],
    "receiveblock": ["receive", "block"],
    "socketchannel": ["socket", "channel"],
    "interruptedioexception": ["interrupted", "input", "exception"],
    "interruptedinput": ["interrupted", "input"],
    "eofexception": ["file", "exception"],
    "sockettimeoutexception": ["socket", "timeout", "exception"],
    "pendingreplicationmonitor": ["pending", "replication", "monitor"],
    "neededreplications": ["needed", "replications"],
}

event_templates_df = process_text_corpus(event_templates_df, word_split)

m = dict(zip(event_templates_df["EventId"],
         event_templates_df["EventTemplate"]))

x, y = load_data(_data_set, _model)

print("Preparing data for training...")
hdfs_corpus_df = pd.DataFrame({"feature": x, "label": y})
hdfs_corpus_df = hdfs_corpus_df.drop(
    hdfs_corpus_df[hdfs_corpus_df.label == 0].sample(
        frac=0.75, random_state=42).index
)
hdfs_corpus_df["feature_str"] = (
    hdfs_corpus_df["feature"].str.replace(
        "[", "").str.replace("]", "").str.split(",")
)
print(hdfs_corpus_df.head())

print("Apply feature string mapping")
hdfs_corpus_df["feature_str"] = hdfs_corpus_df["feature_str"].apply(
    lambda x: [m[str(i)] for i in x]
)
print(hdfs_corpus_df.head())

hdfs_corpus_df["feature_str"] = hdfs_corpus_df["feature_str"].apply(
    lambda x: " ".join(x)
)
hdfs_corpus_df["feature_str"] = hdfs_corpus_df["feature_str"].apply(
    lambda x: x.split(" ")
)
train_data = hdfs_corpus_df["feature_str"]
train_labels = np.asarray(hdfs_corpus_df["label"].tolist())

sequence_length = 128
vectorizer = SequenceVectorizer(num_words=sequence_length)
vectorizer.fit(train_data)
train_data = vectorizer.transform(train_data)
vectorizer.tokenizer.word_counts
vocab_size = len(vectorizer.tokenizer.word_index) + 1

word_index = vectorizer.tokenizer.word_index
EMBEDDING_DIM = 64
MAX_NB_WORDS = 2000

if reload_embedding_index:
    path = "./artefacts/embeddings/"
    pkl_file = f"{path}/glove.840B.300d.pkl"
    # load embeddings_index from file
    with open(pkl_file, "rb") as fp:
        embeddings_index = pickle.load(fp)

print("Preparing embedding matrix")
nb_words = min(MAX_NB_WORDS, len(word_index)) + 1
word_embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
dead_words = []
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        # trim the embedding vector to the embedding dimension
        if len(embedding_vector) > EMBEDDING_DIM:
            embedding_vector = embedding_vector[:EMBEDDING_DIM]
        word_embedding_matrix[i] = embedding_vector
    else:
        dead_words.append(word)
print(dead_words)

print("Null word embeddings: %d" %
      np.sum(np.sum(word_embedding_matrix, axis=1) == 0))

# find words in embedding matrix that nave no embeddings
for word, i in word_index.items():
    if i >= MAX_NB_WORDS:
        continue
    if np.sum(word_embedding_matrix[i]) == 0:
        print(word)

positional_embedding_matrix = positional_encoding(
    len(word_index) + 1, EMBEDDING_DIM)
positional_embedding_matrix = positional_embedding_matrix.reshape(
    positional_embedding_matrix.shape[1], positional_embedding_matrix.shape[2]
)

class_counts = np.unique(train_labels, return_counts=True)[1]
ratio = class_counts[0] / class_counts[1]
class_weights = {0: 1, 1: ratio}

print("Train data shape: ", train_data.shape)
# print("Test data shape: ", test_data.shape)
print("Normal to anomaly ratio: ", ratio)
print("Normal: ", class_counts[0])
print("Anomaly: ", class_counts[1])
print("Class weights:", class_weights)

#### Execution


In [None]:
"""
This code defines a sentiment classification experiment using a Temporal Convolutional Network (TCN) model.
The experiment involves training and evaluating the TCN model on a dataset.

Parameters:
- embedding_input_dim: The dimension of the input embedding.
- input_size: The length of the input sequence.
- input_dim: The dimension of the input data.
- tcn_filters: The number of filters in the TCN layers.
- tcn_kernel_size: The kernel size of the TCN layers.
- dropout_rate: The dropout rate for regularization.
- num_epochs: The number of training epochs.
- num_splits: The number of splits for cross-validation.
- prediction_threshold: The threshold for binary classification prediction.

Variables:
- model_name: The name of the TCN sentiment classifier model.
- classifier: The TCN sentiment classifier object.

Functions:
- cross_validate: Performs cross-validation on the TCN sentiment classifier.
- evaluate_model: Evaluates the TCN sentiment classifier model.

Note: The code assumes the existence of the TCNSentimentclassifier class and the necessary data and model objects.
"""
# embedding_input_dim = 300
input_size = sequence_length
input_dim = EMBEDDING_DIM
tcn_filters = 32
tcn_kernel_size = 3
dropout_rate = 0.25
input_dim = vocab_size
num_epochs = 10
num_splits = 5
prediction_threshold = 0.9

model_name = "TCNSentimentClassifier_HDFS"

classifier = TCNSentimentclassifier(
    model_name,
    input_size,
    input_dim,
    positional_embedding_matrix,
    word_embedding_matrix,
    embedding_output_dim=EMBEDDING_DIM,
    tcn_units=128,
    prediction_threshold=0.9,
)


(
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
) = classifier.cross_validate(
    train_data, train_labels, num_splits, num_epochs, prediction_threshold
)

evaluate_model(
    classifier,
    _model,
    _data_set,
    accuracies,
    precisions,
    recalls,
    fscores,
    aucs,
    conf_matrices,
    roc_curves,
)

In [None]:
#zip the artefacts/models folder
!zip -r artefacts.zip ./artefacts/models
#zip the output/benchmark_results folder
!zip -r output.zip ./output/benchmark_results
