## Log sequence anomaly detection

### Content
- Introduction
- Dataset
- Training and Evaluation
- Conclusion
- Reference

### Introduction

Anomaly detection in sequential log data aims to identify sequences that deviate from the expected behavior
or patterns. For example, software intensive systems often record runtime information by
printing console logs. A large and complex system could produce a massive amount of logs, which can be used for troubleshooting
purposes. It is critical to detect anomalous states in a timely manner to ensure the reliability of the software system and mitigate the losses. 

Log data is usually unstructured text messages, which can help engineers understand the system’s internal
status and facilitate monitoring, administering, and troubleshooting of the system log messages. The log messages can be modeled as an event sequence, where abnormality in events within a sequence could indicate abnormality in the log message.

This usecase shows a workflow for identifying sequential anomalies using an example of log message dataset.




In [1]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import random
import utils, datatools, model
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import time
import math
import os
from sklearn import metrics


SEED = 91
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

#### Dataset processing

The dataset for the example is from BlueGene/L Supercomputer System (BGL). BGL dataset contains 4,747,963 log messages that are collected
from a [BlueGeme/L]('https://zenodo.org/record/3227177/files/BGL.tar.gz?download=1') supercomputer system at Lawrence Livermore National Labs. The log messages can be categorized into alert and not-alert messages. The log message is parsed using [`Drain`](https://github.com/logpai/logparser) parser into structured log format, then the structured log is used to train the model. This work is based on the model develeped in the works of [[2](https://ieeexplore.ieee.org/document/9671642),[3](https://github.com/hanxiao0607/InterpretableSAD)], for further detail refer the paper and associated code at the reference link.

For running this workflow we can also use a portion of parsed BGL dataset taken from  https://github.com/LogIntelligence/LogPPT. 

#### Preprocessing log dataset

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/LogIntelligence/LogPPT/master/logs/BGL/BGL_2k.log_structured.csv')

In [3]:
# Small dataset for testing
DATASET_NAME = 'https://raw.githubusercontent.com/LogIntelligence/LogPPT/master/logs/BGL/BGL_2k.log_structured.csv' #'BGL_2k'
TRAIN_SIZE = 100 
WINDOW_SIZE = 10
STEP_SIZE = 10
RATIO = 0.1

# Full dataset parsed using DRAIN parser
# DATASET_NAME = 'dataset/bgl_1m.log_structured.csv'
# TRAIN_SIZE = 10000 #00
# WINDOW_SIZE = 100
# STEP_SIZE = 20
# RATIO = 0.1

#### Problem statement
An event set $\mathcal{E}$ contains all possible events in the whole log message, then a log event sequence $S_i$ is defined as sequences of events $S_i = (e_1^i, e_2^i, \ldots, e_N^i)$, where 
$e_i^j \in \mathcal{E}$, and $N^i$ is the length of the sequence $S^i$. Given a set of sequences $ S = \{S^1, S^2, \ldots, S^N\}$, where each sequence is normal or abnormal.  Then the dataset $S$ is used to train sequence classifier for binary prediction.


For example, this raw structured log message shows parsed logs along with timestamp, and event template. Each line indicates an event in the log message, and a sliding window & vector representation is used to create a sequence of these messages with its vector representation.

In [4]:
df.head(3)

Unnamed: 0,LineId,Label,Timestamp,Date,Node,Time,NodeRepeat,Type,Component,Level,Content,EventId,EventTemplate
0,1,-,1117838570,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.42.50.675872,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,E77,instruction cache parity error corrected
1,2,-,1117838573,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.42.53.276129,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,E77,instruction cache parity error corrected
2,3,-,1117838976,2005.06.03,R02-M1-N0-C:J12-U11,2005-06-03-15.49.36.156884,R02-M1-N0-C:J12-U11,RAS,KERNEL,INFO,instruction cache parity error corrected,E77,instruction cache parity error corrected


The `WINDOW_SIZE` indicates the sliding window size or the sequence length. The `STEP_SIZE` indicate the overlap of events across consecutive sequence within the window size. If `WINDOW_SIZE` is equal to `STEP_SIZE`, then events are not overlapped. The value of window size and step size are parameters chosen based on the log dataset size and the sequence length we want to give as input to the model. In this case, the parameters are chosen based on the empirical result provided in the referenced paper.

The `sliding_window` method return word2vector transformed vector representation of the log sequences of the training set, testing set, weight vector, and bigram in the training set. The bigram is 
used later for generating negative samples. The `w2v_dict` is used to lookup for test data logs.

In [5]:
# Create train and test dataset by transforming log dataset into embedding vectors of train, test set and associated word2vector weights.
train_normal, test_normal, test_abnormal, bigram, unique, weights, train_dict, w2v_dict = datatools.sliding_window(DATASET_NAME, WINDOW_SIZE, STEP_SIZE, TRAIN_SIZE)

Reading: dataset/bgl_1m.log_structured.csv
Total logs in the dataset:  1000000
training size 10000
test normal size 26631
test abnormal size 13365
Number of training keys: 93
Word2Vec model: Word2Vec(vocab=94, size=8, alpha=0.025)


In [7]:
# Hyperparmeters
vocab_dim = len(train_dict)+1
output_dim = 2
emb_dim = 8
hidden_dim = 128
n_layers = 1
dropout = 0.0
batch_size = 32
times = 20

Given a set of normal sequences, anomalous sequences are generated via negative sampling. Negative sampling generates anomalous samples by randomly replacing $n$ number of events in sequence $s_i$. A randomly selected event $e_{t+1}$ in sequence $s_i=(e_t, e_{t+1})$ is replaced with an event $e_{t*+1}$ so that the bigram $(e_t, e_{t*+1})$ is rare event in the training set. This introduces suspicious events with low frequency, then we expect that there is a high probability that the generated event sequence is anomalous. An LSTM sequence classifier is trained to classify the negative samples from the true positive samples.


In [8]:
# Generate negative samples and split into training data and validation data. 
neg_samples = datatools.negative_sampling(train_normal, bigram, unique, times, vocab_dim)
df_neg = datatools.get_dataframe(neg_samples, 1, w2v_dict)
df_pos = datatools.get_dataframe(list(train_normal['EventId']), 0, w2v_dict)
df_pos.columns = df_pos.columns.astype(str)
df_train = pd.concat([df_pos, df_neg], ignore_index = True, axis=0)
df_train.reset_index(drop = True)
y = list(df_train.loc[:,'class_label'])
X = list(df_train['W2V_EventId'])

# split train, validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = torch.tensor(X_train,requires_grad=False).long()
X_val = torch.tensor(X_val,requires_grad=False).long()
y_train = torch.tensor(y_train).reshape(-1, 1).long()
y_val = torch.tensor(y_val).reshape(-1, 1).long()
train_iter = utils.get_iter(X_train, y_train, batch_size)
val_iter = utils.get_iter(X_val, y_val, batch_size)

### Training and Evaluation

An LSTM model is trained using word2vector input generated from both positive and negative examples with task of binary classification.

In [15]:
device = torch.device( "cuda" if torch.cuda.is_available() else"cpu")
n_epoch = 10
kwargs = {"matrix_embeddings":weights, 
"vocab_dim": vocab_dim, "output_dim": output_dim, "emb_dim": emb_dim,
"hid_dim": hidden_dim, 
"n_layers": n_layers, 
"dropout": dropout}

In [9]:

LAD_model = model.LogLSTM(weights, vocab_dim, output_dim, emb_dim, hidden_dim, n_layers, dropout).to(device)
optimizer = optim.Adam(LAD_model.parameters())
criterion = nn.CrossEntropyLoss()

try:
    os.makedirs('model')
except:
    pass

# Train LSTM model
clip = 1

best_test_loss = float('inf')

for epoch in tqdm(range(n_epoch)):
    
    start_time = time.time()
    train_loss= model.train(LAD_model, train_iter, optimizer, criterion,  device)        

    val_loss = model.evaluate(LAD_model, val_iter, criterion, device)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = model.epoch_time(start_time, end_time)
    
    if val_loss < best_test_loss:
        best_test_loss = val_loss
        torch.save({
            'model_state_dict':LAD_model.state_dict(),
            "model_hyperparam": kwargs,
            "W2V_conf": {
            'train_dict': train_dict, 
            'w2v_dict': w2v_dict,
            "WINDOW_SIZE": WINDOW_SIZE,
            "STEP_SIZE": STEP_SIZE
            }
        }, 'model/model_BGL.pt')
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {val_loss:.3f} |  Val. PPL: {math.exp(val_loss):7.3f}')



 10%|█         | 1/10 [00:37<05:35, 37.31s/it]

Epoch: 01 | Time: 0m 37s
	Train Loss: 0.056 | Train PPL:   1.058
	 Val. Loss: 0.036 |  Val. PPL:   1.037


 20%|██        | 2/10 [01:12<04:48, 36.05s/it]

Epoch: 02 | Time: 0m 35s
	Train Loss: 0.034 | Train PPL:   1.035
	 Val. Loss: 0.031 |  Val. PPL:   1.031


 30%|███       | 3/10 [01:49<04:14, 36.34s/it]

Epoch: 03 | Time: 0m 36s
	Train Loss: 0.032 | Train PPL:   1.032
	 Val. Loss: 0.032 |  Val. PPL:   1.033


 40%|████      | 4/10 [02:25<03:38, 36.49s/it]

Epoch: 04 | Time: 0m 36s
	Train Loss: 0.030 | Train PPL:   1.031
	 Val. Loss: 0.031 |  Val. PPL:   1.032


 50%|█████     | 5/10 [03:02<03:02, 36.52s/it]

Epoch: 05 | Time: 0m 36s
	Train Loss: 0.030 | Train PPL:   1.030
	 Val. Loss: 0.032 |  Val. PPL:   1.032


 60%|██████    | 6/10 [03:39<02:26, 36.65s/it]

Epoch: 06 | Time: 0m 36s
	Train Loss: 0.030 | Train PPL:   1.030
	 Val. Loss: 0.028 |  Val. PPL:   1.028


 70%|███████   | 7/10 [04:16<01:50, 36.73s/it]

Epoch: 07 | Time: 0m 36s
	Train Loss: 0.028 | Train PPL:   1.029
	 Val. Loss: 0.026 |  Val. PPL:   1.027


 80%|████████  | 8/10 [04:52<01:13, 36.73s/it]

Epoch: 08 | Time: 0m 36s
	Train Loss: 0.024 | Train PPL:   1.024
	 Val. Loss: 0.023 |  Val. PPL:   1.023


 90%|█████████ | 9/10 [05:29<00:36, 36.78s/it]

Epoch: 09 | Time: 0m 36s
	Train Loss: 0.025 | Train PPL:   1.025
	 Val. Loss: 0.022 |  Val. PPL:   1.022


100%|██████████| 10/10 [06:06<00:00, 36.67s/it]

Epoch: 10 | Time: 0m 36s
	Train Loss: 0.020 | Train PPL:   1.020
	 Val. Loss: 0.021 |  Val. PPL:   1.021





### Evaluation

Since the test cases are imbalanced dataset, we use precision-recall and F1 score of the test sample to evaluate the model performance. Overall, F1 gives general overview performance of the model over recall/precision criteria of the test sample.

In [12]:
# Prepare test data proportion
test_abnormal_ratio = model.ratio_abnormal_sequence(test_abnormal, WINDOW_SIZE, RATIO)
test_ab_X, test_ab_X_key_label = test_abnormal_ratio['W2V_EventId'], test_abnormal_ratio['Key_label']
test_n_X, test_n_X_key_label = test_normal['W2V_EventId'], test_normal['Key_label']
test_ab_y = test_abnormal_ratio['Label']
test_n_y = test_normal['Label']


In [13]:
# Compute evaluation metrics
y, y_pre = model.model_precision(LAD_model, device, test_n_X.values.tolist()[:int(len(test_n_X.values.tolist())*(len(test_abnormal_ratio)/len(test_abnormal)))], \
                           test_ab_X.values.tolist())
f1_acc = metrics.classification_report(y, y_pre, digits=5)
print(f1_acc)

              precision    recall  f1-score   support

           0    0.98333   0.96686   0.97503      2867
           1    0.93611   0.96734   0.95147      1439

    accuracy                        0.96702      4306
   macro avg    0.95972   0.96710   0.96325      4306
weighted avg    0.96755   0.96702   0.96715      4306



To perform inference we can load a saved model and perform inference as follows.

In [17]:

# Load trained model and parameters for inference

check_point  = torch.load('model/model_BGL.pt')

window_df = datatools.preprocess(df, check_point['W2V_conf']['WINDOW_SIZE'], check_point['W2V_conf']['STEP_SIZE'])


# # convert to input vector
test_vector = datatools.test_vector(window_df, check_point['W2V_conf']['train_dict'], check_point['W2V_conf']['w2v_dict'])

# # load LogLSTM model
trained_model_ = model.LogLSTM(**check_point['model_hyperparam']).to(device)
trained_model_.load_state_dict(check_point['model_state_dict'])

# # predict label
_, y_pred = model.model_inference(trained_model_, device, test_vector['W2V_EventId'].values.tolist())


### Conclusion

In this workflow, we show a pipeline for training sequence binary classifier to identify anomalous log sequence from set of generated log sequences. We used negative sampling to generate negative examples along normal logs for training the model. The model is evaluated on BGL dataset to identify alerts from non-alert messages. With an F1 score of 0.9 the model is able to identify true alerts from non-alert messages of test log samples.

### Reference

1. https://arxiv.org/pdf/2202.04301.pdf
2. https://ieeexplore.ieee.org/document/9671642
3.  https://github.com/hanxiao0607/InterpretableSAD