## Log sequence anomaly detection

### Content
- Introduction
- Dataset
- Training and Evaluation
- Conclusion
- Reference

### Introduction
Anomaly detection in sequential log data aims to identify sequences that deviate from the expected behavior
or patterns. For example, software intensive systems often record runtime information by
printing console logs. A large and complex system could produce a massive amount of logs, which can be used for troubleshooting
purposes. The log messages can be modeled as an event sequence. It is critical to detect anomalous states
in a timely manner to ensure the reliability the software system and mitigate the losses.

Log data is usually unstructured text messages, which can help engineers understand the system’s internal
status and facilitate monitoring, administering, and troubleshooting of the system Log messages can be parsed into log events,
which are templates (constant part) of the messages. 
 
This usecase shows a workflow for identifying sequential anomalies from raw log sequence data.

In [36]:
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
import random
import utils, datatools, model
from sklearn.model_selection import train_test_split
import torch
import torch.nn as nn
import torch.optim as optim
from tqdm import tqdm
import time
import math
import os
from sklearn import metrics
from sklearn.metrics import precision_recall_fscore_support
import collections

SEED = 91
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


#### Dataset processing

The dataset for the example used from BlueGene/L Supercomputer System (BGL). BGL dataset contains 4,747,963 log messages that are collected
from a [BlueGeme/L]('https://zenodo.org/record/3227177/files/BGL.tar.gz?download=1') supercomputer system at Lawrence Livermore National Labs. The log messages can be categorized into alert and not-alert messages. The log message is parsed using [`Drain`](https://github.com/logpai/logparser) parser into structured log format.

For running this workflow we can use portion of parsed BGL dataset taken from  https://github.com/LogIntelligence/LogPPT. 

#### Preprocessing log dataset

In [4]:
df = pd.read_csv('https://raw.githubusercontent.com/LogIntelligence/LogPPT/master/logs/BGL/BGL_2k.log_structured.csv')

In [22]:
# Smaller dataset
DATASET_NAME = 'https://raw.githubusercontent.com/LogIntelligence/LogPPT/master/logs/BGL/BGL_2k.log_structured.csv' #'BGL_2k'
TRAIN_SIZE = 100 
WINDOW_SIZE = 10
STEP_SIZE = 20
RATIO = 0.1

# Full dataset 
# DATASET_NAME = 'dataset/bgl_1m.log_structured.csv'
# TRAIN_SIZE = 10000 #00
# WINDOW_SIZE = 100
# STEP_SIZE = 20
# RATIO = 0.1

Create train and test dataset by transforming log dataset into embedding vectors.

In [37]:
train_normal, test_normal, test_abnormal, bigram, unique, weights, train_dict, w2v_dic = datatools.sliding_window(DATASET_NAME, WINDOW_SIZE, STEP_SIZE, TRAIN_SIZE)

Reading: Dataset/bgl_1m.log_structured.csv
Total logs in the dataset:  1000000
training size 10000
test normal size 27904
test abnormal size 12096
Number of all keys: 141
Number of training keys: 106
Word2Vec model: Word2Vec(vocab=107, size=8, alpha=0.025)


In [24]:
train_normal.shape

(10000, 4)

In [25]:
# Hyperparmeters
vocab_dim = len(train_dict)+1
output_dim = 2
emb_dim = 8
hidden_dim = 128
n_layers = 1
dropout = 0.0
batch_size = 32
times = 20

In [26]:
def get_dataframe(lst, label, dic):
    df = pd.DataFrame()
    df['EventId'] = lst
    df['class_label'] = label
    return datatools.str_key_to_w2v_index(df, dic)

Generate negative samples and split into training data and validation data. Given a set of normal sequences, an anomalous sequences are generated via neative sampling. A binary sequence classifier is trained to classify the negative samples from the true positive samples.

In [27]:
neg_samples = datatools.negative_sampling(train_normal, bigram, unique, times, vocab_dim)
df_neg = get_dataframe(neg_samples, 1, w2v_dic)
df_pos = get_dataframe(list(train_normal['EventId']), 0, w2v_dic)
df_pos.columns = df_pos.columns.astype(str)
df_train = pd.concat([df_pos, df_neg], ignore_index = True, axis=0)
df_train.reset_index(drop = True)
y = list(df_train.loc[:,'class_label'])
X = list(df_train['W2V_EventId'])

# split train, validation set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
X_train = torch.tensor(X_train,requires_grad=False).long()
X_val = torch.tensor(X_val,requires_grad=False).long()
y_train = torch.tensor(y_train).reshape(-1, 1).long()
y_val = torch.tensor(y_val).reshape(-1, 1).long()
train_iter = utils.get_iter(X_train, y_train, batch_size)
val_iter = utils.get_iter(X_val, y_val, batch_size)

### Training and Evaluation

An LSTM model is trained using word2vector input genrated from both positive and negative examples with task of binary classification.

In [29]:
device = torch.device( "cuda" if torch.cuda.is_available() else"cpu")
n_epoch = 10
LAD_model = model.LogLSTM(weights, vocab_dim, output_dim, emb_dim, hidden_dim, n_layers, dropout, device, batch_size).to(device)
optimizer = optim.Adam(LAD_model.parameters())
criterion = nn.CrossEntropyLoss()

try:
    os.makedirs('model')
except:
    pass

# Training LSTM model
clip = 1

best_test_loss = float('inf')

for epoch in tqdm(range(n_epoch)):
    
    start_time = time.time()
    train_loss= model.train(LAD_model, train_iter, optimizer, criterion, clip, epoch, device)        

    val_loss = model.evaluate(LAD_model, val_iter, criterion, device)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = model.epoch_time(start_time, end_time)
    
    if val_loss < best_test_loss:
        best_test_loss = val_loss
        torch.save(LAD_model.state_dict(), 'model/model_BGL.pt')
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {val_loss:.3f} |  Val. PPL: {math.exp(val_loss):7.3f}')

 10%|█         | 1/10 [00:47<07:09, 47.75s/it]

Epoch: 01 | Time: 0m 47s
	Train Loss: 0.059 | Train PPL:   1.061
	 Val. Loss: 0.029 |  Val. PPL:   1.029


 20%|██        | 2/10 [01:35<06:21, 47.73s/it]

Epoch: 02 | Time: 0m 47s
	Train Loss: 0.019 | Train PPL:   1.020
	 Val. Loss: 0.015 |  Val. PPL:   1.015


 30%|███       | 3/10 [02:22<05:32, 47.48s/it]

Epoch: 03 | Time: 0m 47s
	Train Loss: 0.016 | Train PPL:   1.016
	 Val. Loss: 0.014 |  Val. PPL:   1.014


 40%|████      | 4/10 [03:09<04:43, 47.32s/it]

Epoch: 04 | Time: 0m 47s
	Train Loss: 0.012 | Train PPL:   1.012
	 Val. Loss: 0.011 |  Val. PPL:   1.011


 50%|█████     | 5/10 [03:56<03:56, 47.29s/it]

Epoch: 05 | Time: 0m 47s
	Train Loss: 0.010 | Train PPL:   1.010
	 Val. Loss: 0.010 |  Val. PPL:   1.010


 60%|██████    | 6/10 [04:45<03:10, 47.73s/it]

Epoch: 06 | Time: 0m 48s
	Train Loss: 0.009 | Train PPL:   1.009
	 Val. Loss: 0.010 |  Val. PPL:   1.010


 70%|███████   | 7/10 [05:32<02:22, 47.60s/it]

Epoch: 07 | Time: 0m 47s
	Train Loss: 0.007 | Train PPL:   1.007
	 Val. Loss: 0.009 |  Val. PPL:   1.009


 80%|████████  | 8/10 [06:20<01:35, 47.58s/it]

Epoch: 08 | Time: 0m 47s
	Train Loss: 0.007 | Train PPL:   1.007
	 Val. Loss: 0.010 |  Val. PPL:   1.010


 90%|█████████ | 9/10 [07:07<00:47, 47.57s/it]

Epoch: 09 | Time: 0m 47s
	Train Loss: 0.006 | Train PPL:   1.006
	 Val. Loss: 0.009 |  Val. PPL:   1.009


100%|██████████| 10/10 [07:55<00:00, 47.55s/it]

Epoch: 10 | Time: 0m 47s
	Train Loss: 0.005 | Train PPL:   1.005
	 Val. Loss: 0.010 |  Val. PPL:   1.010





### Evaluation

The model is evaluated using F1 score.

In [30]:
# For evaluation the 
test_abnormal_ratio = model.ratio_abnormal_sequence(test_abnormal, WINDOW_SIZE, RATIO)
test_ab_X, test_ab_X_key_label = test_abnormal_ratio['W2V_EventId'], test_abnormal_ratio['Key_label']
test_n_X, test_n_X_key_label = test_normal['W2V_EventId'], test_normal['Key_label']
test_ab_y = test_abnormal_ratio['Label']
test_n_y = test_normal['Label']
y, y_pre = model.model_precision(LAD_model, device, test_n_X.values.tolist()[:int(len(test_n_X.values.tolist())*(len(test_abnormal_ratio)/len(test_abnormal)))], \
                           test_ab_X.values.tolist())
f1_acc = metrics.classification_report(y, y_pre, digits=5)
print(f1_acc)

              precision    recall  f1-score   support

           0    1.00000   0.97895   0.98936       475
           1    0.95370   1.00000   0.97630       206

    accuracy                        0.98532       681
   macro avg    0.97685   0.98947   0.98283       681
weighted avg    0.98600   0.98532   0.98541       681



In [32]:
#y_pre

### Conclusion

In this workflow, we showed a pipeline for training sequence binary classifier to identify anomalous log sequence from normaly generated log sequences. We used negative sampling to generate negative exampels for training the model using only normal log sequence dataset. The model is evaluated on BGL dataset to identify alerts from non-alert messages for BGL dataset. With an F1 score of 0.9 the model is able to identify true alerts from non-alert messages.

### Reference

- https://arxiv.org/pdf/2202.04301.pdf
- https://ieeexplore.ieee.org/document/9671642