# LAD Custom experiment

Here we test the prediction quality regarding to the existence of spaces in word2vec model

In the original implementation the w2v model removes all the numbers and spaces in a log message and create a one single word:

`2019-04-12 01:13:20 [DEBUG] Processed 181 out of 181 packages` → `["DEBUGProcessedoutofpackages"]`

Here we want to test the original implementation and our approach:

`2019-04-12 01:13:20 [DEBUG] Processed 181 out of 181 packages` → `["DEBUG", "Processed", "out", "of", "packages"]`

### Import packages

In [1]:
import os
import time
import numpy as np
import logging
import sompy
from multiprocessing import Pool
from itertools import product
import pandas as pd
import re
import gensim as gs
import matplotlib.pyplot as plt
from scipy.spatial.distance import cosine
from sklearn.preprocessing import normalize

import matplotlib.pyplot as plt


CACHEDIR=/home/nadzya/.cache/matplotlib
Using fontManager instance from /home/nadzya/.cache/matplotlib/fontlist-v330.json
Loaded backend module://ipykernel.pylab.backend_inline version unknown.
Loaded backend module://ipykernel.pylab.backend_inline version unknown.
NumExpr defaulting to 4 threads.


In [2]:
import logging

logger = logging.getLogger()
logger.disabled = True
logging.disable()

# Original Approach

### Define Functions

#### 1. Log Preprocesing

One assumption that all these functions use is that we instantly convert our data into a pandas dataframe that has a "message" column containing the relevent information for us. 

**We then treat each individual log line as a "word", cleaning it by removing all non-alphabet charcters including white spaces.**

In [3]:
def _preprocess(data):
    for col in data.columns:
        if col == "message":
            data[col] = data[col].apply(_clean_message)
        else:
            data[col] = data[col].apply(to_str)

    data = data.fillna("EMPTY")
    
def _clean_message(line):
    """Remove all none alphabetical characters from message strings."""
    return "".join(
        re.findall("[a-zA-Z]+", line)
    )  # Leaving only a-z in there as numbers add to anomalousness quite a bit

def to_str(x):
    """Convert all non-str lists to string lists for Word2Vec."""
    ret = " ".join([str(y) for y in x]) if isinstance(x, list) else str(x)
    return ret

#### 2. Text Encoding  

Here we employ the gensim implementation of Word2Vec to encode our logs as fixed length numerical vectors. Logs are noteably not the natural usecase for word2vec, but this appraoch attemps to leverage the fact that logs lines themselves, like words, have a context, so encoding a log based on its co-occurence with other logs does make some intuitive sense.

In [4]:
def create(words, vector_length, window_size):
    """Create new word2vec model."""
    w2vmodel = {}
    for col in words.columns:
        if col in words:
            w2vmodel[col] = gs.models.Word2Vec([list(words[col])], min_count=1, size=vector_length, 
                                     window=window_size, seed=42, workers=1, iter=550,sg=0)
        else:
            #_LOGGER.warning("Skipping key %s as it does not exist in 'words'" % col)
            pass
        
    return w2vmodel

def one_vector(new_D, w2vmodel):
    """Create a single vector from model."""
    transforms = {}
    for col in w2vmodel.keys():
        if col in new_D:
            transforms[col] = w2vmodel[col].wv[new_D[col]]

    new_data = []

    for i in range(len(transforms["message"])):
        logc = np.array(0)
        for _, c in transforms.items():
            if c.item(i):
                logc = np.append(logc, c[i])
            else:
                logc = np.append(logc, [0, 0, 0, 0, 0])
        new_data.append(logc)

    return np.array(new_data, ndmin=2)

#### 3. Model Training

Here we employ the SOMPY implementation of the Self-Organizing Map to train our model. This function simply makes it a bit easier for the user to interact with the sompy training requirements. This function returns a trained model.

The trained model object also has a method called codebook.matrix() which allows the user access directly to the trained self organizing map itself. If the map successfull converged then it should consist of nodes in our N-dimensional log space that are well ordered and provide an approximation to the topology of the logs in our training set.

During training we also, compute the distances of our training data to the trained map as a baseline to build a threashold.   

In [5]:
def train(inp, map_size, iterations, parallelism):
    print(f'training dataset is of size {inp.shape[0]}')
    mapsize = [map_size, map_size]
    np.random.seed(42)
    som = sompy.SOMFactory.build(inp, mapsize , initialization='random')
    som.train(n_job=parallelism, train_rough_len=100,train_finetune_len=5)
    model = som.codebook.matrix.reshape([map_size, map_size, inp.shape[1]])
    
    #distances = get_anomaly_score(inp, 8, model)
    #threshold = 3*np.std(distances) + np.mean(distances)
    
    return som #,threshold

#### 4. Generating Anomaly Scores

One of the key elements of this approach is quantifying the distance between our logs and the nodes on our self organizing map. The two functions below, taken together, represent a parrallel implementation for performing this calculaton.  

In [6]:
def get_anomaly_score(logs, parallelism, model):

    parameters = [[x,model] for x in logs]
    pool = Pool(parallelism)
    dist = pool.map(calculate_anomaly_score, parameters) 
    pool.close()
    pool.join()
    return dist

def calculate_anomaly_score(parameters):
    log = parameters[0]
    model = parameters[1]
    """Compute a distance of a log entry to elements of SOM."""
    dist_smallest = np.inf
    for x in range(model.shape[0]):
        for y in range(model.shape[1]):
            dist = cosine(model[x][y], log) 
            #dist = np.linalg.norm(model[x][y] - log)
            if dist < dist_smallest:
                dist_smallest = dist
    return dist_smallest

#### 5. Model Inference / Prediction

Here we are making an inference about a new log message. This is done by scoring the incoming log and evaluating whether or not it passess a certain threshold value.  


Ideally our word2vec has been monitoring our application long enough to have seen all the logs. So, if we get a known log we can simply look up its vector representation   

One downside with word2vec is that its quite brittle when it comes to incorporating words that haven't been seen before. In this example, we will retrain the W2Vmodel if our new log has not been seen by the before.  

In [7]:
def infer(w2v, som, log, data, threshold):
    
    log =  pd.DataFrame({"message":log},index=[1])
    _preprocess(log)
    
    if log.message.iloc[0] in list(w2v['message'].wv.vocab.keys()):
        vector = w2v["message"].wv[log.message.iloc[0]]
    else:
        w2v = gs.models.Word2Vec([[log.message.iloc[0]] + list(data["message"])], 
                                 min_count=1, size=25, window=3, seed=42, workers=1, iter=550, sg=0)
        vector = w2v.wv[log.message.iloc[0]]
    
    score = get_anomaly_score([vector], 1, som)
    
    if score < threshold:
        return 0, score
    else:
        return 1, score

## Implementation

### Get logs from file

In [8]:
data_path = r"file:///home/nadzya/Apps/log-anomaly-detector/validation_data/solidex.by.json"
data = pd.DataFrame(pd.read_json(data_path, orient=str).message)
data

Unnamed: 0,message
0,<158>Nov 25 12:02:31 195-137-160-145 nginx-acc...
1,<158>Nov 25 12:04:11 195-137-160-145 nginx-acc...
2,<158>Nov 25 12:15:42 195-137-160-145 nginx-acc...
3,<158>Nov 25 12:25:32 195-137-160-145 nginx-acc...
4,<158>Nov 25 12:25:22 195-137-160-145 nginx-acc...
...,...
9995,<158>Nov 25 16:01:44 195-137-160-145 nginx-acc...
9996,<158>Nov 25 13:53:47 195-137-160-145 nginx-acc...
9997,<158>Nov 25 15:25:02 195-137-160-145 nginx-acc...
9998,<158>Nov 25 14:48:50 195-137-160-145 nginx-acc...


### Preprocessing

In [9]:
preprocessed_data = data.copy()
_preprocess(preprocessed_data)

# First 5 prepocessed messages
pd.DataFrame(preprocessed_data.message).head()

Unnamed: 0,message
0,NovnginxaccessNovGETHTTPGohttpclient
1,NovnginxaccessNovGETcoursesHTTPhttpwwwsolidexb...
2,NovnginxaccessNovGETHTTPGohttpclient
3,NovnginxaccessNovGETHTTPGohttpclient
4,NovnginxaccessNovGETHTTPGohttpclient


Let's see how many logs are there for each preprocessed word-log. We display top 5

In [10]:
x = preprocessed_data.message.value_counts()
for i in x.keys()[:5]:
    print(i, x[i])

NovnginxaccessNovGETokompaniiHTTPHealthCheck 3979
NovnginxaccessNovGETHTTPGohttpclient 3836
NovnginxaccessNovPOSTwpcronphpdoingwpcronHTTPWordPresshttpwwwsolidexby 734
NovnginxaccessNovPOSTwploginphpHTTPhttpwwwsolidexbywploginphpMozillaWindowsNTWinxrvGeckoFirefox 702
NovnginxaccessNovHEADHTTPhttpsolidexbyMozillacompatibleUptimeRobothttpwwwuptimerobotcom 376


### Word2Vec for such words

In [11]:
w2v = create(words=preprocessed_data, vector_length=25, window_size=5)

In [12]:
log_vectors = one_vector(preprocessed_data, w2v)

In [13]:
print(log_vectors.shape)
print(log_vectors[:, 1:].shape)
log_vectors = log_vectors[:, 1:]

(10000, 26)
(10000, 25)


### Train SOM

In [15]:
map_size = 24
som = train(log_vectors, map_size=map_size, iterations=0, parallelism=2)

training dataset is of size 10000


In [16]:
model = som.codebook.matrix.reshape([map_size, map_size, log_vectors.shape[1]])

In [17]:
anomaly_scores = get_anomaly_score(log_vectors, parallelism=4, model=model)

In [18]:
[x for x in anomaly_scores if x > 0.6]

[]

In [19]:
threshold = 3*np.std(anomaly_scores) + np.mean(anomaly_scores)
threshold

0.8403589377654117

# With Custom Approach

### Define Functions

#### 1. Log Preprocesing

One assumption that all these functions use is that we instantly convert our data into a pandas dataframe that has a "message" column containing the relevent information for us. 

We then treat each individual log line as a set of words, cleaning it by removing all non-alphabet charcters.

We keep white spaces and return a list of wordss

In [20]:
def _preprocess_custom(data):
    for col in data.columns:
        if col == "message":
            data[col] = data[col].apply(_clean_message_custom)
        else:
            data[col] = data[col].apply(to_str_custom)

    data = data.fillna("EMPTY")
    
def _clean_message_custom(line):
    """Remove all none alphabetical characters from message strings."""
    words = list(re.findall("[a-zA-Z]+", line))
    return words

def to_str_custom(x):
    """Convert all non-str lists to string lists for Word2Vec."""
    ret = " ".join([str(y) for y in x]) if isinstance(x, list) else str(x)
    return ret

#### Text encoding

In [21]:
def create_custom(logs, vector_length, window_size):
    """Create new word2vec model."""
    model = gs.models.Word2Vec(sentences=list(logs), size=vector_length, window=window_size)
    return model

def get_vectors(model, logs, vector_length):
    """Return logs as list of vectorized words"""
    vectors = []
    for x in logs:
        temp = []
        for word in x:
            if word in model.wv:
                temp.append(model.wv[word])
            else:
                temp.append(np.array([0]*vector_length))
        vectors.append(temp)
    return vectors

def _log_words_to_one_vector(log_words_vectors):
        result = []
        log_array_transposed = np.array(log_words_vectors, dtype=object).transpose()
        for coord in log_array_transposed:
            result.append(np.mean(coord))
        return result

def vectorized_logs_to_single_vectors(vectors):
    """Represent log messages as vectors according to the vectors
    of the words in these logs

    :params vectors: list of log messages, represented as list of words vectors
            [[wordvec11, wordvec12], [wordvec21, wordvec22], ...]
    """
    result = []
    for log_words_vector in vectors:
        result.append(_log_words_to_one_vector(log_words_vector))
    return np.array(result)

#### Prediction

In [22]:
def infer_custom(w2v, som, log, logs_list, threshold):
    
    log = pd.DataFrame({"message": log}, index=[1])
    _preprocess_custom(log)
    
    vector = []
    w2v = gs.models.Word2Vec([log.message.iloc[0]] + logs_list,
                             min_count=1, size=25, window=5)
    for word in log.message.iloc[0]:
        if word in w2v.wv.vocab.keys():
            vector.append(w2v.wv[word])
        else:
            vector.append(np.array([0]*25))
    
    one_vector = _log_words_to_one_vector(vector)
    
    score = get_anomaly_score([one_vector], 1, som)
    
    if score < threshold:
        return 0, score
    else:
        return 1, score

## Implementation

### Preprocessing

In [23]:
custom_preproc_data = data.copy()
_preprocess_custom(custom_preproc_data)

# First 5 prepocessed messages
pd.DataFrame(custom_preproc_data.message).head()

Unnamed: 0,message
0,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."
1,"[Nov, nginx, access, Nov, GET, courses, HTTP, ..."
2,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."
3,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."
4,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."


In [24]:
logs_list = list(custom_preproc_data.message)
logs_list[0]

['Nov', 'nginx', 'access', 'Nov', 'GET', 'HTTP', 'Go', 'http', 'client']

In [25]:
w2v_custom = create_custom(logs_list, vector_length=25, window_size=5)

In [26]:
vectors_custom = get_vectors(model=w2v_custom, logs=logs_list, vector_length=25)

In [27]:
logs_as_vectors = vectorized_logs_to_single_vectors(vectors_custom)

### Train SOM

In [28]:
map_size = 24
som_custom = train(logs_as_vectors, map_size=map_size, iterations=0, parallelism=2)

training dataset is of size 10000


In [29]:
model_custom = som_custom.codebook.matrix.reshape([map_size, map_size, logs_as_vectors.shape[1]])

In [30]:
anomaly_scores_custom = get_anomaly_score(logs_as_vectors, parallelism=4, model=model_custom)

In [31]:
[x for x in anomaly_scores_custom if x > 0.6]

[0.6063197258670838,
 0.6063197258670838,
 0.7999504675137618,
 0.7999504675137618,
 0.7999504675137618,
 0.7999504675137618,
 0.8141560154503695,
 0.6685166572869321,
 0.8141560154503695,
 0.6685166572869321,
 0.8141560154503695,
 0.6685166572869321,
 0.8141560154503695,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7572923941418215,
 0.7999504675137618,
 0.7999504675137618]

In [32]:
threshold_custom = 3*np.std(anomaly_scores) + np.mean(anomaly_scores)
threshold_custom

0.8403589377654117

# Prediciton

**This is a test message**

In [33]:
infer(w2v, model, "This is a test message", preprocessed_data, threshold)

(0, [0.14252834281546678])

In [34]:
infer_custom(w2v_custom, model_custom, "This is a test message", logs_list, threshold_custom)

(0, [0.6412396233086541])

**<182>Dec 07 14:16:16 dataform  172.17.17.100 - - [07/Dec/2021:14:16:16 +0300] "POST /cgi-bin/.%2e/.%2e/.%2e/.%2e/etc/passwd HTTP/1.1" 400 5604 "-" "curl/7.68.0"**

In [35]:
infer(w2v, model, "<182>Dec 07 14:16:16 dataform  172.17.17.100 - - [07/Dec/2021:14:16:16 +0300] \"POST /cgi-bin/.%2e/.%2e/.%2e/.%2e/etc/passwd HTTP/1.1\" 400 5604 \"-\" \"curl/7.68.0\"", preprocessed_data, threshold)

(0, [0.13615576457271295])

In [36]:
infer_custom(w2v_custom, model_custom, "<182>Dec 07 14:16:16 dataform  172.17.17.100 - - [07/Dec/2021:14:16:16 +0300] \"POST /cgi-bin/.%2e/.%2e/.%2e/.%2e/etc/passwd HTTP/1.1\" 400 5604 \"-\" \"curl/7.68.0\"", logs_list, threshold_custom)

(1, [0.8640913019388633])

**<179>Dec 07 14:51:17 dataform  [Wed Dec 07 14:51:17.120946 2021] [auth_basic:error] [pid 2179734:tid 139663791892224] [client 172.17.17.100:44992] AH01617: user webadmin: authentication failure for "/register": Password Mismatch***

In [37]:
infer(w2v, model, "<179>Dec 07 14:51:17 dataform  [Wed Dec 07 14:51:17.120946 2021] [auth_basic:error] [pid 2179734:tid 139663791892224] [client 172.17.17.100:44992] AH01617: user webadmin: authentication failure for \"/register\": Password Mismatch", preprocessed_data, threshold)

(0, [0.1326183988460914])

In [38]:
infer_custom(w2v_custom, model_custom, "<179>Dec 07 14:51:17 dataform  [Wed Dec 07 14:51:17.120946 2021] [auth_basic:error] [pid 2179734:tid 139663791892224] [client 172.17.17.100:44992] AH01617: user webadmin: authentication failure for \"/register\": Password Mismatch", logs_list, threshold_custom)

(0, [0.8363002343084958])

**<179>Dec 07 12:10:53 smtplib.SMTPRecipientsRefused: {'doesntexist@solidex.by': (550, b'5.1.1 <doesntexist@solidex.by>: Recipient address rejected: User unknown in virtual mailbox table')}**

In [39]:
infer(w2v, model, "<179>Dec 07 12:10:53 smtplib.SMTPRecipientsRefused: {'doesntexist@solidex.by': (550, b'5.1.1 <doesntexist@solidex.by>: Recipient address rejected: User unknown in virtual mailbox table')}", preprocessed_data, threshold)

(0, [0.14048444087710865])

In [40]:
infer_custom(w2v_custom, model_custom, "<179>Dec 07 12:10:53 smtplibb.SMTPRecipientsRefused: {'doesntexist@solidex.by': (550, b'5.1.1 <doesntexist@solidex.by>: Recipient address rejected: User unknown in virtual mailbox table')}", logs_list, threshold_custom)

(0, [0.45630292906224423])