# LAD Custom experiment

Here we test the prediction quality regarding to the existence of spaces in word2vec model

In the original implementation the w2v model removes all the numbers and spaces in a log message and create a one single word:

`2019-04-12 01:13:20 [DEBUG] Processed 181 out of 181 packages` → `["DEBUGProcessedoutofpackages"]`

Here we want to test the original implementation and our approach:

`2019-04-12 01:13:20 [DEBUG] Processed 181 out of 181 packages` → `["DEBUG", "Processed", "out", "of", "packages"]`

### Import packages

In [1]:
import os
import time
import numpy as np
import logging
import sompy
from multiprocessing import Pool
from itertools import product
import pandas as pd
import re
import gensim as gs
import matplotlib.pyplot as plt
from scipy.spatial.distance import cosine
from sklearn.preprocessing import normalize

import matplotlib.pyplot as plt


CACHEDIR=/home/nadzya/.cache/matplotlib
Using fontManager instance from /home/nadzya/.cache/matplotlib/fontlist-v330.json
Loaded backend module://ipykernel.pylab.backend_inline version unknown.
Loaded backend module://ipykernel.pylab.backend_inline version unknown.
NumExpr defaulting to 4 threads.


In [2]:
# import logging

# logger = logging.getLogger()
# logger.disabled = True
# logging.disable()

# Original Approach

### Define Functions

#### 1. Log Preprocesing

One assumption that all these functions use is that we instantly convert our data into a pandas dataframe that has a "message" column containing the relevent information for us. 

**We then treat each individual log line as a "word", cleaning it by removing all non-alphabet charcters including white spaces.**

In [3]:
def _preprocess(data):
    for col in data.columns:
        if col == "message":
            data[col] = data[col].apply(_clean_message)
        else:
            data[col] = data[col].apply(to_str)

    data = data.fillna("EMPTY")
    
def _clean_message(line):
    """Remove all none alphabetical characters from message strings."""
    return "".join(
        re.findall("[a-zA-Z]+", line)
    )  # Leaving only a-z in there as numbers add to anomalousness quite a bit

def to_str(x):
    """Convert all non-str lists to string lists for Word2Vec."""
    ret = " ".join([str(y) for y in x]) if isinstance(x, list) else str(x)
    return ret

#### 2. Text Encoding  

Here we employ the gensim implementation of Word2Vec to encode our logs as fixed length numerical vectors. Logs are noteably not the natural usecase for word2vec, but this appraoch attemps to leverage the fact that logs lines themselves, like words, have a context, so encoding a log based on its co-occurence with other logs does make some intuitive sense.

In [4]:
def create(words, vector_length, window_size):
    """Create new word2vec model."""
    w2vmodel = {}
    for col in words.columns:
        if col in words:
            w2vmodel[col] = gs.models.Word2Vec([list(words[col])], min_count=1, size=vector_length, 
                                     window=window_size, seed=42, workers=1, iter=550, sg=0)
        else:
            #_LOGGER.warning("Skipping key %s as it does not exist in 'words'" % col)
            pass
        
    return w2vmodel

def one_vector(new_D, w2vmodel):
    """Create a single vector from model."""
    transforms = {}
    for col in w2vmodel.keys():
        if col in new_D:
            transforms[col] = w2vmodel[col].wv[new_D[col]]

    new_data = []

    for i in range(len(transforms["message"])):
        logc = np.array(0)
        for _, c in transforms.items():
            if c.item(i):
                logc = np.append(logc, c[i])
            else:
                logc = np.append(logc, [0, 0, 0, 0, 0])
        new_data.append(logc)

    return np.array(new_data, ndmin=2)

#### 3. Model Training

Here we employ the SOMPY implementation of the Self-Organizing Map to train our model. This function simply makes it a bit easier for the user to interact with the sompy training requirements. This function returns a trained model.

The trained model object also has a method called codebook.matrix() which allows the user access directly to the trained self organizing map itself. If the map successfull converged then it should consist of nodes in our N-dimensional log space that are well ordered and provide an approximation to the topology of the logs in our training set.

During training we also, compute the distances of our training data to the trained map as a baseline to build a threashold.   

In [5]:
def train(inp, map_size, iterations, parallelism):
    print(f'training dataset is of size {inp.shape[0]}')
    mapsize = [map_size, map_size]
    np.random.seed(42)
    som = sompy.SOMFactory.build(inp, mapsize , initialization='random')
    som.train(n_job=parallelism, train_rough_len=100,train_finetune_len=5)
    model = som.codebook.matrix.reshape([map_size, map_size, inp.shape[1]])
    
    #distances = get_anomaly_score(inp, 8, model)
    #threshold = 3*np.std(distances) + np.mean(distances)
    
    return som #,threshold

#### 4. Generating Anomaly Scores

One of the key elements of this approach is quantifying the distance between our logs and the nodes on our self organizing map. The two functions below, taken together, represent a parrallel implementation for performing this calculaton.  

In [6]:
def get_anomaly_score(logs, parallelism, model):

    parameters = [[x,model] for x in logs]
    pool = Pool(parallelism)
    dist = pool.map(calculate_anomaly_score, parameters) 
    pool.close()
    pool.join()
    return dist

def calculate_anomaly_score(parameters):
    log = parameters[0]
    model = parameters[1]
    """Compute a distance of a log entry to elements of SOM."""
    dist_smallest = np.inf
    for x in range(model.shape[0]):
        for y in range(model.shape[1]):
            dist = cosine(model[x][y], log) 
            #dist = np.linalg.norm(model[x][y] - log)
            if dist < dist_smallest:
                dist_smallest = dist
    return dist_smallest

#### 5. Model Inference / Prediction

Here we are making an inference about a new log message. This is done by scoring the incoming log and evaluating whether or not it passess a certain threshold value.  


Ideally our word2vec has been monitoring our application long enough to have seen all the logs. So, if we get a known log we can simply look up its vector representation   

One downside with word2vec is that its quite brittle when it comes to incorporating words that haven't been seen before. In this example, we will retrain the W2Vmodel if our new log has not been seen by the before.  

In [7]:
def infer(w2v, som, log, data, threshold):
    
    log =  pd.DataFrame({"message":log},index=[1])
    _preprocess(log)
    
    if log.message.iloc[0] in list(w2v['message'].wv.vocab.keys()):
        vector = w2v["message"].wv[log.message.iloc[0]]
    else:
        w2v = gs.models.Word2Vec([[log.message.iloc[0]] + list(data["message"])], 
                                 min_count=1, size=25, window=3, seed=42, workers=1, iter=550, sg=0)
        vector = w2v.wv[log.message.iloc[0]]
    
    score = get_anomaly_score([vector], 1, som)
    
    if score < threshold:
        return 0, score
    else:
        return 1, score

## Implementation

### Get logs from file

In [8]:
data_path = r"file:///home/nadzya/Apps/log-anomaly-detector/validation_data/solidex.by.json"
data = pd.DataFrame(pd.read_json(data_path, orient=str).message)
data

Unnamed: 0,message
0,<158>Nov 25 12:02:31 195-137-160-145 nginx-acc...
1,<158>Nov 25 12:04:11 195-137-160-145 nginx-acc...
2,<158>Nov 25 12:15:42 195-137-160-145 nginx-acc...
3,<158>Nov 25 12:25:32 195-137-160-145 nginx-acc...
4,<158>Nov 25 12:25:22 195-137-160-145 nginx-acc...
...,...
9995,<158>Nov 25 16:01:44 195-137-160-145 nginx-acc...
9996,<158>Nov 25 13:53:47 195-137-160-145 nginx-acc...
9997,<158>Nov 25 15:25:02 195-137-160-145 nginx-acc...
9998,<158>Nov 25 14:48:50 195-137-160-145 nginx-acc...


### Preprocessing

In [9]:
preprocessed_data = data.copy()
_preprocess(preprocessed_data)

# First 5 prepocessed messages
pd.DataFrame(preprocessed_data.message).head()

Unnamed: 0,message
0,NovnginxaccessNovGETHTTPGohttpclient
1,NovnginxaccessNovGETcoursesHTTPhttpwwwsolidexb...
2,NovnginxaccessNovGETHTTPGohttpclient
3,NovnginxaccessNovGETHTTPGohttpclient
4,NovnginxaccessNovGETHTTPGohttpclient


Let's see how many logs are there for each preprocessed word-log. We display top 5

In [10]:
x = preprocessed_data.message.value_counts()
for i in x.keys()[:5]:
    print(i, x[i])

NovnginxaccessNovGETokompaniiHTTPHealthCheck 3979
NovnginxaccessNovGETHTTPGohttpclient 3836
NovnginxaccessNovPOSTwpcronphpdoingwpcronHTTPWordPresshttpwwwsolidexby 734
NovnginxaccessNovPOSTwploginphpHTTPhttpwwwsolidexbywploginphpMozillaWindowsNTWinxrvGeckoFirefox 702
NovnginxaccessNovHEADHTTPhttpsolidexbyMozillacompatibleUptimeRobothttpwwwuptimerobotcom 376


### Word2Vec for such words

In [11]:
w2v = create(words=preprocessed_data, vector_length=25, window_size=5)

consider setting layer size to a multiple of 4 for greater performance
collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
collected 29 word types from a corpus of 10000 raw words and 1 sentences
Loading a fresh vocabulary
effective_min_count=1 retains 29 unique words (100% of original 29, drops 0)
effective_min_count=1 leaves 10000 word corpus (100% of original 10000, drops 0)
deleting the raw counts dictionary of 29 items
sample=0.001 downsamples 11 most-common words
downsampling leaves estimated 948 word corpus (9.5% of prior 10000)
estimated required memory for 29 words and 25 dimensions: 20300 bytes
resetting layer weights
training model with 1 workers on 29 vocabulary and 25 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 1 : training on 10000 raw words (974 effective words) took 0.0s, 3476

EPOCH - 34 : training on 10000 raw words (975 effective words) took 0.0s, 163530 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 35 : training on 10000 raw words (956 effective words) took 0.0s, 111969 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 36 : training on 10000 raw words (909 effective words) took 0.0s, 133893 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 37 : training on 10000 raw words (914 effective words) took 0.0s, 148949 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 38 : training on 10000 raw words (994 effective words) took 0.0s, 167797 effective words/s
job loop exiting,

job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 72 : training on 10000 raw words (982 effective words) took 0.0s, 225188 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 73 : training on 10000 raw words (963 effective words) took 0.0s, 94667 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 74 : training on 10000 raw words (948 effective words) took 0.0s, 138855 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 75 : training on 10000 raw words (922 effective words) took 0.0s, 125979 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more thr

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 109 : training on 10000 raw words (965 effective words) took 0.0s, 134524 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 110 : training on 10000 raw words (929 effective words) took 0.0s, 67193 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 111 : training on 10000 raw words (912 effective words) took 0.0s, 198502 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 112 : training on 10000 raw words (976 effective words) took 0.0s, 150714 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 113 : training

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 146 : training on 10000 raw words (996 effective words) took 0.0s, 177737 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 147 : training on 10000 raw words (941 effective words) took 0.0s, 202250 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 148 : training on 10000 raw words (914 effective words) took 0.0s, 133271 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 149 : training on 10000 raw words (904 effective words) took 0.0s, 187832 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 150 : trainin

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 183 : training on 10000 raw words (958 effective words) took 0.0s, 171467 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 184 : training on 10000 raw words (955 effective words) took 0.0s, 150864 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 185 : training on 10000 raw words (976 effective words) took 0.0s, 143971 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 186 : training on 10000 raw words (945 effective words) took 0.0s, 122160 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 187 : trainin

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 220 : training on 10000 raw words (945 effective words) took 0.0s, 83789 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 221 : training on 10000 raw words (960 effective words) took 0.0s, 129206 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 222 : training on 10000 raw words (947 effective words) took 0.0s, 153908 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 223 : training on 10000 raw words (930 effective words) took 0.0s, 135225 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 224 : training

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 257 : training on 10000 raw words (945 effective words) took 0.0s, 166020 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 258 : training on 10000 raw words (936 effective words) took 0.0s, 86730 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 259 : training on 10000 raw words (940 effective words) took 0.0s, 88601 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 260 : training on 10000 raw words (969 effective words) took 0.0s, 179727 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 261 : training 

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 294 : training on 10000 raw words (970 effective words) took 0.0s, 198134 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 295 : training on 10000 raw words (949 effective words) took 0.0s, 154719 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 296 : training on 10000 raw words (920 effective words) took 0.0s, 120707 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 297 : training on 10000 raw words (934 effective words) took 0.0s, 186832 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 298 : trainin

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 331 : training on 10000 raw words (952 effective words) took 0.0s, 114853 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 332 : training on 10000 raw words (934 effective words) took 0.0s, 84164 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 333 : training on 10000 raw words (984 effective words) took 0.0s, 107140 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 334 : training on 10000 raw words (960 effective words) took 0.0s, 149807 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 335 : training

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 368 : training on 10000 raw words (961 effective words) took 0.0s, 268938 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 369 : training on 10000 raw words (970 effective words) took 0.0s, 170395 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 370 : training on 10000 raw words (985 effective words) took 0.0s, 180891 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 371 : training on 10000 raw words (951 effective words) took 0.0s, 142311 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 372 : trainin

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 405 : training on 10000 raw words (944 effective words) took 0.0s, 207657 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 406 : training on 10000 raw words (983 effective words) took 0.0s, 107683 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 407 : training on 10000 raw words (977 effective words) took 0.0s, 164964 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 408 : training on 10000 raw words (898 effective words) took 0.0s, 75337 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 409 : training

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 442 : training on 10000 raw words (873 effective words) took 0.0s, 166476 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 443 : training on 10000 raw words (979 effective words) took 0.0s, 71601 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 444 : training on 10000 raw words (926 effective words) took 0.0s, 82564 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 445 : training on 10000 raw words (915 effective words) took 0.0s, 213491 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 446 : training 

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 479 : training on 10000 raw words (929 effective words) took 0.0s, 142226 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 480 : training on 10000 raw words (962 effective words) took 0.0s, 110420 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 481 : training on 10000 raw words (921 effective words) took 0.0s, 72762 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 482 : training on 10000 raw words (932 effective words) took 0.0s, 134678 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 483 : training

worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 516 : training on 10000 raw words (937 effective words) took 0.0s, 149485 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 517 : training on 10000 raw words (971 effective words) took 0.0s, 118246 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 518 : training on 10000 raw words (948 effective words) took 0.0s, 191742 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 519 : training on 10000 raw words (967 effective words) took 0.0s, 125814 effective words/s
job loop exiting, total 1 jobs
worker exiting, processed 1 jobs
worker thread finished; awaiting finish of 0 more threads
EPOCH - 520 : trainin

In [12]:
log_vectors = one_vector(preprocessed_data, w2v)

In [13]:
print(log_vectors.shape)
print(log_vectors[:, 1:].shape)
log_vectors = log_vectors[:, 1:]

(10000, 26)
(10000, 25)


### Train SOM

In [14]:
map_size = 24
som = train(log_vectors, map_size=map_size, iterations=0, parallelism=2)

 Training...
 random_initialization took: 0.002000 seconds
 Rough training...
 radius_ini: 8.000000 , radius_final: 1.333333, trainlen: 100



training dataset is of size 10000


 epoch: 1 ---> elapsed time:  0.321000, quantization error: 13.200572

 epoch: 2 ---> elapsed time:  0.250000, quantization error: 2.579154

 epoch: 3 ---> elapsed time:  0.231000, quantization error: 2.273410

 epoch: 4 ---> elapsed time:  0.254000, quantization error: 1.716095

 epoch: 5 ---> elapsed time:  0.240000, quantization error: 1.535242

 epoch: 6 ---> elapsed time:  0.321000, quantization error: 1.446450

 epoch: 7 ---> elapsed time:  0.238000, quantization error: 1.381383

 epoch: 8 ---> elapsed time:  0.269000, quantization error: 1.352988

 epoch: 9 ---> elapsed time:  0.206000, quantization error: 1.328116

 epoch: 10 ---> elapsed time:  0.186000, quantization error: 1.309811

 epoch: 11 ---> elapsed time:  0.187000, quantization error: 1.298746

 epoch: 12 ---> elapsed time:  0.257000, quantization error: 1.287646

 epoch: 13 ---> elapsed time:  0.214000, quantization error: 1.276522

 epoch: 14 ---> elapsed time:  0.184000, quantization error: 1.265373

 epoch: 15 ---

In [15]:
model = som.codebook.matrix.reshape([map_size, map_size, log_vectors.shape[1]])

In [16]:
anomaly_scores = get_anomaly_score(log_vectors, parallelism=4, model=model)

In [17]:
[x for x in anomaly_scores if x > 0.6]

[]

In [18]:
threshold = 3*np.std(anomaly_scores) + np.mean(anomaly_scores)
threshold

1.005634134165455

# With Custom Approach

### Define Functions

#### 1. Log Preprocesing

One assumption that all these functions use is that we instantly convert our data into a pandas dataframe that has a "message" column containing the relevent information for us. 

We then treat each individual log line as a set of words, cleaning it by removing all non-alphabet charcters.

We keep white spaces and return a list of wordss

In [19]:
def _preprocess_custom(data):
    for col in data.columns:
        if col == "message":
            data[col] = data[col].apply(_clean_message_custom)
        else:
            data[col] = data[col].apply(to_str_custom)

    data = data.fillna("EMPTY")
    
def _clean_message_custom(line):
    """Remove all none alphabetical characters from message strings."""
    words = list(re.findall("[a-zA-Z]+", line))
    return words

def to_str_custom(x):
    """Convert all non-str lists to string lists for Word2Vec."""
    ret = " ".join([str(y) for y in x]) if isinstance(x, list) else str(x)
    return ret

#### Text encoding

In [20]:
def create_custom(logs, vector_length, window_size):
    """Create new word2vec model."""
    model = gs.models.Word2Vec(sentences=list(logs), size=vector_length, window=window_size)
    return model

def get_vectors(model, logs, vector_length):
    """Return logs as list of vectorized words"""
    vectors = []
    for x in logs:
        temp = []
        for word in x:
            if word in model.wv:
                temp.append(model.wv[word])
            else:
                temp.append(np.array([0]*vector_length))
        vectors.append(temp)
    return vectors

def _log_words_to_one_vector(log_words_vectors):
        result = []
        log_array_transposed = np.array(log_words_vectors, dtype=object).transpose()
        for coord in log_array_transposed:
            result.append(np.mean(coord))
        return result

def vectorized_logs_to_single_vectors(vectors):
    """Represent log messages as vectors according to the vectors
    of the words in these logs

    :params vectors: list of log messages, represented as list of words vectors
            [[wordvec11, wordvec12], [wordvec21, wordvec22], ...]
    """
    result = []
    for log_words_vector in vectors:
        result.append(_log_words_to_one_vector(log_words_vector))
    return np.array(result)

#### Prediction

In [21]:
def infer_custom(w2v, som, log, logs_list, threshold):
    
    log = pd.DataFrame({"message": log}, index=[1])
    _preprocess_custom(log)
    
    vector = []
    w2v = gs.models.Word2Vec([log.message.iloc[0]] + logs_list,
                             min_count=1, size=25, window=5)
    for word in log.message.iloc[0]:
        if word in w2v.wv.vocab.keys():
            vector.append(w2v.wv[word])
        else:
            vector.append(np.array([0]*25))
    
    one_vector = _log_words_to_one_vector(vector)
    
    score = get_anomaly_score([one_vector], 1, som)
    
    if score < threshold:
        return 0, score
    else:
        return 1, score

## Implementation

### Preprocessing

In [22]:
custom_preproc_data = data.copy()
_preprocess_custom(custom_preproc_data)

# First 5 prepocessed messages
pd.DataFrame(custom_preproc_data.message).head()

Unnamed: 0,message
0,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."
1,"[Nov, nginx, access, Nov, GET, courses, HTTP, ..."
2,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."
3,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."
4,"[Nov, nginx, access, Nov, GET, HTTP, Go, http,..."


In [23]:
logs_list = list(custom_preproc_data.message)
logs_list[0]

['Nov', 'nginx', 'access', 'Nov', 'GET', 'HTTP', 'Go', 'http', 'client']

In [24]:
w2v_custom = create_custom(logs_list, vector_length=25, window_size=5)

consider setting layer size to a multiple of 4 for greater performance
collecting all words and their counts
PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
collected 141 word types from a corpus of 113834 raw words and 10000 sentences
Loading a fresh vocabulary
effective_min_count=5 retains 103 unique words (73% of original 141, drops 38)
effective_min_count=5 leaves 113753 word corpus (99% of original 113834, drops 81)
deleting the raw counts dictionary of 141 items
sample=0.001 downsamples 34 most-common words
downsampling leaves estimated 25758 word corpus (22.6% of prior 113753)
estimated required memory for 103 words and 25 dimensions: 72100 bytes
resetting layer weights
training model with 3 workers on 103 vocabulary and 25 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
worker thread finished; awaiting finish of 2 more threads
worker thread finished; awaiting finish of 1 more threads
worker thread finished; awaiting finish of 0 more threads
EPOCH - 

In [25]:
vectors_custom = get_vectors(model=w2v_custom, logs=logs_list, vector_length=25)

In [26]:
logs_as_vectors = vectorized_logs_to_single_vectors(vectors_custom)

### Train SOM

In [27]:
map_size = 24
som_custom = train(logs_as_vectors, map_size=map_size, iterations=0, parallelism=2)

 Training...
 random_initialization took: 0.002000 seconds
 Rough training...
 radius_ini: 8.000000 , radius_final: 1.333333, trainlen: 100

 epoch: 1 ---> elapsed time:  0.164000, quantization error: 6.353162



training dataset is of size 10000


 epoch: 2 ---> elapsed time:  0.188000, quantization error: 2.776452

 epoch: 3 ---> elapsed time:  0.187000, quantization error: 1.422516

 epoch: 4 ---> elapsed time:  0.184000, quantization error: 1.213082

 epoch: 5 ---> elapsed time:  0.186000, quantization error: 1.071425

 epoch: 6 ---> elapsed time:  0.213000, quantization error: 0.960143

 epoch: 7 ---> elapsed time:  0.164000, quantization error: 0.914878

 epoch: 8 ---> elapsed time:  0.194000, quantization error: 0.911211

 epoch: 9 ---> elapsed time:  0.176000, quantization error: 0.905501

 epoch: 10 ---> elapsed time:  0.183000, quantization error: 0.889863

 epoch: 11 ---> elapsed time:  0.192000, quantization error: 0.876456

 epoch: 12 ---> elapsed time:  0.195000, quantization error: 0.861839

 epoch: 13 ---> elapsed time:  0.188000, quantization error: 0.848856

 epoch: 14 ---> elapsed time:  0.179000, quantization error: 0.836003

 epoch: 15 ---> elapsed time:  0.182000, quantization error: 0.823030

 epoch: 16 ---

In [28]:
model_custom = som_custom.codebook.matrix.reshape([map_size, map_size, logs_as_vectors.shape[1]])

In [29]:
anomaly_scores_custom = get_anomaly_score(logs_as_vectors, parallelism=4, model=model_custom)

In [30]:
[x for x in anomaly_scores_custom if x > 0.8]

[0.8456641901972225,
 0.8456641901972225,
 0.8456641901972225,
 0.8456641901972225,
 0.8027104025427731,
 0.8027104025427731,
 0.8027104025427731,
 0.8027104025427731,
 0.8456641901972225,
 0.8456641901972225]

In [43]:
threshold_custom = 3*np.std(anomaly_scores_custom) + np.mean(anomaly_scores_custom)
threshold_custom

0.8510280375408442

# Prediciton

In [44]:
import logging

logger = logging.getLogger()
logger.disabled = True
logging.disable()

**This is a test message**

In [64]:
infer(w2v, model, "Blah blah blah blah blah", preprocessed_data, threshold)

(0, [0.12197518870150093])

In [65]:
infer_custom(w2v_custom, model_custom, "Blah blah blah blah blah", logs_list, threshold_custom)

(0, [0.5231461429748809])

**<182>Dec 07 14:16:16 dataform  172.17.17.100 - - [07/Dec/2021:14:16:16 +0300] "POST /cgi-bin/.%2e/.%2e/.%2e/.%2e/etc/passwd HTTP/1.1" 400 5604 "-" "curl/7.68.0"**

In [62]:
infer(w2v, model, "<182>Dec 07 14:16:16 www.solidex.by  172.17.17.100 - - [07/Dec/2021:14:16:16 +0300] \"POST /cgi-bin/.%2e/.%2e/.%2e/.%2e/etc/passwd HTTP/1.1\" 400 5604 \"-\" \"curl/7.68.0\"", preprocessed_data, threshold)

(0, [0.11154981364453675])

In [63]:
infer_custom(w2v_custom, model_custom, "<182>Dec 07 14:16:16 www.solidex.by  172.17.17.100 - - [07/Dec/2021:14:16:16 +0300] \"POST /cgi-bin/.%2e/.%2e/.%2e/.%2e/etc/passwd HTTP/1.1\" 400 5604 \"-\" \"curl/7.68.0\"", logs_list, threshold_custom)

(0, [0.3961406046864743])

**<179>Dec 07 14:51:17 dataform  [Wed Dec 07 14:51:17.120946 2021] [auth_basic:error] [pid 2179734:tid 139663791892224] [client 172.17.17.100:44992] AH01617: user webadmin: authentication failure for "/register": Password Mismatch***

In [60]:
infer(w2v, model, "<179>Dec 07 14:51:17 www.solidex.by  [Wed Dec 07 14:51:17.120946 2021] [auth_basic:error] [pid 2179734:tid 139663791892224] [client 172.17.17.100:44992] AH01617: user webadmin: authentication failure for \"/register\": Password Mismatch", preprocessed_data, threshold)

(0, [0.10764399846779404])

In [61]:
infer_custom(w2v_custom, model_custom, "<179>Dec 07 14:51:17 www.solidex.by  [Wed Dec 07 14:51:17.120946 2021] [auth_basic:error] [pid 2179734:tid 139663791892224] [client 172.17.17.100:44992] AH01617: user webadmin: authentication failure for \"/register\": Password Mismatch", logs_list, threshold_custom)

(0, [0.7150129409415584])

**<179>Dec 07 12:10:53 smtplib.SMTPRecipientsRefused: {'doesntexist@solidex.by': (550, b'5.1.1 <doesntexist@solidex.by>: Recipient address rejected: User unknown in virtual mailbox table')}**

In [81]:
infer(w2v, model, "<179>Dec 07 12:10:53 smtplib.SMTPRecipientsRefused: {'doesntexist@solidex.by': (550, b'5.1.1 <doesntexist@solidex.by>: Recipient address rejected: User unknown in virtual mailbox table')}", preprocessed_data, threshold)

(0, [0.12365093363587643])

In [82]:
infer_custom(w2v_custom, model_custom, "<179>Dec 07 12:10:53 smtplibb.SMTPRecipientsRefused: {'doesntexist@solidex.by': (550, b'5.1.1 <doesntexist@solidex.by>: Recipient address rejected: User unknown in virtual mailbox table')}", logs_list, threshold_custom)

(0, [0.5059701485747206])

In [85]:
infer(w2v, model, "<158>Dec 07 14:24:59 195-137-160-145 nginx-access 2021/11/25 14:24:56 [error] 492#492: *15416147 open() \"/var/www/solidex.by/public_html/robots.txt\" failed (2: No such file or directory), client: 216.244.66.231, server: , request: \"GET /robots.txt HTTP/1.1\", host: \"solidex.by\"", preprocessed_data, threshold)

(0, [0.11994083210707229])

In [86]:
infer_custom(w2v_custom, model_custom, "<158>Dec 07 14:24:59 195-137-160-145 nginx-access 2021/11/25 14:24:56 [error] 492#492: *15416147 open() \"/var/www/solidex.by/public_html/robots.txt\" failed (2: No such file or directory), client: 216.244.66.231, server: , request: \"GET /robots.txt HTTP/1.1\", host: \"solidex.by\"", logs_list, threshold_custom)

(0, [0.35218559029002194])