# Single-tier language model example

The following tutorial demonstrates how to utilize safekit's language modeling recurrent neural network to perform event-level anomaly detection. Unlike the aggregate autoencoder and its baselines, the language model is capable of detecting anomalous behavior at the event level. It accomplishes this by attempting to learn the syntax of log lines and the semantic relationships between individual fields in a log line. This allows the model to predict not only the likelihood of a network event, but also the likelihood of individual features appearing at given positions in the log line representation of that event.

In [1]:
import tensorflow as tf
import numpy as np
import json
import sys
import os

from safekit.batch import OnlineBatcher
from safekit.graph_training_utils import ModelRunner, EarlyStop
from safekit.tf_ops import lm_rnn
from safekit.util import get_mask, Parser

tf.set_random_seed(408)
np.random.seed(408)

First, we'll define some hyperparameters for our model—these will be explained in greater detail as we go.

In [2]:
layer_list = [10]
lr = 1e-3
embed_size = 20
mb_size = 64

maxbadcount = 10

Next, we load the JSON file describing the specifications for the data.

This JSON file describes a dictionary specifying the number of features in the input data; the categories corresponding to the features; whether the corresponding category is metadata, input, or output; and the indices which map these categories to specific features. This dictionary can later be used to ease interaction with the data when providing it as input to Tensorflow.

`sentence_length` specifies a fixed sequence length over which our model will perform backpropagation through time, and `token_set_size` specifies the size of the vocabulary comprising all of the sequences—the former will be used to define the shape of the placeholders used for the features and targets, while the latter is used to define the shape of the embedding matrix used to map our categorical features to embedded representations.

In [3]:
dataspecs = json.load(open('../safekit/features/specs/lm/lanl_word_config.json', 'r'))
sentence_length = dataspecs['sentence_length'] - 1
token_set_size = dataspecs['token_set_size']

x = tf.placeholder(tf.int32, [None, sentence_length])
t = tf.placeholder(tf.int32, [None, sentence_length])
ph_dict = {'x': x, 't': t}

token_embed = tf.Variable(tf.truncated_normal([token_set_size, embed_size]))

Now we define the recurrent neural network proper. A call to `lm_rnn` will instantiate all of the graph operations comprising our RNN and return a tuple of tensors: `token_losses`, which represents the token-wise losses over each input sequence; `h_states`, a sentence-length tensor comprised of the hidden states at each time step; and `final_h`, simply the hidden state at the last time step. For this call, we pass our input and output placeholders as well as our embedding matrix. We also provide a list of hidden layer sizes which determines the dimensionality of the hidden states at each time step—specifying more than one layer size will yield a stacked RNN architecture. The resulting model is a single-tiered RNN using Long Short Term Memory cells with a hidden dimensionality of 10.

Finally, we define our losses over individual lines and over all lines by first averaging the feature-wise losses, then averaging these losses over an entire batch.

In [4]:
token_losses, h_states, final_h = lm_rnn(x, t, token_embed, layer_list)

line_losses = tf.reduce_mean(token_losses, axis=1)
avg_loss = tf.reduce_mean(line_losses)

Instructions for updating:
This class is deprecated, please use tf.nn.rnn_cell.LSTMCell, which supports all the feature this cell currently has. Please replace the existing code with tf.nn.rnn_cell.LSTMCell(name='basic_lstm_cell').


To map losses back to our input features easily, we'll next define a function that we can call during the training loop that will write metadata and losses for each data point in the current minibatch.

In [5]:
outfile = open('results', 'w')
outfile.write("batch line second day user red loss\n")

def write_results(data_dict, loss, outfile, batch):
    for n, s, d, u, r, l in zip(data_dict['line'].flatten().tolist(),
                                data_dict['second'].flatten().tolist(),
                                data_dict['day'].flatten().tolist(),
                                data_dict['user'].flatten().tolist(),
                                data_dict['red'].flatten().tolist(),
                                loss.flatten().tolist()):
        outfile.write('%s %s %s %s %s %s %r\n' % (batch, int(n), int(s), int(d), int(u), int(r), l))

Now we instantiate a `ModelRunner` object, which provides a simple interface for interacting with the Tensorflow session. Instantiating this object will define the optimizer Tensorflow will use for gradient descent and initialize all of the variables in the Tensorflow graph. We can then use the `train_step` method on this object to perform an optimization step or the `eval` method to retrieve the values of arbitrary tensors in the graph.

In order to record the losses for all of the features, we define a list `eval_tensors` that contains tensors whose values we want to retrieve during training. We'll provide this list to the `ModelRunner`'s `eval` method during the training loop to compute these tensors, then record their values with the `write_results` function defined previously.

In [6]:
model = ModelRunner(avg_loss, ph_dict, learnrate=lr)

eval_tensors = [avg_loss, line_losses]

For our experiments, we want to first train our model on a single day of user activity, evaluate the model's performance on the next day, then repeat this process for each day in the data. To ease this process, we'll define a function that will either train or evaluate our model over a single day of events.

We first instantiate a batcher to divide the data into smaller portions. Since each day may contain a large number of events, we want to provide it to the model in small batches to avoid filling memory. Adjusting the minibatch size may also improve the model's performance. Here, we'll use a batch size of 64 data points, defined above as `mb_size`.

We then define a stopping criteria for training using the `EarlyStop` object; if our model's performance doesn't improve after 10 training steps—defined above as `maxbadcount`—the `check_error` function we instantiate will return `False`, and training will be discontinued.

In order to prepare data for training or evaluation, we manipulate raw batches from our batcher to construct a dictionary for Tensorflow that maps features to the placeholders used to feed data into the computational graph during training. We map the metadata features to their respective dictionary fields, define the upper range of our inputs and outputs with the `endx` and `endt` variables, then use these to select the appropriate features in the raw batch to determine our input and output.

During training, we retrieve the losses for the current batch, then perform a training step to perform gradient descent over a single batch of inputs. This process repeats until either the batcher has reached the end of the input file, the stopping criteria has been met, or the model's error has diverged to infinity. During evaluation, we only retrieve the losses, then write these to our results file using `write_results`.

In [7]:
def trainday(is_training, f):
    batch_num = 0
    #data = OnlineBatcher('/home/hutch_research/data/lanl/char_feats/word_day_split/' + f, mb_size, delimiter=' ')
    data = OnlineBatcher('/home/wxh/AnomalyDetectionModels/safekit-master/data_examples/lanl/lm_feats/word_day_split/' + f, mb_size, delimiter=' ')
    raw_batch = data.next_batch()
    cur_loss = sys.float_info.max
    check_error = EarlyStop(maxbadcount)
    endx = raw_batch.shape[1] - 1
    endt = raw_batch.shape[1]
    training = check_error(raw_batch, cur_loss)
    while training:
        data_dict = {'line': raw_batch[:, 0], 'second': raw_batch[:, 1], 
                     'day': raw_batch[:, 2], 'user': raw_batch[:, 3], 
                     'red': raw_batch[:, 4], 'x': raw_batch[:, 5:endx],
                     't': raw_batch[:, 6:endt]}

        _, cur_loss, pointloss = model.train_step(data_dict, eval_tensors, update=is_training)
        if not is_training:
            write_results(data_dict, pointloss, outfile, batch_num)
        batch_num += 1
        
        print('%s %s %s %s %s %s %r' % (raw_batch.shape[0], data_dict['line'][0],
                                        data_dict['second'][0], ('fixed', 'update')[is_training],
                                        f, data.index, cur_loss))
        
        raw_batch = data.next_batch()
        training = check_error(raw_batch, cur_loss)
        if training < 0:
            exit(0)

For concision, we will train and evaluate our model on a small subset of our data. To train and evaluate over the entire data set, uncomment the lines following the current definition of `files`.

Notice that if we use the entire data set, we reference a field in our data specifications called `weekend_days`. In our configuration files, we have specified a list of days in our data set which correspond to weekends. We want to exclude these days from training simply because they represent different patterns of user activity that may not match the distribution of user activities found during weekdays. To include these events in our analyses without affecting accuracy, another model can be trained on these events.|

In [8]:
files = dataspecs['test_files']

# weekend_days = dataspecs['weekend_days']
# files = [str(i) + '.txt' for i in range(dataspecs["num_days"]) if i not in weekend_days]

Finally, we enter the training loop, which simply consists of two successive calls to `trainday`. The first call trains the model on the current day, and the second call evaluates the model on the following day.

In [9]:
for idx, f in enumerate(files[:-1]):
    trainday(True, f)
    trainday(False, files[idx + 1])
outfile.close()

datadict: {'second': array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]), 'user': array([1.010e+02, 1.010e+02, 1.000e+01, 1.000e+01, 1.137e+03, 1.137e+03,
       1.190e+02, 1.190e+02, 1.290e+02, 1.290e+02, 1.470e+02, 1.470e+02,
       1.470e+02, 1.470e+02, 1.470e+02, 1.500e+01, 1.500e+01, 1.750e+02,
       1.750e+02, 1.750e+02, 1.750e+02, 1.750e+02, 1.750e+02, 1.750e+02,
       1.750e+02, 1.750e+02, 1.750e+02, 1.750e+02, 1.750e+02, 1.782e+03,
       1.782e+03, 1.782e+03, 1.880e+02, 1.980e+02, 1.980e+02, 1.000e+00,
       2.000e+01, 2.000e+01, 2.100e+01, 2.200e+01, 2.200e+01, 2.300e+01,
       2.483e+03, 2.500e+01, 2.500e+01, 2.600e+01, 2.600e+01, 2.600e+01,
       2.600e+01, 2.600e+01, 2.700e+01, 3.000e+01, 3.000e+01, 3.000e+01,
       3.270e+02, 3.200e+01,

ValueError: Cannot feed value of shape (64, 11) for Tensor u'Placeholder:0', which has shape '(?, 10)'