# Detect CAN Anomalies with RNN + GRU

## How To

Use RNN, with state and GRU (to learn reasonably, while have a state. Use logs as examples of non anomal sequences.
As input, use embeddings of all CAN IDs, extended to all 2048 values.
As output, use one-hot encoded probabilities.

For anomalies prediction, check predicted probability of next CAN ID frame. Use `1 - p` to get probability of been an anomaly.

Sum probablities of latest n frames from sequence and if value bigger than certain threshold, mark sequence as anomalous.

Possible tweaks and questions:
- Use only CAN frames found in logs. It will speed up learning. But how to extend if we need to add new CAN IDs?
- Use squared probabilities `(1 - p)^2`. Just check it. Also check `1/p` and `1/p^2`
- Size of sequences. It can vary, like: 32-128. Bigger is better (should be more accurate). Smaller is quicker (not 100% sure if it's true to learn, but 100% quicker to predict). Should measure that.
- Epsilon parameter, which will indicate that certain places in logs are anomalous.
- Do we need examples of anomalous logs in training set? Does it increase accuracy?
- Do we need embeddings for input? Or just one-hot encode them? Does it affect accuracy?

### Additional benefits
- We can compare length between embedding vectors for CAN IDs and find similar ones.
- We can find the most common and the least common group of IDs (sequences, but w/o certain start point).
- We can find the most common and the least common IDs (should calculate for every ID, if we'll use embeddings for inputs).


### Prepare training, validation and test sets
- Training set
Take log examples,take only normal logs. Take random parts, including starts and finishes of log files. Cut them into sequences and use them to train.
- Validation and test sets
Take log examples as mentioned in **Training Set**, but add examples of abnormal sequences too.

## Libraries and util methods

In [4]:
## Libs
import os
import zipfile
import re

In [2]:
# change to read only random lines
def read_file(file):
    return file.read().splitlines()

def get_can_id(line):
    id_text = '"id":'
    before = line.find(id_text)
    if (before > 0):
        before += len(id_text) + 1
        after = line[before:].find(',')
        if (after > 0):
            found = line[before:before+after]
            return int(found)
    return ''

## Prepare Datasets

In [3]:
data_dir = os.getcwd() + '/data/normal/'
zips = [zipfile.ZipFile(data_dir + f) for f in listdir(data_dir) if f[-3:] == 'zip']
log_files = [z.open(l.filename, 'r') for z in zips for l in z.filelist if l.filename[-3:] == 'log']
seqs = [[get_can_id(str(l)) for l in read_file(f)] for f in log_files]

print('Total sequences: ' + str(len(seqs)))
print(seqs[0][:10])

NameError: name 'os' is not defined

## Model

In [3]:
%matplotlib inline
from importlib import reload
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

Using Theano backend.


#### Model parameters

In [6]:
# size of hidden state
n_hidden = 256

# sequence length
seq_len = 64

# length of CAN IDs vocabulary
vocab_size = 2048

# size of embeddings
embed_size = 50

In [8]:
model=Sequential([
        Embedding(vocab_size, embed_size, input_length=seq_len, batch_input_shape=(64,8)),
        BatchNormalization(),
        GRU(n_hidden, return_sequences=True, input_shape=(seq_len, vocab_size),
                  activation='relu', inner_init='identity'),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])
model.compile(loss='categorical_crossentropy', optimizer=Adam())

#### Train and predict

In [9]:
model.fit(oh_x_rnn, oh_y_rnn, batch_size=32, nb_epoch=8)

NameError: name 'oh_x_rnn' is not defined