# Detect CAN Anomalies with RNN + GRU

## How To

Use RNN Neural Network, with state and GRU (to learn reasonably, while have a state. Use logs as examples of non anomal sequences.
As input, use embeddings of all CAN IDs, extended to all 2048 values.
As output, use one-hot encoded probabilities.

For anomalies prediction, check predicted probability of next CAN ID frame. Use `1 - p` to get probability of been an anomaly.

Sum probablities of latest n frames from sequence and if value bigger than certain threshold, mark sequence as anomalous.

Possible tweaks and questions:
- Use only CAN frames found in logs. It will speed up learning. But how to extend if we need to add new CAN IDs?
- Use squared probabilities `(1 - p)^2`. Just check it. Also check `1/p` and `1/p^2`
- Size of sequences. It can vary, like: 32-128. Bigger is better (should be more accurate). Smaller is quicker (not 100% sure if it's true to learn, but 100% quicker to predict). Should measure that.
- Epsilon parameter, which will indicate that certain places in logs are anomalous.
- Do we need examples of anomalous logs in training set? Does it increase accuracy?
- Do we need embeddings for input? Or just one-hot encode them? Does it affect accuracy?

### Additional benefits
- We can compare length between embedding vectors for CAN IDs and find similar ones.
- We can find the most common and the least common IDs.
- We can find the most common and the least common group of IDs (sequences, but w/o certain start point).

### Prepare training, validation and test sets
- Training set
Take log examples,take only normal logs. Take random parts, including starts and finishes of log files. Cut them into sequences and use them to train.
- Validation and test sets
Take log examples as mentioned in **Training Set**, but add examples of abnormal sequences too.

## Model

In [9]:
%matplotlib inline
from importlib import reload
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

#### Model parameters

In [18]:
# size of hidden state
n_hidden = 256

# sequence length
seq_len = 64

# length of CAN IDs vocabulary
vocab_size = 2048

In [19]:
model=Sequential([
        GRU(n_hidden, return_sequences=True, input_shape=(seq_len, vocab_size),
                  activation='relu', inner_init='identity'),
        TimeDistributed(Dense(vocab_size, activation='softmax')),
    ])
model.compile(loss='categorical_crossentropy', optimizer=Adam())

#### Train and predict

In [21]:
model.fit(oh_x_rnn, oh_y_rnn, batch_size=64, nb_epoch=8)

NameError: name 'oh_x_rnn' is not defined