<img src="logo.png" align='center' width=80%>
# Overview
As data scientists working in a cyber-security company, we wanted to show that Natural Language Processing (NLP) algorithms can be applied to security related events. For this task we used 2 algorithm developed by Google: **Word2vec** ([link](https://arxiv.org/abs/1301.3781)) and **Doc2vec** ([link](https://arxiv.org/abs/1405.4053)). These algorithms use the context of words to extract a vectorized representation (aka embedding) for each word/document in a given vocabulary.  
If you want to learn about how **Word2vec** works, you can [start here](https://skymind.ai/wiki/word2vec).

Using these algorithms, we managed to model the behavior of common vulnerability scanners (and other client applications) based on their unique 'syntax' of malicious web requests. We named our implementation **Mal2vec**.

### About this notebook
This notebook contains easy to use widgets to execute each step on your own data. We also include 3 datasets as examples of how to use this project.

### Table of contents
- [Load csv data file](#Load-CSV-data-file)
- [Map columns](#Map-columns)
- [Select additional grouping columns](#Select-additional-grouping-columns)
- [Create sentences](#Create-sentences)
- [Prepare dataset](#Prepare-dataset)
- [Train classification model](#Train-classifictaion-model)
- [Evaluate the model](#Evaluate-the-model)

# Imports

In [13]:
import random
from IPython.display import display, Markdown, clear_output, HTML
def hide_toggle():
    # @author: harshil
    # @Source: https://stackoverflow.com/a/28073228/6306692
    this_cell = """$('div.cell.code_cell.rendered.selected')"""
    next_cell = this_cell + '.next()'

    toggle_text = 'Show/hide code'  # text shown on toggle link
    target_cell = this_cell  # target cell to control with toggle
    js_hide_current = ''  # bit of JS to permanently hide code in current cell (only when toggling next cell)

    js_f_name = 'code_toggle_{}'.format(str(random.randint(1,2**64)))

    html = """
        <script>
            function {f_name}() {{
                {cell_selector}.find('div.input').toggle();
            }}

            {js_hide_current}
        </script>

        <a href="javascript:{f_name}()">{toggle_text}</a>
    """.format(
        f_name=js_f_name,
        cell_selector=target_cell,
        js_hide_current=js_hide_current, 
        toggle_text=toggle_text
    )

    return HTML(html)
display(hide_toggle())
display(HTML('''<style>.text_cell {background: #E0E5EE;}
.widget-inline-hbox .widget-label{width:120px;}</style>'''))

%load_ext autoreload
%autoreload 2

import os
import pandas as pd
import ipywidgets as widgets
from classify import prepare_dataset, train_classifier
from vizualize import draw_model, plot_model_results
from sentensize import create_sentences, dump_sentences 

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Load CSV data file
Provide a path to your data file. It can be a full path, relative path or URL - your choice.  
Then click "Load CSV" to load your data to this notebook. 

**Note:** each row in the CSV file should represent a single event.

In [14]:
display(hide_toggle())

df = None
def load_csv(btn):
    global df
    clear_output()
    display(hide_toggle())
    display(widgets.VBox([filename_input, nrows_input]))
    display(HTML('<img src="loading.gif" alt="Drawing" style="width: 50px;"/>'))

    nrows = int(nrows_input.value)
    df = pd.read_csv(filename_input.value, nrows=nrows if nrows > 0 else None)

    clear_output()
    display(hide_toggle())
    display(widgets.VBox([filename_input, nrows_input, load_button]))
    print('Loaded {} rows'.format(df.shape[0]))
    display(df.sample(n=5))

filename_input = widgets.Text(description='CSV file:')
nrows_input = widgets.Text(description='Rows limit:', value='0')

load_button = widgets.Button(description='Load CSV')
load_button.on_click(load_csv)

widgets.VBox([filename_input, nrows_input, load_button])

VBox(children=(Text(value='examples/data/Imperva WAF.gz', description='CSV file:'), Text(value='100', descript…

Loaded 100 rows


Unnamed: 0,start_time,rule,client_application_type,site,ip,client_application,user_agent,rule_category
43,1571095702606,RULE_0000,Browser,SITE_0005,221.73.226.116,TOOL_0000,Mozilla/5.0 (iPhone; CPU iPhone OS 13_1_2 like...,MISC
9,1571095458917,RULE_0000,Browser,SITE_0001,32.61.241.203,TOOL_0000,Mozilla/5.0 (iPad; CPU OS 10_3_3 like Mac OS X...,MISC
27,1571095605936,RULE_0000,Browser,SITE_0003,35.25.91.125,TOOL_0000,Mozilla/5.0 (iPhone; CPU iPhone OS 13_1_2 like...,MISC
44,1571095737098,RULE_0000,Browser,SITE_0004,213.213.73.217,TOOL_0000,Mozilla/5.0 (iPad; CPU OS 9_3_5 like Mac OS X)...,MISC
81,1571095742547,RULE_0000,Browser,SITE_0004,213.213.73.217,TOOL_0000,Mozilla/5.0 (iPad; CPU OS 9_3_5 like Mac OS X)...,MISC


# Map columns
The data should have at least 3 columns:
- **Timestamp** (int) - if you don't have timestamps, it can also be a simple increasing index
- **Event** (string) - rule name, event description, etc. Must be a single word containing only alpha-numeric characters
- **Label** (string) - type of event. This will be later used to create the classification model

In [23]:
time_column_input, event_column_input, label_column_input = None, None, None
def show_dropdown(obj):
    global time_column_input, event_column_input, label_column_input
    time_column_input = widgets.Dropdown(options=df.columns, description='Time column:')
    event_column_input = widgets.Dropdown(options=df.columns, description='Event column:')
    label_column_input = widgets.Dropdown(options=df.columns, description='Label column:')

    clear_output()
    display(hide_toggle())
    display(widgets.VBox([show_dropdown_button, time_column_input, event_column_input, label_column_input]))
    
show_dropdown_button = widgets.Button(description='Refresh')
show_dropdown_button.on_click(show_dropdown)
show_dropdown(None)

VBox(children=(Button(description='Refresh', style=ButtonStyle()), Dropdown(description='Time column:', option…

# Select additional grouping columns
Select those columns which represents unique sequences

In [24]:
checkboxes = None
def show_checkboxes(obj):
    global checkboxes
    checkboxes = {k:widgets.Checkbox(description=k) for k in df.columns if k not in [time_column_input.value, 
                                                                                 event_column_input.value, 
                                                                                 label_column_input.value
                                                                                ]}
    clear_output()
    display(hide_toggle())
    display(widgets.VBox([show_checkboxes_button] + [checkboxes[x] for x in checkboxes]))

show_checkboxes_button = widgets.Button(description='Refresh')
show_checkboxes_button.on_click(show_checkboxes)
show_checkboxes(None)

VBox(children=(Button(description='Refresh', style=ButtonStyle()), Checkbox(value=False, description='rule'), …

# Create sentences
This cell will group events into sentences (using the grouping columns selected).  
It will then split sentences if to consecutive events are separated by more than the given timeout (default: 300 seconds)

In [17]:
display(hide_toggle())

dataset_name = os.path.splitext(os.path.basename(filename_input.value))[0]
sentences_df, sentences_filepath = None, None
def sentences(obj):
    global sentences_df, sentences_filepath
    clear_output()
    display(hide_toggle())
    display(HTML('<img src="loading.gif" alt="Drawing" style="width: 50px;"/>'))

    groupping_columns = [x for x in checkboxes if checkboxes[x].value]
    sentences_df = create_sentences(df, 
                                    time_column_input.value, 
                                    event_column_input.value, 
                                    label_column_input.value, 
                                    groupping_columns,
                                    timeout=300
                                   )
    sentences_filepath = dump_sentences(sentences_df, dataset_name)

    clear_output()
    display(hide_toggle())
    display(sentence_button)
    print('Created {} sentences. Showing 5 examples:'.format(sentences_df.shape[0]))
    display(sentences_df.sample(n=5))

sentence_button = widgets.Button(description='Start')

display(sentence_button)
sentence_button.on_click(sentences)

Button(description='Start', style=ButtonStyle())

# Prepare dataset
1) Train a doc2vec model to extract the embedding vector from each sentence.  
**Parameters**:  
*vector_size*: the size of embedding vector. Increasing this parameters might improve accuracy, but will take longer to train (int, default=30)  
*epochs*: how many epochs should be applied during training. Increasing this parameters might improve accuracy, but will take longer to train  (int, default=50)  
*min_sentence_count*: don't classify labels with small amount of sentences (int, default=200)  

2) Prepare dataset
- Infer the embedding vector for each sample in the data set
- Perform [stratified sampling](https://en.wikipedia.org/wiki/Stratified_sampling) for each label
- Split to train/test sets 80%-20%

In [18]:
display(hide_toggle())

X_train, X_test, y_train, y_test, classes = None, None, None, None, None
def dataset(obj):
    global sentences_df, sentences_filepath, dataset_name, X_train, X_test, y_train, y_test, classes
    clear_output()
    display(hide_toggle())
    display(HTML('<img src="loading.gif" alt="Drawing" style="width: 50px;"/>'))

    X_train, X_test, y_train, y_test, classes = prepare_dataset(sentences_df, 
                                                                sentences_filepath, 
                                                                dataset_name,
                                                                vector_size=30,
                                                                epochs=50,
                                                                min_sentence_count=200
                                                               )

    dataset_button.description = 'Run Again'
    clear_output()
    display(hide_toggle())
    print('Dataset ready!')
    display(dataset_button)

dataset_button = widgets.Button(description='Start')

display(dataset_button)
dataset_button.on_click(dataset)

Button(description='Start', style=ButtonStyle())

# Train classification model
Train a deep neural network to classify each sentence to its correct label for 500 epochs (automatically stop when training no longer improves results)

For the purpose of this demo, the network architecture and hyper-parameters are constant. Feel free the modify to code and improve the model

In [19]:
display(hide_toggle())

history, report, df_cm = None, None, None
def train(obj):
    global dataset_name, X_train, X_test, y_train, y_test, classes, history, report, df_cm
    train_button.description = 'Train Again'

    clear_output()
    display(hide_toggle())
    display(train_button)

    history, report, df_cm = train_classifier(X_train, X_test, y_train, y_test, classes, dataset_name)
    

train_button = widgets.Button(description='Start')

display(train_button)
train_button.on_click(train)

Button(description='Start', style=ButtonStyle())

# Evaluate the model
Plot the results of the model:
- **Loss** - how did the model progress during training (lower values mean better performance)
- **Accuracy** - how did the model perform on the validation set (higher values are better)
- **Confusion Matrix** - mapping each of the model's predictions (x-axis) to its true label (y-axis). Correct predictions are placed on the main diagonal (brighter is better)
- **Detailed report** - for each label, show the following metrics: precision, recall, f1-score ([read more here](https://towardsdatascience.com/accuracy-precision-recall-or-f1-331fb37c5cb9)). The 'support' metric is the number of instances in that class

In [20]:
display(hide_toggle())

def evaluate(btn):
    global history, report, df_cm
    
    clear_output()
    evaluate_button.description = 'Refresh'
    display(hide_toggle())
    display(evaluate_button)
    plot_model_results(history, report, df_cm, classes)
    
evaluate_button = widgets.Button(description='Evaluate Model')
display(evaluate_button)
evaluate_button.on_click(evaluate)

Button(description='Evaluate Model', style=ButtonStyle())

# Map Events using t-SNE
t-SNE is a visualization algorithm for vectors with high dimensionality. By applying dimensionality reduction, it allows us to extract embedding for each event and draw it on a 2d map.

If your CSV contains a 'category' column, select it in the dropdown list below. Mal2vec will extract all necessary information and create a link to an HTML file.

In [25]:
display(hide_toggle())

def draw_tsne(obj):
    global category_column_input, event_column_input, draw_tsne_button
    clear_output()
    display(hide_toggle())
    display(HTML('<img src="loading.gif" alt="Drawing" style="width: 50px;"/>'))

    html_file = draw_model(df, sentences_filepath, event_column_input.value, category_column_input.value, dataset_name)

    clear_output()
    display(hide_toggle())
    display(widgets.VBox([category_column_input, draw_tsne_button]))
    display(HTML(f"<a href='{html_file}' target=_blank>Open output in new tab</a>"))
    
category_column_input = widgets.Dropdown(options=df.columns, description='Category column:')
draw_tsne_button = widgets.Button(description='Draw t-SNE')
draw_tsne_button.on_click(draw_tsne)
display(widgets.VBox([category_column_input, draw_tsne_button]))

VBox(children=(Dropdown(description='Category column:', options=('start_time', 'rule', 'client_application_typ…