To do meaningful detection on the dataset we need to personalize the classifier.
This process require proper knowledge about the data available.

We will start by importing the available bro logs into pandas to perform some statistical analysis and filter out some noise.

Import the required dependencies

In [None]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt

Import the bro logs and verify if it contains data.
The bro logs imported are the output of bro-cut with the -d and -u flag to format the time.
If the format of the bro logs is changed this code has to be altered.

In [None]:
bro_logs = pd.read_csv("data/smb_files_utc_formatted.csv", sep="\t", na_values="-", parse_dates=[0, 11, 12, 13, 14])
bro_logs.columns = ["ts", "uid", "id.orig_h", "id.orig_p", "id.resp_h", "id.resp_p", "fuid", "action", "path", "name", "size", "times.modified", "times.accessed", "times.created", "times.changed"]
bro_logs.fillna(0, inplace=True)
print(bro_logs.shape)

Do some statistics on the data.
The top 10 list shows some outliers that will generate noise in the classification algorithm

In [None]:
N = 10
files_and_folders_seen = bro_logs["name"].nunique()
files_and_folders_read = bro_logs["name"][bro_logs["action"] == "SMB::FILE_READ"].nunique()
top_n_files_read = bro_logs["name"][bro_logs["action"] == "SMB::FILE_READ"].value_counts(sort=True).head(N)

print("Dataset contains {} unique files and folders").format(files_and_folders_seen)
print("Dataset contains {} unique files and folders which were read").format(files_and_folders_read)
print("{} most read files in the dataset:\n{}").format(N, top_n_files_read)

Now plot some basic graphs that show some information about the network.

In [None]:
plt.figure()
fig, ax = plt.subplots()
fig.suptitle("Activity over time")

bro_logs["ts"][bro_logs["action"] == "SMB::FILE_READ"].groupby(pd.TimeGrouper('D')).plot(ax=ax, color='#267f8c', kind='bar', edgecolor='#267f8c')
bro_logs["ts"][bro_logs["action"] == "SMB::FILE_READ"].value_counts(sort=False).plot(ax=ax, color='#267f8c', kind='bar', edgecolor='#267f8c')

fig.savefig('output/activity_over_time_before_cleaning.png', dpi=1000)
plt.show()

In [None]:
plt.figure()
fig, ax = plt.subplots()
fig.suptitle("Files opened")
ax = bro_logs["name"][bro_logs["action"] == "SMB::FILE_READ"].value_counts(sort=False).plot(color='#267f8c', kind='bar', edgecolor='#267f8c')
ax.xaxis.set_visible(False)
fig.savefig('output/files_opened_before_cleaning.png', dpi=1000)
plt.show()

In [None]:
print(bro_logs["name"][bro_logs["action"] == "SMB::FILE_READ"].value_counts().head(25))

We can see that some of the most accessed files are not interresting to monitor.
Having these files in the dataset will mess up the prediction in the future.

To have a better view at the data we will filter out some files.

NOTE: The files in the ignore_list have to be changed according to the data provided in the dataset.

In [None]:
import re

ignore_list = ["Example", "Files", "That", "Have", "To", "Be", "Filtered"]
ignore_regex = '|'.join(ignore_list)

bro_logs_filtered = bro_logs[~bro_logs["name"].str.contains(ignore_regex, flags=re.IGNORECASE)]

print(bro_logs_filtered["name"][bro_logs_filtered["action"] == "SMB::FILE_READ"].value_counts().head(10))
print (bro_logs_filtered["name"][bro_logs_filtered["action"] == "SMB::FILE_READ"].shape)
print(bro_logs_filtered["name"][bro_logs_filtered["action"] == "SMB::FILE_READ"].nunique())

Let's make the graphs again to see if we can confirm the data has no significant outliers

In [None]:
plt.figure()
fig, ax = plt.subplots()
fig.suptitle("Activity over time")

ax = bro_logs_filtered["ts"][bro_logs_filtered["action"] == "SMB::FILE_READ"].value_counts(sort=False).plot(color='#267f8c', kind='bar', edgecolor='#267f8c')

fig.savefig('output/activity_over_time_after_cleaning.png', dpi=1000)
plt.show()

plt.figure()
fig, ax = plt.subplots()
fig.suptitle("Files opened")
ax = bro_logs_filtered["name"][bro_logs_filtered["action"] == "SMB::FILE_READ"].value_counts(sort=False).plot(color='#267f8c', kind='bar', edgecolor='#267f8c')
ax.xaxis.set_visible(False)
fig.savefig('output/files_opened_after_cleaning.png', dpi=1000)
plt.show()

Now that we have a normalized dataset we will save it for further processing.

In [None]:
bro_logs_filtered.to_pickle("data/bro_logs_filtered.pkl")