# Information

**Author:**<br>Pascal Munaretto (<a href="mailto:pascal.munaretto@outlook.com">Mail</a>)

**Date:**<br>30.09.2022

**Type:**<br>Master's Thesis

**Topic:**<br>Design, Implementation and Performance Analysis of an AI-Based Insider Threat Detection Platform	in Splunk To Counteract Data Exfiltration

**Study Program:**<br>Enterprise and IT Security

**Institution:**<br><a href="https://www.hs-offenburg.de">Offenburg University of Applied Sciences</a>

**Github:**<br>https://github.com/pmunaretto/Master-Thesis

# Check GPU

In [None]:
!nvidia-smi

Mon Jun 20 03:14:27 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    29W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

# Setup

## Requirements

First, Conda is installed by using a custom instructor. Prepackaging Conda reduces the installation time from ~25 minutes to ~6 minutes.

**Channels:**
- rapidsai
- nvidia
- conda-forge

**Specs:**
- python=3.7
- cudatoolkit=11.2
- llvmlite
- gcsfs
- openssl
- dask-sql
- pip
- conda
- mamba

In [None]:
!pip install -q condacolab
import condacolab
condacolab.install_from_url("file:/content/drive/MyDrive/condacolab-0.1-Linux-x86_64.sh")

In [None]:
!pip install line_profiler
!pip install memory_profiler
%load_ext line_profiler
%load_ext memory_profiler

## Imports

In [None]:
import dask_cudf
import glob
import json
import requests
import os
import re
import numpy as np
import pandas as pd

## Configuration

In [None]:
BASE_PATH = "/content/drive/MyDrive/CERT/r4.2"

# Preparation

## Get a List of Job Sites

In [None]:
import requests
import re

In [None]:
r = requests.get("https://raw.githubusercontent.com/emredurukn/awesome-job-boards/master/README.md")
job_portals = re.findall("http[s]?:\/\/(?:www.)?(?!github|cdn\.rawgit)([^:\/\s)]+)", r.text)
print(f"Amount of Job Portals: {len(job_portals)}")
print(f"Job Portals: {job_portals}")

Amount of Job Portals: 342
Job Portals: ['linkedin.com', 'indeed.com', 'glassdoor.com', 'angel.co', 'datajobs.com', 'ai-jobs.net', 'jobhunt.ai', 'icrunchdata.com', 'kdnuggets.com', 'datayoshi.com', 'jobs.opendatascience.com', 'jobsnew.analyticsvidhya.com', 'jobsfornewdatascientists.com', 'bigdatajobs.com', 'statsjobs.com', 'bigcloud.global', 'data-stryde.com', 'blocktribe.com', 'cryptojobslist.com', 'crypto.jobs', 'blockew.com', 'cryptocurrencyjobs.co', 'useweb3.xyz', 'block-stryde.com', 'nextcryptojobs.com', 'designjobs.aiga.org', 'authenticjobs.com', 'behance.net', 'coroflot.com', 'ixda.org', 'dribbble.com', 'krop.com', 'opensourcedesign.net', 'uxjobsboard.com', 'designerjobs.co', 'designmodo.com', 'designjobsboard.com', 'jobs.designweek.co.uk', 'designjobs.aiga.org', 'ifyoucouldjobs.com', 'creativemornings.com', 'creativepool.com', 'theloop.com.au', 'ninjajobs.org', 'infosec-jobs.com', 'cybersecurityjobsite.com', 'careersincyber.com', 'careersinfosecurity.com', 'cybersecurityjobs.ne

In [None]:
job_portals = pd.read_csv(os.path.join(BASE_PATH, "job_portals.csv"))["url"].to_list()

## Get a List of Malicious Event IDs

In [None]:
# Get the path of all answers files
files = glob.glob(os.path.join(BASE_PATH, "answers/r4.2/*.csv"))

# Create an empty list for malicious event ids
answers = []

# Iterate through each file and extract the id of the malicious event
for f in files:
    with open(f, "rb") as infile:
        for line in infile:
            line = line.decode().split(",")[1]
            line = line.replace("\"","")
            answers.append(line)

answers[:10]

['{Q4V8-Q7GY61AR-5156JFIN}',
 '{Y4R1-Z8TY05GU-7397GLKY}',
 '{A7Y6-T2ZM78RE-3085IFRU}',
 '{R4U0-L5EY71BN-9432WLPE}',
 '{L7V8-B3VF10NY-3528JHMD}',
 '{J0W5-D1UM54UE-1226CKXX}',
 '{T4O3-W9NH72UI-2778USIL}',
 '{P3P3-R5NS04UU-2644BJIC}',
 '{I2P1-S9MC06AP-9989GDDO}',
 '{I7C2-D6NT66AA-1853AICO}']

## Get Additional Information about Each User (Role, Team, ...)

In [None]:
from sklearn.preprocessing import LabelEncoder
import dask.dataframe as dd
import cudf

# Reading the first LDAP file is enough since it contains information about all 1000 employees
user_mapping = pd.read_csv(os.path.join(BASE_PATH, "ldap/2009-12.csv"), usecols=["user_id", "role", "functional_unit", "department", "team"])
user_mapping["functional_unit"] = user_mapping["functional_unit"].str[:1].fillna(0).astype("int")
user_mapping["department"] = user_mapping["department"].str[:1].fillna(0).astype("int")
user_mapping["team"] = user_mapping["team"].str[:1].fillna(0).astype("int")
user_mapping["role"] = LabelEncoder().fit_transform(user_mapping["role"])
user_mapping.rename(columns={"user_id": "user"}, inplace=True)
user_mapping.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 5 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   user             1000 non-null   object
 1   role             1000 non-null   int64 
 2   functional_unit  1000 non-null   int64 
 3   department       1000 non-null   int64 
 4   team             1000 non-null   int64 
dtypes: int64(4), object(1)
memory usage: 39.2+ KB


# Preprocessing

In [None]:
# Iterate through the datasets
for filename in glob.glob(os.path.join(BASE_PATH, "*.csv"))[::-1]:

    print(f"Processing {filename}...")

    # Read the CSV as a dask cudf dataframe, parsing the dates is faster...
    df = dask_cudf.read_csv(filename, parse_dates=["date"], chunksize="2GB")

    # Add threat labels
    df["threat"] = 0
    df["threat"] = df.threat.where(~df.id.isin(answers), 1)

    # Extrct the hours and weekdays from the datetime and map it to sin/cos
    df["hour"] = df.date.dt.hour
    df["weekday"] = df.date.dt.isocalendar().day
    df["hour_sin"] = np.sin(2 * np.pi * df["hour"]/23.0)
    df["hour_cos"] = np.cos(2 * np.pi * df["hour"]/23.0)
    df["weekday_sin"] = np.sin(2 * np.pi * df["weekday"]/7)
    df["weekday_cos"] = np.cos(2 * np.pi * df["weekday"]/7)

    # Drop the id column
    df = df.drop(columns=["id"])

    # HTTP specific preprocessing
    if os.path.basename(filename) == "http.csv":
        df["url"] = df["url"].str[7:].str.split("/", n=1, expand=True)[0]
        df["is_job_portal"] = 0
        df["is_job_portal"] = df.is_job_portal.where(~df.url.isin(job_portals), 1)
        df = df.drop(columns=["content"])

    # Convert the dask cudf dataframe to a dask dataframe
    df = df.map_partitions(lambda df: df.to_pandas())

    # Add the user role/functional_unit/department/team information
    df = df.merge(user_mapping, on=["user"])

    # Save the modified dataframe in parquet which is more efficient to read/write
    if not os.path.exists(os.path.join(BASE_PATH, "preprocessed")):
        os.makedirs(os.path.join(BASE_PATH, "preprocessed"))

    # Finally, save the dataframe to the new subdirectory
    df.to_parquet(
        f"{os.path.join(BASE_PATH, 'preprocessed', os.path.splitext(os.path.basename(filename))[0])}",
        write_index=False,
        compression=None
    )

Processing /content/drive/MyDrive/CERT/r4.2/http.csv...


# Data Augmentation

## Helper Function

In [None]:
def transform_features_to_sessions(df, open_activity, close_activity):

    # Use a counter so we do not have to start the inner loop from the beginning
    checkpoint = 0
    
    # Accumulating the sessions in a list is cheaper than appending to a dataframe
    sessions = []

    # Iterate through the grouped dataframe
    for i in range(0, df.shape[0] - 1):

        row1 = df.iloc[i]
        row2 = df.iloc[i+1]

        if row1.activity == open_activity and row2.activity == close_activity:

            # Calculate the time delta and convert it to minutes
            session_duration = row2.date - row1.date
            session_duration = int(session_duration.total_seconds() / 60)

            # Append the session information to the list
            sessions.append(
                list(row1) + [session_duration]
            )

    # Transform the list to a dataframe and return it 
    return pd.DataFrame(sessions, columns=list(df) + ["session_duration"])

## Logon Sessions

In [None]:
%%time

# Read the logon data
df = pd.read_parquet(os.path.join(BASE_PATH, "preprocessed", "logon"))

# Group the dataframe by users
session_df = df.groupby("user").apply(
    transform_features_to_sessions,
    open_activity="Logon",
    close_activity="Logoff"
)

# Save the session data to a new parquet
session_df.to_parquet(
    os.path.join(BASE_PATH, "preprocessed", "logon_sessions"),
    index=False,
    engine="pyarrow",
    compression=None
)

## Device Sessions

In [None]:
%%time
%%memit

# Read the device data
df = pd.read_parquet(os.path.join(BASE_PATH, "preprocessed", "device"))

# Group the dataframe by users
session_df = df.groupby("user").apply(
    transform_features_to_sessions,
    open_activity="Connect",
    close_activity="Disconnect"
)

# Save the session data to a new parquet
session_df.to_parquet(
    os.path.join(BASE_PATH, "preprocessed", "device_sessions"),
    index=False,
    engine="pyarrow",
    compression=None
)

peak memory: 3782.61 MiB, increment: 97.56 MiB
CPU times: user 10min 13s, sys: 5.13 s, total: 10min 18s
Wall time: 10min 13s
