Importujemy biblioteki i wczytujemy pliki CSV

In [1]:
from pyspark.ml.feature import StringIndexer
from pyspark.sql import SparkSession
from sklearn.preprocessing import LabelEncoder
from LogParser import  LogParser
import pandas as pd
import re

spark = SparkSession.builder.appName("Preprocess data").getOrCreate()
ddos_tf_df = spark.read.format("csv").option("header", "true").load("ddos-tcp-syn-flood.csv")
normal_tf_df = spark.read.format("csv").option("header", "true").load("normal-traffic.csv")
port_scan_tf_df = spark.read.format("csv").option("header", "true").load("port-scanning.csv")

data_frames = {
    "normal-traffic": normal_tf_df,
    "port-scanning": port_scan_tf_df,
    "ddos-tcp-syn-flood": ddos_tf_df,
}

24/05/21 17:34:02 WARN Utils: Your hostname, PC-Debian resolves to a loopback address: 127.0.1.1; using 192.168.0.122 instead (on interface enp4s0)
24/05/21 17:34:02 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/21 17:34:02 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Wybieramy sobie kolumny zawierające istotne iformacje. Można dodać więcej ale wtedy trzeba pamiętać o noramlizacji w kolejnej komórce.

In [2]:
selected_columns = [
    "frame-time",
    "ip-src_host",
    "ip-dst_host",
    "tcp-connection-syn",
    "tcp-connection-synack",
    "tcp-flags_index",
    "tcp-len",
    "tcp-seq",
    "tcp-dstport",
    "Attack_type"
]

Iterujemy po wczytanych ramkach, zamieniamy nazwy kolumn na takie bez kropek i normalizujemy/kodujemy nieliczbowe kolumny (oprócz timestampów, ta kolumna jest modyfikowana później). Odchudzone dane zapisujemy do katalogu `preprocessed_data`

In [3]:
for df_name, df in data_frames.items():

    for col_name in df.columns:
        new_col_name = re.sub(r'\.', '-', col_name)
        df = df.withColumnRenamed(col_name, new_col_name)
        
    tcp_flags_indexer = StringIndexer(inputCol="tcp-flags", outputCol="tcp-flags_index")
    indexed_df = tcp_flags_indexer.fit(df).transform(df)

    indexed_df = indexed_df.select([c for c in df.columns if c in selected_columns])
    pandas_df = indexed_df.toPandas()

    label_encoder = LabelEncoder()
    pandas_df["ip-src_host"] = label_encoder.fit_transform(pandas_df["ip-src_host"])
    pandas_df["ip-dst_host"] = label_encoder.fit_transform(pandas_df["ip-dst_host"])

    pandas_df.to_csv(f'preprocessed_data/{df_name}.csv', index=False)

Wczytujemy zapisane pliki csv i tworzymy próbki z danymi, gdzie jedna próbka X to lista zawierająca kolejne 32 logi gdzie od każdego timestampa został odjęty timestamp pierwszego loga z listy (w ten sposób timestampy są niewielkimi wartościami liczbowymi a jednocześnie przechowują informację o odległości pomiędzy kolejnymi logami), a próbka Y to pojedynczy numer określający typ ataku/ruchu normalnego dla zagregowanych logów.

In [4]:
encoded_attacks = {
    "normal-traffic": 0,
    "port-scanning": 1,
    "ddos-tcp-syn-flood": 2
}
x_data = []
y_data = []
for df_name in data_frames.keys():
    df = pd.read_csv(f'preprocessed_data/{df_name}.csv', parse_dates=['frame-time'])
    log_series = LogParser.logs_to_series(df, 32)
    x_data.extend(log_series)
    y_data.extend([encoded_attacks.get(df_name)] * len(log_series))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bucket['frame-time'] = bucket['frame-time'] - bucket['frame-time'].iloc[0]
24/05/21 17:34:15 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(G1 Concurrent GC), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  bucket['frame-time'] = bucket['frame-time'] - bucket['frame-time'].iloc[0]
A value is trying to be set on a copy of a slice fr

In [5]:
print(len(x_data), len(y_data))
print(x_data[0])
print(y_data[0])

32457 32457
    frame-time  ip-src_host  ip-dst_host  tcp-connection-syn  \
0        0.000            2            3                 0.0   
1        0.000            3            2                 0.0   
2        0.000            2            3                 0.0   
3        0.000            3            2                 0.0   
4        0.319            2            3                 1.0   
5        0.319            3            2                 0.0   
6        0.322            2            3                 0.0   
7        0.322            3            2                 0.0   
8        0.323            3            2                 0.0   
9        0.329            2            3                 1.0   
10       0.329            3            2                 0.0   
11       0.330            2            3                 0.0   
12       0.330            3            2                 0.0   
13       0.330            2            3                 0.0   
14       0.330            3 

Przed treningem modeli należy jeszcze pomieszać próbki z danymi oraz podzielić na zbiory treningowe i testowe. W sumie dobrze by też było dodać jakiś padding dla przypadków gdzie jednak próbka nie ma 32 logów.