## analysis

NB: this notebook probably will not work again due to not having dataset files in the same directory + you will need to install python libraries necessary etc...
So this will serve as a notebook to see how the models got trained and the accuracies etc...

in summary:
* 2 classification models trained on this dataset (the two are xgboost models (extreme gradient boosting))
* a binary classification model (classify the event log if it is normal or attack)
* a multi-class classification model (classify the attack category of this event log, there's nine categories)
* for training we used these two csv files : `UNSW_NB15_testing-set.csv` and `UNSW_NB15_training-set.csv`
* the raw data (four csv files) have some columns that are not present in the training data files 
* used two encoders to encode categorical columns. 

here's the link to the full dataset (csv files folder has our data): 
https://unsw-my.sharepoint.com/:f:/g/personal/z5025758_ad_unsw_edu_au/EnuQZZn3XuNBjgfcUu4DIVMBLCHyoLHqOswirpOQifr1ag?e=gKWkLS

the page of the research: https://research.unsw.edu.au/projects/unsw-nb15-dataset

### infos about raw data (files 1, 2, 3, 4)

In [1]:
# read csv file in a pandas dataframe
import pandas as pd

# first row is a row not the names of the columns
df = pd.read_csv('../../../CSV_Files/UNSW-NB15_1.csv', header=None)
df.head()

  df = pd.read_csv('../CSV_Files/UNSW-NB15_1.csv', header=None)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,39,40,41,42,43,44,45,46,47,48
0,59.166.0.0,1390,149.171.126.6,53,udp,CON,0.001055,132,164,31,...,0,3,7,1,3,1,1,1,,0
1,59.166.0.0,33661,149.171.126.9,1024,udp,CON,0.036133,528,304,31,...,0,2,4,2,3,1,1,2,,0
2,59.166.0.6,1464,149.171.126.7,53,udp,CON,0.001119,146,178,31,...,0,12,8,1,2,2,1,1,,0
3,59.166.0.5,3593,149.171.126.5,53,udp,CON,0.001209,132,164,31,...,0,6,9,1,1,1,1,1,,0
4,59.166.0.3,49664,149.171.126.0,53,udp,CON,0.001169,146,178,31,...,0,7,9,1,1,1,1,1,,0


In [2]:
print(df.shape)

(700001, 49)


In [3]:
# count how many distinct values are in the last column, and how much those values are repeated
value_counts = df[df.columns[-1]].value_counts()
print("Counts of each value in the last column:")
print(value_counts)

# same for the second last column
value_counts = df[df.columns[-2]].value_counts()
print("Counts of each value in the second last column:")
print(value_counts)

Counts of each value in the last column:
48
0    677786
1     22215
Name: count, dtype: int64
Counts of each value in the second last column:
47
Generic           7522
Exploits          5409
 Fuzzers          5051
Reconnaissance    1759
DoS               1167
Backdoors          534
Analysis           526
Shellcode          223
Worms               24
Name: count, dtype: int64


In [4]:
# how many 1s and 0s in the 4 files combined and the total
total_ones = 0
total_zeros = 0
for i in range(1, 5):
    df = pd.read_csv(f'UNSW-NB15_{i}.csv', header=None)
    value_counts = df[df.columns[-1]].value_counts()
    ones = value_counts.get(1, 0)
    zeros = value_counts.get(0, 0)
    total_ones += ones
    total_zeros += zeros
    print(f"File UNSW-NB15_{i}.csv: 1s = {ones}, 0s = {zeros}")

print(f"Total: 1s = {total_ones}, 0s = {total_zeros}")

  df = pd.read_csv(f'UNSW-NB15_{i}.csv', header=None)


File UNSW-NB15_1.csv: 1s = 22215, 0s = 677786


  df = pd.read_csv(f'UNSW-NB15_{i}.csv', header=None)


File UNSW-NB15_2.csv: 1s = 52749, 0s = 647252
File UNSW-NB15_3.csv: 1s = 157425, 0s = 542576
File UNSW-NB15_4.csv: 1s = 88894, 0s = 351150
Total: 1s = 321283, 0s = 2218764


NOTE: so the 4 fies has 49 columns (including the target values)

### analysis of GT file

In [6]:
# read the file 'UNSW-NB15_GT.csv' which contains the mapping of attack categories
gt_df = pd.read_csv('NUSW-NB15_GT.csv')
print("GT file shape : ", gt_df.shape)

GT file shape :  (174347, 12)


## START of real work
### analysis of training and testing sets an preparing for training

In [5]:
# read from a folder named 'Training and Testing Sets'
train_set = pd.read_csv('../../../CSV_Files/Training_and_Testing_Sets/UNSW_NB15_training-set.csv')
test_set =  pd.read_csv('../../../CSV_Files/Training_and_Testing_Sets/UNSW_NB15_testing-set.csv')

print("shape of training set: ", train_set.shape)
print("shape of testing set: ", test_set.shape)

shape of training set:  (175341, 45)
shape of testing set:  (82332, 45)


In [7]:
train_set.head(20)

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,Normal,0
1,2,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,...,1,2,0,0,0,1,6,0,Normal,0
2,3,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,...,1,3,0,0,0,2,6,0,Normal,0
3,4,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,...,1,3,1,1,0,2,1,0,Normal,0
4,5,0.449454,tcp,-,FIN,10,6,534,268,33.373826,...,1,40,0,0,0,2,39,0,Normal,0
5,6,0.380537,tcp,-,FIN,10,6,534,268,39.41798,...,1,40,0,0,0,2,39,0,Normal,0
6,7,0.637109,tcp,-,FIN,10,8,534,354,26.683033,...,1,40,0,0,0,1,39,0,Normal,0
7,8,0.521584,tcp,-,FIN,10,8,534,354,32.593026,...,1,40,0,0,0,3,39,0,Normal,0
8,9,0.542905,tcp,-,FIN,10,8,534,354,31.313031,...,1,40,0,0,0,3,39,0,Normal,0
9,10,0.258687,tcp,-,FIN,10,6,534,268,57.985135,...,1,40,0,0,0,3,39,0,Normal,0


In [7]:
test_set.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0


In [8]:
print(train_set.dtypes)
test_set.head()

id                     int64
dur                  float64
proto                 object
service               object
state                 object
spkts                  int64
dpkts                  int64
sbytes                 int64
dbytes                 int64
rate                 float64
sttl                   int64
dttl                   int64
sload                float64
dload                float64
sloss                  int64
dloss                  int64
sinpkt               float64
dinpkt               float64
sjit                 float64
djit                 float64
swin                   int64
stcpb                  int64
dtcpb                  int64
dwin                   int64
tcprtt               float64
synack               float64
ackdat               float64
smean                  int64
dmean                  int64
trans_depth            int64
response_body_len      int64
ct_srv_src             int64
ct_state_ttl           int64
ct_dst_ltm             int64
ct_src_dport_l

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,1.1e-05,udp,-,INT,2,0,496,0,90909.0902,...,1,2,0,0,0,1,2,0,Normal,0
1,2,8e-06,udp,-,INT,2,0,1762,0,125000.0003,...,1,2,0,0,0,1,2,0,Normal,0
2,3,5e-06,udp,-,INT,2,0,1068,0,200000.0051,...,1,3,0,0,0,1,3,0,Normal,0
3,4,6e-06,udp,-,INT,2,0,900,0,166666.6608,...,1,3,0,0,0,2,3,0,Normal,0
4,5,1e-05,udp,-,INT,2,0,2126,0,100000.0025,...,1,3,0,0,0,2,3,0,Normal,0


In [13]:
# summarize categorical columns in training set

# Summarize the categorical column
attack_summary = train_set['attack_cat'].value_counts()
print(attack_summary)

#  Normal => 0
# all others categories => 1

attack_cat
Normal            56000
Generic           40000
Exploits          33393
Fuzzers           18184
DoS               12264
Reconnaissance    10491
Analysis           2000
Backdoor           1746
Shellcode          1133
Worms               130
Name: count, dtype: int64


In [14]:
proto_summary = train_set['proto'].value_counts()
print(proto_summary)

proto
tcp       79946
udp       63283
unas      12084
arp        2859
ospf       2595
          ...  
argus        98
netblt       98
igmp         18
icmp         15
rtp           1
Name: count, Length: 133, dtype: int64


In [15]:
service_summary = train_set['service'].value_counts()
print(service_summary)

service
-           94168
dns         47294
http        18724
smtp         5058
ftp-data     3995
ftp          3428
ssh          1302
pop3         1105
dhcp           94
snmp           80
ssl            56
irc            25
radius         12
Name: count, dtype: int64


In [16]:
state_summary = train_set['state'].value_counts()
print(state_summary)

state
INT    82275
FIN    77825
CON    13152
REQ     1991
RST       83
ECO       12
PAR        1
URN        1
no         1
Name: count, dtype: int64


In [21]:
#Check for nan
import numpy as np

print(train_set.isna().sum())

id                   0
dur                  0
proto                0
service              0
state                0
spkts                0
dpkts                0
sbytes               0
dbytes               0
rate                 0
sttl                 0
dttl                 0
sload                0
dload                0
sloss                0
dloss                0
sinpkt               0
dinpkt               0
sjit                 0
djit                 0
swin                 0
stcpb                0
dtcpb                0
dwin                 0
tcprtt               0
synack               0
ackdat               0
smean                0
dmean                0
trans_depth          0
response_body_len    0
ct_srv_src           0
ct_state_ttl         0
ct_dst_ltm           0
ct_src_dport_ltm     0
ct_dst_sport_ltm     0
ct_dst_src_ltm       0
is_ftp_login         0
ct_ftp_cmd           0
ct_flw_http_mthd     0
ct_src_ltm           0
ct_srv_dst           0
is_sm_ips_ports      0
attack_cat 

TypeError: ufunc 'isinf' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

In [22]:
print(train_set.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 175341 entries, 0 to 175340
Data columns (total 45 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   id                 175341 non-null  int64  
 1   dur                175341 non-null  float64
 2   proto              175341 non-null  object 
 3   service            175341 non-null  object 
 4   state              175341 non-null  object 
 5   spkts              175341 non-null  int64  
 6   dpkts              175341 non-null  int64  
 7   sbytes             175341 non-null  int64  
 8   dbytes             175341 non-null  int64  
 9   rate               175341 non-null  float64
 10  sttl               175341 non-null  int64  
 11  dttl               175341 non-null  int64  
 12  sload              175341 non-null  float64
 13  dload              175341 non-null  float64
 14  sloss              175341 non-null  int64  
 15  dloss              175341 non-null  int64  
 16  si

## Preprocessing data

In [25]:
# prepare data for training a XGBoost model to classify 0s and 1s
import sklearn
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
import numpy as np

# the split done from dataset creator isn't good , so we concatenat them and then split again
ml_dataset = pd.concat([train_set, test_set], ignore_index=True)
X = ml_dataset.drop(columns=["label", 'attack_cat', 'id'])
y = ml_dataset["label"] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Encode categorical columns
categorical_cols = ['proto', 'service', 'state']
label_encoders = {}

for col in categorical_cols:
    le = LabelEncoder()
    le.fit(X_train[col])
    # Add 'Unknown' to the encoder's classes
    le.classes_ = np.append(le.classes_, 'Unknown')
    X_train[col] = le.transform(X_train[col])
    X_test[col] = le.transform(X_test[col])  # Use the same encoder for test data
    label_encoders[col] = le  # Save for later use


# see X 
#print(X_train.head())
print(X_train.dtypes)
#print(X_test.head())


ModuleNotFoundError: No module named 'sklearn'

In [10]:
X.head()

Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports
0,0.121478,tcp,-,FIN,6,4,258,172,74.08749,252,...,1,1,1,1,0,0,0,1,1,0
1,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,62,...,1,1,1,2,0,0,0,1,6,0
2,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,62,...,2,1,1,3,0,0,0,2,6,0
3,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,62,...,2,1,1,3,1,1,0,2,1,0
4,0.449454,tcp,-,FIN,10,6,534,268,33.373826,254,...,2,2,1,40,0,0,0,2,39,0


In [11]:
X_train.head()

Unnamed: 0,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,sttl,...,ct_dst_ltm,ct_src_dport_ltm,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports
184800,0.239865,113,0,4,10,6,822,268,62.535175,254,...,18,1,1,1,0,0,0,2,1,0
147134,9e-06,119,2,5,2,0,114,0,111111.1072,254,...,16,16,16,27,0,0,0,17,27,0
33634,4.110055,113,11,4,622,682,48672,85554,317.02739,31,...,2,1,1,1,0,0,0,4,1,0
110206,0.0,6,0,5,1,0,46,0,0.0,0,...,2,2,2,2,0,0,0,2,2,1
181332,3e-06,119,0,5,2,0,90,0,333333.3215,254,...,12,12,1,12,0,0,0,12,12,0


## Training xgboost classification model 
### Binary classification
0 or 1 => normal or attack

In [12]:
# Train the model
model = XGBClassifier()
model.fit(X_train, y_train)

# Evaluate the model
y_test_pred = model.predict(X_test)
print("Test accuracy: ", accuracy_score(y_test, y_test_pred))

Test accuracy:  0.9493354031240904


### test accuracy on csv files 1,2,3,4
il faut raiter les données des fichiers , puisque les colonnes ne sont pas les memes que les données d'entrainement

In [26]:
ml_dataset = pd.concat([train_set, test_set], ignore_index=True)

# lets test accuracy on csv files of similar format and concatenat them first
csv_file_1_set = pd.read_csv('../../../CSV_Files/UNSW-NB15_1.csv', header=None)
#csv_file_2_set = pd.read_csv('UNSW-NB15_2.csv', header=None)               # don't do this or your pc will be cooked
#csv_file_3_set = pd.read_csv('UNSW-NB15_3.csv', header=None)
#csv_file_4_set = pd.read_csv('UNSW-NB15_4.csv', header=None)
#csv_files_set = pd.concat([csv_file_1_set, csv_file_2_set, csv_file_3_set, csv_file_4_set], ignore_index=True)
csv_files_set = csv_file_1_set
print("shape of all csv files after concatenation: ", csv_files_set.shape)


## read column names in another csv file 
features_pd = pd.read_csv('../../../CSV_Files/NUSW-NB15_features.csv', encoding='utf-8', encoding_errors='ignore')
## take the column 'Name'
feature_names = features_pd['Name'].tolist()
# lets make changes in these names , first one is lower case all of them
feature_names = [name.lower() for name in feature_names]
# Sintpkt => sinpkt and Dintpkt => dinpkt and smeansz => smean and dmeansz => dmean and res_bdy_len => response_body_len
feature_names = [name.replace('sintpkt', 'sinpkt').replace('dintpkt', 'dinpkt').replace('smeansz', 'smean').replace('dmeansz', 'dmean').replace('res_bdy_len', 'response_body_len') for name in feature_names]
csv_files_set.columns = feature_names

# eliminate columns that are not in the training set and keep the ones aren't in the training set in another list
columns_to_drop = [col for col in csv_files_set.columns if col.lower() not in ml_dataset.columns.str.lower()]
columns_to_keep = [col for col in csv_files_set.columns if col.lower() in ml_dataset.columns.str.lower()]
csv_files_set = csv_files_set[columns_to_keep]
print("shape of csv file 1 set after keeping only relevant columns: ", csv_files_set.shape)
print("columns dropped: ", columns_to_drop)
print("columns kept: ", len(columns_to_keep))
print("train set columns: ", len(list(ml_dataset.columns)))
# list of columns in train set and not in csv file 1 set
missing_columns = [col for col in ml_dataset.columns if col.lower() not in csv_files_set.columns.str.lower()]
print("missing columns in csv file 1 set: ", missing_columns)

print("--- adding rate column with default value nan ---")
csv_files_set['rate'] = np.nan

# now re organize the columns to be in the same order as the training set
csv_files_set = csv_files_set.reindex(columns=ml_dataset.columns, fill_value=0)
print("shape of csv file 1 set after reindexing: ", csv_files_set.shape)
csv_files_set.head()




  csv_file_1_set = pd.read_csv('../CSV_Files/UNSW-NB15_1.csv', header=None)


shape of all csv files after concatenation:  (700001, 49)
shape of csv file 1 set after keeping only relevant columns:  (700001, 42)
columns dropped:  ['srcip', 'sport', 'dstip', 'dsport', 'stime', 'ltime', 'ct_src_ ltm']
columns kept:  42
train set columns:  45
missing columns in csv file 1 set:  ['id', 'rate', 'ct_src_ltm']
--- adding rate column with default value nan ---
shape of csv file 1 set after reindexing:  (700001, 45)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  csv_files_set['rate'] = np.nan


Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,0,0.001055,udp,dns,CON,2,2,132,164,,...,1,1,0,0,0,0,7,0,,0
1,0,0.036133,udp,-,CON,4,4,528,304,,...,1,2,0,0,0,0,4,0,,0
2,0,0.001119,udp,dns,CON,2,2,146,178,,...,1,1,0,0,0,0,8,0,,0
3,0,0.001209,udp,dns,CON,2,2,132,164,,...,1,1,0,0,0,0,9,0,,0
4,0,0.001169,udp,dns,CON,2,2,146,178,,...,1,1,0,0,0,0,9,0,,0


In [27]:
# save csv_files_set to a csv file to see it later
csv_files_set.to_csv('processed_csv_file_1_set.csv', index=False)

In [14]:
ml_dataset.head(1)

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,Normal,0


In [15]:
X_csv_test = csv_files_set.drop(columns=["attack_cat", "label", "id"])
y_csv_test = csv_files_set["label"]

# Encode categorical columns in csv_test_set
for col in categorical_cols:
    le = label_encoders[col]
    # Map unseen test values to 'Unknown' before transforming
    X_csv_test[col] = X_csv_test[col].apply(
        lambda x: x if x in le.classes_ else 'Unknown'
    )
    X_csv_test[col] = le.transform(X_csv_test[col])  # Use the saved encoder

y_csv_test_pred = model.predict(X_csv_test)
print("CSV file test accuracy: ", accuracy_score(y_csv_test, y_csv_test_pred))

CSV file test accuracy:  0.9944014365693763


### saving the model in .pt file

In [18]:
# save the model in a file to be imported later
import joblib
joblib.dump(model, './pretrained_models/xgboost_unsw_nb15_model_binary_class.pkl')

# now load the model from the file and test it again
loaded_model = joblib.load('./pretrained_models/xgboost_unsw_nb15_model_binary_class.pkl')
y_csv_test_pred_loaded = loaded_model.predict(X_csv_test)
print("CSV file test accuracy with loaded model: ", accuracy_score(y_csv_test, y_csv_test_pred_loaded))


CSV file test accuracy with loaded model:  0.9944014365693763


## multi class classification
### train model

In [31]:
# train other xgboost for multi class classification
ml_dataset_multi = pd.concat([train_set, test_set], ignore_index=True)
X_multi = ml_dataset_multi.drop(columns=["label", 'attack_cat', 'id'])
y_multi = ml_dataset_multi["attack_cat"]

# encode the y labels to integers
#le_y_multi = LabelEncoder()
#y_multi = le_y_multi.fit_transform(y_multi)

# not using encoder , make a dict to map labels to integers
label_mapping = {
    'Normal': 0,
    'Generic': 1,
    'Exploits': 2,
    'Fuzzers': 3,
    'DoS': 4,
    'Reconnaissance': 5,
    'Analysis': 6,
    'Backdoor': 7,
    'Shellcode': 8,
    'Worms': 9
}
y_multi = y_multi.map(label_mapping)

X_multi_train, X_multi_test, y_multi_train, y_multi_test = train_test_split(X_multi, y_multi, test_size=0.2, random_state=42)

## Encode categorical columns
categorical_cols_multi = ['proto', 'service', 'state']
label_encoders_multi = {}

# Encode categorical columns
for col in categorical_cols_multi:
    le_multi = LabelEncoder()
    le_multi.fit(X_multi_train[col])
    # Add 'Unknown' to the encoder's classes
    le_multi.classes_ = np.append(le_multi.classes_, 'Unknown')
    X_multi_train[col] = le_multi.transform(X_multi_train[col])
    X_multi_test[col] = X_multi_test[col].apply(
        lambda x: x if x in le_multi.classes_ else 'Unknown'
    )
    X_multi_test[col] = le_multi.transform(X_multi_test[col])  # Use the same encoder for test data
    label_encoders_multi[col] = le_multi  # Save for later use


# ValueError: Invalid classes inferred from unique values of `y`.  Expected: [0 1 2 3 4 5 6 7 8 9], got ['Analysis' 'Backdoor' 'DoS' 'Exploits' 'Fuzzers' 'Generic' 'Normal' 'Reconnaissance' 'Shellcode' 'Worms']

# Train the model
model_multi = XGBClassifier()
model_multi.fit(X_multi_train, y_multi_train)
# Evaluate the model
y_multi_test_pred = model_multi.predict(X_multi_test)
print("Multi class Test accuracy: ", accuracy_score(y_multi_test, y_multi_test_pred))


Multi class Test accuracy:  0.8362860192102455


### test model on file 1 csv

In [None]:
X_csv_test = csv_files_set.drop(columns=["attack_cat", "label", "id"])
y_csv_test = csv_files_set["attack_cat"]

# the nan values in y_csv_test should be "Normal"
y_csv_test.fillna("Normal", inplace=True)
# transform y_csv_test to integers using le_y_multi using label_mapping
y_csv_test = y_csv_test.map(label_mapping)

# make sure there are no nan values in y_csv_test (idk why there are nan even when i already filled them , it gave me an error before making this block)
print("Number of nan values in y_csv_test: ", y_csv_test.isna().sum())
if y_csv_test.isna().sum() > 0:
    print("There are nan values in y_csv_test, please check the data.")
    # replace nan values with "Normal"
    y_csv_test.fillna(label_mapping['Normal'], inplace=True)




# Encode categorical columns in csv_test_set
for col in categorical_cols_multi:
    le_multi = label_encoders_multi[col]
    # Map unseen test values to 'Unknown' before transforming
    X_csv_test[col] = X_csv_test[col].apply(
        lambda x: x if x in le_multi.classes_ else 'Unknown'
    )
    X_csv_test[col] = le_multi.transform(X_csv_test[col])  # Use the saved encoder

y_csv_test_pred = model_multi.predict(X_csv_test)
print("CSV file test accuracy: ", accuracy_score(y_csv_test, y_csv_test_pred))

Number of nan values in y_csv_test:  5585
There are nan values in y_csv_test, please check the data.
CSV file test accuracy:  0.9832500239285372


In [36]:
# export the multi class model too
import joblib
joblib.dump(model_multi, './pretrained_models/xgboost_unsw_nb15_model_multi_class.pkl')

['./pretrained_models/xgboost_unsw_nb15_model_multi_class.pkl']

#### exportng encoders too

In [38]:
# export encoders too
import joblib
joblib.dump(label_encoders, './pretrained_models/label_encoders_binary_class.pkl')
joblib.dump(label_encoders_multi, './pretrained_models/label_encoders_multi_class.pkl')

# import encoders
label_encoders_import = joblib.load('./pretrained_models/label_encoders_binary_class.pkl')
label_encoders_multi_import = joblib.load('./pretrained_models/label_encoders_multi_class.pkl')
# see them
print(label_encoders_import)

# use them to test again on csv files
X_csv_test = csv_files_set.drop(columns=["attack_cat", "label", "id"])
y_csv_test = csv_files_set["attack_cat"]

# the nan values in y_csv_test should be "Normal"
y_csv_test.fillna("Normal", inplace=True)
# transform y_csv_test to integers using le_y_multi using label_mapping
y_csv_test = y_csv_test.map(label_mapping)

# make sure there are no nan values in y_csv_test (idk why there are nan even when i already filled them , it gave me an error before making this block)
print("Number of nan values in y_csv_test: ", y_csv_test.isna().sum())
if y_csv_test.isna().sum() > 0:
    print("There are nan values in y_csv_test, please check the data.")
    # replace nan values with "Normal"
    y_csv_test.fillna(label_mapping['Normal'], inplace=True)




# Encode categorical columns in csv_test_set
for col in categorical_cols_multi:
    le_multi = label_encoders_multi_import[col]
    # Map unseen test values to 'Unknown' before transforming
    X_csv_test[col] = X_csv_test[col].apply(
        lambda x: x if x in le_multi.classes_ else 'Unknown'
    )
    X_csv_test[col] = le_multi.transform(X_csv_test[col])  # Use the saved encoder

y_csv_test_pred = model_multi.predict(X_csv_test)
print("CSV file test accuracy: ", accuracy_score(y_csv_test, y_csv_test_pred))


{'proto': LabelEncoder(), 'service': LabelEncoder(), 'state': LabelEncoder()}
Number of nan values in y_csv_test:  5585
There are nan values in y_csv_test, please check the data.
CSV file test accuracy:  0.9832500239285372


### some Docs

cols used in encoders:
`categorical_cols_multi = ['proto', 'service', 'state']`

import encoders using: 
```
# import encoders
label_encoders_import = joblib.load('./pretrained_models/label_encoders_binary_class.pkl')
label_encoders_multi_import = joblib.load('./pretrained_models/label_encoders_multi_class.pkl')
```

and use them like this:
```
# Encode categorical columns in csv_test_set
for col in categorical_cols_multi:
    le_multi = label_encoders_multi_import[col]
    # Map unseen test values to 'Unknown' before transforming
    X_csv_test[col] = X_csv_test[col].apply(
        lambda x: x if x in le_multi.classes_ else 'Unknown'
    )
    X_csv_test[col] = le_multi.transform(X_csv_test[col])  # Use the saved encoder
```


Models training columns:

In [41]:
print("Columns in X_multi_train (training data of multi-class model): ", X_multi_train.columns)
print("Columns in X_train  (training data of binary-class model): ", X_train.columns)
print("Columns in the data (not the exact columns that the model trained on): ", ml_dataset.columns)


Columns in X_multi_train (training data of multi-class model):  Index(['dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
       'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss',
       'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin',
       'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
       'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm',
       'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm',
       'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm',
       'ct_srv_dst', 'is_sm_ips_ports'],
      dtype='object')
Columns in X_train  (training data of binary-class model):  Index(['dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
       'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss',
       'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin',
       'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
       'respons

NB: data from 4 csv files have the column names in features csv file, but they should get a liitle changes to be the same as the training data columns. here's what preprocessing to do to them before send them to model inference:

```python
# lets make changes in these names , first one is lower case all of them
feature_names = [name.lower() for name in feature_names]

# Sintpkt => sinpkt and Dintpkt => dinpkt and smeansz => smean and dmeansz => dmean and res_bdy_len => response_body_len
feature_names = [name.replace('sintpkt', 'sinpkt').replace('dintpkt', 'dinpkt').replace('smeansz', 'smean').replace('dmeansz', 'dmean').replace('res_bdy_len', 'response_body_len') for name in feature_names]

csv_files_set.columns = feature_names

#--- adding rate column with default value nan ---
csv_files_set['rate'] = np.nan

# now re organize the columns to be in the same order as the training set
csv_files_set = csv_files_set.reindex(columns=ml_dataset.columns, fill_value=0)
```