UNSW-NB15 dataset

Explanation of Key Features:



* **id**: A unique identifier for each record.
*  **dur** (duration): Duration of the connection in seconds, indicating how long a connection lasts.
*  **proto** (protocol): The protocol used for the connection (e.g., TCP, UDP, ICMP).
*  **service**: Type of service involved (e.g., HTTP, FTP, SMTP).
*  **state**: State of the connection (e.g., FIN, CON, INT).
*  **spkts** (source packets): Number of packets sent by the source.
*  **dpkts** (destination packets): Number of packets received by the destination.
*  **sbytes** (source bytes): Number of bytes sent by the source.
* **dbytes** (destination bytes): Number of bytes received by the destination.
*  **rate**: Traffic flow rate.
* **sttl** (source time-to-live): The TTL value set by the source; used in analyzing packet lifetimes.
*  **dttl** (destination time-to-live): The TTL value set by the destination.
*  **sload** (source load): Load on the source during the connection.
*  **dload** (destination load): Load on the destination during the connection.
* **sloss** (source loss): Number of packets lost by the source.
*  **dloss** (destination loss): Number of packets lost by the destination.
* **sinpkt** (source inter-packet arrival time): Time between packets sent by the source.
* **dinpkt** (destination inter-packet arrival time): Time between packets received by the destination.
* **sjit** (source jitter): Jitter in the packet flow from the source.
* **djit** (destination jitter): Jitter in the packet flow to the destination.
* **swin** (source window size): The size of the source's TCP window.
* **stcpb** (source TCP base sequence number): TCP sequence number from the source.
* **dtcpb** (destination TCP base sequence number): TCP sequence number from the destination.
* **dwin** (destination window size): The size of the destination's TCP window.
* **tcprtt** (TCP round-trip time): Round-trip time for TCP packets.
* **synack**: Time between SYN and ACK packets.
* **ackdat**: Time between ACK packets.
* **smean** (source mean): Mean of the source's data.
* **dmean** (destination mean): Mean of the destination's data.
* **trans_depth**: The depth of the connection, indicating levels of HTTP transactions.
* **response_body_len**: Length of the response body.
* **ct_srv_src**: Count of connections to the same service from the same source.
* **ct_state_ttl**: Count of connections with the same state and TTL.
* **ct_dst_ltm**: Count of connections to the same destination over time.
* **ct_src_dport_ltm**: Count of connections from the same source to a specific destination port over time.
* **ct_dst_sport_ltm**: Count of connections to the same destination from a specific source port over time.
* **ct_dst_src_ltm**: Count of connections between the same source and destination.
* **is_ftp_login**: Binary indicator of whether an FTP login attempt was made.
* **ct_ftp_cmd**: Number of FTP commands in the connection.
* **ct_flw_http_mthd**: Number of HTTP methods used in the connection.
* **ct_src_ltm**: Count of connections from the same source over time.
* **ct_srv_dst**: Count of connections to the same service at the destination.
* **is_sm_ips_ports**: Binary indicator if the source and destination IPs and ports are the same.
* **attack_cat**: The category of the attack (e.g., DoS, Probe, Normal).
* **label**: Binary classification label (0 for normal traffic, 1 for malicious traffic).

In [1]:
#mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [16]:
# Paths to the CSV files in the 'datasets' folder
training_set_path = '/content/drive/My Drive/datasets/UNSW_NB15_training-set.csv'
testing_set_path = '/content/drive/My Drive/datasets/UNSW_NB15_testing-set.csv'



In [17]:
import pandas as pd

# Load the training, testing, and feature CSV files
training_df = pd.read_csv(training_set_path)
testing_df = pd.read_csv(testing_set_path)


# Display the first few rows of each dataframe to ensure they loaded correctly
print("Training Set:")
print(training_df.head())

print("\nTesting Set:")
print(testing_df.head())



Training Set:
   id       dur proto service state  spkts  dpkts  sbytes  dbytes       rate  \
0   1  0.121478   tcp       -   FIN      6      4     258     172  74.087490   
1   2  0.649902   tcp       -   FIN     14     38     734   42014  78.473372   
2   3  1.623129   tcp       -   FIN      8     16     364   13186  14.170161   
3   4  1.681642   tcp     ftp   FIN     12     12     628     770  13.677108   
4   5  0.449454   tcp       -   FIN     10      6     534     268  33.373826   

   ...  ct_dst_sport_ltm  ct_dst_src_ltm  is_ftp_login  ct_ftp_cmd  \
0  ...                 1               1             0           0   
1  ...                 1               2             0           0   
2  ...                 1               3             0           0   
3  ...                 1               3             1           1   
4  ...                 1              40             0           0   

   ct_flw_http_mthd  ct_src_ltm  ct_srv_dst  is_sm_ips_ports  attack_cat  \
0       

This dataset includes both numerical and categorical features that describe network connections.
The target columns, attack_cat and label, are used to differentiate between normal and anomalous/malicious traffic.
The label column is particularly important for binary classification tasks, indicating whether the traffic is normal (0) or an attack (1).
The attack_cat column provides more detailed information about the type of attack, which can be used for multi-class classification.

In [13]:
#Aqui calculem la correlation matrix pero exloent els valors numerics.

# Select only numeric columns and drop the 'id' column
numeric_df = training_df.select_dtypes(include=['number']).drop(columns=['id'])

# Calculate the correlation matrix
correlation_matrix = numeric_df.corr()

# Extract correlation scores for the target variable (assuming the target column is 'label')
target_correlations = correlation_matrix['label'].sort_values(ascending=False)

# Create a DataFrame to display the results as a table
correlation_table = pd.DataFrame({
    'Feature': target_correlations.index,
    'Correlation Score': target_correlations.values
})

# Exclude the target variable itself from the table
correlation_table = correlation_table[correlation_table['Feature'] != 'label']

# Add a ranking column
correlation_table['Rank'] = range(1, len(correlation_table) + 1)

# Display the correlation table
print(correlation_table)

# Optionally, display the top 10 features
print("\nTop 10 Features by Correlation Score:")
print(correlation_table.head(10))

              Feature  Correlation Score  Rank
1                sttl           0.692741     1
2        ct_state_ttl           0.577704     2
3    ct_dst_sport_ltm           0.357213     3
4                rate           0.337979     4
5    ct_src_dport_ltm           0.305579     5
6      ct_dst_src_ltm           0.303855     6
7          ct_src_ltm           0.238225     7
8          ct_dst_ltm           0.229887     8
9          ct_srv_src           0.229044     9
10         ct_srv_dst           0.228046    10
11              sload           0.182870    11
12             ackdat           0.097364    12
13               dttl           0.095049    13
14             tcprtt           0.081584    14
15             synack           0.058299    15
16                dur           0.036175    16
17             sbytes           0.018576    17
18   ct_flw_http_mthd           0.015800    18
19        trans_depth           0.010801    19
20              sloss          -0.000640    20
21           

Step-by-Step Guide to Implement the Analysis:
Preprocess the Data:

Select the top n features according to their correlation scores.
Encode categorical features (if any) and scale numerical features for consistency.


In [21]:

#Tornem a fer el mateix pero incloent els categoric values en el case del training set hi ha tres columnes
#amb categorical values - proto , state and attack_cat
import pandas as pd
from sklearn.preprocessing import LabelEncoder



# Print original columns to ensure non-numeric columns are included
print("Original Columns:", training_df.columns)

# Replace non-numeric placeholders (e.g., '-') with NaN
training_df.replace('-', pd.NA, inplace=True)

# Optionally, convert columns to numeric, coercing errors to NaN for better handling
training_df = training_df.apply(pd.to_numeric, errors='coerce')

# Check for any remaining non-numeric values
print("Number of non-numeric values per column:")
print(training_df.isna().sum())

# Fill or drop NaNs as needed (example: fill with 0 or drop rows with NaNs)
# training_df.fillna(0, inplace=True)  # Option 1: Fill NaNs with 0
# training_df.dropna(inplace=True)     # Option 2: Drop rows with NaNs

# Proceed with correlation analysis

# Drop the 'id' column before calculating the correlation matrix
training_df_no_id = training_df.drop(columns=['id'])

# Calculate the correlation matrix
correlation_matrix = training_df_no_id.corr()

target_correlations = correlation_matrix['label'].sort_values(ascending=False)

# Create a DataFrame for correlation results
correlation_table = pd.DataFrame({
    'Feature': target_correlations.index,
    'Correlation Score': target_correlations.values
})

correlation_table = correlation_table[correlation_table['Feature'] != 'label']
correlation_table['Rank'] = range(1, len(correlation_table) + 1)

# Display the correlation table
print(correlation_table.head(10))



Original Columns: Index(['id', 'dur', 'proto', 'service', 'state', 'spkts', 'dpkts', 'sbytes',
       'dbytes', 'rate', 'sttl', 'dttl', 'sload', 'dload', 'sloss', 'dloss',
       'sinpkt', 'dinpkt', 'sjit', 'djit', 'swin', 'stcpb', 'dtcpb', 'dwin',
       'tcprtt', 'synack', 'ackdat', 'smean', 'dmean', 'trans_depth',
       'response_body_len', 'ct_srv_src', 'ct_state_ttl', 'ct_dst_ltm',
       'ct_src_dport_ltm', 'ct_dst_sport_ltm', 'ct_dst_src_ltm',
       'is_ftp_login', 'ct_ftp_cmd', 'ct_flw_http_mthd', 'ct_src_ltm',
       'ct_srv_dst', 'is_sm_ips_ports', 'attack_cat', 'label'],
      dtype='object')
Number of non-numeric values per column:
id                        0
dur                       0
proto                     0
service              175341
state                     0
spkts                     0
dpkts                     0
sbytes                    0
dbytes                    0
rate                      0
sttl                      0
dttl                      0
sload     

In [28]:
# Define the top features based on your correlation matrix
top_features = ['sttl', 'ct_state_ttl', 'state', 'ct_dst_sport_ltm', 'rate',
                'ct_src_dport_ltm', 'ct_dst_src_ltm', 'ct_src_ltm', 'ct_dst_ltm', 'ct_srv_src']

# Define the target variable
target = 'label'


In [29]:
# Apply Label Encoding safely with .loc to avoid SettingWithCopyWarning
for feature in categorical_features:
    if feature in X_train.columns:
        # Fit the encoder only on the training set
        label_encoder.fit(X_train[feature])

        # Transform training and testing sets
        X_train.loc[:, feature] = label_encoder.transform(X_train[feature])

        # Transform the test set with the same encoder; handle unseen labels
        X_test.loc[:, feature] = X_test[feature].apply(lambda x: label_encoder.transform([x])[0] if x in label_encoder.classes_ else -1)




In [30]:
# Initialize and fit the scaler on the training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Apply the same scaling to the testing data
X_test_scaled = scaler.transform(X_test)


In [31]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import SGDClassifier, LogisticRegression
import xgboost as xgb

# List of models to train and evaluate
models = {
    'Naive Bayes': GaussianNB(),
    'LDA': LDA(),
    'KNN': KNeighborsClassifier(),
    'XGBoost': xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss'),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(),
    'AdaBoost': AdaBoostClassifier(),
    'SGD': SGDClassifier(),
    'Logistic Regression': LogisticRegression()
}

# Function to train and evaluate models
def evaluate_models(X_train, X_test, y_train, y_test):
    results = {}

    for name, model in models.items():
        print(f"Training {name}...")
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)

        # Calculate performance metrics
        accuracy = accuracy_score(y_test, y_pred)
        precision = precision_score(y_test, y_pred, zero_division=1)
        recall = recall_score(y_test, y_pred, zero_division=1)
        f1 = f1_score(y_test, y_pred, zero_division=1)

        # Store results
        results[name] = {
            'Accuracy': round(accuracy * 100, 2),
            'Precision': round(precision * 100, 2),
            'Recall': round(recall * 100, 2),
            'F1 Score': round(f1 * 100, 2)
        }

    return pd.DataFrame.from_dict(results, orient='index')

# Evaluate the models using the prepared data
results = evaluate_models(X_train_scaled, X_test_scaled, y_train, y_test)
print("\nModel Performance with Top Features:")
print(results)


Training Naive Bayes...
Training LDA...
Training KNN...
Training XGBoost...


Parameters: { "use_label_encoder" } are not used.



Training Decision Tree...
Training Random Forest...
Training SVM...
Training AdaBoost...




Training SGD...
Training Logistic Regression...

Model Performance with Top Features:
                     Accuracy  Precision  Recall  F1 Score
Naive Bayes             58.39      90.33   27.36     41.99
LDA                     73.31      72.26   83.64     77.53
KNN                     68.77      71.20   72.66     71.93
XGBoost                 80.83      74.25   99.78     85.14
Decision Tree           84.02      79.61   95.41     86.80
Random Forest           84.35      78.47   98.65     87.41
SVM                     73.52      75.23   77.38     76.29
AdaBoost                80.15      74.45   97.35     84.37
SGD                     71.85      69.74   86.31     77.15
Logistic Regression     70.82      71.26   78.76     74.82
