<a href="https://colab.research.google.com/github/matteoturnu/NetSecProject/blob/main/NetSec_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Studying features correlation and distribution

## Prepararing the dataset

Importing the needed libraries

In [1]:
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report, roc_curve, roc_auc_score

Loading and setting the dataset

In [2]:
url = 'https://raw.githubusercontent.com/matteoturnu/NetSecProject/refs/heads/main/BenignAndMaliciousDataset.csv'
traffic_df = pd.read_csv(url)
print(traffic_df.shape)

# We only select numeric and boolean features
numeric_dataset = traffic_df.select_dtypes(include=['number','boolean'])

# We remove Domain because it is an incremental ID
new_numeric_dataset = numeric_dataset.drop(columns='Domain')
# we remove Ip and ASN because they are based on strings "de facto" so it doesn't make sense to study correlation
excluded_features = ['Ip', 'ASN']
new_numeric_dataset = new_numeric_dataset.drop(columns=excluded_features)

(90000, 34)


## Evaluating features weight: ANOVA test
UPDATE: ANOVA test can't be used since samples have no Gaussian distribution!

Maybe move to this section gaussian test on samples distribution


In [3]:
# let's suppose "target" columns contains 0 for benign and 1 for malicious traffic
# NOTE: actually the dataset uses "0" for malicious classes and "1" for benign ones
target = 'Class'
feature_columns = [col for col in new_numeric_dataset.columns if col != target]

# ANOVA test for each numeric feature
p_values = {}
for feature in feature_columns:
    # split data into bening and malicious groups
    group_benign = new_numeric_dataset[new_numeric_dataset[target] == 0][feature]
    group_malicious = new_numeric_dataset[new_numeric_dataset[target] == 1][feature]
    # applying test
    stat, p_value = stats.f_oneway(group_benign, group_malicious)
    # save p-value for each feature
    p_values[feature] = p_value

    # evaluate p-value
    if p_value < 0.01:  # significance level (alpha)
        print(f'Feature "{feature}" is significant (p-value: {p_value:.3e})')
    else:
        print(f'La feature {feature} is NOT significant (p-value: {p_value:.3e})')

Feature "MXDnsResponse" is significant (p-value: 0.000e+00)
Feature "TXTDnsResponse" is significant (p-value: 0.000e+00)
Feature "HasSPFInfo" is significant (p-value: 0.000e+00)
Feature "HasDkimInfo" is significant (p-value: 6.456e-04)
Feature "HasDmarcInfo" is significant (p-value: 1.391e-62)
Feature "DomainInAlexaDB" is significant (p-value: 3.066e-94)
Feature "CommonPorts" is significant (p-value: 0.000e+00)
Feature "CreationDate" is significant (p-value: 0.000e+00)
Feature "LastUpdateDate" is significant (p-value: 0.000e+00)
Feature "HttpResponseCode" is significant (p-value: 0.000e+00)
La feature SubdomainNumber is NOT significant (p-value: 1.102e-02)
Feature "Entropy" is significant (p-value: 0.000e+00)
Feature "EntropyOfSubDomains" is significant (p-value: 2.358e-17)
Feature "StrangeCharacters" is significant (p-value: 0.000e+00)
Feature "IpReputation" is significant (p-value: 3.377e-219)
La feature DomainReputation is NOT significant (p-value: 4.411e-01)
Feature "ConsoantRatio"

Previous test shows features "DomainReputation" and "SubdomainNumber" are not important for determining the class

In [4]:
new_numeric_dataset = new_numeric_dataset.drop(columns=['DomainReputation', 'SubdomainNumber'])

## Bening and malicious traffic distribution based on remaining features

In [None]:
plt.figure(figsize=(20, 20))

for i, feature in enumerate(new_numeric_dataset, 1):
    plt.subplot(7, 5, i)
    plot = sns.histplot(data=traffic_df, x=feature, hue='Class', multiple='dodge', palette='Set1', bins=25, stat='percent')

    if traffic_df[feature].nunique() == 2:  # Check if the current feature is boolean
        plt.xticks([0, 1])
        plt.xlim(-0.5, 1.5)

    plt.title(f'{feature}', fontweight='bold')
    plt.ylabel('Total [%]')

    plt.grid(True, which='both', linestyle='--', linewidth=0.5)
    plt.minorticks_on()
    plt.grid(True, which='minor', linestyle=':', linewidth=0.5)


plt.tight_layout()
plt.show()

Graphs show some features are equally distributed among malicious and bening samples

In [None]:
# remove overlapping features
new_numeric_dataset = new_numeric_dataset.drop(columns=['HasDkimInfo', 'HasDmarcInfo', 'DomainInAlexaDB', 'CommonPorts', 'EntropyOfSubDomains', 'IpReputation'])

## Studying feature-target correlation

In [None]:
corr_matrix = new_numeric_dataset.corr()
class_corr = corr_matrix[['Class']]
plt.figure(figsize=(1, 6))
sns.heatmap(class_corr, annot=True, linewidths=0.5, fmt='.2f')
plt.show()

Chosen features (based on high correlation with the class)
- TXTDnsResponse
- HasSPFInfo
- StrangeCharacters
- ConsonantRatio
- NumericRatio
- VowelRatio
- NumericSequence

## Studying pairwise feature correlation

In [None]:
features = ['TXTDnsResponse', 'HasSPFInfo', 'StrangeCharacters', 'ConsoantRatio', 'NumericRatio', 'VowelRatio', 'NumericSequence']
new_numeric_dataset = new_numeric_dataset[features]

corr_matrix = new_numeric_dataset.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, linewidth=0.5, fmt='.2f')

Now, looking at correlation between features themselves:
- choosing consonantRatio against NumericRatio
- keeping TXTDnsResponse against HasSPFInfo
- choosing NumericSequence against consonantRatio (4 features left)

Left: NumericSequence, TXTDnsResponse, StrangeCharacters, VowelRatio (next test: maybe remove NumericSequence or VowelRatio)

In [None]:
final_features = ['TXTDnsResponse', 'StrangeCharacters', 'VowelRatio', 'NumericSequence']

# Choosing the ML classifier

Test to evaluate how much the samples distribution is a Gaussian one


In [None]:
study_features=['TXTDnsResponse', 'StrangeCharacters', 'VowelRatio', 'NumericSequence', 'Class']
dataset_study=traffic_df[study_features]
dataset_ben= dataset_study[dataset_study['Class']==1]
dataset_mal= dataset_study[dataset_study['Class']==0]

# Esegui il test di Shapiro-Wilk
stat, p_value = stats.shapiro(dataset_ben)

print('Statistic:', stat)
print('p-value:', p_value)

if p_value > 0.05:
    print('Distribuzione probabilmente normale (non rifiuto H0)')
else:
    print('Distribuzione non normale (rifiuto H0)')

stat, p_value = stats.shapiro(dataset_mal)

print('Statistic:', stat)
print('p-value:', p_value)

if p_value > 0.05:
    print('Distribuzione probabilmente normale (non rifiuto H0)')
else:
    print('Distribuzione non normale (rifiuto H0)')

In [None]:
plt.figure(figsize=(10, 10))

i = 1
for feature in final_features[1:]:
  for j in range(2):
    plt.subplot(3, 2, i)
    if i % 2 == 1:
      plot = sns.histplot(data=dataset_ben, x=feature, multiple='dodge', bins=25, stat='percent', color='blue')
    else:
      plot = sns.histplot(data=dataset_mal, x=feature, multiple='dodge', bins=25, stat='percent', color='red')
    i += 1

plt.show()

Sort of probability distribution for each feature. They are NO gaussian ones so it's not possible to suppose that data has Gaussian distribution

# K-NN

In [None]:
y = traffic_df['Class']

# k values to try
k_values = [5, 10, 20, 50, 100, 200, 300]

## Test for chosen features

K-NN con tutte le 4 features

In [None]:
X = traffic_df[final_features] # 4 features

# X = traffic_df[final_features] # Sostituisci con le tue feature

# Suddividi il dataset in 70% training e 30% test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Normalizzazione delle feature per KNN
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Addestramento del modello
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# Valutazione del modello
y_pred = knn.predict(X_test)

# Matrice di confusione
cm = confusion_matrix(y_test, y_pred)

TP = cm[1, 1]  # Veri positivi
TN = cm[0, 0]  # Veri negativi
FP = cm[0, 1]  # Falsi positivi
FN = cm[1, 0]  # Falsi negativi

TPR = TP / (TP + FN) if (TP + FN) > 0 else 0
FNR = FN / (TP + FN) if (TP + FN) > 0 else 0

P = TP/(TP+FP) if (TP+FP) > 0 else 0

print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
print(f'True Positive Ratio: {TPR:.2f}')
print(f'False Negative Ratio: {FNR:.2f}')
print(f'Precision: {P:.2f}')
print('Classification Report:')
print(classification_report(y_test, y_pred))

# Test su accuratezza per diversi valori di k
k_values = [1, 3, 5, 7, 9, 11]
for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)

    print(f'k={k}, Accuracy: {accuracy_score(y_test, y_pred):.2f}')

# --- CURVA ROC E CALCOLO AUC ---
# Prevedi le probabilità delle classi positive
y_prob = knn.predict_proba(X_test)[:, 1]

# Calcola fpr, tpr e soglie per la ROC
fpr, tpr, thresholds = roc_curve(y_test, y_prob)

# Calcola AUC
roc_auc = roc_auc_score(y_test, y_prob)

# Traccia la curva ROC
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()


K-NN sulle 4 features prese singolarmente

In [None]:
plt.figure(figsize=(10, 8))

for feature in X:
    print(f"\nAnalizzando la feature: {feature}")

    feature_vector = traffic_df[feature].values.reshape(-1, 1)  # Reshape per ottenere una 2D array

    X_train, X_test, y_train, y_test = train_test_split(feature_vector, y, test_size=0.3, random_state=42)

    scaler = StandardScaler()
    X_train = scaler.fit_transform(X_train)
    X_test = scaler.transform(X_test)

    k = 5
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)

    # Valutazione del modello
    y_pred = knn.predict(X_test)

    cm = confusion_matrix(y_test, y_pred)

    TP = cm[1, 1]  # Veri positivi
    TN = cm[0, 0]  # Veri negativi
    FP = cm[0, 1]  # Falsi positivi
    FN = cm[1, 0]  # Falsi negativi

    TPR = TP / (TP + FN) if (TP + FN) > 0 else 0
    FNR = FN / (TP + FN) if (TP + FN) > 0 else 0

    P = TP / (TP + FP) if (TP + FP) > 0 else 0

    print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
    print(f'True Positive Ratio: {TPR:.2f}')
    print(f'False Negative Ratio: {FNR:.2f}')
    print(f'Precision: {P:.2f}')
    print('Classification Report:')
    print(classification_report(y_test, y_pred))

    # Prevedi le probabilità della classe positiva
    y_prob = knn.predict_proba(X_test)[:, 1]

    # Calcola fpr, tpr e soglie per la ROC
    fpr, tpr, thresholds = roc_curve(y_test, y_prob)

    # Calcola AUC
    roc_auc = roc_auc_score(y_test, y_prob)

    # Traccia la curva ROC per ogni feature
    plt.plot(fpr, tpr, lw=2, label=f'{feature} (AUC = {roc_auc:.2f})')

# Aggiungi dettagli al grafico
plt.plot([0, 1], [0, 1], color='gray', lw=2, linestyle='--')  # Linea diagonale per classificatore casuale
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (FPR)')
plt.ylabel('True Positive Rate (TPR)')
plt.title('ROC Curves per le 4 feature')
plt.legend(loc='lower right')
plt.grid(True)
plt.show()

## Test for excluded features

In [None]:
X = traffic_df[excluded_features]

for feature in X:
  print(f"\nAnalizzando la feature: {feature}")

  feature_vector = traffic_df[feature].values.reshape(-1, 1)  # Reshape per ottenere una 2D array

  X_train, X_test, y_train, y_test = train_test_split(feature_vector, y, test_size=0.3, random_state=42)

  scaler = StandardScaler()
  X_train = scaler.fit_transform(X_train)
  X_test = scaler.transform(X_test)

  k = 5
  knn = KNeighborsClassifier(n_neighbors=k)
  knn.fit(X_train, y_train)

  # Model evaluation
  y_pred = knn.predict(X_test)

  cm = confusion_matrix(y_test, y_pred)

  TP = cm[1, 1]  # veri positivi
  TN = cm[0, 0]  # veri negativi
  FP = cm[0, 1]  # falsi positivi
  FN = cm[1, 0]  # falsi negativi

  TPR = TP / (TP + FN) if (TP + FN) > 0 else 0
  FNR = FN / (TP + FN) if (TP + FN) > 0 else 0

  P = TP/(TP+FP)

  print(f'Accuracy: {accuracy_score(y_test, y_pred):.2f}')
  print(f'True Positive Ratio: {TPR:.2f}')
  print(f'False Negative Ratio: {FNR:.2f}')
  print(f'Precision: {P:.2f}')
  print('Classification Report:')
  print(classification_report(y_test, y_pred))

  for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    print(f'k={k}, Accuracy: {accuracy_score(y_test, y_pred):.2f}')