# INCS 615 Advanced Network and Internet Security
#### Spring 2025, INCS 615-VA1 (2854)
#### Instructor: Dr. Zhida Li
#### Email: zli74@nyit.edu

## Lab #3 (Group) - Detecting Intrusions using Machine Learning Models
### Please keep your output when you submit

# Step 1: Load the data, learn the five classes, and count the number of data points for each class
Write the Python code:
- Load the training and testing data
- Count the number of data points for Regular (0), DoS (1), R2L (2), U2R (3), Probe (4) in both training and testing datasets

You may use pandas OR numpy for extraction and removing the header:
- [_NumPy_](https://numpy.org): used to perform mathematical operations
- [_pandas_](https://pandas.pydata.org/): open source data analysis and manipulation tool

In [None]:
import numpy as np
import pandas as pd

# To do...
# Load your NSL-KDD training dataset
#data_train = pd.read_csv('KDDTrain+_20Percent_615.csv', header=None)
data_train = pd.read_csv('KDDTrain+_20Percent_615.csv')  # Keep the header

# Load your NSL-KDD testing dataset
data_test = pd.read_csv('KDDTest+_615.csv')
#data_test = pd.read_csv('KDDTest+_615.csv', header=None)
# ...


# To do...
# Count the number of data points for Regular (0), DoS (1), R2L (2), U2R (3), Probe (4) in both training and testing datasets
# training dataset
# Count the number of data points for each class in the training dataset
num_regular = data_train[data_train.iloc[:, -1] == 0].shape[0]
num_dos = data_train[data_train.iloc[:, -1] == 1].shape[0]
num_r2l = data_train[data_train.iloc[:, -1] == 2].shape[0]
num_u2r = data_train[data_train.iloc[:, -1] == 3].shape[0]
num_probe = data_train[data_train.iloc[:, -1] == 4].shape[0]
print('\Training Dataset:')
print('Regular data points:', num_regular)
print('DoS data points:', num_dos)
print('R2L data points:', num_r2l)
print('U2R data points:', num_u2r)
print('Probe data points:', num_probe)

# Count the number of data points for each class in the testing dataset
num_regular_test = data_test[data_test.iloc[:, -1] == 0].shape[0]
num_dos_test = data_test[data_test.iloc[:, -1] == 1].shape[0]
num_r2l_test = data_test[data_test.iloc[:, -1] == 2].shape[0]
num_u2r_test = data_test[data_test.iloc[:, -1] == 3].shape[0]
num_probe_test = data_test[data_test.iloc[:, -1] == 4].shape[0]

print('\nTesting Dataset:')
print('Regular data points:', num_regular_test)
print('DoS data points:', num_dos_test)
print('R2L data points:', num_r2l_test)
print('U2R data points:', num_u2r_test)
print('Probe data points:', num_probe_test)

# Also testing dataset
...

# Step 2: Prepare numerical features
Write the Python code:
- Select a method and Convert 3 categorical features (“protocol_type”, “service”, and “flag”) to numerical from both training and testing data.
  (Data should be numerical when feeding into machine learning models.)

You may use pandas OR numpy:
- [_NumPy_](https://numpy.org): used to perform mathematical operations
- [_pandas_](https://pandas.pydata.org/): open source data analysis and manipulation tool

In [None]:
categorical_features = ["protocol_type", "service", "flag"]
train_df = data_train.copy()
test_df = data_test.copy()
# n (next): Execute the next line.
# s (step): Step into a function call.
# c (continue): Continue execution until the next breakpoint or the end of the program.
# p (print): Print the value of a variable.
# q (quit): Exit the debugger.
for feature in categorical_features:
    #%debug

     # Get column position using iloc
    # try:
    #     #import pdb; pdb.set_trace()
    #     feature_position = train_df.columns.get_loc(feature)

    # except KeyError:
    #     print(f"Feature '{feature}' not found, skipping...")
    #     continue  # Skip to the next feature

    # # Generate dummy variables for the training data using iloc
    # train_dummies = pd.get_dummies(train_df.iloc[:, feature_position], prefix=feature)

    train_dummies = pd.get_dummies(train_df[feature], prefix=feature)


    # Get the columns to ensure alignment with test data
    feature_columns = train_dummies.columns

    # Generate dummy variables for the test data
    #test_dummies = pd.get_dummies(test_df.iloc[:, feature_position], prefix=feature)
    test_dummies = pd.get_dummies(test_df[feature], prefix=feature)
    # Reindex test dummy columns to match training columns, filling missing with 0
    test_dummies = test_dummies.reindex(columns=feature_columns, fill_value=0)

    train_df = pd.concat([train_df, train_dummies], axis=1)
    test_df = pd.concat([test_df, test_dummies], axis=1)

    # Drop the original categorical column from both datasets
    #import pdb; pdb.set_trace()
    train_df = train_df.drop(columns=[feature])
    #import pdb; pdb.set_trace()
    try:
        #test_df = test_df.drop(feature, axis=1)
        #import pdb; pdb.set_trace()
        test_df = test_df.drop(columns=[feature])
    except KeyError:
        print(f"Feature '{feature}' not found in test data, skipping...")
        continue  # Skip to the next feature
    #test_df = test_df.drop(feature, axis=1)

    # Concatenate the dummy variables to the datasets


# Step 3: Nomalize the data and create ML models
Write the Python code to:
- Normalize the two datasets (training and test data);  
- Run a ML model. Various ML algorithms are available in the ML library (https://scikit-learn.org/stable/index.html).

If you are running the exercise on your local platform, download and install machine learning (ML) library:  
	https://scikit-learn.org/stable/index.html

The Python libraries installed by [_pip_](https://pip.pypa.io/en/stable/) are:
- [_SciPy_](https://scipy.org): dependency of the _scikit-learn_ library.
- _SciPy_'s _zscore_: function used to perform normalization.
- [_scikit-learn_](https://scikit-learn.org/stable): employed for processing data and calculating performance metrics.

In [None]:
# Import the Python libraries
import time
from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score
from sklearn.ensemble import RandomForestClassifier  # Assuming you're using RandomForest

import numpy as np #import numpy

# Get training and test data and labels for model training
# Convert pandas DataFrames to NumPy arrays
features_train = train_df.drop(columns=train_df.columns[-1]).values  # Exclude last column (target)
labels_train = train_df.iloc[:, -1].values  # Target variable in the last column
features_test = test_df.drop(columns=test_df.columns[-1]).values  # Exclude last column (target)
labels_test = test_df.iloc[:, -1].values  # Target variable in the last column

# Handle NaNs before applying zscore
#import pdb; pdb.set_trace()
features_train = np.nan_to_num(features_train).astype(np.float64)  # Replace NaNs, convert to float64
features_test = np.nan_to_num(features_test).astype(np.float64)  # Replace NaNs, convert to float64

# Normalize the training and test datasets using zscore
features_train = zscore(features_train, axis=0, ddof=1)
features_test = zscore(features_test, axis=0, ddof=1)

# Create and train your model
time_start = time.time()  # training time - start
#model = DecisionTreeClassifier()  # Initialize the DecisionTreeClassifier
# Initialize RandomForestClassifier with parameters
model = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)

# Generate the model using training data and labels
model.fit(features_train, labels_train)

time_end = time.time()  # training time - end
training_time = time_end - time_start
print('Training completed')
print('Training time:', training_time)

# Step 4:
Write the Python code to:
- Test the developed model on the test dataset named "features_test"
- Calculate Accuracy and F1-Score based on test labels and predicted labels.   

In [None]:
# Import the Python libraries
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

# Testing, for sklearn libriary if applicable
predicted_labels = model.predict(features_test)

# To do...
# Performance metrics
# accuracy = accuracy_score(labels_test, predicted_labels)
# fscore = f1_score(labels_test, predicted_labels)
# Okay, let's discuss the meaning of accuracy_score and f1_score in the context of your intrusion detection task.

# Accuracy Score:

# Definition: Accuracy is the most intuitive performance metric and represents the proportion of correctly classified samples out of the total number of samples.
# Formula: Accuracy = (Number of Correct Predictions) / (Total Number of Predictions)
# Interpretation: In your case, accuracy would tell you the percentage of network connections that your model correctly classified as either normal or an intrusion (DoS, R2L, U2R, Probe). For example, an accuracy of 0.95 would mean that the model correctly classified 95% of the connections in the test dataset.
# F1-Score:

# Definition: The F1-score is a more comprehensive metric that considers both precision and recall. It is the harmonic mean of precision and recall, providing a balanced measure of a model's performance, especially when dealing with imbalanced datasets.
# Precision: Precision measures the proportion of correctly predicted positive instances (intrusions) out of all instances predicted as positive. It answers the question: "Of all the connections the model flagged as intrusions, how many were actually intrusions?"
# Recall: Recall measures the proportion of correctly predicted positive instances (intrusions) out of all actual positive instances. It answers the question: "Of all the actual intrusions, how many did the model correctly identify?"
# Formula: F1-score = 2 * (Precision * Recall) / (Precision + Recall)
# Interpretation: In your intrusion detection scenario, the F1-score provides a balanced measure of how well the model identifies intrusions while minimizing false positives and false negatives. A higher F1-score indicates a better balance between precision and recall, which is desirable in security applications where both identifying true intrusions and minimizing false alarms are important.
# Why both are important:

# Accuracy alone can be misleading, especially with imbalanced datasets. For example, if your dataset has 95% normal connections and only 5% intrusions, a model that simply predicts "normal" for every connection would achieve 95% accuracy, even though it fails to detect any intrusions.
# The F1-score provides a more balanced evaluation by considering both precision and recall, making it more suitable for intrusion detection where identifying true intrusions is critical, even if it means some false alarms.
# In your code:


# accuracy = accuracy_score(labels_test, predicted_labels)
# fscore = f1_score(labels_test, predicted_labels)
# Use code with caution
# These lines calculate the accuracy and F1-score of your model based on the true labels (labels_test) and the predicted labels (predicted_labels) for the test dataset. These metrics help you evaluate the performance of your intrusion detection model.

# I hope this clarifies the meaning and importance of accuracy_score and f1_score in your case. Let me know if you have any further questions.
# To do...
# Show the results: accuracy and training time
print('Accuracy:', accuracy)
print('F1-Score:', fscore)
print('Taining time:', training_time)


# Go back to Step 3 to adjust the hyper-parameters and retrain the model, to achieve better results.


# Step 5:
Write the Python code to:
- Select the five most important features with a feature selection algorithm or provide a reasonable explanation.
- Re-run the algorithm with the new datasets.
- Recalculate the Accuracy and F1-Score and compare these metrics to your previous results.

In [None]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
from scipy.stats import zscore

# ... (Previous code for data loading, preprocessing, and model training) ...

# 1. Feature Selection:
# Get feature importances from the trained model
feature_importances = model.feature_importances_

# Get the indices of the top 5 features
top_5_feature_indices = np.argsort(feature_importances)[-5:]

# Get the names of the top 5 features (if you have feature names)
# Assuming 'train_df' has the feature names as columns
top_5_feature_names = train_df.columns[top_5_feature_indices]

print("Top 5 important features:", top_5_feature_names)

# 2. Re-run with Selected Features:
# Create new training and testing datasets with only the top 5 features
features_train_selected = features_train[:, top_5_feature_indices]
features_test_selected = features_test[:, top_5_feature_indices]

# Re-train the model with selected features
#model_selected = DecisionTreeClassifier()
model_selected = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)
model_selected.fit(features_train_selected, labels_train)

# 3. Re-calculate Metrics and Compare:
# Make predictions on the test set with the selected features
predicted_labels_selected = model_selected.predict(features_test_selected)

# Calculate Accuracy and F1-Score for the selected features model
accuracy_selected = accuracy_score(labels_test, predicted_labels_selected)
f1_selected = f1_score(labels_test, predicted_labels_selected, average='weighted')

# Print and compare the results
print("\nOriginal Model:")
print("Accuracy:", accuracy)
print("F1-Score:", fscore)

print("\nSelected Features Model:")
print("Accuracy:", accuracy_selected)
print("F1-Score:", f1_selected)

# Compare the metrics and analyze the impact of feature selection