<a href="https://colab.research.google.com/github/ladyTootie/ACE-R-D/blob/main/MLP_Skeleton.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow as tf
#print("TensorFlow version:", tf.__version__)

TensorFlow version: 2.19.0


In [3]:
import pandas as pd

# Load the dataset from the specified file path
file_path = "/KDDTest+.txt"
df = pd.read_csv(file_path)

print("Dataset loaded successfully:")
display(df.head())

Dataset loaded successfully:


Unnamed: 0,0,tcp,private,REJ,0.1,0.2,0.3,0.4,0.5,0.6,...,0.04.1,0.06.1,0.00.3,0.00.4,0.00.5,0.00.6,1.00.2,1.00.3,neptune,21
0,0,tcp,private,REJ,0,0,0,0,0,0,...,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,neptune,21
1,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,normal,21
2,0,icmp,eco_i,SF,20,0,0,0,0,0,...,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,saint,15
3,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,mscan,11
4,0,tcp,http,SF,267,14515,0,0,0,0,...,1.0,0.0,0.01,0.03,0.01,0.0,0.0,0.0,normal,21


# Task
Build an MLP model using the "/KDDTest+.txt" dataset.

## Data preprocessing

### Subtask:
Prepare the data for the MLP model. This typically involves handling categorical features, scaling numerical features, and splitting the data into training and testing sets.


**Reasoning**:
Define the column names and assign them to the dataframe, then identify categorical and numerical features.



In [4]:
column_names = [
    "duration", "protocol_type", "service", "flag", "src_bytes", "dst_bytes",
    "land", "wrong_fragment", "urgent", "hot", "num_failed_logins", "logged_in",
    "num_compromised", "root_shell", "su_attempted", "num_root", "num_file_creations",
    "num_shells", "num_access_files", "num_outbound_cmds", "is_host_login",
    "is_guest_login", "count", "srv_count", "serror_rate", "srv_serror_rate",
    "rerror_rate", "srv_rerror_rate", "same_srv_rate", "diff_srv_rate",
    "srv_diff_host_rate", "dst_host_count", "dst_host_srv_count",
    "dst_host_same_srv_rate", "dst_host_diff_srv_rate", "dst_host_same_src_port_rate",
    "dst_host_srv_diff_host_rate", "dst_host_serror_rate", "dst_host_srv_serror_rate",
    "dst_host_rerror_rate", "dst_host_srv_rerror_rate", "outcome", "difficulty"
]

df.columns = column_names

categorical_features = ['protocol_type', 'service', 'flag', 'land', 'logged_in', 'is_host_login', 'is_guest_login']
numerical_features = [col for col in df.columns if col not in categorical_features + ['outcome', 'difficulty']]

print("Categorical features:", categorical_features)
print("Numerical features:", numerical_features)
display(df.head())

Categorical features: ['protocol_type', 'service', 'flag', 'land', 'logged_in', 'is_host_login', 'is_guest_login']
Numerical features: ['duration', 'src_bytes', 'dst_bytes', 'wrong_fragment', 'urgent', 'hot', 'num_failed_logins', 'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds', 'count', 'srv_count', 'serror_rate', 'srv_serror_rate', 'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate', 'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count', 'dst_host_same_srv_rate', 'dst_host_diff_srv_rate', 'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate', 'dst_host_serror_rate', 'dst_host_srv_serror_rate', 'dst_host_rerror_rate', 'dst_host_srv_rerror_rate']


Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,outcome,difficulty
0,0,tcp,private,REJ,0,0,0,0,0,0,...,0.0,0.06,0.0,0.0,0.0,0.0,1.0,1.0,neptune,21
1,2,tcp,ftp_data,SF,12983,0,0,0,0,0,...,0.61,0.04,0.61,0.02,0.0,0.0,0.0,0.0,normal,21
2,0,icmp,eco_i,SF,20,0,0,0,0,0,...,1.0,0.0,1.0,0.28,0.0,0.0,0.0,0.0,saint,15
3,1,tcp,telnet,RSTO,0,15,0,0,0,0,...,0.31,0.17,0.03,0.02,0.0,0.0,0.83,0.71,mscan,11
4,0,tcp,http,SF,267,14515,0,0,0,0,...,1.0,0.0,0.01,0.03,0.01,0.0,0.0,0.0,normal,21


**Reasoning**:
Apply one-hot encoding to the categorical features and scale the numerical features.



In [6]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ])

processed_data = preprocessor.fit_transform(df)

# Convert the processed data back to a DataFrame to inspect (optional)
# Get the new column names after one-hot encoding
onehot_feature_names = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
all_feature_names = numerical_features + list(onehot_feature_names)

processed_df = pd.DataFrame(processed_data, columns=all_feature_names)
display(processed_df.head())

Unnamed: 0,duration,src_bytes,dst_bytes,wrong_fragment,urgent,hot,num_failed_logins,num_compromised,root_shell,su_attempted,...,flag_SF,flag_SH,land_0,land_1,logged_in_0,logged_in_1,is_host_login_0,is_host_login_1,is_guest_login_0,is_guest_login_1
0,-0.155538,-0.021989,-0.096898,-0.059105,-0.01946,-0.113524,-0.144002,-0.016494,-0.049455,-0.012638,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
1,-0.154116,0.005472,-0.096898,-0.059105,-0.01946,-0.113524,-0.144002,-0.016494,-0.049455,-0.012638,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
2,-0.155538,-0.021946,-0.096898,-0.059105,-0.01946,-0.113524,-0.144002,-0.016494,-0.049455,-0.012638,...,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
3,-0.154827,-0.021989,-0.096191,-0.059105,-0.01946,-0.113524,-0.144002,-0.016494,-0.049455,-0.012638,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
4,-0.155538,-0.021424,0.587149,-0.059105,-0.01946,-0.113524,-0.144002,-0.016494,-0.049455,-0.012638,...,1.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0


In [12]:
import pandas as pd

# Assuming your conn.log is tab-separated and has a header
# If your log file has comments or other lines before the header, you might need to skip those.
# Replace 'path/to/your/conn.log' with the actual path to your Zeek conn.log file.
try:
    conn_log_df = pd.read_csv('path/to/your/conn.log', sep='\t', comment='#', low_memory=False)
    print("conn.log loaded successfully:")
    display(conn_log_df.head())
except FileNotFoundError:
    print("Error: conn.log not found. Please check the file path.")
except Exception as e:
    print(f"An error occurred while reading the file: {e}")

Error: conn.log not found. Please check the file path.


**Reasoning**:
Separate features and target variable, then split the data into training and testing sets.



In [7]:
from sklearn.model_selection import train_test_split

X = processed_df
y = df['outcome']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train: (18034, 120)
Shape of X_test: (4509, 120)
Shape of y_train: (18034,)
Shape of y_test: (4509,)


## Model definition

### Subtask:
Define the architecture of the MLP model using a library like TensorFlow/Keras. This includes specifying the number of layers, the number of neurons in each layer, and the activation functions.


**Reasoning**:
Define the MLP model architecture using Sequential and Dense layers.



In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from sklearn.preprocessing import LabelEncoder

# Encode the target variable
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
num_classes = len(label_encoder.classes_)

# Define the model
model = Sequential()
model.add(Dense(128, activation='relu', input_shape=(X_train.shape[1],)))
model.add(Dense(64, activation='relu'))
model.add(Dense(num_classes, activation='softmax'))

model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


## Model compilation

### Subtask:
Compile the model by specifying the optimizer, loss function, and metrics.


**Reasoning**:
Compile the defined model using the specified optimizer, loss function, and metrics.



In [9]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

print("Model compiled successfully.")

Model compiled successfully.


## Model training

### Subtask:
Train the model using the preprocessed training data.


**Reasoning**:
Train the compiled model using the `fit()` method with the encoded training data.



In [10]:
history = model.fit(X_train, y_train_encoded, epochs=10, validation_split=0.2)

Epoch 1/10
[1m451/451[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 3ms/step - accuracy: 0.7439 - loss: 1.1127 - val_accuracy: 0.9240 - val_loss: 0.2368
Epoch 2/10
[1m451/451[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - accuracy: 0.9474 - loss: 0.1686 - val_accuracy: 0.9468 - val_loss: 0.1724
Epoch 3/10
[1m451/451[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9519 - loss: 0.1374 - val_accuracy: 0.9518 - val_loss: 0.1601
Epoch 4/10
[1m451/451[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 2ms/step - accuracy: 0.9551 - loss: 0.1191 - val_accuracy: 0.9512 - val_loss: 0.1541
Epoch 5/10
[1m451/451[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 4ms/step - accuracy: 0.9592 - loss: 0.1111 - val_accuracy: 0.9537 - val_loss: 0.1490
Epoch 6/10
[1m451/451[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 2ms/step - accuracy: 0.9584 - loss: 0.1100 - val_accuracy: 0.9504 - val_loss: 0.1609
Epoch 7/10
[1m451/451[0m 

## Model evaluation

### Subtask:
Evaluate the performance of the trained model on the test data.


**Reasoning**:
Import LabelEncoder and encode the test labels, then evaluate the model on the test data.



In [11]:
from sklearn.preprocessing import LabelEncoder

# Initialize and fit LabelEncoder on the original y data
label_encoder = LabelEncoder()
label_encoder.fit(y)

# Transform y_test using the fitted LabelEncoder
y_test_encoded = label_encoder.transform(y_test)

# Evaluate the model on the test data
loss, accuracy = model.evaluate(X_test, y_test_encoded, verbose=0)

# Print the evaluation results
print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")

Test Loss: 0.1192
Test Accuracy: 0.9550


## Summary:

### Data Analysis Key Findings

*   The MLP model achieved a test accuracy of 0.9550 and a test loss of 0.1192 on the KDDTest+ dataset.
*   The data preprocessing involved handling categorical features using one-hot encoding and scaling numerical features using `StandardScaler`.
*   The dataset was split into training and testing sets with an 80/20 ratio.
*   The MLP model architecture consisted of two hidden layers with ReLU activation and an output layer with softmax activation.
*   The model was compiled using the Adam optimizer and sparse categorical crossentropy loss.

### Insights or Next Steps

*   The high accuracy suggests the model is performing well on this dataset, indicating its potential for intrusion detection.
*   Further steps could include exploring other metrics like precision, recall, and F1-score for a more comprehensive evaluation, especially considering the potential class imbalance in intrusion detection datasets.
