## Network Traffic Classification Project🚦
### Overview


This project is about creating a machine learning model to classify network traffic. The model looks at features like protocol, service type, and packet counts to tell the difference between normal and malicious traffic. This helps improve network security by detecting cyber threats. We will also deploy the model as a Streamlit app for easy, real-time traffic monitoring and analysis. 🌐🔒

## Coulumn Description
Here is a description of each column in the dataset with emojis:

1. **id** 🆔: Unique identifier for each record.
2. **dur** ⏳: Duration of the connection in seconds.
3. **proto** 📡: Protocol used for the connection (e.g., TCP, UDP).
4. **service** 🖥️: Network service on the destination (e.g., HTTP, FTP).
5. **state** 📶: State of the connection (e.g., FIN, SYN).
6. **spkts** 📦: Number of packets sent by the source.
7. **dpkts** 📦: Number of packets sent by the destination.
8. **sbytes** 💾: Number of bytes sent by the source.
9. **dbytes** 💾: Number of bytes sent by the destination.
10. **rate** 🚀: Rate of traffic in packets per second.
11. **sttl** ⏱️: Time-to-live value of packets sent by the source.
12. **dttl** ⏱️: Time-to-live value of packets sent by the destination.
13. **sload** 📊: Average load of the source in bits per second.
14. **dload** 📊: Average load of the destination in bits per second.
15. **sloss** ❌📦: Number of packets lost by the source.
16. **dloss** ❌📦: Number of packets lost by the destination.
17. **sinpkt** ⏲️: Average time between packets sent by the source.
18. **dinpkt** ⏲️: Average time between packets sent by the destination.
19. **sjit** ⚡: Jitter in the traffic from the source.
20. **djit** ⚡: Jitter in the traffic from the destination.
21. **swin** 🪟: TCP window size of the source.
22. **stcpb** 🔢: Base sequence number of the TCP connection from the source.
23. **dtcpb** 🔢: Base sequence number of the TCP connection from the destination.
24. **dwin** 🪟: TCP window size of the destination.
25. **tcprtt** 🔄: Round-trip time of the TCP connection.
26. **synack** ⏲️: Time between SYN and ACK packets in the TCP connection.
27. **ackdat** ⏲️: Time between ACK and data packets in the TCP connection.
28. **smean** 📏: Mean packet size sent by the source.
29. **dmean** 📏: Mean packet size sent by the destination.
30. **trans_depth** 🌊: Depth into the protocol transaction.
31. **response_body_len** 📜: Length of the response body in bytes.
32. **ct_srv_src** 🔄📶: Number of connections to the same service as the current connection in the past 2 seconds from the same source.
33. **ct_state_ttl** 🔄⏱️: Number of connections with the same state and time-to-live value.
34. **ct_dst_ltm** 🔄📍: Number of connections to the same destination in the past 100 connections.
35. **ct_src_dport_ltm** 🔄📍🛠️: Number of connections to the same destination port from the same source in the past 100 connections.
36. **ct_dst_sport_ltm** 🔄📍🔧: Number of connections to the same source port from the same destination in the past 100 connections.
37. **ct_dst_src_ltm** 🔄📍🔄: Number of connections from the same source to the same destination in the past 100 connections.
38. **is_ftp_login** 🔑: Indicator of whether the FTP session was authenticated.
39. **ct_ftp_cmd** 🔄🔧: Number of FTP commands issued in the current connection.
40. **ct_flw_http_mthd** 🔄🌐: Number of HTTP methods used in the current connection.
41. **ct_src_ltm** 🔄📍: Number of connections from the same source in the past 100 connections.
42. **ct_srv_dst** 🔄📶📍: Number of connections to the same service from the same source.
43. **is_sm_ips_ports** ⚖️: Indicator of whether the source and destination ports are equal.
44. **attack_cat** 🛡️: Category of the attack (e.g., DoS, Probe).
45. **label** 🔍: Binary label indicating whether the connection is benign or malicious.

## Step 1: Import Library

In [24]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib

. pandas: Used for data manipulation and analysis, particularly for handling data in DataFrame format.

. train_test_split: A function from sklearn.model_selection to split the dataset into training and testing sets.

. LabelEncoder: A class from sklearn.preprocessing to encode categorical labels as integers.

. StandardScaler: A class from sklearn.preprocessing to standardize features by removing the mean and scaling to unit variance.

. RandomForestClassifier: A class from sklearn.ensemble to create a Random Forest model for classification.
. accuracy_score: A function from sklearn.metrics to calculate the accuracy of the model.

. joblib: A library for saving and loading models efficiently.


## Step 2: Load The dataset

In [25]:

# Load dataset
df = pd.read_csv(r'C:\Users\nazish\Downloads\Network Traffic Classification streamlit App\UNSW_NB15_training-set.csv')

In [26]:
df.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,tcp,-,FIN,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,Normal,0
1,2,0.649902,tcp,-,FIN,14,38,734,42014,78.473372,...,1,2,0,0,0,1,6,0,Normal,0
2,3,1.623129,tcp,-,FIN,8,16,364,13186,14.170161,...,1,3,0,0,0,2,6,0,Normal,0
3,4,1.681642,tcp,ftp,FIN,12,12,628,770,13.677108,...,1,3,1,1,0,2,1,0,Normal,0
4,5,0.449454,tcp,-,FIN,10,6,534,268,33.373826,...,1,40,0,0,0,2,39,0,Normal,0


In [27]:
df.tail()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
175336,175337,9e-06,udp,dns,INT,2,0,114,0,111111.1072,...,13,24,0,0,0,24,24,0,Generic,1
175337,175338,0.505762,tcp,-,FIN,10,8,620,354,33.612649,...,1,2,0,0,0,1,1,0,Shellcode,1
175338,175339,9e-06,udp,dns,INT,2,0,114,0,111111.1072,...,3,13,0,0,0,3,12,0,Generic,1
175339,175340,9e-06,udp,dns,INT,2,0,114,0,111111.1072,...,14,30,0,0,0,30,30,0,Generic,1
175340,175341,9e-06,udp,dns,INT,2,0,114,0,111111.1072,...,16,30,0,0,0,30,30,0,Generic,1


## Step 3: Data Preprocessing

In [28]:
# Preprocessing
# Encoding categorical columns
label_encoder = LabelEncoder()
categorical_columns = ['proto', 'service', 'state', 'attack_cat', 'label']
for col in categorical_columns:
    df[col] = label_encoder.fit_transform(df[col])


## Step 4:Splitting Data into Features and Target
 The code snippet separates the dataset into features (X) and target (y), where X consists of all columns except 'id' and 'label', and y is the 'label' column used for prediction.


In [29]:
# Feature and target split
X = df.drop(['id', 'label'], axis=1)  # Drop 'id' if not needed for the model
y = df['label']

In [39]:
df.head()

Unnamed: 0,id,dur,proto,service,state,spkts,dpkts,sbytes,dbytes,rate,...,ct_dst_sport_ltm,ct_dst_src_ltm,is_ftp_login,ct_ftp_cmd,ct_flw_http_mthd,ct_src_ltm,ct_srv_dst,is_sm_ips_ports,attack_cat,label
0,1,0.121478,113,0,2,6,4,258,172,74.08749,...,1,1,0,0,0,1,1,0,6,0
1,2,0.649902,113,0,2,14,38,734,42014,78.473372,...,1,2,0,0,0,1,6,0,6,0
2,3,1.623129,113,0,2,8,16,364,13186,14.170161,...,1,3,0,0,0,2,6,0,6,0
3,4,1.681642,113,3,2,12,12,628,770,13.677108,...,1,3,1,1,0,2,1,0,6,0
4,5,0.449454,113,0,2,10,6,534,268,33.373826,...,1,40,0,0,0,2,39,0,6,0


## Step 5 :Data Splitting for Training and Testing
 The code divides the feature and target datasets into training and testing sets, with 20% of the data reserved for testing and a fixed random seed for reproducibility.

In [31]:

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Step 6:Feature Standardization
This code applies standard scaling to the training and testing features, ensuring that they have a mean of 0 and a standard deviation of 1, using the same scaling parameters derived from the training data.


In [32]:
# Standardize features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Step 7:Model Training
This code initializes a RandomForestClassifier with 100 trees and a fixed random seed, then trains the model using the standardized training features and target labels.

In [33]:

# Train a model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


## Step 8:Model Evaluation
This code predicts the target values for the test set using the trained model and calculates the accuracy of the predictions by comparing them to the actual test labels.





In [34]:
# Step 7: Evaluate the model on the Training Set
y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
print(f"Training Accuracy: {train_accuracy:.15f}")

Training Accuracy: 1.000000000000000


In [35]:
# Step 8: Evaluate the model on the Test Set
y_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy:.15f}")

Test Accuracy: 1.000000000000000


In [36]:
# Display classification report for Test Set
print("\nClassification Report for Test Set:")
print(classification_report(y_test, y_test_pred))


Classification Report for Test Set:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     11169
           1       1.00      1.00      1.00     23900

    accuracy                           1.00     35069
   macro avg       1.00      1.00      1.00     35069
weighted avg       1.00      1.00      1.00     35069



In [37]:
# Display confusion matrix for Test Set
print("\nConfusion Matrix for Test Set:")
print(confusion_matrix(y_test, y_test_pred))


Confusion Matrix for Test Set:
[[11169     0]
 [    0 23900]]


## Step 9 Model and Scaler Saving
This code saves the trained model and the feature scaler to disk as model.pkl and scaler.pkl files using joblib, allowing for future use or deployment.

In [38]:
# Save the model and the scaler
joblib.dump(model, 'model.pkl')
joblib.dump(scaler, 'scaler.pkl')

['scaler.pkl']