## 📌 Project Overview / Summary
**Project Title**:
Cyber Security – Network Intrusion Detection Case Study

----------------
## Objective:
The goal of this project is to build a machine learning model to detect network intrusions and classify them as either:

**Binary classification**: Normal vs. Attack

**Multiclass classification**: Normal vs. various specific attacks (Back, BufferOverflow, FTPWrite, GuessPassword, Neptune, NMap, PortSweep, RootKit, Satan, Smurf).

---------------------
## Dataset Description:
The dataset includes network traffic data captured from multiple files, each representing different types of attacks.

Each record contains 41 features including:

Basic features (duration, protocol_type, service, flag, src_bytes, dst_bytes, etc.)

Content-based features (num_failed_logins, logged_in, etc.)

Traffic features (count, serror_rate, etc.)

A new column ‘attack’ was created to label the type of activity.

-------------------------
## Data Preparation:
✅ All CSV files were combined into a single dataset.
✅ The ‘attack’ column was used to define:

Binary classification target: 0 (Normal) vs. 1 (Attack)

Multiclass classification target: 0 (Normal) vs. specific attack classes.
✅ Categorical features (protocol_type, service, flag) were label encoded and stored in a mapping dictionary for interpretability.
✅ Numerical features were standardized using StandardScaler.
✅ Train-test split was performed using stratified sampling to ensure class balance.

----------------------
## Modeling Approach:
We explored and evaluated multiple machine learning models:

Logistic Regression

Decision Tree Classifier

Random Forest Classifier

Each model was trained and evaluated on both binary and multiclass classification tasks.

------------------------
## Key Findings:
✅ All models consistently achieved very high accuracy (~99.97% to 99.99%) on both training and testing sets.
✅ No significant overfitting observed (train and test scores were nearly identical).
✅ Feature importance plots confirmed that the dataset contains strong signals distinguishing between classes.

------------------------
## Conclusions:
The network traffic dataset is highly predictive of attacks, enabling effective detection.

Logistic Regression, Decision Tree, and Random Forest models all performed equally well, indicating model robustness.

Due to interpretability and simplicity, Logistic Regression can be considered as the preferred model, while Random Forest offers an additional perspective on feature importance.##

#### Importing some important libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import os

In [3]:
os.chdir('C:\\Users\\ASUS\\OneDrive\\Desktop\\python case studies\\Machine Learning Projects\\12. Capstone Case Study - Cyber Security Case Study')

In [6]:
back = pd.read_csv('Data_of_Attack_Back.csv')
back.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0.0,0,0,0.0,0.5454,0.08314,0,0,0,0.2,...,0.001,0.001,0.1,0,0.1,0,0.0,0.0,0.0,0.0
1,0.0,0,0,0.0,0.5454,0.08314,0,0,0,0.2,...,0.002,0.002,0.1,0,0.05,0,0.0,0.0,0.0,0.0
2,0.0,0,0,0.0,0.5454,0.08314,0,0,0,0.2,...,0.003,0.003,0.1,0,0.033,0,0.0,0.0,0.0,0.0
3,0.0,0,0,0.0,0.5454,0.08314,0,0,0,0.2,...,0.004,0.004,0.1,0,0.025,0,0.0,0.0,0.0,0.0
4,0.0,0,0,0.0,0.5454,0.08314,0,0,0,0.2,...,0.005,0.005,0.1,0,0.02,0,0.0,0.0,0.0,0.0


In [8]:
bufferoverflow = pd.read_csv('Data_of_Attack_Back_BufferOverflow.csv')
ftpwrite = pd.read_csv('Data_of_Attack_Back_FTPWrite.csv',header=None)
guesspassword = pd.read_csv('Data_of_Attack_Back_GuessPassword.csv')
neptune = pd.read_csv('Data_of_Attack_Back_Neptune.csv')
nmap = pd.read_csv('Data_of_Attack_Back_NMap.csv')
normal = pd.read_csv('Data_of_Attack_Back_Normal.csv')
portsweep = pd.read_csv('Data_of_Attack_Back_PortSweep.csv')
rootkit = pd.read_csv('Data_of_Attack_Back_RootKit.csv')
satan = pd.read_csv('Data_of_Attack_Back_Satan.csv')
smurf = pd.read_csv('Data_of_Attack_Back_Smurf.csv')

In [9]:
for dfs in [back,bufferoverflow,ftpwrite,guesspassword,neptune,nmap,normal,portsweep,rootkit,satan,smurf]:
    dfs.info()
    print('______________________________________________')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 968 entries, 0 to 967
Data columns (total 41 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   duration                      968 non-null    float64
 1    protocol_type                968 non-null    int64  
 2    service                      968 non-null    int64  
 3    flag                         968 non-null    float64
 4    src_bytes                    968 non-null    float64
 5    dst_bytes                    968 non-null    float64
 6    land                         968 non-null    int64  
 7    wrong_fragment               968 non-null    int64  
 8    urgent                       968 non-null    int64  
 9    hot                          968 non-null    float64
 10   num_failed_logins            968 non-null    int64  
 11   logged_in                    968 non-null    float64
 12   num_compromised              968 non-null    float64
 13   root

In [10]:
back.columns.values

array(['duration', ' protocol_type', ' service', ' flag', ' src_bytes',
       ' dst_bytes', ' land', ' wrong_fragment', ' urgent', ' hot',
       ' num_failed_logins', ' logged_in', ' num_compromised',
       ' root_shell', ' su_attempted', ' num_root', ' num_file_creations',
       ' num_shells', ' num_access_files', ' num_outbound_cmds',
       ' is_host_login', ' is_guest_login', ' count', ' srv_count',
       ' serror_rate', ' srv_error_rate', ' rerror_rate',
       ' srv_rerror_rate', ' same_srv_rate', ' diff_srv_rate',
       ' srv_diff_host_rate', ' dst_host_count', ' dst_host_srv_count',
       ' dst_host_same_srv_rate', ' dst_host_diff_srv_rate',
       ' dst_host_same_src_port_rate', ' dst_host_srv_diff_host_rate',
       ' dst_host_serror_rate', ' dst_host_srv_serror_rate',
       ' dst_host_rerror_rate', ' dst_host_srv_rerror_rate'], dtype=object)

#### Data Preparation

In [12]:
ftpwrite.columns = back.columns.values
ftpwrite.head()

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_count,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate
0,0.0026,0,0.07,0,0.00116,0.00451,0,0,0.0,0.2,...,0.001,0.001,0.1,0.0,0.1,0.0,0,0,0,0
1,0.0134,0,0.34,0,0.001,0.39445,0,0,0.2,0.0,...,0.002,0.001,0.05,0.1,0.05,0.0,0,0,0,0
2,0.0,0,0.14,0,0.00613,0.0,0,0,0.0,0.0,...,0.001,0.084,0.1,0.0,0.1,0.002,0,0,0,0
3,0.0,0,0.14,0,0.0,5e-05,0,0,0.0,0.0,...,0.002,0.085,0.1,0.0,0.1,0.002,0,0,0,0
4,0.0032,0,0.07,0,0.00104,0.00449,0,0,0.0,0.2,...,0.001,0.001,0.1,0.0,0.1,0.0,0,0,0,0


In [17]:
back['attack'] = 'Back'
bufferoverflow['attack'] = 'Bufferoverflow'
ftpwrite['attack'] = 'FtpWrite'
guesspassword['attack'] = 'GuessPassword'
neptune['attack'] = 'Neptune'
nmap['attack'] = 'NMap'
normal['attack'] = 'Normal'
portsweep['attack'] = 'PortSweep'
rootkit['attack'] = 'RootKit'
satan['attack'] = 'Satan'
smurf['attack'] = 'Smurf'

In [19]:
df = pd.concat([back,bufferoverflow,ftpwrite,guesspassword,neptune,nmap,normal,portsweep,rootkit,satan,smurf],ignore_index=True)

In [21]:
df

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_srv_count,dst_host_same_srv_rate,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack
0,0.0,0.00,0.00,0.0,0.54540,0.08314,0,0.0,0.0,0.2,...,0.001,0.100,0.000,0.100,0.0,0.0,0.0,0.0,0.0,Back
1,0.0,0.00,0.00,0.0,0.54540,0.08314,0,0.0,0.0,0.2,...,0.002,0.100,0.000,0.050,0.0,0.0,0.0,0.0,0.0,Back
2,0.0,0.00,0.00,0.0,0.54540,0.08314,0,0.0,0.0,0.2,...,0.003,0.100,0.000,0.033,0.0,0.0,0.0,0.0,0.0,Back
3,0.0,0.00,0.00,0.0,0.54540,0.08314,0,0.0,0.0,0.2,...,0.004,0.100,0.000,0.025,0.0,0.0,0.0,0.0,0.0,Back
4,0.0,0.00,0.00,0.0,0.54540,0.08314,0,0.0,0.0,0.2,...,0.005,0.100,0.000,0.020,0.0,0.0,0.0,0.0,0.0,Back
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
817546,0.0,0.02,0.09,0.0,0.01032,0.00000,0,0.0,0.0,0.0,...,0.251,0.098,0.001,0.098,0.0,0.0,0.0,0.0,0.0,Smurf
817547,0.0,0.02,0.09,0.0,0.01032,0.00000,0,0.0,0.0,0.0,...,0.252,0.099,0.001,0.099,0.0,0.0,0.0,0.0,0.0,Smurf
817548,0.0,0.02,0.09,0.0,0.01032,0.00000,0,0.0,0.0,0.0,...,0.253,0.099,0.001,0.099,0.0,0.0,0.0,0.0,0.0,Smurf
817549,0.0,0.02,0.09,0.0,0.01032,0.00000,0,0.0,0.0,0.0,...,0.254,0.100,0.001,0.100,0.0,0.0,0.0,0.0,0.0,Smurf


In [23]:
column = []
for i in df.columns:
    j = i.strip()
    column.append(j)

In [25]:
df.columns = column

In [27]:
df.columns

Index(['duration', 'protocol_type', 'service', 'flag', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_access_files', 'num_outbound_cmds', 'is_host_login',
       'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_error_rate',
       'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
       'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
       'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
       'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
       'dst_host_serror_rate', 'dst_host_srv_serror_rate',
       'dst_host_rerror_rate', 'dst_host_srv_rerror_rate', 'attack'],
      dtype='object')

#### Handling Null values

In [30]:
for col in df.columns:
    print(col,':',df[col].isnull().sum())

duration : 0
protocol_type : 0
service : 0
flag : 0
src_bytes : 0
dst_bytes : 0
land : 0
wrong_fragment : 0
urgent : 0
hot : 0
num_failed_logins : 0
logged_in : 0
num_compromised : 0
root_shell : 0
su_attempted : 0
num_root : 0
num_file_creations : 0
num_shells : 0
num_access_files : 0
num_outbound_cmds : 0
is_host_login : 0
is_guest_login : 0
count : 0
srv_count : 0
serror_rate : 0
srv_error_rate : 0
rerror_rate : 0
srv_rerror_rate : 0
same_srv_rate : 0
diff_srv_rate : 0
srv_diff_host_rate : 0
dst_host_count : 0
dst_host_srv_count : 0
dst_host_same_srv_rate : 0
dst_host_diff_srv_rate : 0
dst_host_same_src_port_rate : 0
dst_host_srv_diff_host_rate : 0
dst_host_serror_rate : 0
dst_host_srv_serror_rate : 0
dst_host_rerror_rate : 0
dst_host_srv_rerror_rate : 0
attack : 0


#### Handling Class Imbalance 

In [33]:
df['attack'].value_counts()

attack
Normal            576710
Neptune           227228
Satan               5019
Smurf               3007
PortSweep           2964
NMap                1554
Back                 968
GuessPassword         53
Bufferoverflow        30
RootKit               10
FtpWrite               8
Name: count, dtype: int64

In [35]:
print(df['attack'].value_counts()/df.shape[0])

attack
Normal            0.705412
Neptune           0.277937
Satan             0.006139
Smurf             0.003678
PortSweep         0.003625
NMap              0.001901
Back              0.001184
GuessPassword     0.000065
Bufferoverflow    0.000037
RootKit           0.000012
FtpWrite          0.000010
Name: count, dtype: float64


In [37]:
df_majority = df[df['attack']=='Normal']
df_minority = df[df['attack']!='Normal']

#### Downsampling the majority data

In [40]:
from sklearn.utils import resample

In [42]:
df_majority_downsampled = resample(df_majority,replace=False,n_samples=len(df_minority),random_state=42)

In [44]:
df_majority_downsampled.shape

(240841, 42)

In [46]:
df_majority.shape

(576710, 42)

In [48]:
df_balanced = pd.concat([df_majority_downsampled,df_minority])
df_balanced.shape

(481682, 42)

#### Creating Classification Labels

In [51]:
df_balanced['attack_binary'] = df_balanced['attack'].apply(lambda x: 0 if x=='Normal' else 1)

In [53]:
df_balanced['attack_multiclass'] = df_balanced['attack']

In [55]:
df_balanced.head(3)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack,attack_binary,attack_multiclass
800847,0.0,0.0,0.14,0.0,0.00641,0.0,0,0.0,0.0,0.0,...,0.002,0.016,0.003,0.0,0.002,0.0,0.0,Normal,0,Normal
752065,0.0,0.01,0.02,0.0,0.00042,0.00042,0,0.0,0.0,0.0,...,0.002,0.001,0.0,0.0,0.0,0.0,0.0,Normal,0,Normal
763867,0.0001,0.0,0.01,0.0,0.01324,0.00339,0,0.0,0.0,0.0,...,0.006,0.001,0.0,0.0,0.0,0.0,0.0,Normal,0,Normal


#### Encoding

In [58]:
cat_col = ['protocol_type','service','flag','attack','attack_binary','attack_multiclass']
cont_col = ['duration', 'src_bytes',
       'dst_bytes', 'land', 'wrong_fragment', 'urgent', 'hot',
       'num_failed_logins', 'logged_in', 'num_compromised', 'root_shell',
       'su_attempted', 'num_root', 'num_file_creations', 'num_shells',
       'num_access_files', 'num_outbound_cmds', 'is_host_login',
       'is_guest_login', 'count', 'srv_count', 'serror_rate', 'srv_error_rate',
       'rerror_rate', 'srv_rerror_rate', 'same_srv_rate', 'diff_srv_rate',
       'srv_diff_host_rate', 'dst_host_count', 'dst_host_srv_count',
       'dst_host_same_srv_rate', 'dst_host_diff_srv_rate',
       'dst_host_same_src_port_rate', 'dst_host_srv_diff_host_rate',
       'dst_host_serror_rate', 'dst_host_srv_serror_rate',
       'dst_host_rerror_rate', 'dst_host_srv_rerror_rate',]

In [60]:
from sklearn.preprocessing import LabelEncoder

In [62]:
le = LabelEncoder()
mapping_dict = {}
for col in cat_col:
    df_balanced[col] = le.fit_transform(df_balanced[col])
    temp_dict = {original: encoded for encoded, original in enumerate(le.classes_)}
    mapping_dict[col] = temp_dict
    
print(mapping_dict)

{'protocol_type': {0.0: 0, 0.01: 1, 0.02: 2}, 'service': {0.0: 0, 0.01: 1, 0.02: 2, 0.03: 3, 0.04: 4, 0.05: 5, 0.06: 6, 0.07: 7, 0.08: 8, 0.09: 9, 0.1: 10, 0.11: 11, 0.12: 12, 0.13: 13, 0.14: 14, 0.15: 15, 0.16: 16, 0.17: 17, 0.18: 18, 0.19: 19, 0.2: 20, 0.21: 21, 0.22: 22, 0.23: 23, 0.24: 24, 0.25: 25, 0.26: 26, 0.27: 27, 0.28: 28, 0.29: 29, 0.3: 30, 0.31: 31, 0.32: 32, 0.33: 33, 0.34: 34, 0.35: 35, 0.36: 36, 0.37: 37, 0.38: 38, 0.39: 39, 0.4: 40, 0.41: 41, 0.42: 42, 0.43: 43, 0.44: 44, 0.45: 45, 0.46: 46, 0.47: 47, 0.48: 48, 0.49: 49, 0.5: 50, 0.51: 51, 0.52: 52, 0.53: 53, 0.54: 54, 0.55: 55, 0.56: 56, 0.57: 57, 0.58: 58, 0.59: 59, 0.6: 60, 0.61: 61, 0.62: 62, 0.63: 63, 0.64: 64, 0.65: 65, 0.67: 66}, 'flag': {0.0: 0, 0.01: 1, 0.02: 2, 0.03: 3, 0.04: 4, 0.05: 5, 0.06: 6, 0.07: 7, 0.08: 8, 0.09: 9, 0.1: 10}, 'attack': {'Back': 0, 'Bufferoverflow': 1, 'FtpWrite': 2, 'GuessPassword': 3, 'NMap': 4, 'Neptune': 5, 'Normal': 6, 'PortSweep': 7, 'RootKit': 8, 'Satan': 9, 'Smurf': 10}, 'attack_

In [63]:
df_balanced.head(3)

Unnamed: 0,duration,protocol_type,service,flag,src_bytes,dst_bytes,land,wrong_fragment,urgent,hot,...,dst_host_diff_srv_rate,dst_host_same_src_port_rate,dst_host_srv_diff_host_rate,dst_host_serror_rate,dst_host_srv_serror_rate,dst_host_rerror_rate,dst_host_srv_rerror_rate,attack,attack_binary,attack_multiclass
800847,0.0,0,14,0,0.00641,0.0,0,0.0,0.0,0.0,...,0.002,0.016,0.003,0.0,0.002,0.0,0.0,6,0,6
752065,0.0,1,2,0,0.00042,0.00042,0,0.0,0.0,0.0,...,0.002,0.001,0.0,0.0,0.0,0.0,0.0,6,0,6
763867,0.0001,0,1,0,0.01324,0.00339,0,0.0,0.0,0.0,...,0.006,0.001,0.0,0.0,0.0,0.0,0.0,6,0,6


#### Standardization

In [67]:
from sklearn.preprocessing import StandardScaler

In [69]:
ss = StandardScaler()
df_balanced[cont_col] = ss.fit_transform(df_balanced[cont_col])

In [70]:
x = df_balanced.drop(columns=['attack','attack_binary', 'attack_multiclass'])
y_binary = df_balanced['attack_binary']
y_multiclass = df_balanced['attack_multiclass']

#### Train Test Split

In [74]:
from sklearn.model_selection import train_test_split 

In [76]:
x_train_bin,x_test_bin,y_train_bin,y_test_bin = train_test_split(x,y_binary,test_size=0.2,random_state=64,stratify=y_binary)

In [77]:
x_train_multi,x_test_multi,y_train_multi,y_test_multi = train_test_split(x,y_multiclass,test_size=0.2,random_state=64,stratify=y_multiclass)

In [79]:
x_train_bin.shape,x_test_bin.shape

((385345, 41), (96337, 41))

In [80]:
x_train_multi.shape,x_test_multi.shape

((385345, 41), (96337, 41))

#### Model Selection

Logistic Regression

In [86]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report
import warnings
warnings.filterwarnings('ignore')

In [103]:
lr = LogisticRegression(penalty='l2', C=0.01)
lr.fit(x_train_bin,y_train_bin)
bin_pred = lr.predict(x_test_bin)
accuracy_score(bin_pred,y_test_bin)

0.9970208746379896

In [104]:
bin_train_pred = lr.predict(x_train_bin)
accuracy_score(bin_train_pred,y_train_bin)

0.9972777640815373

In [127]:
lr1 = LogisticRegression(penalty='l2', C=0.01, multi_class='ovr', max_iter=500)
lr1.fit(x_train_multi, y_train_multi)
multi_pred = lr1.predict(x_test_multi)
accuracy_score(y_test_multi,multi_pred)

0.9971142966876694

In [128]:
multi_train_pred = lr1.predict(x_train_multi)
accuracy_score(y_train_multi,multi_train_pred)

0.9967276077281397

 Decision Tree Classifier

In [94]:
from sklearn.tree import DecisionTreeClassifier

In [97]:
dtc = DecisionTreeClassifier()
dtc.fit(x_train_bin,y_train_bin)
bin_pred = dtc.predict(x_test_bin)
accuracy_score(bin_pred,y_test_bin)

0.9997820152174138

In [94]:
bin_train_pred = dtc.predict(x_train_bin)
accuracy_score(bin_train_pred,y_train_bin)

0.9999974049228614

In [95]:
dtc1 = DecisionTreeClassifier()
dtc1.fit(x_train_multi,y_train_multi)
multi_pred = dtc1.predict(x_test_multi)
accuracy_score(y_test_multi,multi_pred)

0.9997820152174138

In [97]:
multi_train_pred = dtc1.predict(x_train_multi)
accuracy_score(y_train_multi,multi_train_pred)

0.9999974049228614

 Random Forest Classifier

In [105]:
from sklearn.ensemble import RandomForestClassifier

In [106]:
rc1 = RandomForestClassifier(n_estimators=100,random_state=42)
rc1.fit(x_train_bin,y_train_bin)
bin_pred = rc1.predict(x_test_bin)
accuracy_score(bin_pred,y_test_bin)

0.999896197722578

In [107]:
bin_train_pred = rc1.predict(x_train_bin)
accuracy_score(bin_train_pred,y_train_bin)

0.9999974049228614

In [108]:
rc2 = RandomForestClassifier(n_estimators=100,random_state=42)
rc2.fit(x_train_multi,y_train_multi)
multi_pred = rc2.predict(x_test_multi)
accuracy_score(y_test_multi,multi_pred)

0.9998235361283827

In [109]:
multi_train_pred = rc2.predict(x_train_multi)
accuracy_score(y_train_multi,multi_train_pred)

0.9999974049228614