# **Load data**

## Objectives

* This notebook aims to load the dataset into a pandas DataFrame, perform basic data cleaning and validation, and save the cleaned version as a new CSV file for further analysis.

## Inputs

* **Data source:** [Cybersecurity Intrusion Detection Dataset](https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset)  
* **Author:** [dnkumars](https://www.kaggle.com/dnkumars)  
* **Raw file:** `cybersecurity.csv`  
* **Location:** `data/raw/cybersecurity.csv`

## Outputs

* **Cleaned file:** `cybersecurity_cleaned.csv`  
* **Location:** `data/processed/cybersecurity_cleaned.csv`

## Additional Comments

* As stated by the dataset author, parts of the data are synthetic and intended for educational and research purposes.

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [2]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Code projects\\final-hackathon\\cybersecurity-intrusion-detection\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [3]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [4]:
current_dir = os.getcwd()
current_dir

'c:\\Code projects\\final-hackathon\\cybersecurity-intrusion-detection'

---

# Initial Setup and Extracting Data

Importing libaries used for data loading and cleaning
* pandas for data manipulation and analysis
* numpy for numerical operations

In [None]:
import numpy as np
import pandas as pd

In [6]:
df = pd.read_csv('data/raw/cybersecurity_intrusion_data.csv')
df

Unnamed: 0,session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
0,SID_00001,599,TCP,4,492.983263,DES,0.606818,1,Edge,0,1
1,SID_00002,472,TCP,3,1557.996461,DES,0.301569,0,Firefox,0,0
2,SID_00003,629,TCP,3,75.044262,DES,0.739164,2,Chrome,0,1
3,SID_00004,804,UDP,4,601.248835,DES,0.123267,0,Unknown,0,1
4,SID_00005,453,TCP,5,532.540888,AES,0.054874,1,Firefox,0,0
...,...,...,...,...,...,...,...,...,...,...,...
9532,SID_09533,194,ICMP,3,226.049889,AES,0.517737,3,Chrome,0,1
9533,SID_09534,380,TCP,3,182.848475,,0.408485,0,Chrome,0,0
9534,SID_09535,664,TCP,5,35.170248,AES,0.359200,1,Firefox,0,0
9535,SID_09536,406,TCP,4,86.664703,AES,0.537417,1,Chrome,1,0


## Feature Description
Description of the features were provided with the dataset on [kaggle](https://www.kaggle.com/datasets/dnkumars/cybersecurity-intrusion-detection-dataset). Below is a brief summary of each feature:

| **Feature Name**          | **Category**              | **Description** |
|-----------------------------|----------------------------|-----------------|
| `network_packet_size`       | Network-Based             | Represents the size of network packets (64–1500 bytes). Small packets (~64 bytes) may be control messages; large packets (~1500 bytes) carry bulk data. Abnormally small or large packets can indicate reconnaissance or exploitation. |
| `protocol_type`             | Network-Based             | Communication protocol used in the session: **TCP** (reliable, common for HTTP/HTTPS/SSH), **UDP** (fast, less reliable, used in VoIP/streaming), or **ICMP** (network diagnostics, often abused in DoS attacks). |
| `encryption_used`           | Network-Based             | Encryption protocol: **AES** (strong), **DES** (weak, outdated), or **None** (unencrypted). Attackers might avoid encryption or use weak ones to exploit vulnerabilities. |
| `login_attempts`            | User Behavior-Based       | Number of login attempts. Typical users: 1–3. High values may indicate brute-force attacks with hundreds or thousands of attempts. |
| `session_duration`          | User Behavior-Based       | Session length in seconds. Very long sessions may indicate unauthorized access or an attacker maintaining persistence. |
| `failed_logins`             | User Behavior-Based       | Number of failed login attempts. High values suggest credential stuffing or dictionary attacks. A pattern of many failed attempts followed by a success could indicate compromise. |
| `unusual_time_access`       | User Behavior-Based       | Binary flag (0 or 1) indicating login at an unusual time. Attackers often access systems outside normal business hours to evade detection. |
| `ip_reputation_score`       | User Behavior-Based       | A score (0–1) representing IP trustworthiness. Higher scores indicate more suspicious activity (e.g., botnets, spam, or prior attacks). |
| `browser_type`              | User Behavior-Based       | User’s browser (Chrome, Firefox, Edge, Safari, etc.). Unknown browsers could indicate bots or automated scripts. |
| `attack_detected`           | **Target Variable**       | Binary classification target: **1** = attack detected, **0** = normal activity. |



---

# Data Cleaning
The dataset will be cleaned to handle any missing values, duplicated entries and other inconsistencies.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9537 entries, 0 to 9536
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   session_id           9537 non-null   object 
 1   network_packet_size  9537 non-null   int64  
 2   protocol_type        9537 non-null   object 
 3   login_attempts       9537 non-null   int64  
 4   session_duration     9537 non-null   float64
 5   encryption_used      7571 non-null   object 
 6   ip_reputation_score  9537 non-null   float64
 7   failed_logins        9537 non-null   int64  
 8   browser_type         9537 non-null   object 
 9   unusual_time_access  9537 non-null   int64  
 10  attack_detected      9537 non-null   int64  
dtypes: float64(2), int64(5), object(4)
memory usage: 819.7+ KB


## Handling Missing Values and Duplicates

In [8]:
df.isna().sum()

session_id                0
network_packet_size       0
protocol_type             0
login_attempts            0
session_duration          0
encryption_used        1966
ip_reputation_score       0
failed_logins             0
browser_type              0
unusual_time_access       0
attack_detected           0
dtype: int64

In [9]:
e_used_unique_counts = df['encryption_used'].value_counts(dropna=False).reset_index()
e_used_unique_counts

Unnamed: 0,encryption_used,count
0,AES,4706
1,DES,2865
2,,1966


In [10]:
df['encryption_used'] = df['encryption_used'].fillna('No encryption')

In [11]:
df.isna().sum()

session_id             0
network_packet_size    0
protocol_type          0
login_attempts         0
session_duration       0
encryption_used        0
ip_reputation_score    0
failed_logins          0
browser_type           0
unusual_time_access    0
attack_detected        0
dtype: int64

In [12]:
df.duplicated().sum()

0

## Checking for Data Inconsistencies

In [13]:
df.describe()

Unnamed: 0,network_packet_size,login_attempts,session_duration,ip_reputation_score,failed_logins,unusual_time_access,attack_detected
count,9537.0,9537.0,9537.0,9537.0,9537.0,9537.0,9537.0
mean,500.430639,4.032086,792.745312,0.331338,1.517773,0.149942,0.447101
std,198.379364,1.963012,786.560144,0.177175,1.033988,0.357034,0.49722
min,64.0,1.0,0.5,0.002497,0.0,0.0,0.0
25%,365.0,3.0,231.953006,0.191946,1.0,0.0,0.0
50%,499.0,4.0,556.277457,0.314778,1.0,0.0,0.0
75%,635.0,5.0,1105.380602,0.453388,2.0,0.0,1.0
max,1285.0,13.0,7190.392213,0.924299,5.0,1.0,1.0


In [14]:
df['session_id'].nunique()

9537

In [15]:
for col in ['protocol_type','encryption_used','browser_type']:
    print(df[col].value_counts())
    print("\n")

protocol_type
TCP     6624
UDP     2406
ICMP     507
Name: count, dtype: int64


encryption_used
AES              4706
DES              2865
No encryption    1966
Name: count, dtype: int64


browser_type
Chrome     5137
Firefox    1944
Edge       1469
Unknown     502
Safari      485
Name: count, dtype: int64




In [16]:
for col in ['unusual_time_access', 'attack_detected']:
    print(df[col].value_counts())
    print("\n")

unusual_time_access
0    8107
1    1430
Name: count, dtype: int64


attack_detected
0    5273
1    4264
Name: count, dtype: int64




In [17]:
df[df['login_attempts'] < df['failed_logins']]

Unnamed: 0,session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
17,SID_00018,562,UDP,1,87.641002,No encryption,0.136729,2,Firefox,0,0
33,SID_00034,288,ICMP,2,1039.101186,AES,0.110269,3,Chrome,0,1
59,SID_00060,695,UDP,1,989.889796,AES,0.259419,2,Edge,0,1
69,SID_00070,370,TCP,1,1105.380602,AES,0.408654,2,Chrome,0,1
101,SID_00102,415,TCP,2,294.584967,AES,0.204681,3,Chrome,0,1
...,...,...,...,...,...,...,...,...,...,...,...
9440,SID_09441,148,TCP,1,443.215461,AES,0.313589,2,Chrome,0,0
9450,SID_09451,465,UDP,1,453.611248,DES,0.323267,4,Chrome,0,1
9472,SID_09473,128,UDP,1,1186.658300,AES,0.360473,2,Chrome,1,0
9487,SID_09488,739,UDP,2,392.207834,No encryption,0.420159,4,Chrome,0,1


In [18]:
df.drop(df[df['login_attempts'] < df['failed_logins']].index, inplace=True)
df[df['login_attempts'] < df['failed_logins']].shape

(0, 11)

---

# Saving Cleaned Data

In [19]:
df.to_csv('data/processed/cybersecurity_intrusion_data_cleaned.csv', index=False)

---

# Conclusion

* 