In [1]:
### Import libraies
import pandas as pd # Data manipulation and analysis.
import numpy as np # Numerical operations and array handling.
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats # Statistical functions and tests.

pd.set_option('display.max_columns', None) # Display all columns in DataFrame output.
pd.set_option('display.max_rows', None) # Display all rows in DataFrame output.


In [2]:
# Load data from CSV
df = pd.read_csv('data/intrusion_data.csv')

In [3]:
# Display csv file in table view (5 rows)
df.head()

Unnamed: 0,session_id,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
0,SID_00001,599,TCP,4,492.983263,DES,0.606818,1,Edge,0,1
1,SID_00002,472,TCP,3,1557.996461,DES,0.301569,0,Firefox,0,0
2,SID_00003,629,TCP,3,75.044262,DES,0.739164,2,Chrome,0,1
3,SID_00004,804,UDP,4,601.248835,DES,,0,Unknown,0,1
4,SID_00005,453,TCP,5,532.540888,AES,0.054874,1,Firefox,0,0


In [4]:
# Return number of rows and columns in the DataFrame
df.shape

(9537, 11)

In [5]:
# Return information about the DataFrame, Including data types and non-null counts.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9537 entries, 0 to 9536
Data columns (total 11 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   session_id           9537 non-null   object 
 1   network_packet_size  9537 non-null   int64  
 2   protocol_type        9537 non-null   object 
 3   login_attempts       9537 non-null   int64  
 4   session_duration     9060 non-null   float64
 5   encryption_used      7571 non-null   object 
 6   ip_reputation_score  8106 non-null   float64
 7   failed_logins        9537 non-null   int64  
 8   browser_type         8583 non-null   object 
 9   unusual_time_access  9537 non-null   int64  
 10  attack_detected      9537 non-null   int64  
dtypes: float64(2), int64(5), object(4)
memory usage: 819.7+ KB


The dataset has total rows of 9537 and some columns have missing values that will need to be addressed. 

There is a mix of data types and columns are numerical (int64, float64), and categorical (object). The object columns will require encoding for use in predictive training. 

Missing values: Columns (session_duration, encryption_used, ip_reputation_score and browser_type) have non-null counts lower than the total number of entries, indicating  missing values that will need to be addressed.

In [6]:
# Return summary statistics for numerical columns in the DataFrame
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
network_packet_size,9537.0,500.430639,198.379364,64.0,365.0,499.0,635.0,1285.0
login_attempts,9537.0,4.032086,1.963012,1.0,3.0,4.0,5.0,13.0
session_duration,9060.0,789.259572,785.282753,0.5,229.883982,553.389511,1102.056853,7190.392213
ip_reputation_score,8106.0,0.33097,0.176582,0.002497,0.191938,0.313947,0.452014,0.891286
failed_logins,9537.0,1.517773,1.033988,0.0,1.0,1.0,2.0,5.0
unusual_time_access,9537.0,0.149942,0.357034,0.0,0.0,0.0,0.0,1.0
attack_detected,9537.0,0.447101,0.49722,0.0,0.0,0.0,1.0,1.0


The summary statistics provides overview of  7 out of 11 columns of the dataset. This aligns with the 4 categorical (object) columns we saw earlier.

There is a wide variance in the session_duration which has a maximum value over 7,000. the mean (789.259572) is significantly higher than the median (50th percentile, 553.389511).


# Handle Duplicates

Each session is uniquely identified by the primary key, my focus is not duplicates within individual features, but on ensuring an entire rows are not duplicated.

Feature columns like 'protocol_type', 'encryption_used' and 'browser_type' contain many repeated values and these are categorical features with a limited set of possible values Therefore, repeated values are expected in these columns in this dataset.

Session uniqueness - I will verify that each row represents a unique session. The 'session_id' column is an designated primary key each row of this dataset. Each Session ID should be unique and no duplicated values.

Duplicate rows - I will check for any fully identical rows across all columns. This is a crucial step to eliminate records that may have been duplicated during data collection.


In [7]:
# Show columns of dataset
df.columns

Index(['session_id', 'network_packet_size', 'protocol_type', 'login_attempts',
       'session_duration', 'encryption_used', 'ip_reputation_score',
       'failed_logins', 'browser_type', 'unusual_time_access',
       'attack_detected'],
      dtype='object')

In [8]:
# The "session_id" column should be a unique identifier for each row.
# Check there are no identical rows of session_id
duplicated_ids = df['session_id'].duplicated().sum()
print(f"Number of duplicate session_ids: {duplicated_ids}")

# Each record in the dataset is to be distinct.
# Check for any rows that are complete duplicated across all columns.
full_duplicates_row = df.duplicated().sum()
print(f"Number of fully identical rows: {full_duplicates_row}")

Number of duplicate session_ids: 0
Number of fully identical rows: 0


# Handle Irrelavant Data

In [9]:
# The 'session_id' column is dropped as it is an identifier of unique session. 
# It offers no general predictive value and keeping would add noise to any predicitive training..
df.drop('session_id', axis=1, inplace=True);

# Verify the column has been dropped by displaying the columns
df.columns

Index(['network_packet_size', 'protocol_type', 'login_attempts',
       'session_duration', 'encryption_used', 'ip_reputation_score',
       'failed_logins', 'browser_type', 'unusual_time_access',
       'attack_detected'],
      dtype='object')

In [10]:
# Check for any columns where all values are the same (constant features)
constant_features = [col for col in df.columns if df[col].nunique() == 1]
print("Constant features:", constant_features)

Constant features: []


In [11]:
# Remove constant features from the DataFrame.
df_no_consntant_features = df.drop(columns=constant_features)

# Handle Missing Values

In [12]:
# Display the DataFrame having missing data.
df_missing_data = df[df.isnull().any(axis=1)]
df_missing_data.shape

(4046, 10)

In [13]:
df_missing_data.tail()

Unnamed: 0,network_packet_size,protocol_type,login_attempts,session_duration,encryption_used,ip_reputation_score,failed_logins,browser_type,unusual_time_access,attack_detected
9529,469,TCP,1,2487.078455,,0.497153,0,,0,0
9530,661,UDP,5,,,0.613622,3,Chrome,0,1
9533,380,TCP,3,182.848475,,,0,Chrome,0,0
9535,406,TCP,4,86.664703,AES,,1,Chrome,1,0
9536,340,TCP,6,86.876744,,0.277069,4,Chrome,1,1


In [14]:
# Show columns with mostly values more that x% missing values
threshold = 5
print(f"Total records {df.shape[0]}")
print("*"* 50)
for col in df.columns:
    missing_count = df[col].isnull().sum()
    missing_ratio = (missing_count / df.shape[0]) * 100
    if missing_ratio > threshold:
        print(f"Column: {col} has {missing_count} missing values ({missing_ratio: 2f}%)")
        print("*"* 50)

Total records 9537
**************************************************
Column: session_duration has 477 missing values ( 5.001573%)
**************************************************
Column: encryption_used has 1966 missing values ( 20.614449%)
**************************************************
Column: ip_reputation_score has 1431 missing values ( 15.004718%)
**************************************************
Column: browser_type has 954 missing values ( 10.003146%)
**************************************************


There are four columns which contain missing values: session_duration, encryption_used, ip_reputation_score and browser_type.

The percentage of missing data in these columns ranges from approximately 5% to 21%. Since this is not high enough to justify dropping these potentially valuable features and losing information, I will impute the missing values.

In [None]:
# Impute 'encryption_used' with 'none'
df['encryption_used'] = df['encryption_used'].fillna('none')
print("Imputed 'encryption_used' with the constant value 'none'.")

# Impute 'browser_type' with 'unknown'
df['browser_type'] = df['browser_type'].fillna('unknown')
print("Imputed 'browser_type' with the constant value 'unknown'.")


# Check null value count
print("\nMissing values count after categorical imputation:")
print(df[['encryption_used', 'browser_type', 'session_duration']].isnull().sum())

Imputed 'encryption_used' with the constant value 'none'.
Imputed 'browser_type' with the constant value 'unknown'.

Missing values count after categorical imputation:
encryption_used       0
browser_type          0
session_duration    477
dtype: int64


I am taking an cybersecuity prespective approach to dataset missing values.

The 'encryption_used' missing values are imputed with the string 'none'. This is a conservative security assumption that if an encryption protocol is not present it should be treated as unencrypted session attempt.

The 'browser_type' missing values are imputed with the existing 'unknown' category. This consolidates all sessions where the browser was not identifiable, which could be a useful feature for attack detection.