📊 Cybersecurity Data Analysis with Pandas

This notebook explores a cybersecurity dataset using Pandas for data analysis.


In [32]:
# Importing the pandas library, for data manipulation and analysis.
# Assigning it the alias 'pd' for easier access in the code.

import pandas as pd

In [33]:
# Using pandas' read_csv function to read a CSV file from a dataset on github and storing it in a DataFrame called 'df'.
# The CSV file contains network traffic data to be analysed.

df = pd.read_csv('https://raw.githubusercontent.com/ritaafrica/data/main/network_traffic_data.csv')

In [6]:
# Checking the number of rows and columns

df.shape

(1000, 9)

In [7]:
# Getting column names

df.columns

Index(['Timestamp', 'Source_IP', 'Destination_IP', 'Protocol', 'Port',
       'Bytes_Sent', 'Bytes_Received', 'Status', 'Threat_Level'],
      dtype='object')

In [8]:
# Getting basic info about the dataset

df.info

<bound method DataFrame.info of                Timestamp     Source_IP  Destination_IP Protocol    Port  \
0    2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP     NaN   
1    2025-03-19 13:03:40  192.168.1.13  172.217.169.46     ICMP   443.0   
2    2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP   443.0   
3    2025-03-19 13:02:40      10.0.0.9    192.168.1.20      TCP     NaN   
4    2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP     NaN   
..                   ...           ...             ...      ...     ...   
995  2025-03-19 04:46:40     10.0.0.46  172.217.169.46      DNS    53.0   
996  2025-03-19 04:46:10      10.0.0.3         8.8.8.8     HTTP   443.0   
997  2025-03-19 04:45:40      10.0.0.3    192.168.1.20      UDP    21.0   
998  2025-03-19 04:45:10  192.168.1.30  172.217.169.46      DNS     NaN   
999  2025-03-19 04:44:40  192.168.1.34         8.8.8.8     ICMP  3389.0   

     Bytes_Sent  Bytes_Received   Status Threat_Level  
0          

In [9]:
# Getting a summary of the statistics

df.describe

<bound method NDFrame.describe of                Timestamp     Source_IP  Destination_IP Protocol    Port  \
0    2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP     NaN   
1    2025-03-19 13:03:40  192.168.1.13  172.217.169.46     ICMP   443.0   
2    2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP   443.0   
3    2025-03-19 13:02:40      10.0.0.9    192.168.1.20      TCP     NaN   
4    2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP     NaN   
..                   ...           ...             ...      ...     ...   
995  2025-03-19 04:46:40     10.0.0.46  172.217.169.46      DNS    53.0   
996  2025-03-19 04:46:10      10.0.0.3         8.8.8.8     HTTP   443.0   
997  2025-03-19 04:45:40      10.0.0.3    192.168.1.20      UDP    21.0   
998  2025-03-19 04:45:10  192.168.1.30  172.217.169.46      DNS     NaN   
999  2025-03-19 04:44:40  192.168.1.34         8.8.8.8     ICMP  3389.0   

     Bytes_Sent  Bytes_Received   Status Threat_Level  
0        

In [34]:
# Selecting important columns and storing them in a dataframe

focus_columns = df[["Timestamp", "Source_IP", "Destination_IP", "Status", "Threat_Level"]]

In [35]:
# Display first 10 rows

focus_columns.head(10)

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,Blocked,Low
1,2025-03-19 13:03:40,192.168.1.13,172.217.169.46,Allowed,Medium
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,Allowed,Medium
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,Blocked,Low
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,Blocked,Medium
5,2025-03-19 13:01:40,10.0.0.43,172.217.169.46,Allowed,Low
6,2025-03-19 13:01:10,10.0.0.26,10.0.0.5,Allowed,High
7,2025-03-19 13:00:40,192.168.1.36,192.168.1.20,Allowed,Medium
8,2025-03-19 13:00:10,192.168.1.26,192.168.1.20,Allowed,Medium
9,2025-03-19 12:59:40,10.0.0.43,10.0.0.5,Blocked,Low


In [36]:
# Storing a single column in a series

ports_list = df["Port"]

In [37]:
# Display first 10 rows

ports_list.head(10)

0       NaN
1     443.0
2     443.0
3       NaN
4       NaN
5      53.0
6      53.0
7      21.0
8       NaN
9    3389.0
Name: Port, dtype: float64

In [38]:
# Filter only blocked traffic

blocked_traffic = df[df["Status"] == "Blocked"]

In [39]:
# Display last 10 rows

blocked_traffic.tail(10)

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
979,2025-03-19 04:54:40,192.168.1.21,172.217.169.46,TCP,443.0,5713,13035,Blocked,Medium
982,2025-03-19 04:53:10,10.0.0.3,192.168.1.20,TCP,3389.0,1932,1021,Blocked,Low
983,2025-03-19 04:52:40,192.168.1.50,192.168.1.20,ICMP,22.0,3993,10494,Blocked,Low
986,2025-03-19 04:51:10,192.168.1.8,203.0.113.99,FTP,3389.0,7565,1259,Blocked,Medium
987,2025-03-19 04:50:40,10.0.0.18,192.168.1.20,ICMP,8080.0,560,12910,Blocked,Medium
992,2025-03-19 04:48:10,10.0.0.11,203.0.113.99,HTTP,,2839,2939,Blocked,Medium
993,2025-03-19 04:47:40,192.168.1.39,192.168.1.20,ICMP,22.0,4178,8307,Blocked,Low
995,2025-03-19 04:46:40,10.0.0.46,172.217.169.46,DNS,53.0,2290,6246,Blocked,Low
997,2025-03-19 04:45:40,10.0.0.3,192.168.1.20,UDP,21.0,6655,13170,Blocked,Low
998,2025-03-19 04:45:10,192.168.1.30,172.217.169.46,DNS,,7308,13117,Blocked,Low


In [40]:
# Select key details for analysis

blocked_traffic_summary = blocked_traffic[["Timestamp", "Source_IP", "Destination_IP", "Status", "Threat_Level"]]

In [41]:
# Display first few rows

blocked_traffic_summary.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,Blocked,Low
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,Blocked,Low
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,Blocked,Medium
9,2025-03-19 12:59:40,10.0.0.43,10.0.0.5,Blocked,Low
10,2025-03-19 12:59:10,10.0.0.33,203.0.113.99,Blocked,Medium


In [44]:
# Filter high-risk traffic

high_risk_traffic = df[df["Threat_Level"] == "Critical"]

In [45]:
# Display first few rows

high_risk_traffic.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
59,2025-03-19 12:34:40,10.0.0.47,192.168.1.20,ICMP,,5885,463,Allowed,Critical
96,2025-03-19 12:16:10,192.168.1.35,203.0.113.99,FTP,8080.0,9371,7189,Allowed,Critical
134,2025-03-19 11:57:10,192.168.1.17,172.217.169.46,DNS,22.0,6714,13124,Blocked,Critical
150,2025-03-19 11:49:10,192.168.1.42,10.0.0.5,HTTP,53.0,2702,634,Allowed,Critical
209,2025-03-19 11:19:40,10.0.0.17,203.0.113.99,TCP,3389.0,5085,10014,Blocked,Critical


In [46]:
# Filter traffic with bytes sent over 5000

high_data_transfer = df[df["Bytes_Sent"] > 5000]

In [47]:
# Display summary

high_data_transfer.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked,Low
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed,Medium
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked,Medium
5,2025-03-19 13:01:40,10.0.0.43,172.217.169.46,DNS,53.0,6915,12981,Allowed,Low
7,2025-03-19 13:00:40,192.168.1.36,192.168.1.20,TCP,21.0,5655,119,Allowed,Medium


In [55]:
# Show the number of high data transfer events

print(f"Number of high-data transfer event: {len(high_data_transfer)}")

Number of high-data transfer event: 518


In [58]:
# Split data set into features (x) and target variable (y)
# Select features (x) - exclude target variable

x = df.drop(columns = ["Threat_Level"])

In [59]:
# Select target variable (y)

y = df["Threat_Level"]

In [63]:
# Display first few rows of x and y

print(f"Features (x): {x.head()}")
print(f"\nTarget Variable (y): {y.head()}")

Features (x):              Timestamp     Source_IP  Destination_IP Protocol   Port  \
0  2025-03-19 13:04:10     10.0.0.15    192.168.1.20      TCP    NaN   
1  2025-03-19 13:03:40  192.168.1.13  172.217.169.46     ICMP  443.0   
2  2025-03-19 13:03:10      10.0.0.5    203.0.113.99     HTTP  443.0   
3  2025-03-19 13:02:40      10.0.0.9    192.168.1.20      TCP    NaN   
4  2025-03-19 13:02:10   192.168.1.4  172.217.169.46      FTP    NaN   

   Bytes_Sent  Bytes_Received   Status  
0        5411            8989  Blocked  
1        4999           11808  Allowed  
2        6360           10852  Allowed  
3        4011           14314  Blocked  
4        5254            8718  Blocked  

Target Variable (y): 0       Low
1    Medium
2    Medium
3       Low
4    Medium
Name: Threat_Level, dtype: object


In [None]:
# Remove "Timestamp" column

df = df.drop(columns=["Timestamp"])

In [64]:
# Display summary

df.head()

Unnamed: 0,Timestamp,Source_IP,Destination_IP,Protocol,Port,Bytes_Sent,Bytes_Received,Status,Threat_Level
0,2025-03-19 13:04:10,10.0.0.15,192.168.1.20,TCP,,5411,8989,Blocked,Low
1,2025-03-19 13:03:40,192.168.1.13,172.217.169.46,ICMP,443.0,4999,11808,Allowed,Medium
2,2025-03-19 13:03:10,10.0.0.5,203.0.113.99,HTTP,443.0,6360,10852,Allowed,Medium
3,2025-03-19 13:02:40,10.0.0.9,192.168.1.20,TCP,,4011,14314,Blocked,Low
4,2025-03-19 13:02:10,192.168.1.4,172.217.169.46,FTP,,5254,8718,Blocked,Medium
