Incident Dataset:
Contains categorical data to examine the relationship between Time of Day (e.g., Morning, Afternoon, Evening, Night)
 and Incident Type (e.g., Phishing, Malware, Brute Force, Insider Threat).

Login Dataset:
This dataset contains continuous data such as Login Attempts by regular users vs. potentially malicious users, 
allowing us to test for a significant difference in behavior between the two groups.

Chi-Square Test
Purpose: To determine if there’s a statistically significant association between the time of day and type of cybersecurity incident.
Hypotheses:

Null Hypothesis (H₀): There is no association between the time of day and the type of cybersecurity incident (i.e., they are independent).
Alternative Hypothesis (H₁): There is an association between the time of day and the type of cybersecurity incident (i.e., they are dependent).

T-Test
Purpose: To test if there’s a statistically significant difference in the average number of login attempts between regular users and malicious users.

Hypotheses:

Null Hypothesis (H₀): The mean number of login attempts for regular and malicious users is the same.
Alternative Hypothesis (H₁): The mean number of login attempts for regular and malicious users is different.

In [34]:
# Import libraries
import pandas as pd
from scipy.stats import chi2_contingency, ttest_ind
from sklearn.preprocessing import StandardScaler

In [35]:
# Load the datasets
chi_square_data = pd.read_csv('Incident.csv')
t_test_data = pd.read_csv('Login.csv')

In [36]:
# Data Cleaning for Chi-Square Test Dataset
# Check for missing values
print("Chi-Square Data Missing Values:\n", chi_square_data.isnull().sum())
# Check for duplicates
print("Chi-Square Data Duplicate Records:", chi_square_data.duplicated().sum())

Chi-Square Data Missing Values:
 Time_of_Day      0
Incident_Type    0
dtype: int64
Chi-Square Data Duplicate Records: 184


In [37]:
# Data Cleaning for T-Test Dataset
# Check for missing values
print("\nT-Test Data Missing Values:\n", t_test_data.isnull().sum())
# Check for duplicates
print("T-Test Data Duplicate Records:", t_test_data.duplicated().sum())


T-Test Data Missing Values:
 User_Type         0
Login_Attempts    0
dtype: int64
T-Test Data Duplicate Records: 0


In [38]:
# Handle missing values (if any) - here we simply drop them for demonstration
chi_square_data.dropna(inplace=True)
t_test_data.dropna(inplace=True)

In [39]:
# Handle duplicates
chi_square_data.drop_duplicates(inplace=True)
t_test_data.drop_duplicates(inplace=True)

In [40]:
# Data Type Verification
print("\nChi-Square Data Types:\n", chi_square_data.dtypes)
print("\nT-Test Data Types:\n", t_test_data.dtypes)


Chi-Square Data Types:
 Time_of_Day      object
Incident_Type    object
dtype: object

T-Test Data Types:
 User_Type          object
Login_Attempts    float64
dtype: object


In [41]:
# Normalization for Login Attempts in T-Test Dataset
scaler = StandardScaler()
t_test_data['Normalized_Login_Attempts'] = scaler.fit_transform(t_test_data[['Login_Attempts']])

In [42]:
# Chi-Square Test: Association between Time of Day and Incident Type
# Create a contingency table
contingency_table = pd.crosstab(chi_square_data['Time_of_Day'], chi_square_data['Incident_Type'])

In [43]:
# Perform Chi-Square test
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

In [44]:
# display Chi-Square test
print("\nChi-Square Test Results")
print("-----------------------")
print(f"Chi2 Statistic: {chi2}")
print(f"P-value: {p_value}")
print(f"Degrees of Freedom: {dof}")
print(f"Expected Frequencies Table:\n{expected}\n")


Chi-Square Test Results
-----------------------
Chi2 Statistic: 0.0
P-value: 1.0
Degrees of Freedom: 9
Expected Frequencies Table:
[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]



1. Chi2 Statistic: 0.0
A Chi2 statistic of 0.0 indicates that there is no difference between the observed and expected frequencies. In other words, the observed values exactly match the expected values for each category in the contingency table.

2. P-value: 1.0
A p-value of 1.0 suggests there is no statistical significance in the relationship between Time of Day and Incident Type. This is because a high p-value (especially 1.0) indicates that the observed data fits the null hypothesis perfectly, meaning we have no reason to believe there’s an association between these variables.
In practice, this means we fail to reject the null hypothesis.

3. Degrees of Freedom (DOF): 9
The degrees of freedom in a Chi-Square test are calculated based on the number of categories in each variable. With four categories for both Time of Day and Incident Type, we get (4−1)×(4−1)=9. The DOF doesn’t indicate significance but helps calculate the Chi2 distribution.

4. Expected Frequencies Table:
The table shows the expected frequency of each combination of Time of Day and Incident Type if there were no association between them. All values are 1, meaning each combination was expected to occur exactly once.

Summary Interpretation
Since the p-value is 1.0 and the Chi2 statistic is 0, we can conclude that there is no association between the time of day and the type of cybersecurity incident. This suggests that incidents occur independently of the time of day, with no discernible pattern or preference for specific times.

"Null Hypothesis (H₀): There is no association between the time of day and the type of cybersecurity incident (i.e., they are independent)."

In [45]:
# T-Test: Difference in Login Attempts between Regular and Malicious users
# Split data based on User Type
regular_login_attempts = t_test_data[t_test_data['User_Type'] == 'Regular']['Normalized_Login_Attempts']
malicious_login_attempts = t_test_data[t_test_data['User_Type'] == 'Malicious']['Normalized_Login_Attempts']

In [46]:
# Perform T-Test
t_stat, p_val = ttest_ind(regular_login_attempts, malicious_login_attempts)

In [47]:
# display T-Test result
print("T-Test Results")
print("--------------")
print(f"T-Statistic: {t_stat}")
print(f"P-value: {p_val}")

T-Test Results
--------------
T-Statistic: -8.22397927565179
P-value: 2.5690634182473794e-14


T-Statistic: -8.22
A T-statistic of -8.22 indicates a substantial difference between the means of the two groups (i.e., Regular and Malicious users).
The negative sign shows that the mean of the Regular users' login attempts is lower than that of Malicious users, which aligns with expectations in cybersecurity (malicious users might make more attempts).

P-value: 2.57e-14
A p-value of 2.57e-14 (or 0.0000000000000257) is exceedingly low and far below the typical significance threshold of 0.05.
This means we can reject the null hypothesis with strong confidence, indicating a statistically significant difference in the average number of login attempts between the two groups.

Summary Interpretation
These results provide strong evidence that malicious users tend to have a different (and likely higher) number of login attempts compared to regular users. This distinction could be useful in identifying potentially malicious activity based on login behavior alone.

"Alternative Hypothesis (H₁): The mean number of login attempts for regular and malicious users is different."