## Assessing Statistical Significance and P-values in Project Objectives

### Analyzing Associations and Predicting Impact in Fire Incidents
Our objective is to assess the strength of associations between fire incidents and predict their impact based on various circumstances. 
This impact will be measured using dependent variables, which may include features such as the number of people displaced, fatalities, property damage, among others. 
The circumstances surrounding fire incidents will be evaluated and treated as independent variables, encompassing features such as the extent of fire, ignition source, property use, etc.

We will examine whether the presence or absence of specific nominal features (categorized circumstances) can influence the outcome of continuous features (measured impact) of fire incidents.

### Methodologies Used
- Kruskal-Wallis Test.

The Kruskal-Wallis test is a non-parametric test used to compare the medians of a continuous variable across different groups defined by a nominal variable. It is suitable when the assumptions of normality and homogeneity of variances (lack of outliers) required by ANOVA are not met.

(https://www.statology.org/kruskal-wallis-test/)

- Spearman coefficient.

The Spearman co-efficient is used to measure a monotonic relationship between two continous variables. "A monotonic relationship between two continous variables refers to a scenario where a change in one variable is generally associated with a change in a specific direction in another variable."

(https://www.statology.org/monotonic-relationship/)

#### Import Dependencies and Dataframe


In [10]:
import pandas as pd

from sklearn.preprocessing import StandardScaler
from scipy.stats import kruskal
from scipy.stats import chi2 # Will be used to reject 
from scipy.stats import spearmanr

In [11]:
df = pd.read_csv('../../data/processed/numerical_encoded_Fire_Incidents_Data.csv')

#### Test 1: Assesing correlation between Number_of_responding_personnel and Estimated_Dollar_Loss

- **Null Hypothesis (H0):** There is no significant association between the number of responding personnel and the estimated dollar loss.

- **Alternative Hypothesis (H1):** There is a significant association between the number of responding personnel and the estimated dollar loss.

In [12]:
# Initialize StandardScaler
scaler = StandardScaler()

#Normalize variables.
df[['Estimated_Dollar_Loss', 'Number_of_responding_personnel']] = scaler.fit_transform(df[['Estimated_Dollar_Loss', 'Number_of_responding_personnel']])

spearman_corr, p_value = spearmanr(df['Estimated_Dollar_Loss'], df['Number_of_responding_personnel'])

print("Spearman's Rank Correlation Coefficient:", spearman_corr)
print("P-value:", p_value)

Spearman's Rank Correlation Coefficient: 0.5638037628815248
P-value: 0.0


##### Test 1 Results.
The Spearman's Rank Correlation Coefficient of 0.5638037628815248 indicates moderate correlation between the number of responding personnel and the estimated dollar loss per fire incident.

The p-value is 0.0, indicating that it is extremely small. With significance levels of 0.01, this result is statistically significant.

Therefore, the (H0) null hypothesis of test 1 can be rejected.

#### Test 2: Assesing the difference between Estimated_Dollar_Loss and Material_First_Ignited.
- **Null Hypothesis (H0):** There is no significant difference in the estimated dollar loss per fire incident across different materials first ignited.

- **Alternative Hypothesis (H1):** There is a significant difference in the estimated dollar loss per fire incident across different materials first ignited.

In [13]:
nominal_feature = 'Estimated_Dollar_Loss'
continous_feature = 'Material_First_Ignited'

# Perform Kruskal-Wallis test. Estimated_Dollar_Loss has already been normalized, though it does need to be.
h_statistic, p_value = kruskal(*[group['Estimated_Dollar_Loss'] for name, group in df.groupby('Material_First_Ignited')])

print("H-statistic:", h_statistic)
print("P-value:", p_value)


alpha = 0.01  # Setting significance value

# calculating degrees of freedom (c - 1)
degrees_of_freedom = len(df['Material_First_Ignited'].unique()) - 1

# Find the critical chi-square value. 
critical_chi2_value = chi2.ppf(1 - alpha, degrees_of_freedom)

print("Critical Chi-square Value:", critical_chi2_value)

H-statistic: 2847.2881536150944
P-value: 0.0
Critical Chi-square Value: 98.02840328331405


##### Test 2 Results.

With an H-statistic of 2847.2881536150944, and a Critical Chi-square Value of 98.02840328331405 less that, suggests that there is a signifcant difference between these variables.
The p-value is 0.0 meaning that this test is statistically significant.

Therefore, the null hypothesis (H0) of test 2 can be rejected, since there is a significant difference in the estimated dollar loss per fire incident across different materials first ignited.

### Discussion.

These findings suggest that attributes such as the number of responding personnel and the material first ignited play significant roles in determining the financial losses associated with fire incidents.

#### Excerpt  of Test #1 Results:
- For the correlation between Number_of_responding_personnel and Estimated_Dollar_Loss:
  - Spearman correlation coefficient: 0.5638037628815248

#### Excerpt  of Test #2 Results:
- For the the difference between Estimated_Dollar_Loss and Material_First_Ignited:
  - H-statistic: 2847.2881536150944
  - Critical Chi-square Value: 98.02840328331405

At a significance level of 0.01, the results of Test #1 and Test #2 <b>indicate a rejection of the null hypothesis</b> in both tests, suggesting statistically significant relationships between the circumstances of a fire incident and it's impact. As a result, further data mining will be pursued to glean additional insights into the significance of this relationship, in order to develop predictive models to enhance emergency response, minimize financial losses, and reduce human casualties.