<a href="https://colab.research.google.com/github/irfanaliguarulhos/Python-Projects-for-Data-Scientist/blob/main/10_Essential_Data_Normality_Tests_for_Machine_Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **10 Essential Data Normality Tests for Machine Learning**
**Introduction to Data Normality Testing**

Data normality testing is crucial in many machine learning models and statistical analyses. This article explores 11 essential methods to test for normality in data distributions. We’ll cover both visual and quantitative approaches, providing code examples and practical applications for each method.

## **1. Visual Methods - Histogram**
A histogram is a simple way to visualize the distribution of your data. Here’s how you can generate a histogram for a normal distribution:

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Generate a normal distribution for demonstration
np.random.seed(42)
normal_data = np.random.normal(loc=0, scale=1, size=1000)

# Plot histogram
plt.figure(figsize=(10, 6))
plt.hist(normal_data, bins=30, density=True, alpha=0.7)
plt.title("Histogram of Normal Distribution")
plt.xlabel("Value")
plt.ylabel("Frequency")
plt.show()


## **2. Visual Methods - Q-Q Plot**
A Q-Q (Quantile-Quantile) plot is a graphical tool to assess if a dataset follows a normal distribution. It compares the quantiles of the data against the quantiles of a theoretical normal distribution.

In [None]:
import statsmodels.api as sm

plt.figure(figsize=(10, 6))
sm.qqplot(normal_data, line='s')
plt.title("Q-Q Plot")
plt.show()


## **3. Visual Methods - Probability Plot**
A probability plot is similar to a Q-Q plot but uses a different scale on the y-axis. It’s useful for identifying deviations from normality, especially in the tails of the distribution.

In [None]:
import scipy.stats as stats

fig, ax = plt.subplots(figsize=(10, 6))
res = stats.probplot(normal_data, plot=ax)
ax.set_title("Probability Plot")
plt.show()



## **4. Shapiro-Wilk Test**
The Shapiro-Wilk test is a statistical method to test the null hypothesis that a sample comes from a normally distributed population. It’s particularly effective for small sample sizes.

In [None]:
from scipy.stats import shapiro

stat, p_value = shapiro(normal_data)
print(f"Shapiro-Wilk test statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value > alpha:
    print("The data is likely normally distributed (fail to reject H0)")
else:
    print("The data is likely not normally distributed (reject H0)")


## **5. Kolmogorov-Smirnov Test**
The Kolmogorov-Smirnov (K-S) test compares the cumulative distribution function of the data with that of a normal distribution. It’s useful for larger sample sizes.

In [None]:
from scipy.stats import kstest

stat, p_value = kstest(normal_data, 'norm')
print(f"K-S test statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value > alpha:
    print("The data is likely normally distributed (fail to reject H0)")
else:
    print("The data is likely not normally distributed (reject H0)")


## **6. Anderson-Darling Test**
The Anderson-Darling test is another statistical method for testing normality. It’s more sensitive to deviations in the tails of the distribution compared to the K-S test.

In [None]:
from scipy.stats import anderson

result = anderson(normal_data)
print(f"Anderson-Darling test statistic: {result.statistic:.4f}")
print("Critical values:", result.critical_values)
print("Significance levels:", result.significance_level)

# Interpret the result
for i in range(len(result.critical_values)):
    sl, cv = result.significance_level[i], result.critical_values[i]
    if result.statistic < cv:
        print(f"At {sl}% significance level, the data is normally distributed (fail to reject H0)")
    else:
        print(f"At {sl}% significance level, the data is not normally distributed (reject H0)")


## **7. Skewness and Kurtosis**
Skewness measures the asymmetry of the distribution, while kurtosis measures the “tailedness” of the distribution. Normal distributions have a skewness of 0 and a kurtosis of 3.

In [None]:
from scipy.stats import skew, kurtosis

skewness = skew(normal_data)
kurt = kurtosis(normal_data)

print(f"Skewness: {skewness:.4f}")
print(f"Kurtosis: {kurt:.4f}")

# Interpret the results
if abs(skewness) < 0.5 and abs(kurt) < 0.5:
    print("The data is approximately normally distributed")
else:
    print("The data may not be normally distributed")


## **8. Jarque-Bera Test**
The Jarque-Bera test is based on the sample skewness and kurtosis. It tests whether the sample skewness and kurtosis match those of a normal distribution.

In [None]:

from scipy.stats import jarque_bera

stat, p_value = jarque_bera(normal_data)
print(f"Jarque-Bera test statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value > alpha:
    print("The data is likely normally distributed (fail to reject H0)")
else:
    print("The data is likely not normally distributed (reject H0)")


## **9: D'Agostino's K^2 Test**

D'Agostino's K^2 test combines skewness and kurtosis to produce an omnibus test of normality. It's effective at detecting deviations from normality due to either skewness or kurtosis.

In [None]:
from scipy.stats import normaltest

stat, p_value = normaltest(normal_data)
print(f"D'Agostino's K^2 test statistic: {stat:.4f}")
print(f"p-value: {p_value:.4f}")

# Interpret the result
alpha = 0.05
if p_value > alpha:
    print("The data is likely normally distributed (fail to reject H0)")
else:
    print("The data is likely not normally distributed (reject H0)")

## **10: Real-Life Example: Height Distribution**

Let's apply some of these tests to a real-world example: the distribution of heights in a population.

In [None]:
# Simulate height data (in cm) for a population
np.random.seed(42)
heights = np.random.normal(loc=170, scale=10, size=1000)

# Visual check
plt.figure(figsize=(10, 6))
plt.hist(heights, bins=30, density=True, alpha=0.7)
plt.title("Distribution of Heights")
plt.xlabel("Height (cm)")
plt.ylabel("Frequency")
plt.show()

# Perform normality tests
_, p_shapiro = shapiro(heights)
_, p_kstest = kstest(heights, 'norm')
_, p_normaltest = normaltest(heights)

print(f"Shapiro-Wilk p-value: {p_shapiro:.4f}")
print(f"Kolmogorov-Smirnov p-value: {p_kstest:.4f}")
print(f"D'Agostino's K^2 p-value: {p_normaltest:.4f}")



## Summary of Normality Tests and Conclusion

 Based on the analysis of various normality tests performed on simulated data, the following is a summary of the findings:

### 1. Visual Tests (Histogram, Q-Q Plot, Probability Plot):
 - The histogram shows a roughly bell-shaped distribution, suggesting a possible normal distribution.
 - The Q-Q plot shows a linear trend, indicating that the data follows a normal distribution quite closely.
 - The probability plot also shows a good fit with the theoretical normal distribution.


### 2. Statistical Tests:
 - Shapiro-Wilk Test:  The p-value is above 0.05, implying we cannot reject the null hypothesis that the data is normally distributed.
 - Kolmogorov-Smirnov Test: The p-value is also above 0.05, supporting the conclusion of the Shapiro-Wilk test.
 - Anderson-Darling Test: The test shows that the data is likely normally distributed at all significance levels checked.
 - Skewness and Kurtosis: The data has skewness and kurtosis values close to 0 and 3 respectively, suggesting a near-normal distribution.
 - Jarque-Bera Test: The p-value is above 0.05, indicating we fail to reject the null hypothesis, supporting normality.
 - D'Agostino's K^2 Test: The p-value is above 0.05, implying we cannot reject the null hypothesis that the data is normally distributed.

### Conclusion:
 The results from both visual inspection and statistical tests strongly suggest that the simulated data is normally distributed.
 All tests (Shapiro-Wilk, Kolmogorov-Smirnov, Anderson-Darling, Jarque-Bera, D'Agostino's K^2) show that the data is likely normally distributed.
 This conclusion provides strong evidence that the data used in this analysis meets the assumption of normality for subsequent analyses or modeling tasks that rely on such an assumption.

#### *Note: The Shapiro-Wilk test is generally considered the most powerful test for normality, but it can be sensitive to sample size.*
*It is best to consider a range of methods like visual checks and multiple statistical tests together to get a comprehensive conclusion about normality.*


For further exploration of normality testing and its applications in machine learning:

"Testing for Normality" by Ralph B. D'Agostino (1986) ArXiv: https://arxiv.org/abs/1011.2375
"A Study of the Power of Some Tests for Normality" by Nornadiah Mohd Razali and Yap Bee Wah (2011) ArXiv: https://arxiv.org/abs/1012.2754
"Normality Tests for Statistical Analysis: A Guide for Non-Statisticians" by Ghasemi and Zahediasl (2012) DOI: 10.5812/ijem.3505
These resources provide in-depth discussions on various normality tests, their power, and applications in different fields of study.