This repository contains an end-to-end notebook for exploring data distributions, checking for normality, and applying the right statistical tests before using the data in machine learning models.
The notebook demonstrates:
- How to inspect and visualize data distributions
- How to measure skewness and kurtosis
- How to test for normality using statistical tests
- How to apply data transformations to improve normality
- When to use parametric vs non-parametric tests
- Why all this matters in machine learning preprocessing
Uses:
- NumPy, Pandas for data handling
- Matplotlib, Seaborn for visualization
- SciPy.stats for statistical tests
- Scikit-learn for transformations and scaling
The dataset is loaded (e.g., Mall_Customers.csv) and inspected for:
- Columns, data types, and shape
- Missing values and unique counts
- Basic descriptive statistics
Histograms and distribution plots are used to understand how features like:
- Age, Annual Income, and Spending Score
are distributed, including comparisons between genders.
Explains whether data is:
- Right-skewed (long right tail)
- Left-skewed (long left tail)
- Symmetric
| Skew Type | Meaning | Fix |
|---|---|---|
| Right-skewed | Many small values, few large ones | log(x+1), sqrt(x), Box–Cox |
| Left-skewed | Many large values, few small ones | x², exp(x) |
| No skew | Balanced | No action needed |
Indicates tail heaviness and peak sharpness.
| Kurtosis | Shape | Implication |
|---|---|---|
| < 0 | Platykurtic (flat) | Evenly spread, fewer outliers |
| ≈ 0 | Mesokurtic (normal) | Typical distribution |
| > 0 | Leptokurtic (sharp) | More outliers, heavy tails |
Several transformations were applied and visually compared:
| Method | Works For | Example |
|---|---|---|
| Log Transform | Right-skewed data | np.log1p(x) |
| Square Root | Mild skew | np.sqrt(x) |
| Box–Cox | Positive data only | scipy.stats.boxcox(x) |
| Yeo–Johnson | Any data (pos/neg) | PowerTransformer(method='yeo-johnson') |
Distribution plots before and after transformation show improved symmetry and bell-shape.
Four normality tests were conducted to statistically assess whether data follows a normal distribution.
| Test | Purpose | Key Idea |
|---|---|---|
| Shapiro–Wilk | Tests for normality | Works best for small samples |
| Kolmogorov–Smirnov (KS) | Compares data vs. reference | Measures goodness-of-fit |
| Anderson–Darling | Tail-sensitive test | Compares against critical values |
| D’Agostino–Pearson K² | Uses skewness & kurtosis | Detects subtle deviations |
- Shapiro–Wilk p < 0.05 → Not normal
- KS p < 0.05 → Not normal
- Anderson–Darling statistic > critical values → Not normal
- D’Agostino p < 0.05 → Not normal
🧠 Conclusion: The Age variable deviates from normal distribution despite mild skew/kurtosis — large samples make tests more sensitive.
All tests check the same null hypothesis:
H₀: Data is normally distributed
H₁: Data is not normally distributed
| Result | Interpretation |
|---|---|
| p > 0.05 | ✅ Data likely normal |
| p ≤ 0.05 | ❌ Data not normal |
For Anderson–Darling:
If statistic < critical value → normal; else → not normal.
| Model Type | Normality Needed? | Notes |
|---|---|---|
| Linear Regression, Logistic Regression | ✅ Yes | Assumes normally distributed errors |
| PCA, LDA | ✅ Yes | Normality improves projection |
| Tree-based models (RF, XGBoost) | ❌ No | Split by thresholds |
| Neural Networks | ⚙️ Partially | Scaling helps convergence |
| KNN, SVM | ⚙️ Sometimes | Sensitive to scale/distribution |
Failing a normality test doesn’t “break” ML — it only means you may need transformations or robust models.
- t-test — compare two means
- ANOVA — compare three or more means
- Pearson correlation
Used when normality tests fail.
Examples:
- Mann–Whitney U
- Kruskal–Wallis
- Spearman correlation
They work on ranks or medians, not means.
- Most real-world data is not perfectly normal — and that’s fine.
- Use visual + statistical checks together (histograms, Q–Q plots).
- For large datasets, rely more on visual judgment than strict p-values.
- Scaling ≠ Normalizing:
- Scaling changes range/variance.
- Normalizing changes distribution shape.
| Scaler | Output Range | Outlier Handling | Notes |
|---|---|---|---|
| MinMaxScaler | [0,1] | No | Sensitive to outliers |
| StandardScaler | Any | No | Mean = 0, Std = 1 |
| RobustScaler | Any | Yes | Uses median & IQR |
| QuantileTransformer | [0,1] or normal | Yes | Maps to uniform/normal |
| PowerTransformer | Any | Some | Reduces skew |
This notebook provides a complete workflow for:
- Exploring and visualizing data
- Measuring skewness and kurtosis
- Testing for normality
- Applying appropriate transformations
- Choosing correct parametric or non-parametric tests
- Understanding how data shape affects machine learning
Follow these steps to set up and run the project:
-
Clone the repository
git clone
cd -
Install dependencies
Make sure Python and pip are installed on your system, then run:
pip install -r requirements.txt -
Open the notebook
You can openmain.ipynbusing either:
- VS Code Jupyter extension: Open VS Code and open
main.ipynb. - Jupyter Notebook: This will open a browser window where you can select
main.ipynb.
-
Run the notebook cells
Click the Run▶️ button on each cell or pressShift + Enter.
The notebook will automatically use the CSV file located in thedata/folder. -
Verify the CSV file
Ensure that your data file exists at:
data/your_data.csv
The notebook reads data from this file, so it must be present to run correctly.
Author: Mohammad Hammad Ahmad Email: mdhammadahmadgithub@gmail.com LinkedIn: www.linkedin.com/in/mohammad-hammad-ahmad-188628227
Feel free to reach out via email or on LinkedIn.