Professional EDA workflow: profiling → cleaning → visualisation → insight
1. Data Generation → 500-row synthetic e-commerce dataset with realistic quality issues
2. Inspection → Shape, dtypes, memory, sample rows
3. Missing Values → Per-column null counts and percentages
4. Distributions → Revenue histogram (linear + log scale), outlier detection
5. Correlations → Heatmap, high-correlation pairs
6. Categorical → Revenue by category, region, salesperson
7. Time Series → Monthly trend with 3-month rolling average
8. Visualizations → 6 charts saved to reports/figures/
| File | Description |
|---|---|
01_revenue_distribution.png |
Histogram (linear + log) — shows right-skewed revenue |
02_revenue_by_category.png |
Horizontal bar — category contribution to total revenue |
03_monthly_trend.png |
Line + bars — revenue trend with Q4 seasonal spike |
04_correlation_heatmap.png |
Correlation matrix for numeric features |
05_salesperson_performance.png |
Scatter plot — volume vs revenue per salesperson |
06_missing_values.png |
Bar chart — null rates per column |
git clone https://github.com/YOUR_USERNAME/python-eda-showcase.git
cd python-eda-showcase
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt
# Run full EDA and generate charts
python src/run_eda.py
# Run tests (no display needed)
pytest tests/ -v
# Charts saved to reports/figures/| Insight | Finding | Action |
|---|---|---|
| Right-skewed revenue | Most orders < $200, but max = $2,500+ | Use median not mean for reports |
| 8% null emails | CRM integration dropping email on sync | Add NOT NULL at ingestion |
| Q4 seasonal spike | October–December 40% higher volume | Pre-scale warehouse compute |
| Furniture highest AOV | Average order value 3× Electronics | Target enterprise for upsell |
| 3% negative revenue | Refunds not filtered before analysis | Add revenue > 0 filter |
- Log transformation for skewed distributions: why you need both linear and log histograms to understand revenue data
- IQR outlier detection vs Z-score: IQR is more robust to extreme values; Z-score assumes normality
- Correlation ≠ causation: high correlation between quantity and revenue is definitional, not analytical
- Seaborn themes:
set_theme()globally styles all matplotlib charts for professional presentation - Missing value patterns: random (MCAR) vs systematic (MAR/MNAR) — 8% null emails is likely systematic
MIT
Part of a 10-project Data Analyst portfolio