Skip to content

monroesolisdata/python-eda-showcase

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python EDA Showcase

Exploratory Data Analysis with pandas, matplotlib & seaborn

Professional EDA workflow: profiling → cleaning → visualisation → insight


pandas matplotlib seaborn Python License


Analysis Workflow

1. Data Generation   → 500-row synthetic e-commerce dataset with realistic quality issues
2. Inspection        → Shape, dtypes, memory, sample rows
3. Missing Values    → Per-column null counts and percentages
4. Distributions     → Revenue histogram (linear + log scale), outlier detection
5. Correlations      → Heatmap, high-correlation pairs
6. Categorical       → Revenue by category, region, salesperson
7. Time Series       → Monthly trend with 3-month rolling average
8. Visualizations    → 6 charts saved to reports/figures/

Charts Generated

File Description
01_revenue_distribution.png Histogram (linear + log) — shows right-skewed revenue
02_revenue_by_category.png Horizontal bar — category contribution to total revenue
03_monthly_trend.png Line + bars — revenue trend with Q4 seasonal spike
04_correlation_heatmap.png Correlation matrix for numeric features
05_salesperson_performance.png Scatter plot — volume vs revenue per salesperson
06_missing_values.png Bar chart — null rates per column

Quick Start

git clone https://github.com/YOUR_USERNAME/python-eda-showcase.git
cd python-eda-showcase
python -m venv venv && venv\Scripts\activate
pip install -r requirements.txt

# Run full EDA and generate charts
python src/run_eda.py

# Run tests (no display needed)
pytest tests/ -v

# Charts saved to reports/figures/

Key EDA Findings (Simulated Dataset)

Insight Finding Action
Right-skewed revenue Most orders < $200, but max = $2,500+ Use median not mean for reports
8% null emails CRM integration dropping email on sync Add NOT NULL at ingestion
Q4 seasonal spike October–December 40% higher volume Pre-scale warehouse compute
Furniture highest AOV Average order value 3× Electronics Target enterprise for upsell
3% negative revenue Refunds not filtered before analysis Add revenue > 0 filter

What I Learned

  • Log transformation for skewed distributions: why you need both linear and log histograms to understand revenue data
  • IQR outlier detection vs Z-score: IQR is more robust to extreme values; Z-score assumes normality
  • Correlation ≠ causation: high correlation between quantity and revenue is definitional, not analytical
  • Seaborn themes: set_theme() globally styles all matplotlib charts for professional presentation
  • Missing value patterns: random (MCAR) vs systematic (MAR/MNAR) — 8% null emails is likely systematic

License

MIT

Part of a 10-project Data Analyst portfolio

About

End-to-end EDA toolkit in pandas: null profiling, outlier detection, correlation analysis, group statistics, and six publication-quality matplotlib and seaborn charts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages