A comprehensive data analytics pipeline analyzing 100K+ orders from a Brazilian e-commerce marketplace with SQL analytics, predictive machine learning models, and an interactive Power BI dashboard.
| Metric | Value |
|---|---|
| Status | β Complete |
| Total Orders | 99,441 |
| Total Revenue | R$13,494,400.74 |
| Customers | 99,441 |
| Sellers | 3,095 |
| Products | 32,951 |
| States Covered | 27 |
| Cities Covered | 4,119 |
| Date Range | Sep 2016 - Oct 2018 |
- Processed 668,293 total records across 8 datasets
- Cleaned and validated customer, order, product, and payment data
- Created derived features for RFM analysis and delivery metrics
- 99.4% data quality after cleaning process
- 9 comprehensive business queries extracting key insights
- Geographic analysis (27 states, 4,119 cities)
- Payment method distribution (74.7% credit card dominance)
- Product category performance analysis
- Delivery performance metrics (92.1% on-time delivery rate)
- Customer segmentation and seller analytics
- 14+ interactive charts using Plotly
- 6 static visualizations with Matplotlib/Seaborn
- Geographic heatmaps and trend analysis
- Revenue distribution by state and city
- Model: Random Forest Classifier
- Accuracy: 92.16%
- ROC-AUC Score: 0.729
- Top Feature: Days until delivery
- Use Case: Identify orders at risk of delay
- Model: Random Forest Regressor
- RΒ² Score: 0.2054
- RMSE: 1.1480 stars
- Top Feature: Delivery efficiency
- Use Case: Predict customer satisfaction
- Method: K-Means Clustering (RFM Analysis)
- Segments: 4 (Champions, At Risk, etc.)
- Customers Analyzed: 96,478
- Use Case: Targeted marketing campaigns
Interactive dashboard featuring:
- Executive KPI summary
- Geographic revenue analysis
- State and city-level breakdowns
- Payment method distribution
- Product category performance
- Order status tracking
- Customer segment analysis
- Seller performance metrics
olist-ecommerce analysis/
βββ data/
β βββ raw-data/ # Original CSV files
β βββ cleaned/ # Processed CSV files
βββ python/
β βββ 00_explore_data.py # Data exploration
β βββ 01_data_cleaning.py # Data cleaning pipeline
β βββ 02_load_to_sql.py # Database loading
β βββ 03_run_sql_analytics.py # Query execution
β βββ 04_visualizations.py # Chart generation
β βββ 05_ml_models_fixed.py # ML models
βββ sql/
β βββ 01_schema.sql # Database schema
β βββ 02_core_metrics.sql # Business metrics
β βββ 03_product_analysis.sql
β βββ 04_payment_delivery_analysis.sql
β βββ 05_customer_seller_analysis.sql
β βββ 06_export_for_powerbi.sql
βββ power bi/
β βββ olist_dashboard.pbix # Interactive dashboard
βββ visualizations/
β βββ interactive/ # HTML charts
βββ ml_models/
β βββ MODEL_SUMMARY_REPORT.txt
βββ docs/
β βββ README.md
β βββ PROJECT_SUMMARY.md
β βββ POWERBI_GUIDE.md
β βββ POWERBI_CHECKLIST.md
βββ requirements.txt
- Python 3.8+
- Git
- Power BI Desktop (for dashboard)
- Clone the repository
git clone https://github.com/noturbob/olist-ecommerce-analysis.git
cd olist-ecommerce-analysis- Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies
pip install -r requirements.txt- Run the pipeline
# Data exploration
python python/00_explore_data.py
# Data cleaning
python python/01_data_cleaning.py
# Load to SQLite
python python/02_load_to_sql.py
# Run SQL analytics
python python/03_run_sql_analytics.py
# Generate visualizations
python python/04_visualizations.py
# Train ML models
python python/05_ml_models_fixed.py- Total Revenue: R$13,494,400.74
- Top State: SΓ£o Paulo (37.5% of revenue, R$5.07M)
- Top City: SΓ£o Paulo (R$1.86M, 15,045 orders)
- Top Category: Health & Beauty (R$1.23M, 8,647 orders)
| Segment | Customer Count | Avg Lifetime Value |
|---|---|---|
| Champions | 35,000+ | High |
| At Risk | 60,000+ | Medium |
- Delivered: 97.02% (96,478 orders)
- Shipped: 1.11% (1,104 orders)
- Canceled: 0.78% (776 orders)
- Other: 0.09% (83 orders)
| Method | % of Orders | Avg Order Value |
|---|---|---|
| Credit Card | 74.7% | R$162.24 |
| Boleto | 19.3% | R$144.33 |
| Voucher | 3.7% | R$62.49 |
| Debit Card | 1.5% | R$140.26 |
- Average Delivery Time: 12.56 days
- On-Time Rate: 92.1%
- Late Deliveries: 7.9%
- Max Delay Observed: 209.63 days
Accuracy: 92.16%
ROC-AUC: 0.729
Precision: 85%+
Recall: 78%+
Top Features: [days_until_delivery, freight_value, product_weight_g]
RΒ² Score: 0.2054
RMSE: 1.1480 stars
MAE: 0.89 stars
Top Features: [delivery_efficiency, actual_delivery_days]
Method: K-Means (k=4)
Silhouette Score: 0.45
Segments: Champions, At Risk, Potential, Need Attention
Best for: Targeted marketing campaigns
- Business Summary - Overall metrics and KPIs
- Revenue by State - Geographic performance analysis
- Revenue by City - City-level revenue breakdown
- Payment Methods - Payment preference analysis
- Top Categories - Product category performance
- Delivery Analysis - Delivery time and performance metrics
- Customer Segmentation - RFM-based customer grouping
- Order Status - Order fulfillment tracking
- Top Sellers - Seller performance ranking
- PROJECT_SUMMARY.md - Detailed findings and analysis
- POWERBI_GUIDE.md - Dashboard usage guide
- POWERBI_CHECKLIST.md - Dashboard setup checklist
The Power BI dashboard includes:
- KPI Cards: Total Orders, Products Sold, Active Sellers, Revenue, Freight Revenue
- Revenue Analytics: By state, by city, trends over time
- Payment Analysis: Distribution and average values by payment method
- Product Analysis: Revenue and pricing by category with seller details
- Order Tracking: Status distribution and percentages
- Customer Insights: Segmentation and lifetime value metrics
See POWERBI_GUIDE.md for detailed dashboard walkthrough
Get a high-level view of key business metrics and performance indicators at a glance.
Comprehensive analysis of order patterns, payment methods, and transaction trends.
Revenue distribution across states and cities with interactive mapping capabilities.
Detailed product category analysis and seller performance metrics.
Advanced segmentation analysis showing customer value distribution and lifetime metrics.
Detailed analytical charts generated from the data pipeline.
Distribution of revenue across different states showing geographic performance.
Breakdown of order statuses showing fulfillment rates and order outcomes.
Price distribution across products with statistical insights.
Analysis of delivery times showing performance metrics and delivery patterns.
Performance ranking of product categories by revenue and volume.
Correlation analysis between product pricing and customer review scores.
| Category | Tools |
|---|---|
| Data Processing | Python, Pandas, NumPy |
| Database | SQLite, SQL |
| Visualization | Power BI, Plotly, Matplotlib, Seaborn |
| ML Frameworks | Scikit-learn |
| Data Analysis | Jupyter Notebook |
- Extraction - Load raw CSV files
- Validation - Check data types and constraints
- Cleaning - Handle missing values, outliers
- Transformation - Create derived features
- Loading - Store in SQLite database
- Analysis - SQL queries for insights
- Modeling - Train ML models
- Visualization - Create dashboards and charts
- Train-Test Split: 80-20
- Cross-Validation: 5-fold
- Feature Scaling: StandardScaler for regression
- Model Selection: Random Forest (best performance)
- Hyperparameter Tuning: Grid search applied
| Deliverable | Status | Quality |
|---|---|---|
| Data Cleaning | β | 99.4% |
| SQL Analytics | β | 9 queries |
| Visualizations | β | 14+ charts |
| ML Models | β | 92% accuracy |
| Power BI Dashboard | β | Production-ready |
| Documentation | β | Complete |
This project is complete. For questions or improvements:
- Review the documentation in
/docs - Check the detailed analysis in
PROJECT_SUMMARY.md - Examine model details in
ml_models/MODEL_SUMMARY_REPORT.txt
This project is open source and available for educational and professional use.
Data Analyst: Bobby Anthene
Email: bobbyanthenrao@gmail.com
GitHub: @noturbob
Project Start Date: December 2024
Completion Date: December 14, 2025
Total Duration: ~4 weeks
All 5 Phases Delivered:
- β Data Exploration & Cleaning
- β SQL Analytics
- β Advanced Visualizations
- β Machine Learning Models
- β Power BI Dashboard
Last Updated: December 14, 2025










