# ML Canvas - Absenteeism Prediction System

**Author:** Alexis Alduncin (Data Scientist)
**Team:** MLOps 62
**Date:** October 2025

This notebook presents the Machine Learning Canvas for our absenteeism prediction project, documenting the business understanding and value proposition.

## 1. VALUE PROPOSITION

### Problem Statement
- Companies face productivity losses due to unexpected employee absenteeism
- HR departments struggle to predict and plan for workforce availability
- Current manual tracking methods are reactive, not proactive
- Lack of insights into absenteeism patterns and drivers

### Proposed ML Solution
- **Predict hours of absenteeism** based on employee characteristics and historical patterns
- **Enable proactive workforce planning** and resource allocation
- **Identify high-risk factors** for absenteeism to inform HR policies
- **Provide actionable insights** for employee wellbeing programs

### Business Impact
- Reduce unplanned workforce shortages by 15-20%
- Improve resource allocation efficiency
- Enable targeted interventions for high-risk employees
- Support data-driven HR decision making

## 2. DATA SOURCES

### Input Data
- **Source:** Employee records from Brazilian courier company
- **Period:** 2007-2010
- **Size:** 740 records
- **Storage:** AWS S3 (`s3://mlopsequipo62/mlops/`)
- **Versioning:** DVC for data tracking

### Features (21 total)
- **Personal:** Age, BMI, number of children, education
- **Work-related:** Service time, workload, distance from work, transportation cost
- **Behavioral:** Social drinker, social smoker, disciplinary issues
- **Temporal:** Month, day of week, season
- **Health:** Reason for absence (ICD codes)

### Target Variable
- **Absenteeism time in hours:** Continuous variable (0-120 hours)

## 3. PREDICTION TASK

### Primary Task: Regression
- **Type:** Supervised learning - Regression
- **Objective:** Predict exact hours of absenteeism
- **Rationale:** Enables precise workforce planning and cost estimation

### Alternative Task: Classification
- **Type:** Multi-class classification
- **Classes:** Short (<4h), Half-day (4-8h), Full-day (8-24h), Extended (>24h)
- **Rationale:** Simpler for operational planning, potentially higher accuracy

## 4. BUSINESS METRICS

### Operational Metrics
- **Workforce shortage incidents:** Count of unplanned shortages
- **Cost savings:** Reduced overtime and temporary staffing costs
- **Planning accuracy:** % of days with adequate staffing

### HR Metrics
- **Employee satisfaction:** Survey scores
- **Retention rates:** % of employees staying
- **Wellbeing program participation:** Enrollment in support programs

### Financial Metrics
- **ROI:** Cost savings vs. implementation cost
- **Productivity gains:** Output per employee
- **Absenteeism cost reduction:** Direct cost savings

## 5. ML METRICS

### Regression Metrics
- **MAE (Mean Absolute Error):** Average hours off in predictions
  - Target: <4 hours MAE
- **RMSE (Root Mean Square Error):** Penalizes large errors
  - Target: <8 hours RMSE
- **R² Score:** Variance explained by model
  - Target: >0.3 (challenging given data variability)

### Classification Metrics (if applicable)
- **Precision:** Accuracy of absence category predictions
- **Recall:** Coverage of actual absences
- **F1-Score:** Balance of precision and recall
- **Confusion Matrix:** Category-specific performance

## 6. STAKEHOLDERS

### Primary Users
- **HR Department:** Workforce planning, policy decisions
- **Operations Management:** Daily staffing and scheduling

### Secondary Users
- **Employees:** Receive support through targeted wellbeing programs
- **Finance:** Cost impact analysis and budget planning
- **Executive Leadership:** Strategic workforce planning

### Technical Team
- **Data Scientists:** Model development and insights
- **ML Engineers:** Model deployment and monitoring
- **Data Engineers:** Data pipeline maintenance

## 7. DEPLOYMENT

### Batch Predictions
- **Weekly forecasts:** Predict absenteeism for upcoming week
- **Monthly reports:** Trend analysis and pattern identification
- **Delivery:** Email dashboards to HR and operations

### Real-time API
- **On-demand predictions:** Immediate risk assessment for specific employees
- **Integration:** HR management system (HRIS)
- **Response time:** <500ms for single prediction

### Monitoring
- **Model performance:** Weekly accuracy tracking
- **Data drift:** Monthly feature distribution checks
- **Retraining:** Quarterly model updates with new data

## 8. RISKS & ASSUMPTIONS

### Risks

#### Privacy & Ethics
- **Data privacy:** Sensitive employee information handling
- **Mitigation:** Anonymization, GDPR compliance, access controls

#### Bias & Fairness
- **Potential bias:** Against certain demographic groups or health conditions
- **Mitigation:** Fairness audits, protected attribute monitoring, bias testing

#### Generalization
- **Limited scope:** Model trained on Brazilian courier company
- **Mitigation:** Domain adaptation, retraining on local data

#### Model Performance
- **Unpredictability:** Absenteeism driven by external factors (pandemics, weather)
- **Mitigation:** External data integration, regular model updates

### Assumptions

#### Data Quality
- Historical absenteeism data is accurate and complete
- Reason codes (ICD) are correctly assigned
- Employee attributes are up-to-date

#### Business Context
- Historical patterns will continue (no major policy changes)
- Company policies remain relatively stable
- Work environment factors stay constant

#### Technical
- AWS infrastructure remains available and cost-effective
- DVC versioning prevents data pipeline issues
- MLflow tracking ensures reproducibility

## 9. TEAM ROLES & RESPONSIBILITIES

### Data Engineer
- Data pipeline development (DVC, S3)
- Data quality monitoring
- Infrastructure management

### Data Scientist (Alexis Alduncin)
- ML Canvas and business understanding
- Feature engineering and transformation
- Model development and evaluation
- Visualization and insights generation

### ML Engineer
- Model deployment and serving
- MLflow integration and tracking
- CI/CD pipeline development
- Performance monitoring

### Software Engineer
- API development
- Frontend integration
- Code quality and testing

### Site Reliability Engineer
- System reliability and uptime
- Scaling and performance optimization
- Incident response

## 10. NEXT STEPS

### Phase 1 (Current)
- ✅ ML Canvas definition
- ✅ Exploratory Data Analysis
- ✅ Feature engineering pipeline
- ✅ Baseline model development
- ✅ MLflow experiment tracking

### Phase 2 (Upcoming)
- [ ] Advanced feature engineering (temporal features, aggregations)
- [ ] Hyperparameter tuning (Optuna/GridSearch)
- [ ] Model ensembling
- [ ] Classification approach comparison
- [ ] External data integration (weather, holidays)

### Phase 3 (Future)
- [ ] Model deployment (FastAPI)
- [ ] Real-time prediction API
- [ ] Dashboard development
- [ ] Monitoring and alerting
- [ ] A/B testing framework

---

## Summary

This ML Canvas establishes the foundation for our absenteeism prediction system. Key takeaways:

1. **Clear Value:** Proactive workforce planning with 15-20% improvement in shortage prediction
2. **Well-defined Problem:** Regression task with concrete metrics (MAE, RMSE, R²)
3. **Comprehensive Data:** 740 records with 21 features from real-world courier company
4. **Multiple Stakeholders:** HR, Operations, Finance, Employees all benefit
5. **Deployment Ready:** Batch and real-time prediction capabilities
6. **Risk Awareness:** Privacy, bias, and generalization concerns identified
7. **Team Coordination:** Clear roles for Data Engineer, Data Scientist, ML Engineer, SRE

Proceed to `02-aa-eda-transformations.ipynb` for data exploration and feature engineering.