# WellOps  
## Data Sources and Data Schema


This project follows a mixed data strategy to balance realism, privacy, and scalability.

Due to the sensitive nature of employee well-being data, real organizational datasets are not publicly available. Therefore, this project combines publicly available datasets with synthetically generated data to simulate realistic workplace scenarios while maintaining ethical and privacy constraints.


### Public Data Sources

The following publicly available datasets are used to ground the project in real-world patterns:

1. **Employee Burnout Dataset (Kaggle)**  
   - Contains features related to workload, mental fatigue, and resource allocation.
   - Used to understand burnout-related feature distributions.

2. **Workplace Productivity and Stress Surveys**  
   - Aggregated survey-based datasets that capture stress indicators and job satisfaction.

These datasets are used for exploratory analysis and feature inspiration rather than direct employee-level prediction.


### Motivation for Synthetic Data

Real employee workload and behavioral data cannot be publicly shared due to privacy and ethical constraints. To address this, synthetic data is generated to simulate realistic employee workloads, task patterns, and temporal behavior.

Synthetic data allows:
- Controlled experimentation
- Time-series modeling
- Privacy-preserving analysis
- Scalability testing


### Synthetic Data Design Principles

The synthetic dataset is designed to reflect realistic workplace dynamics while avoiding identifiable patterns.

Key principles include:
- No direct mapping to real individuals
- Time-based variation in workload
- Role-based differences in task distribution
- Controlled noise to simulate human variability


### Synthetic Data Design Principles

The synthetic dataset is designed to reflect realistic workplace dynamics while avoiding identifiable patterns.

Key principles include:
- No direct mapping to real individuals
- Time-based variation in workload
- Role-based differences in task distribution
- Controlled noise to simulate human variability


## Feature Categories

Features are grouped into the following categories:

### Static Features
- Role
- Department
- Experience level
- Employment type

### Dynamic (Temporal) Features
- Weekly working hours
- Number of tasks assigned
- Task switching frequency
- Overtime streaks

### Behavioral Indicators
- Leave frequency
- Deadline pressure
- Self-reported stress indicators (synthetic)

### Target Variable
- Burnout Risk Score (continuous)


## Data Schema

Each row in the dataset represents one employee over a one-week time window.

| Feature Name | Description |
|-------------|-------------|
| employee_id | Unique employee identifier |
| week_id | Time window index |
| role | Employee role |
| team_id | Team identifier |
| weekly_hours | Total hours worked in the week |
| tasks_assigned | Number of tasks assigned |
| overtime_hours | Hours beyond standard work time |
| task_switches | Number of task context switches |
| stress_indicator | Synthetic stress proxy |
| burnout_score | Target burnout risk score |


## Temporal Structure

Burnout is modeled as a temporal phenomenon rather than a static condition.

Each employee has a sequence of weekly observations, enabling:
- Time-series analysis
- Deep learning models (LSTM/GRU)
- Trend-based early warning signals


## Data Limitations

- Synthetic data may not capture all real-world nuances.
- Survey-based public datasets may introduce self-reporting bias.
- The burnout risk score is a proxy, not a clinical measure.

These limitations are acknowledged and addressed through careful model interpretation.
