# 01_generate_synthetic_data.ipynb

This notebook aims to generate synthetic datasets for occupational health analytics in a LATAM context. The main objective is to create realistic, anonymized data that can be used for analysis, reporting, and modeling without exposing sensitive information.

**Summary of steps:**

- We generated four synthetic tables:
  1. **Employees:** 1,000 unique employees with attributes such as age, gender, department, seniority, and location (20 LATAM cities).
  2. **Health Events:** 2,000 records of absenteeism or accidents, including cause, duration, date, and estimated cost.
  3. **Medical Evaluations:** 1,200 medical checkups with BMI, cholesterol, smoking status, and fitness for work.
  4. **Monthly KPIs:** Aggregated by department and month (2024–2025), including absenteeism rate, number of accidents, and active employees.
- All data was generated using Pandas and Numpy with randomization for realism.
- We included code to preview the data, check dimensions and columns, and export all tables as CSV files to a shared directory for further use.

This notebook provides a reproducible foundation for testing occupational health analytics pipelines and dashboards.

## Download Instructions
After running the export cell, you will find the generated CSV files in the download directory specified in the notebook. By default, this path corresponds to the computer where the notebook was first executed. 

Each user or organization should adapt the export path to their own environment and operating system. For example:

- On Windows: `E:\Desktop\Data Science 2025\3-occupational health`
- On Ubuntu (with Windows shared drive): `/mnt/e/Desktop/Data Science 2025/3-occupational health/`

Adjust the path as needed to ensure the files are saved in an accessible location for your workflow.

## Requirements
This notebook requires Python 3.x, pandas, and numpy. Make sure your environment has these packages installed.
## Reproducibility
A random seed (`np.random.seed(42)`) is set to ensure that the synthetic data generated is reproducible. You can change the seed for different random samples.
## Data Export
All generated tables are saved as CSV files in the specified directory. You can use these files for further analysis, reporting, or as input for other tools.
## Adaptation
You can adapt this notebook to other domains by modifying the attributes, categories, or data generation logic to fit your specific needs.

In [3]:
import pandas as pd
import numpy as np

# 1. Employees
np.random.seed(42)
employees = pd.DataFrame({
    "employee_id": range(1, 1001),
    "age": np.random.randint(18, 65, 1000),
    "gender": np.random.choice(["M", "F"], 1000),
    "department": np.random.choice(["Administration", "Operations", "Sales", "IT", "HR"], 1000),
    "seniority_years": np.random.randint(0, 41, 1000),
    "location": np.random.choice([
        "Buenos Aires", "Santiago", "Lima", "Bogotá", "Mexico City", "Monterrey", "Guadalajara",
        "Medellín", "Quito", "Caracas", "São Paulo", "Rio de Janeiro", "Montevideo", "Asunción",
        "La Paz", "San José", "Panama City", "San Salvador", "Tegucigalpa", "Managua"
    ], 1000)
})
display(employees.head())

# 2. Health Events
health_events = pd.DataFrame({
    "event_id": range(1, 2001),
    "employee_id": np.random.choice(employees["employee_id"], 2000),
    "cause": np.random.choice(["Illness", "Accident", "Medical leave", "Other"], 2000),
    "duration_days": np.random.randint(1, 31, 2000),
    "date": pd.to_datetime(np.random.choice(pd.date_range("2024-01-01", "2025-12-31"), 2000)),
    "estimated_cost": np.round(np.random.uniform(100, 5000, 2000), 2)
})
display(health_events.head())

# 3. Medical Evaluations
medical_evaluations = pd.DataFrame({
    "evaluation_id": range(1, 1201),
    "employee_id": np.random.choice(employees["employee_id"], 1200),
    "date": pd.to_datetime(np.random.choice(pd.date_range("2024-01-01", "2025-12-31"), 1200)),
    "BMI": np.round(np.random.uniform(18, 35, 1200), 1),
    "cholesterol": np.random.randint(150, 300, 1200),
    "smoker": np.random.choice([True, False], 1200),
    "fit_for_work": np.random.choice([True, False], 1200, p=[0.9, 0.1])
})
display(medical_evaluations.head())

# 4. Monthly KPIs
departments = employees["department"].unique()
months = pd.date_range("2024-01-01", "2025-12-01", freq="MS")
kpis = []
for dept in departments:
    for month in months:
        kpis.append({
            "department": dept,
            "month": month.strftime("%Y-%m"),
            "absenteeism_%": np.round(np.random.uniform(0, 10), 2),
            "accidents": np.random.randint(0, 10),
            "active_employees": np.random.randint(50, 300)
        })
monthly_kpis = pd.DataFrame(kpis)
display(monthly_kpis.head())

print('Finished!')

Unnamed: 0,employee_id,age,gender,department,seniority_years,location
0,1,56,M,Operations,29,Monterrey
1,2,46,F,Operations,21,Medellín
2,3,32,F,Operations,27,Rio de Janeiro
3,4,60,F,Operations,6,Montevideo
4,5,25,M,HR,34,São Paulo


Unnamed: 0,event_id,employee_id,cause,duration_days,date,estimated_cost
0,1,247,Illness,13,2025-02-19,3766.35
1,2,72,Accident,22,2025-03-01,655.96
2,3,830,Medical leave,5,2025-02-19,1313.07
3,4,185,Accident,22,2024-01-26,4055.99
4,5,6,Other,7,2025-01-05,3838.67


Unnamed: 0,evaluation_id,employee_id,date,BMI,cholesterol,smoker,fit_for_work
0,1,497,2025-05-03,31.8,268,False,True
1,2,561,2024-08-17,29.5,218,False,True
2,3,484,2025-01-04,28.3,282,False,True
3,4,849,2024-06-26,27.6,294,True,True
4,5,722,2025-02-13,32.6,201,False,True


Unnamed: 0,department,month,absenteeism_%,accidents,active_employees
0,Operations,2024-01,0.22,7,235
1,Operations,2024-02,4.91,0,196
2,Operations,2024-03,9.39,9,114
3,Operations,2024-04,7.74,0,117
4,Operations,2024-05,7.39,6,93


Finished!


In [4]:
# Show dimensions and columns of each table
print('Employees shape:', employees.shape)
print('Employees columns:', employees.columns.tolist(), '\n')

print('Health Events shape:', health_events.shape)
print('Health Events columns:', health_events.columns.tolist(), '\n')

print('Medical Evaluations shape:', medical_evaluations.shape)
print('Medical Evaluations columns:', medical_evaluations.columns.tolist(), '\n')

print('Monthly KPIs shape:', monthly_kpis.shape)
print('Monthly KPIs columns:', monthly_kpis.columns.tolist(), '\n')

Employees shape: (1000, 6)
Employees columns: ['employee_id', 'age', 'gender', 'department', 'seniority_years', 'location'] 

Health Events shape: (2000, 6)
Health Events columns: ['event_id', 'employee_id', 'cause', 'duration_days', 'date', 'estimated_cost'] 

Medical Evaluations shape: (1200, 7)
Medical Evaluations columns: ['evaluation_id', 'employee_id', 'date', 'BMI', 'cholesterol', 'smoker', 'fit_for_work'] 

Monthly KPIs shape: (120, 5)
Monthly KPIs columns: ['department', 'month', 'absenteeism_%', 'accidents', 'active_employees'] 



In [6]:
# Save all tables to CSV files for download in the shared Windows directory
save_path = '/mnt/e/Desktop/Data Science 2025/3-occupational health/'
employees.to_csv(save_path + 'employees.csv', index=False)
health_events.to_csv(save_path + 'health_events.csv', index=False)
medical_evaluations.to_csv(save_path + 'medical_evaluations.csv', index=False)
monthly_kpis.to_csv(save_path + 'monthly_kpis.csv', index=False)
print('All tables saved as CSV files in:', save_path)

All tables saved as CSV files in: /mnt/e/Desktop/Data Science 2025/3-occupational health/
