# **Synthetic Data Generation**
---

## **Basis for Synthetic Dataset Rules**

This synthetic dataset generator simulates student well-being classifications based on heuristic rules, class imbalance, and outliers. These design choices mirror real-world conditions where mental health data is often noisy, imbalanced, and influenced by overlapping emotional states.

**1. Heuristic Labeling Rules**
The rules map probability distributions (from NLP model outputs) into broader well-being categories.
  - InCrisis
    - Condition:
      - `p_depression > 0.5 AND p_suicidal > 0.3`
      - or `p_anxiety > 0.4 AND p_stress > 0.4`
    - Rationale: Crisis states often combine multiple extreme symptoms (e.g., severe depression + suicidal thoughts, or high anxiety + high stress).
  - Struggling
    - Condition:
      - `p_depression > 0.3 OR p_anxiety > 0.3`
    - Rationale: Moderate depression or anxiety indicates difficulty coping, but not yet at crisis level.
  - Surviving
    - Condition:
      - `0.2 < p_anxiety < 0.4 OR 0.2 < p_stress < 0.4`
    - Rationale: Lower but noticeable emotional strain. Individuals are functioning but showing warning signs.
  - Excelling
    - Condition:
      - `p_normal > 0.7 AND all negative emotions < 0.1`
    - Rationale: High positive baseline with minimal distress → thriving beyond stability.
  - Thriving
    - Condition:
      - `p_normal > 0.5 AND all negatives < 0.2`
    - Rationale: Balanced and healthy state, showing resilience but not peak performance.
  - Fallback → Surviving
    - Used when no clear condition is met. Prevents undefined cases.

**2. Class Imbalance**
  - Implementation: Only a portion (`imbalance_ratio`, e.g., 20%) of "InCrisis" rows are retained.
  - Rationale:
    - In real student populations, severe crisis cases are rarer than normal or moderate distress cases.
    - Mimics natural skewed distribution found in mental health datasets.
    - Prepares ML models to handle minority vs majority class trade-offs.

**3. Class Imbalance**
  - Implementation: A fraction (`outlier_ratio`, e.g., 2%) of rows are assigned extreme probability values (e.g., negative or >1).
  - Rationale:
    - Simulates data entry errors, corrupted sensor logs, or NLP misclassifications.
    - Helps ensure models are robust against noise and not overly sensitive to perfect inputs.
    - Encourages the need for data preprocessing (clipping, normalization) in real deployment.

---

## **Basis for data generation**

### **Model 1: Proposed Synthetic Dataset Distribution**

Based on the research into the prevalence of mental health conditions among Filipino university students and young adults, this document proposes a more realistic and balanced distribution for your synthetic dataset. This model, designated as "Model 1," moves away from the assumption that "In Crisis" cases are extremely rare, providing a data-backed foundation for generating a more representative dataset. The percentages are inferred from various academic studies and reports, and while not a direct one-to-one mapping to the Delphis continuum, they provide a strong, citable reference for your project.

The distribution is as follows:

* **Excelling:** **3-5%**
* **Thriving:** **7-10%**
* **Surviving:** **35-45%**
* **Struggling:** **25-30%**
* **In Crisis:** **15-20%**

#### **Rationale and Research-Backed Justification:**

This model maintains the strong, evidence-based percentages from the previous version while logically dividing the top tier into "Excelling" and "Thriving."

* **Excelling (3-5%) & Thriving (7-10%):** These two categories are derived from the "Flourishing" category of the AXA Mind Health Report. The report found that only **15%** of young people in Asia are "flourishing." We propose splitting this group to reflect different levels of well-being. "Excelling" represents the highest tier of this category—the true high-performers—and therefore receives a smaller portion (3-5%). "Thriving" represents the remaining portion of the flourishing group (7-10%) and encompasses individuals who are doing very well but not necessarily at their absolute peak.

* **Surviving (35-45%):** This category aligns directly with the "Getting By" classification in the research. The AXA study found that the Philippines has the largest proportion globally of people "getting by" at **39%**. This represents a significant portion of the student population and is a crucial category for your model to be trained on.

* **Struggling (25-30%):** This range is well-supported by studies showing a high prevalence of mental health symptoms in the student population. Research from Mendoza et al. (2021) found that **over 23%** of a large sample of Filipino university students reported severe anxiety symptoms. This percentage range accurately captures the large portion of students who are experiencing significant distress.

* **In Crisis (15-20%):** This category, while representing a high number, is a more accurate reflection of the current mental health landscape than a sub-1% figure. This range is justified by the high rates of related issues, such as the **24% prevalence of suicide ideation** among university students in Manila (Tria, 2015). A higher percentage allows the model to be more sensitive to a broader definition of "in crisis," which includes not only immediate suicidal risk but also severe clinical distress.

This model provides a comprehensive, research-backed framework for generating a representative synthetic dataset that will allow your machine learning models to be more effective and ethically responsible.

---
### **References:**

* **Mendoza, N. B., et al. (2021).** Mental Health Status and Help-Seeking Behavior of Filipino University Students During the COVID-19 Pandemic. *Transactions of the National Academy of Science and Technology, 43*(2).
* **Tria, A. (2015).** A Multivariate Analysis of Suicide Ideation Among University Students in the Philippines. *Asia Pacific Social Science Review, 15*(1), 1-13.
* **AXA. (2022).** *[AXA Mind Health Report](https://www.axa.com.ph/multimedia/newsroom/gen-z-pinoys-have-more-mind-health-conditions)*
* **AXA. (2024).** *[AXA Mind Health Report: Higher work stress seen among Millennials and Gen Zs](https://pop.inquirer.net/369115/axa-mind-health-report-higher-work-stress-seen-among-millennials-and-gen-zs)*

---

## **Setup & Imports**

In [2]:
import numpy as np
import pandas as pd
import os

---
## **Configurations**

In [3]:
DATA_PATH = "../../data/classification/synthetic_dataset.csv"

---
## **Data Generation**

In [4]:
def generate_sample(
    n=1000,
    save_path="synthetic_dataset.csv",
    outlier_ratio=0.02,
    distribution={
        "Excelling": (0.03, 0.05),
        "Thriving": (0.07, 0.10),
        "Surviving": (0.35, 0.45),
        "Struggling": (0.25, 0.30),
        "InCrisis": (0.15, 0.20),
    }
):
    rows = []
    for _ in range(n * 2):  # oversample first, we’ll trim later
        scores = np.random.rand(5)
        probs = scores / scores.sum()
        p_anx, p_norm, p_dep, p_sui, p_str = probs

        # Apply heuristic rules
        if (p_dep > 0.5 and p_sui > 0.3) or (p_anx > 0.4 and p_str > 0.4):
            label = "InCrisis"
        elif p_dep > 0.3 or p_anx > 0.3:
            label = "Struggling"
        elif 0.2 < p_anx < 0.4 or 0.2 < p_str < 0.4:
            label = "Surviving"
        elif p_norm > 0.7 and max(p_anx, p_dep, p_sui, p_str) < 0.1:
            label = "Excelling"
        elif p_norm > 0.5 and max(p_anx, p_dep, p_sui, p_str) < 0.2:
            label = "Thriving"
        else:
            label = "Surviving"  # fallback

        rows.append({
            "p_anxiety": p_anx,
            "p_normal": p_norm,
            "p_depression": p_dep,
            "p_suicidal": p_sui,
            "p_stress": p_str,
            "WellbeingClass": label
        })
    
    df = pd.DataFrame(rows)

    # --- Enforce target distribution ---
    final_rows = []
    for label, (low, high) in distribution.items():
        target_frac = np.random.uniform(low, high)
        target_n = int(n * target_frac)

        subset = df[df["WellbeingClass"] == label]
        if len(subset) < target_n:
            # oversample with replacement if not enough
            sampled = subset.sample(n=target_n, replace=True, random_state=42)
        else:
            # downsample if too many
            sampled = subset.sample(n=target_n, replace=False, random_state=42)

        final_rows.append(sampled)

    df = pd.concat(final_rows).sample(frac=1, random_state=42).reset_index(drop=True)

    # --- Inject outliers ---
    n_outliers = int(len(df) * outlier_ratio)
    outlier_indices = np.random.choice(df.index, size=n_outliers, replace=False)

    for col in ["p_anxiety", "p_normal", "p_depression", "p_suicidal", "p_stress"]:
        df.loc[outlier_indices, col] = df.loc[outlier_indices, col].apply(
            lambda x: x * np.random.choice([-5, 10])
        )

    # Ensure directory exists
    os.makedirs(os.path.dirname(save_path), exist_ok=True)

    # Save only if file doesn't exist
    if not os.path.exists(save_path):
        df.to_csv(save_path, index=False)
        print(f"✅ Dataset generated and saved to {save_path}")
    else:
        print(f"⚠️ File already exists at {save_path}. Skipping save.")

    return df

In [5]:
df = generate_sample(5000, save_path=DATA_PATH)
print(df.head())

✅ Dataset generated and saved to ../../data/classification/synthetic_dataset.csv
   p_anxiety  p_normal  p_depression  p_suicidal  p_stress WellbeingClass
0   0.334127  0.293177      0.134766    0.008255  0.229675     Struggling
1   0.070999  0.183029      0.288366    0.358976  0.098631      Surviving
2   0.053555  0.689150      0.137887    0.119279  0.000129       Thriving
3   0.082681  0.544932      0.053853    0.134588  0.183945       Thriving
4   0.152601  0.099171      0.361759    0.131337  0.255133     Struggling
