# **Synthetic Data Generation**
---

# Basis for Synthetic Dataset Rules

This synthetic dataset generator simulates student well-being classifications based on heuristic rules, enforced class distributions, and controlled outliers. These design choices mirror real-world conditions where mental health data is often noisy, imbalanced, and influenced by overlapping emotional states.

---

## 1. Heuristic Labeling Rules
The rules map probability distributions (from simulated journal classifier outputs) into broader well-being categories:

- **InCrisis**
  - **Condition:**
    - `p_depression > 0.5 AND p_suicidal > 0.3`
    - OR `p_anxiety > 0.4 AND p_stress > 0.4`
  - **Rationale:** Crisis states often involve multiple extreme symptoms (e.g., severe depression + suicidal thoughts, or high anxiety + high stress).

- **Struggling**
  - **Condition:**
    - `p_depression > 0.3 OR p_anxiety > 0.3`
  - **Rationale:** Moderate depression or anxiety indicates difficulty coping, but not at crisis level.

- **Excelling**
  - **Condition:**
    - `p_normal > 0.7 AND all negatives < 0.1`
  - **Rationale:** Strong positive baseline with minimal distress → peak well-being.

- **Thriving**
  - **Condition:**
    - `p_normal > 0.5 AND all negatives < 0.2`
  - **Rationale:** Balanced, resilient state showing good mental health, though not at the very top.

- **Fallback → Struggling**
  - Applied when no other condition is met, to avoid undefined cases.

---

## 2. Class Distribution Control
- **Implementation:**  
  Instead of simple random imbalance, the generator **enforces a target range** for each class (e.g., InCrisis 15–20%, Struggling 25–30%, etc.).

- **Rationale:**
  - Better mirrors real-world skewed distributions.
  - Prevents under- or over-representation of classes.
  - Provides flexibility by sampling within a range instead of fixed ratios.

---

## 3. Controlled Outliers
- **Implementation:**
  - A fraction (`outlier_ratio`, e.g., 2%) of rows are perturbed.
  - For **journal rows** → probability values are exaggerated with noise but remain non-negative (renormalized to sum to 1).
  - For **mood rows** → random one-hot mood flags are flipped.

- **Rationale:**
  - Mimics noisy inputs from NLP classifiers or inconsistent user check-ins.
  - Ensures the dataset still trains robust models without introducing invalid negatives.
  - Encourages preprocessing pipelines to handle anomalies gracefully.
---

# Basis for Data Generation

## Model 1: Proposed Synthetic Dataset Distribution

Based on research into the prevalence of mental health conditions among Filipino university students and young adults, this model proposes a realistic and balanced distribution for the synthetic dataset. Unlike earlier drafts, **this version removes the "Surviving" category** to better align with the finalized heuristic rules in the generator.  

Instead of five classes, we now use four: **Excelling, Thriving, Struggling, and In Crisis.** This structure matches the logic applied in the data generator and avoids artificial overlap between "Struggling" and "Surviving."

### Final Distribution
* **Excelling:** **3–5%**
* **Thriving:** **7–10%**
* **Struggling:** **25–30%**
* **In Crisis:** **15–20%**

---

## Rationale and Research-Backed Justification

This model maintains evidence-based prevalence rates while making the class taxonomy cleaner and more practical for ML training.

- **Excelling (3–5%) & Thriving (7–10%)**  
  These categories correspond to the "Flourishing" group in the AXA Mind Health Report, which found that only **15%** of young people in Asia are flourishing. We split this into two tiers:
  - *Excelling* → the true high-performers, a smaller share (3–5%).  
  - *Thriving* → doing very well but not at the extreme top (7–10%).

- **Struggling (25–30%)**  
  Supported by studies like Mendoza et al. (2021), which showed **over 23%** of Filipino university students experienced severe anxiety symptoms. This category captures those with significant distress but not at crisis level.

- **In Crisis (15–20%)**  
  Higher than some global estimates, but justified by local data. Tria (2015) found **24% prevalence of suicide ideation** among Manila university students. This category represents both acute suicidal risk and severe clinical distress.

---

## Why "Surviving" Was Removed

- **Overlap with Struggling:** In practice, the "Surviving" category captured milder distress, but the boundary between "Surviving" and "Struggling" was blurry. This weakened class separation.  
- **Heuristic Consistency:** The current labeling rules classify such cases under **Struggling** (moderate anxiety or depression), making a separate category redundant.  
- **Simpler, Cleaner Taxonomy:** By reducing to four classes, the dataset avoids artificial middle-ground states and ensures each label has clear, actionable meaning for training and downstream interpretation.

---

## References
- **Mendoza, N. B., et al. (2021).** Mental Health Status and Help-Seeking Behavior of Filipino University Students During the COVID-19 Pandemic. *Transactions of the National Academy of Science and Technology, 43*(2).  
- **Tria, A. (2015).** A Multivariate Analysis of Suicide Ideation Among University Students in the Philippines. *Asia Pacific Social Science Review, 15*(1), 1–13.  
- **AXA. (2022).** *[AXA Mind Health Report](https://www.axa.com.ph/multimedia/newsroom/gen-z-pinoys-have-more-mind-health-conditions)*  
- **AXA. (2024).** *[AXA Mind Health Report: Higher work stress seen among Millennials and Gen Zs](https://pop.inquirer.net/369115/axa-mind-health-report-higher-work-stress-seen-among-millennials-and-gen-zs)*  


## **Setup & Imports**

In [66]:
import numpy as np
import pandas as pd
import os

---
## **Configurations**

In [67]:
DATA_PATH = "../../data/classification/synthetic_dataset.csv"

In [68]:
MOOD_POOLS = {
    "InCrisis": ["Depressed", "Sad", "Exhausted", "Hopeless"],
    "Struggling": ["Anxious", "Angry", "Stressed", "Restless"],
    "Thriving": ["Calm", "Relaxed", "Peaceful", "Content"],
    "Excelling": ["Happy", "Energized", "Excited", "Motivated"],
}

In [69]:
# Build reverse map automatically
MOOD_CLASS_MAP = {mood: cls for cls, moods in MOOD_POOLS.items() for mood in moods}

In [70]:
ALL_MOODS = sum(MOOD_POOLS.values(), [])

---
## **Data Generation**

In [71]:
def generate_sample_independent_mood(
    n=1000,
    save_path="synthetic_dataset.csv",
    outlier_ratio=0.02,
    gratitude_prob=0.6,   # probability of a gratitude entry
    mood_only_ratio=0.1,  # fraction of rows that are mood-only (no journal text)
    distribution={
        "Excelling": (0.03, 0.05),
        "Thriving": (0.07, 0.10),
        "Struggling": (0.25, 0.30),
        "InCrisis": (0.15, 0.20),
    }
):
    rows = []

    # --- Step 1: Generate oversampled rows ---
    # Oversample to ensure enough data after enforcing class distribution
    for _ in range(n * 2):
        # Decide if this row is mood-only
        is_mood_only = np.random.rand() < mood_only_ratio

        # --- Step 2: Generate journal probabilities (only for journal entries) ---
        if is_mood_only:
            # Mood-only row → journal probabilities set to zero
            p_anx = p_norm = p_dep = p_sui = p_str = 0.0
        else:
            # Generate random probabilities for journal features
            scores = np.random.rand(5)
            probs = scores / scores.sum()  # normalize to sum=1
            p_anx, p_norm, p_dep, p_sui, p_str = probs

        # --- Step 3: Generate moods independently ---
        # Randomly pick 1-3 moods from all moods
        num_moods = np.random.randint(1, 4)
        chosen_moods = np.random.choice(ALL_MOODS, size=num_moods, replace=False)

        # --- Step 4: Assign preliminary label based on journal probabilities ---
        if not is_mood_only:
            if (p_dep > 0.5 and p_sui > 0.3) or (p_anx > 0.4 and p_str > 0.4):
                label = "InCrisis"
            elif p_dep > 0.3 or p_anx > 0.3:
                label = "Struggling"
            elif p_norm > 0.7 and max(p_anx, p_dep, p_sui, p_str) < 0.1:
                label = "Excelling"
            elif p_norm > 0.5 and max(p_anx, p_dep, p_sui, p_str) < 0.2:
                label = "Thriving"
            else:
                label = "Struggling"
        else:
            # Mood-only row → initial label will be determined by moods
            label = None

        # --- Step 5: Adjust label based on moods ---
        # If moods indicate a worse class than the journal, upgrade accordingly
        mood_labels = [MOOD_CLASS_MAP[m] for m in chosen_moods]
        if "InCrisis" in mood_labels:
            label = "InCrisis"
        elif "Struggling" in mood_labels and (label is None or label != "InCrisis"):
            label = "Struggling"
        elif "Thriving" in mood_labels and (label is None):
            label = "Thriving"
        elif label is None:
            label = "Excelling"

        # --- Step 6: One-hot encode moods ---
        mood_encoding = {m: (1 if m in chosen_moods else 0) for m in ALL_MOODS}

        # --- Step 7: Add gratitude flag ---
        gratitude_flag = np.random.choice([0, 1], p=[1 - gratitude_prob, gratitude_prob])

        # --- Step 8: Combine all features into a row ---
        row = {
            "p_anxiety": p_anx,
            "p_normal": p_norm,
            "p_depression": p_dep,
            "p_suicidal": p_sui,
            "p_stress": p_str,
            "gratitude_flag": gratitude_flag,
            "WellbeingClass": label
        }
        row.update(mood_encoding)
        rows.append(row)

    # --- Step 9: Convert to DataFrame ---
    df = pd.DataFrame(rows)

    # --- Step 10: Enforce target distribution ---
    final_rows = []
    for label, (low, high) in distribution.items():
        target_frac = np.random.uniform(low, high)
        target_n = int(n * target_frac)

        subset = df[df["WellbeingClass"] == label]
        if len(subset) < target_n:
            # Oversample if not enough rows
            sampled = subset.sample(n=target_n, replace=True, random_state=42)
        else:
            sampled = subset.sample(n=target_n, replace=False, random_state=42)

        final_rows.append(sampled)

    df = pd.concat(final_rows).sample(frac=1, random_state=42).reset_index(drop=True)

    # --- Step 11: Inject outliers into journal probabilities ---
    journal_rows = df[(df["p_anxiety"] > 0) | (df["p_normal"] > 0)].index
    n_outliers_probs = int(len(journal_rows) * outlier_ratio)
    outlier_indices_probs = np.random.choice(journal_rows, size=n_outliers_probs, replace=False)

    for idx in outlier_indices_probs:
        probs = df.loc[idx, ["p_anxiety", "p_normal", "p_depression", "p_suicidal", "p_stress"]].values
        noisy = probs + np.random.normal(0, 0.5, size=5)
        noisy = np.clip(noisy, 0, None)  # prevent negatives
        if noisy.sum() == 0:
            noisy = np.random.rand(5)  # fallback if all clipped
        noisy = noisy / noisy.sum()
        df.loc[idx, ["p_anxiety", "p_normal", "p_depression", "p_suicidal", "p_stress"]] = noisy

    # --- Step 12: Inject outliers into moods ---
    n_outliers_moods = int(len(df) * outlier_ratio)
    outlier_indices_moods = np.random.choice(df.index, size=n_outliers_moods, replace=False)
    for idx in outlier_indices_moods:
        mood_to_flip = np.random.choice(ALL_MOODS)
        df.at[idx, mood_to_flip] = 1 - df.at[idx, mood_to_flip]  # flip 0→1 or 1→0

    # --- Step 13: Save dataset ---
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    if not os.path.exists(save_path):
        df.to_csv(save_path, index=False)
        print(f"✅ Dataset generated and saved to {save_path}")
    else:
        print(f"⚠️ File already exists at {save_path}. Skipping save.")

    return df


In [72]:
df = generate_sample_independent_mood(5000, save_path=DATA_PATH)
print(df.head())

✅ Dataset generated and saved to ../../data/classification/synthetic_dataset.csv
   p_anxiety  p_normal  p_depression  p_suicidal  p_stress  gratitude_flag  \
0   0.000000  0.000000      0.000000    0.000000  0.000000               1   
1   0.307574  0.123478      0.142580    0.147782  0.278586               0   
2   0.000000  0.000000      0.000000    0.000000  0.000000               0   
3   0.246233  0.174566      0.277604    0.151734  0.149863               0   
4   0.187346  0.243826      0.211637    0.120585  0.236605               1   

  WellbeingClass  Depressed  Sad  Exhausted  ...  Stressed  Restless  Calm  \
0       Thriving          0    0          0  ...         0         0     1   
1     Struggling          0    0          0  ...         0         0     1   
2       InCrisis          0    1          0  ...         0         0     0   
3     Struggling          0    0          0  ...         0         1     0   
4     Struggling          0    0          0  ...         0  