# **Synthetic Data Generation**
---

# Basis for Synthetic Dataset Rules

This synthetic dataset generator simulates student well-being classifications based on heuristic rules, enforced class distributions, and controlled outliers. These design choices mirror real-world conditions where mental health data is often noisy, imbalanced, and influenced by overlapping emotional states.

---
## 1. High-level assumptions
- Each row = a single daily log for a user (one user-day).
- Every user must choose 1–3 moods/day (so at least one mood bit = 1).
- Journaling (free-text) is optional per day; if no journal → all `p_* = 0.0`.
- Gratitude entry is optional and recorded as a binary flag.
- `WellbeingClass` is the label of record, derived primarily from moods, secondarily modulated by journaling probabilities and gratitude.
- There are 4 WellbeingClasses: `InCrisis`, `Struggling`, `Excelling`, `Thriving`.

---
## 2. Target class-level population distribution
These percentages reflect a plausible population-level mix for an app balancing support vs normal users. (adjust based on CGCS expectations).

Based on research into the prevalence of mental health conditions among Filipino university students and young adults, this model proposes a realistic and balanced distribution for the synthetic dataset. Unlike earlier drafts, **this version removes the "Surviving" category** to better align with the finalized heuristic rules in the generator.  

Instead of five classes, we now use four: **Excelling, Thriving, Struggling, and In Crisis.** This structure matches the logic applied in the data generator and avoids artificial overlap between "Struggling" and "Surviving."

### Final Distribution
* **Excelling:** **3–5%**
* **Thriving:** **7–10%**
* **Struggling:** **25–30%**
* **In Crisis:** **15–20%**

### Rationale and Research-Backed Justification
This model maintains evidence-based prevalence rates while making the class taxonomy cleaner and more practical for ML training.

- **Excelling (3–5%) & Thriving (7–10%)**  
  These categories correspond to the "Flourishing" group in the AXA Mind Health Report, which found that only **15%** of young people in Asia are flourishing. We split this into two tiers:
  - *Excelling* → the true high-performers, a smaller share (3–5%).  
  - *Thriving* → doing very well but not at the extreme top (7–10%).

- **Struggling (25–30%)**  
  Supported by studies like Mendoza et al. (2021), which showed **over 23%** of Filipino university students experienced severe anxiety symptoms. This category captures those with significant distress but not at crisis level.

- **In Crisis (15–20%)**  
  Higher than some global estimates, but justified by local data. Tria (2015) found **24% prevalence of suicide ideation** among Manila university students. This category represents both acute suicidal risk and severe clinical distress.

### Why "Surviving" Was Removed

- **Overlap with Struggling:** In practice, the "Surviving" category captured milder distress, but the boundary between "Surviving" and "Struggling" was blurry. This weakened class separation.  
- **Heuristic Consistency:** The current labeling rules classify such cases under **Struggling** (moderate anxiety or depression), making a separate category redundant.  
- **Simpler, Cleaner Taxonomy:** By reducing to four classes, the dataset avoids artificial middle-ground states and ensures each label has clear, actionable meaning for training and downstream interpretation.

### References
- **Mendoza, N. B., et al. (2021).** Mental Health Status and Help-Seeking Behavior of Filipino University Students During the COVID-19 Pandemic. *Transactions of the National Academy of Science and Technology, 43*(2).  
- **Tria, A. (2015).** A Multivariate Analysis of Suicide Ideation Among University Students in the Philippines. *Asia Pacific Social Science Review, 15*(1), 1–13.  
- **AXA. (2022).** *[AXA Mind Health Report](https://www.axa.com.ph/multimedia/newsroom/gen-z-pinoys-have-more-mind-health-conditions)*  
- **AXA. (2024).** *[AXA Mind Health Report: Higher work stress seen among Millennials and Gen Zs](https://pop.inquirer.net/369115/axa-mind-health-report-higher-work-stress-seen-among-millennials-and-gen-zs)*  

---

## 3. Mood sets & mapping (16 moods → 4 classes)
Each class has exactly 4 mapped moods.
- **InCrisis:** `Depressed`, `Sad`, `Exhausted`, `Hopeless`,
- **Struggling:** `Anxious`, `Angry`, `Stressed`, `Restless`,
- **Thriving:** `Calm`, `Relaxed`, `Peaceful`, `Content`,
- **Excelling:** `Happy`, `Energized`, `Excited`, `Motivated`,

---

## 4. Mood selection rules (per row)
- First sample `WellbeingClass` according to distribution in (2).
- Then choose a number of moods `k ∈ {1,2,3}`:
  - Probability of `k`: P(1)=0.6, P(2)=0.3, P(3)=0.1 (adjusta base on CGCS).
- Choose k moods preferentially:
- At least **70%** of selections must be from the chosen class’s 4 mapped moods.
- Remaining selections (if k>1) may be:
  - Same-class co-occurring mood (strong preference), or
  - With small probability (10–15%) a mood from adjacent severity class (e.g., Struggling ↔ Stable), modeling mixed-state days.
- Ensure no row has zero moods.

Example (InCrisis day, k=2): pick `Sad` and `Depressed` (both in InCrisis) with high probability; or `Sad` + `Hopeless`.

---

## 5. Journaling probability behaviour (per row)
Journaling probability is conditional on wellbeing class (people in worse states may journal more or less depending on your assumption — choose one; below is a suggested realistic pattern):

- InCrisis: 60% chance to journal
- Struggling: 50%
- Thriving: 40%
- Excelling: 35%

If journaling = False → set p_anxiety = p_normal = p_depression = p_suicidal = p_stress = 0.0.

If journaling = True → generate a probability vector p = (p_anxiety, p_normal, p_depression, p_suicidal, p_stress) sampled from a Dirichlet-like mechanism tuned per class (details below).

---

## 6. Generating NLP probability vectors (conditional on journaling & class)
Use class-specific Dirichlet / Beta mixtures to produce realistic, soft probabilities that generally peak at the expected class but allow uncertainty.

### **a. Dirichlet centers (example - mean preference per class)**
Normalized center vectors `μ_class` (sums to 1):
- InCrisis μ = [p_anxiety=0.30, p_normal=0.05, p_depression=0.45, p_suicidal=0.15, p_stress=0.05]
- Struggling μ = [0.25, 0.10, 0.40, 0.02, 0.23]
- Thriving μ = [0.05, 0.75, 0.05, 0.00, 0.15]
- Excelling μ = [0.02, 0.85, 0.01, 0.00, 0.12]

>(We can change weights; ensure suicidal probabilities are low except for InCrisis.)

### **b. Concentration parameter (controls spread)**
- Use Dirichlet concentration `α = μ * s`. Pick s (scalar) to control certainty:
  - For more peaked distributions (less noise), use `s = 50`.
  - For more variance, `s = 10`.
- Suggested: `s = 25` for balanced realism (not too deterministic).

### **c. Sampling process**
- Sample `p_vec ~ Dirichlet(α = μ * s)` → yields 5 probabilities summing to 1.
- Optionally scale the vector by the model confidence: produce `p_*` as the sampled vector directly if you want normalized probabilities. If you prefer absolute model confidences with possible low overall confidence, you can multiply the sampled vector by a scalar `c` drawn from Beta(α=2, β=8) to create overall lower confidences (e.g., `p_vec_scaled = c * p_vec`), but ensure semantics: if you want `p_*` to be model logits normalized among classes then no scaling — keep as sum=1.

### **d. When to set zeros or small values**
- If journaling detected but the journal is neutral: the Dirichlet center for `Thriving` or `Excelling` handles that with high p_normal.

---

## 7. Gratitude flag probabilities (conditional on class)
Make gratitude more likely in positive states but still possible in all.
- P(gratitude=1 | Excelling) = 0.80
- P(gratitude=1 | Thriving) = 0.60
- P(gratitude=1 | Struggling) = 0.35
- P(gratitude=1 | InCrisis) = 0.20
> (If a user wrote a gratitude entry, the flag is 1 regardless of length.)

---

## 8. Label consistency & conflict resolution
Primary driver of `WellbeingClass` should be the mood selection (required input). But journaling and gratitude can nudge or validate labels. Implement rule-based checks:

- If moods clearly map to class X (≥60% of chosen moods from class X), set `WellbeingClass` = X.
- Edge cases:
  - If moods are mixed across classes (e.g., 1 InCrisis + 1 Excelling) then:
    - Tie-break by severity bias (prefer worse class) OR
    - Use a small scoring function:
      - Score[class] = sum(weights of chosen moods mapped to class) + `logit_journal_score` + `gratitude_bonus`
    - `logit_journal_score` = +2 if `p_{class}` is top probability and > 0.4; else 0.
    - `gratitude_bonus` = +1 to positive classes if gratitude=1.
- **For safety:** if `p_suicidal` > 0.5 and/or `Hopeless` or `Depressed` selected, force `WellbeingClass = InCrisis` (for synthetic realism and triage logic).

> Document the exact deterministic tie-breaking rule to make the dataset audit-friendly.

--- 

## 9. Generation algorithm (pseudocode)
1. Sample WellbeingClass per distribution (step 2).
2. Sample k moods (1–3) per (4); pick moods with class bias and small cross-class probability.
3. Determine journaling flag from step 5.
  - If False → set all p_* = 0.0.
  - If True → sample p_vec from Dirichlet centered on class mu (step 6).
4. Determine gratitude flag conditional on class (step 7).
5. Apply conflict resolution to set final WellbeingClass if you prefer mood-first then journal-second; otherwise keep the sampled class (but document which you used).
6. Output the row.

---

## 10. Validation checks & metrics
This are the things to include in the approval report to show the synthetic data is realistic and internally consistent.
- No zero-mood rows: assert sum(mood_cols) >= 1 for all rows.
- Journaling nulls match p_ zeros*: rows with journaling=False must have all p_* == 0.
- Distribution check: empirical % of each WellbeingClass matches target distribution ± tolerance (e.g., ±2%).
- Gratitude by class: table of P(gratitude=1 | class) vs target.
- Confusion matrix (mood→label): measure how often moods imply the label (should be >90% if mood-driven).
- p_ summary by class*: report means/medians for each p_* grouped by class — should show expected peaks.
- Co-occurrence rates: e.g., P(Depressed & Hopeless | InCrisis) should be high (report actual).
- Edge-case rules applied: count rows where suicidal probability > 0.5 and verify class = InCrisis.

---

## **Setup & Imports**

In [1]:
import numpy as np
import pandas as pd
import os

---
## **Configurations**

In [2]:
DATA_PATH = "../../data/classification/synthetic_dataset.csv"

In [3]:
MOOD_POOLS = {
    "InCrisis": ["Depressed", "Sad", "Exhausted", "Hopeless"],
    "Struggling": ["Anxious", "Angry", "Stressed", "Restless"],
    "Thriving": ["Calm", "Relaxed", "Peaceful", "Content"],
    "Excelling": ["Happy", "Energized", "Excited", "Motivated"],
}

In [4]:
# Build reverse map automatically
MOOD_CLASS_MAP = {mood: cls for cls, moods in MOOD_POOLS.items() for mood in moods}

In [5]:
ALL_MOODS = sum(MOOD_POOLS.values(), [])

---
## **Data Generation**

In [6]:
def generate_sample_independent_mood(
    n=1000,
    save_path="synthetic_dataset.csv",
    outlier_ratio=0.02,
    gratitude_prob=0.6,   # probability of a gratitude entry
    mood_only_ratio=0.1,  # fraction of rows that are mood-only (no journal text)
    distribution={
        "Excelling": (0.03, 0.05),
        "Thriving": (0.07, 0.10),
        "Struggling": (0.25, 0.30),
        "InCrisis": (0.15, 0.20),
    },
    journal_min_fraction=0.6,  # minimum fraction of rows per class that must have journal probabilities (not mood-only)
):
    rows = []

    # --- Step 1: Generate oversampled rows ---
    # Oversample to ensure enough data after enforcing class distribution
    for _ in range(n * 2):
        # Decide if this row is mood-only
        is_mood_only = np.random.rand() < mood_only_ratio

        # --- Step 2: Generate journal probabilities (only for journal entries) ---
        if is_mood_only:
            # Mood-only row -> journal probabilities set to zero
            p_anx = p_norm = p_dep = p_sui = p_str = 0.0
        else:
            # Generate random probabilities for journal features
            scores = np.random.rand(5)
            probs = scores / scores.sum()  # normalize to sum=1
            p_anx, p_norm, p_dep, p_sui, p_str = probs

        # --- Step 3: Generate moods independently ---
        # Randomly pick 1-3 moods from all moods
        num_moods = np.random.randint(1, 4)
        chosen_moods = np.random.choice(ALL_MOODS, size=num_moods, replace=False)

        # --- Step 4: Assign preliminary label based on journal probabilities ---
        if not is_mood_only:
            if (p_dep > 0.5 and p_sui > 0.3) or (p_anx > 0.4 and p_str > 0.4):
                label = "InCrisis"
            elif p_dep > 0.3 or p_anx > 0.3:
                label = "Struggling"
            elif p_norm > 0.7 and max(p_anx, p_dep, p_sui, p_str) < 0.1:
                label = "Excelling"
            elif p_norm > 0.5 and max(p_anx, p_dep, p_sui, p_str) < 0.2:
                label = "Thriving"
            else:
                label = "Struggling"
        else:
            # Mood-only row -> initial label will be determined by moods
            label = None

        # --- Step 5: Adjust label based on moods ---
        # If moods indicate a worse class than the journal, upgrade accordingly
        mood_labels = [MOOD_CLASS_MAP[m] for m in chosen_moods]
        if "InCrisis" in mood_labels:
            label = "InCrisis"
        elif "Struggling" in mood_labels and (label is None or label != "InCrisis"):
            label = "Struggling"
        elif "Thriving" in mood_labels and (label is None):
            label = "Thriving"
        elif label is None:
            label = "Excelling"

        # --- Step 6: One-hot encode moods ---
        mood_encoding = {m: (1 if m in chosen_moods else 0) for m in ALL_MOODS}

        # --- Step 7: Add gratitude flag ---
        gratitude_flag = np.random.choice([0, 1], p=[1 - gratitude_prob, gratitude_prob])

        # --- Step 8: Combine all features into a row ---
        row = {
            "p_anxiety": p_anx,
            "p_normal": p_norm,
            "p_depression": p_dep,
            "p_suicidal": p_sui,
            "p_stress": p_str,
            "gratitude_flag": gratitude_flag,
            "WellbeingClass": label
        }
        row.update(mood_encoding)
        rows.append(row)

    # --- Step 9: Convert to DataFrame ---
    df = pd.DataFrame(rows)

    # --- Step 10: Enforce target distribution ---
    final_rows = []
    for label, (low, high) in distribution.items():
        target_frac = np.random.uniform(low, high)
        target_n = int(n * target_frac)

        subset = df[df["WellbeingClass"] == label]
        # Ensure a minimum fraction of sampled rows contain journal probabilities (i.e., not mood-only)
        if len(subset) == 0:
            sampled = subset
        else:
            # split subset into journal rows and mood-only rows
            journal_mask = (subset["p_anxiety"] > 0) | (subset["p_normal"] > 0)
            journal_rows = subset[journal_mask]
            mood_only_rows = subset[~journal_mask]

            min_journal_needed = int(np.ceil(target_n * journal_min_fraction))
            # clamp to available
            n_journal_to_take = min(len(journal_rows), min_journal_needed)
            n_remaining = target_n - n_journal_to_take

            chosen_journal = pd.DataFrame()
            chosen_mood_only = pd.DataFrame()

            if n_journal_to_take > 0:
                replace_journal = len(journal_rows) < n_journal_to_take
                chosen_journal = journal_rows.sample(n=n_journal_to_take, replace=replace_journal, random_state=42)

            if n_remaining > 0:
                # fill remaining from mood-only first, then journal if needed
                if len(mood_only_rows) >= n_remaining:
                    chosen_mood_only = mood_only_rows.sample(n=n_remaining, replace=False, random_state=42)
                else:
                    chosen_mood_only = mood_only_rows
                    still_needed = n_remaining - len(chosen_mood_only)
                    if len(journal_rows) > n_journal_to_take:
                        extra_from_journal = journal_rows.drop(chosen_journal.index, errors="ignore")
                        replace_extra = len(extra_from_journal) < still_needed
                        extra = extra_from_journal.sample(n=still_needed, replace=replace_extra, random_state=42)
                        chosen_journal = pd.concat([chosen_journal, extra])

            sampled = pd.concat([chosen_journal, chosen_mood_only])
            # if still short (very small subsets), allow oversampling from subset
            if len(sampled) < target_n:
                needed = target_n - len(sampled)
                extra = subset.sample(n=needed, replace=True, random_state=42)
                sampled = pd.concat([sampled, extra])

        final_rows.append(sampled)

    df = pd.concat(final_rows).sample(frac=1, random_state=42).reset_index(drop=True)

    # --- Step 11: Inject outliers into journal probabilities ---
    journal_rows = df[(df["p_anxiety"] > 0) | (df["p_normal"] > 0)].index
    n_outliers_probs = int(len(journal_rows) * outlier_ratio)
    outlier_indices_probs = np.random.choice(journal_rows, size=n_outliers_probs, replace=False)

    for idx in outlier_indices_probs:
        probs = df.loc[idx, ["p_anxiety", "p_normal", "p_depression", "p_suicidal", "p_stress"]].values
        noisy = probs + np.random.normal(0, 0.5, size=5)
        noisy = np.clip(noisy, 0, None)  # prevent negatives
        if noisy.sum() == 0:
            noisy = np.random.rand(5)  # fallback if all clipped
        noisy = noisy / noisy.sum()
        df.loc[idx, ["p_anxiety", "p_normal", "p_depression", "p_suicidal", "p_stress"]] = noisy

    # --- Step 12: Inject outliers into moods ---
    n_outliers_moods = int(len(df) * outlier_ratio)
    outlier_indices_moods = np.random.choice(df.index, size=n_outliers_moods, replace=False)
    for idx in outlier_indices_moods:
        mood_to_flip = np.random.choice(ALL_MOODS)
        df.at[idx, mood_to_flip] = 1 - df.at[idx, mood_to_flip]  # flip 0→1 or 1→0

    # --- Step 13: Save dataset ---
    os.makedirs(os.path.dirname(save_path), exist_ok=True)
    df.to_csv(save_path, index=False)
    print(f"✅ Dataset generated and saved to {save_path}")

    return df

In [None]:
if __name__ == "__main__":
    # Example/demo usage when running this notebook as a script
    df = generate_sample_independent_mood(5000, save_path=DATA_PATH)
    print(df.head())


✅ Dataset generated and saved to ../../data/classification/synthetic_dataset.csv
   p_anxiety  p_normal  p_depression  p_suicidal  p_stress  gratitude_flag  \
0   0.000000  0.000000      0.000000    0.000000  0.000000               1   
1   0.000000  0.000000      0.000000    0.000000  0.000000               0   
2   0.171911  0.012415      0.095732    0.359229  0.360712               1   
3   0.000000  0.000000      0.000000    0.000000  0.000000               0   
4   0.000000  0.000000      0.000000    0.000000  0.000000               1   

  WellbeingClass  Depressed  Sad  Exhausted  ...  Stressed  Restless  Calm  \
0       InCrisis          0    0          1  ...         0         0     0   
1       InCrisis          0    0          1  ...         0         0     0   
2       InCrisis          1    0          0  ...         0         0     0   
3      Excelling          0    0          0  ...         0         0     0   
4       InCrisis          0    0          1  ...         0  