# **Synthetic Data Generation**
---

## **Basis for Synthetic Dataset Rules**

This synthetic dataset generator simulates student well-being classifications based on heuristic rules, class imbalance, and outliers. These design choices mirror real-world conditions where mental health data is often noisy, imbalanced, and influenced by overlapping emotional states.

**1. Heuristic Labeling Rules**
The rules map probability distributions (from NLP model outputs) into broader well-being categories.
  - InCrisis
    - Condition:
      - `p_depression > 0.5 AND p_suicidal > 0.3`
      - or `p_anxiety > 0.4 AND p_stress > 0.4`
    - Rationale: Crisis states often combine multiple extreme symptoms (e.g., severe depression + suicidal thoughts, or high anxiety + high stress).
  - Struggling
    - Condition:
      - `p_depression > 0.3 OR p_anxiety > 0.3`
    - Rationale: Moderate depression or anxiety indicates difficulty coping, but not yet at crisis level.
  - Surviving
    - Condition:
      - `0.2 < p_anxiety < 0.4 OR 0.2 < p_stress < 0.4`
    - Rationale: Lower but noticeable emotional strain. Individuals are functioning but showing warning signs.
  - Excelling
    - Condition:
      - `p_normal > 0.7 AND all negative emotions < 0.1`
    - Rationale: High positive baseline with minimal distress → thriving beyond stability.
  - Thriving
    - Condition:
      - `p_normal > 0.5 AND all negatives < 0.2`
    - Rationale: Balanced and healthy state, showing resilience but not peak performance.
  - Fallback → Surviving
    - Used when no clear condition is met. Prevents undefined cases.

**2. Class Imbalance**
  - Implementation: Only a portion (`imbalance_ratio`, e.g., 20%) of "InCrisis" rows are retained.
  - Rationale:
    - In real student populations, severe crisis cases are rarer than normal or moderate distress cases.
    - Mimics natural skewed distribution found in mental health datasets.
    - Prepares ML models to handle minority vs majority class trade-offs.

**3. Class Imbalance**
  - Implementation: A fraction (`outlier_ratio`, e.g., 2%) of rows are assigned extreme probability values (e.g., negative or >1).
  - Rationale:
    - Simulates data entry errors, corrupted sensor logs, or NLP misclassifications.
    - Helps ensure models are robust against noise and not overly sensitive to perfect inputs.
    - Encourages the need for data preprocessing (clipping, normalization) in real deployment.

---

## **Setup & Imports**

In [5]:
import numpy as np
import pandas as pd
import os

-
## **Configurations**

In [6]:
DATA_PATH = "../../data/classification/synthetic_dataset.csv"

In [7]:
def generate_sample(n=1000, save_path="synthetic_dataset.csv", imbalance_ratio=0.2, outlier_ratio=0.02):
    rows = []
    for _ in range(n):
        # Random raw scores
        scores = np.random.rand(5)
        probs = scores / scores.sum()  # normalize to probs
        p_anx, p_norm, p_dep, p_sui, p_str = probs
        
        # Apply heuristic rules 
        if (p_dep > 0.5 and p_sui > 0.3) or (p_anx > 0.4 and p_str > 0.4):
            label = "InCrisis"
        elif p_dep > 0.3 or p_anx > 0.3:
            label = "Struggling"
        elif 0.2 < p_anx < 0.4 or 0.2 < p_str < 0.4:
            label = "Surviving"
        elif p_norm > 0.7 and max(p_anx, p_dep, p_sui, p_str) < 0.1:
            label = "Excelling"
        elif p_norm > 0.5 and max(p_anx, p_dep, p_sui, p_str) < 0.2:
            label = "Thriving"
        else:
            label = "Surviving"  # fallback

        rows.append({
            "p_anxiety": p_anx,
            "p_normal": p_norm,
            "p_depression": p_dep,
            "p_suicidal": p_sui,
            "p_stress": p_str,
            "WellbeingClass": label
        })
    
    df = pd.DataFrame(rows)

    # --- 1. Introduce class imbalance ---
    # Keep fewer rows of "InCrisis" and "Struggling"
    crisis_df = df[df["WellbeingClass"] == "InCrisis"]
    other_df = df[df["WellbeingClass"] != "InCrisis"]
    
    # downsample crisis class (only keep imbalance_ratio portion)
    # crisis_df = crisis_df.sample(frac=imbalance_ratio, random_state=42) if not crisis_df.empty else crisis_df
    # Keep all InCrisis (no downsampling)
    crisis_df = df[df["WellbeingClass"] == "InCrisis"]

    
    df = pd.concat([other_df, crisis_df]).sample(frac=1, random_state=42).reset_index(drop=True)

    # --- 2. Inject outliers ---
    n_outliers = int(len(df) * outlier_ratio)
    outlier_indices = np.random.choice(df.index, size=n_outliers, replace=False)

    for col in ["p_anxiety", "p_normal", "p_depression", "p_suicidal", "p_stress"]:
        # Add extreme values (negative or >1)
        df.loc[outlier_indices, col] = df.loc[outlier_indices, col].apply(
            lambda x: x * np.random.choice([-5, 10])  # either big negative or very large positive
        )

    # Ensure directory exists before saving
    os.makedirs(os.path.dirname(save_path), exist_ok=True)

    # Save only if file doesn't already exist
    if not os.path.exists(save_path):
        df.to_csv(save_path, index=False)
        print(f"✅ Dataset generated and saved to {save_path}")
    else:
        print(f"⚠️ File already exists at {save_path}. Skipping save to avoid overwrite.")

    return df


In [8]:
df = generate_sample(5000, save_path=DATA_PATH)
print(df.head())

✅ Dataset generated and saved to ../../data/classification/synthetic_dataset.csv
   p_anxiety  p_normal  p_depression  p_suicidal  p_stress WellbeingClass
0   0.223137  0.063689      0.039086    0.344513  0.329575      Surviving
1   0.630235  0.097685      0.099072    0.076411  0.096598     Struggling
2   0.115375  0.245810      0.203670    0.372431  0.062714      Surviving
3   0.140653  0.355253      0.123620    0.327314  0.053161      Surviving
4   0.186365  0.179734      0.172448    0.232823  0.228629      Surviving
