<a href="https://colab.research.google.com/github/laraAkg/Data-Science-Project/blob/main/Generate_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dataset Generation and Analysis for Heavy-Tailed Distributions

## 0. Imports, Environment Setup & Colab Integration

This cell imports the required libraries for data generation and preprocessing, configures the plotting backend, and prepares the execution environment.  
It also detects whether the notebook is running in Google Colab and mounts Google Drive if needed.

- **Standard Library Imports:** Loads core Python utilities for file handling, serialization, timing, garbage collection, and date/time management.
- **`numpy`**: Provides numerical operations used throughout the data generation pipeline.
- **`matplotlib`**: Configured to use a non-GUI backend (`Agg`) to enable plot generation in headless environments.
- **`scipy.stats`**: Optionally imported for statistical computations (gracefully disabled if unavailable).
- **Colab Detection (`IN_COLAB`)**: Checks whether the notebook is executed in a Google Colab environment.
- **Google Drive Integration:** Automatically mounts Google Drive when running in Colab to enable persistent file access and storage.

In [None]:
from pathlib import Path
import os
import shutil
import json
import csv
import time
import math
import gc
from dataclasses import dataclass
from datetime import datetime, timezone

import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

try:
    import scipy.stats as st
except ImportError:
    st = None

try:
    import google.colab
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


## 1. Project Directory Structure & Global Constants

This cell defines the directory structure for the project and initializes paths for storing datasets, plots, metadata, models, reports, and real-world data.

- **`DEFAULT_PROJECT_DIR`**: Specifies the default project directory name within Google Drive.
- **`BASE_DIR`**: Determines the base directory for all project outputs, using Google Drive when running in Colab or a local `./project_outputs` directory otherwise.
- **`DATA_DIR`, `PLOTS_DIR`, `META_DIR`, `MODELS_DIR`, `REPORTS_DIR`, `REAL_DIR`**: Define dedicated subdirectories under `BASE_DIR` for different types of project artifacts.
- **Directory Creation Logic**: Iterates over all defined directories and creates them if they do not already exist using `mkdir(parents=True, exist_ok=True)`.
- **`IMG_SIZE`, `DPI`, `FIGSIZE`**: Define global constants for image resolution, dots-per-inch for saved figures, and default figure size for Matplotlib plots.
- **Runtime Feedback**: Prints the resolved `BASE_DIR` to confirm the active project root directory.

In [None]:
DEFAULT_PROJECT_DIR = "MyDrive/Generated Data for Data science project"
BASE_DIR = Path("/content/drive") / DEFAULT_PROJECT_DIR if IN_COLAB else Path("./project_outputs")

DATA_DIR   = BASE_DIR / "datasets"
PLOTS_DIR  = BASE_DIR / "plots"
META_DIR   = BASE_DIR / "metadata"
MODELS_DIR = BASE_DIR / "models_tf"
REPORTS_DIR= BASE_DIR / "reports"
REAL_DIR   = BASE_DIR / "real"

for p in [DATA_DIR, PLOTS_DIR, META_DIR, MODELS_DIR, REPORTS_DIR, REAL_DIR]:
    p.mkdir(parents=True, exist_ok=True)

IMG_SIZE = (128, 128)  # (H, W)
DPI      = 150
FIGSIZE  = (4.0, 4.0)

print("BASE_DIR:", BASE_DIR)

BASE_DIR: /content/drive/MyDrive/Generated Data for Data science project


## 2. Utility Functions & Matplotlib Configuration

This cell defines utility functions for file handling and configures Matplotlib for consistent plot creation and saving in a headless environment.

- **Matplotlib Configuration**: Sets global Matplotlib parameters (`figure.dpi`, `savefig.dpi`, `figure.figsize`) and enforces the non-GUI `Agg` backend for reliable plot generation.
- **`ensure_parent`**: Ensures that the parent directory of a given file path exists before writing files to disk.
- **`_patched_savefig`**: Wraps Matplotlib’s `savefig` function to automatically redirect relative paths to `PLOTS_DIR` and create missing directories if necessary.
- **`plt.savefig` Override**: Replaces the default `savefig` with the patched version to enforce consistent plot saving behavior across the notebook.
- **`save_dataset`**: Saves NumPy arrays as `.npy` files with enforced `float32` precision, ensuring reproducibility and storage efficiency.

In [None]:
matplotlib.use("Agg")

plt.rcParams["figure.dpi"] = DPI
plt.rcParams["savefig.dpi"] = DPI
plt.rcParams["figure.figsize"] = FIGSIZE

def ensure_parent(path: Path):
    path = Path(path); path.parent.mkdir(parents=True, exist_ok=True); return path

_original_savefig = plt.savefig
def _patched_savefig(*args, **kwargs):
    if len(args) > 0 and isinstance(args[0], (str, Path)):
        fname = Path(args[0])
        if not fname.is_absolute():
            fname = PLOTS_DIR / fname
        ensure_parent(fname)
        args = (str(fname),) + tuple(args[1:])
    else:
        auto = PLOTS_DIR / f"plot_{int(time.time()*1000)}.png"
        ensure_parent(auto)
        args = (str(auto),) + args
    kwargs.setdefault("dpi", DPI)
    return _original_savefig(*args, **kwargs)
plt.savefig = _patched_savefig

def save_dataset(arr: np.ndarray, path: Path):
    path = ensure_parent(path)
    np.save(path, arr.astype(np.float32))

## 3. Distributions, RNG Utilities & Dataset Labeling

This cell defines probability distributions, utilities for reproducible random sampling, and a helper function to determine whether a generated dataset should be labeled as “heavy-tailed” based on distribution parameters.

- **`SEED`**: Generates a fresh random seed for each run (can be fixed for reproducibility) and is used to initialize randomness.
- **`RNG`**: Global NumPy random generator initialized with `SEED`.
- **`rng_for`**: Creates deterministic, independent RNG streams derived from (`seed`, `labels...`) to ensure reproducible sub-tasks (e.g., params vs. sampling, augmentations).
- **`DistSpec`**: Dataclass holding the distribution `name`, a parameter generator (`param_fn`), and a sampling function (`sample_fn`).
- **`p_*`**: Parameter generator functions that sample valid distribution parameters using an explicit RNG.
- **`sample_*`**: Sampling functions that generate data from specific distributions using an explicit RNG (no hidden global randomness).
- **`DISTRIBUTIONS`**: Registry of supported distributions as a list of `DistSpec` objects.
- **`make_dataset_id`**: Generates a unique identifier for each dataset instance (distribution name + index + timestamp).
- **`is_heavy_tailed`**: Assigns the heavy-tailed label based on distribution type and parameter thresholds (e.g., `df` for Student-t, `sigma` for lognormal).

In [None]:
# === Distributions, seeding, RNG utilities, and dataset builder (fresh randomness each run) ===
from dataclasses import dataclass
import json, hashlib
from datetime import datetime, timezone
import numpy as np
from pathlib import Path

# --- 1) Fresh randomness each run (record the seed) ---
# If you want reproducibility later, you can set SEED to a fixed int instead.
SEED = int(np.random.SeedSequence().entropy)   # fresh each run
RNG  = np.random.default_rng(SEED)

# --- 2) Deterministic per-task RNGs derived from (seed, labels) ---
def rng_for(seed, *labels) -> np.random.Generator:
    """
    Create an independent RNG stream for any (seed, labels...) combination.
    Labels can include: distribution name, replicate index, 'params'/'data', augmentation name, etc.
    """
    h = hashlib.blake2b(digest_size=16)
    h.update(f"seed={seed}".encode())
    h.update(json.dumps(labels, separators=(',', ':'), default=str).encode())
    entropy = int.from_bytes(h.digest(), "big")
    return np.random.default_rng(entropy)

# --- 3) Spec for a distribution: parameter generator + sampler (both receive an RNG) ---
@dataclass
class DistSpec:
    name: str
    param_fn: callable   # signature: param_fn(i: int, rng: np.random.Generator) -> dict
    sample_fn: callable  # signature: sample_fn(n: int, rng: np.random.Generator, **params) -> np.ndarray

# --- 4) Parameter generators (ranges unchanged; only RNG is now explicit) ---
def p_normal(i, rng):       # μ in [-1, 1], σ in [0.5, 2.0]
    mu    = rng.uniform(-1.0, 1.0)
    sigma = rng.uniform(0.5, 2.0)
    return {"mu": float(mu), "sigma": float(sigma)}

def p_exponential(i, rng):  # λ in [0.5, 2.0]
    lam = rng.uniform(0.5, 2.0)
    return {"lam": float(lam)}

def p_pareto(i, rng):       # α in [1.15, 1.9], xm in [0.7, 2.5]
    alpha = rng.uniform(1.15, 1.9)
    xm    = rng.uniform(0.7, 2.5)
    return {"alpha": float(alpha), "xm": float(xm)}

def p_student_t(i, rng):    # df in [2, 12], scale in [0.5, 2.0], loc in [-1,1]
    df    = rng.integers(2, 13)
    loc   = rng.uniform(-1.0, 1.0)
    scale = rng.uniform(0.5, 2.0)
    return {"df": int(df), "loc": float(loc), "scale": float(scale)}

def p_lognormal(i, rng):    # μ in [-0.5, 1.0], σ in [0.4, 1.2]
    mu    = rng.uniform(-0.5, 1.0)
    sigma = rng.uniform(0.4, 1.2)
    return {"mu": float(mu), "sigma": float(sigma)}

def p_mix_norm_pareto(i, rng):  # mixture π in [0.02, 0.25]
    pi    = rng.uniform(0.02, 0.25)  # heavy-tail weight
    mu    = rng.uniform(-0.5, 0.5)
    sigma = rng.uniform(0.6, 1.5)
    alpha = rng.uniform(1.2, 1.8)
    xm    = rng.uniform(0.8, 1.6)
    return {"pi": float(pi), "mu": float(mu), "sigma": float(sigma), "alpha": float(alpha), "xm": float(xm)}

# --- 5) Samplers (explicit RNG; no global sampling) ---
def sample_normal(n, rng, mu, sigma):
    return rng.normal(mu, sigma, size=n)

def sample_exponential(n, rng, lam):
    return rng.exponential(1.0/lam, size=n)

def sample_pareto(n, rng, alpha, xm=1.0):
    return (rng.pareto(alpha, size=n) + 1.0) * xm

def sample_student_t(n, rng, df, loc=0.0, scale=1.0):
    return loc + scale * rng.standard_t(df=df, size=n)

def sample_lognormal(n, rng, mu, sigma):
    return rng.lognormal(mean=mu, sigma=sigma, size=n)

def sample_mix_norm_pareto(n, rng, pi, mu, sigma, alpha, xm=1.0):
    k = rng.binomial(1, pi, size=n).astype(bool)
    x = np.empty(n, dtype=np.float64)
    x[k]  = (rng.pareto(alpha, size=k.sum()) + 1.0) * xm
    x[~k] = rng.normal(mu, sigma, size=(~k).sum())
    return x

# --- 6) Distribution registry (unchanged names) ---
DISTRIBUTIONS = [
    DistSpec("normal",          p_normal,          sample_normal),
    DistSpec("exponential",     p_exponential,     sample_exponential),
    DistSpec("pareto",          p_pareto,          sample_pareto),
    DistSpec("student_t",       p_student_t,       sample_student_t),
    DistSpec("lognormal",       p_lognormal,       sample_lognormal),
    DistSpec("mix_norm_pareto", p_mix_norm_pareto, sample_mix_norm_pareto),
]

# --- 7) Utility: dataset id, tail label heuristic (unchanged logic) ---
def make_dataset_id(dist_name: str, idx: int) -> str:
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S%f")
    return f"{dist_name}_{idx:03d}_{timestamp}"

def is_heavy_tailed(dist_name: str, params: dict) -> bool:
    name = dist_name.lower()
    if name in ["pareto", "cauchy", "mix_norm_pareto"]:
        return True
    if name in ["normal", "gaussian", "exponential"]:
        return False
    if name in ["student_t", "studentt", "t"]:
        return params.get("df", 10) <= 5
    if name in ["lognormal", "lognorm"]:
        return params.get("sigma", 1.0) >= 1.0
    return False


### 3.1 Helper: Safe File Move Utility

This cell defines a small helper function for safely moving files across directories or file systems.

- **`safe_move`**: Moves a file from `src` to `dst`, automatically creating the destination parent directory if it does not exist.
- **Cross-Device Safe**: Uses `shutil.move`, which transparently handles cross-device moves by falling back to copy-and-delete when required.
- **Path Safety**: Ensures that destination paths are always valid before moving files.

In [None]:
import shutil
from pathlib import Path

def safe_move(src: Path, dst: Path):
    dst.parent.mkdir(parents=True, exist_ok=True)
    shutil.move(str(src), str(dst))  # copy+delete when needed

### 3.2 Dataset Construction with Memory-Mapped Arrays

This cell defines the main dataset construction routine that generates numerical samples for all configured distributions and stores them as valid `.npy` files.

- **`build_corpus`**: Main dataset builder that iterates over all distributions and generates multiple datasets per distribution.
- **`open_memmap`**: Writes data incrementally into valid `.npy` files (including headers), enabling efficient handling of large arrays without loading everything into memory.
- **Per-Task RNG Streams**: Uses `rng_for` to create independent random generators for parameter sampling (`"params"`) and data generation (`"data"`).
- **Local Staging (`LOCAL_TMP`)**: Temporarily writes datasets to a local directory before moving them to the final destination.
- **Chunked Writing (`chunk_size`)**: Generates and writes samples in chunks to control memory usage.
- **`safe_move`**: Moves generated `.npy` files to `DATA_DIR` in a cross-device-safe manner (e.g. local disk → Google Drive).
- **`render_plots`**: Generates diagnostic plots from the stored `.npy` files using standard `np.load` (no pickling required).
- **Metadata Records**: Collects dataset metadata including distribution name, parameters, label (`heavy_tailed`), file paths, and generated plots.
- **Return Value**: Returns a list of metadata records, one per generated dataset, to be used for indexing and downstream processing.

In [None]:
# === build_corpus (open_memmap -> writes valid .npy-files; cross-device-safe move) ===
from numpy.lib.format import open_memmap  # important: writes .npy-Header

def build_corpus(n_per_dist=5, n_samples=2000, chunk_size=200_000, dtype=np.float32, seed: int = SEED):
    """
    Builds datasets for all distributions defined in DISTRIBUTIONS.
    - Local staging with open_memmap (creates real .npy files, header-integrated)
    - Cross-device-safe moving via safe_move
    - Per-task RNGs (params vs data)
    """
    records = []

    for dist in DISTRIBUTIONS:
        print("Generating:", dist.name)
        for i in range(n_per_dist):
            # independent RNG-Streams
            rng_params = rng_for(seed, dist.name, i, "params")
            rng_data   = rng_for(seed, dist.name, i, "data")

            ds_id  = make_dataset_id(dist.name, i)
            params = dist.param_fn(i, rng_params)
            label  = is_heavy_tailed(dist.name, params)

            # --- Write locally (.npy)
            local_out_dir = LOCAL_TMP / dist.name
            local_out_dir.mkdir(parents=True, exist_ok=True)
            local_data_path = local_out_dir / f"{ds_id}.npy"

            remaining = int(n_samples)
            pos = 0
            # open_memmap creates an .npy file with a header (compatible with np.load)
            mm = open_memmap(filename=str(local_data_path), mode='w+', dtype=dtype, shape=(n_samples,))
            try:
                while remaining > 0:
                    m = min(remaining, chunk_size)
                    x = dist.sample_fn(m, rng_data, **params)
                    if x.dtype != dtype:
                        x = x.astype(dtype, copy=False)
                    mm[pos:pos+m] = x
                    pos += m
                    remaining -= m
                mm.flush()
            finally:
                del mm  # Release handle

            # --- cross-device-safe move to Drive
            final_data_path = DATA_DIR / dist.name / f"{ds_id}.npy"
            safe_move(local_data_path, final_data_path)

            # --- Render plots (np.load(..., allow_pickle=False) now works)
            plots = render_plots(ds_id, final_data_path)

            records.append({
                "dataset_id": ds_id,
                "distribution": dist.name,
                "n": int(n_samples),
                "heavy_tailed": bool(label),
                "data_path": str(final_data_path),
                "params": params,
                "plots": plots,
            })

    return records

### 3.3 Plotting Helpers for Distribution Diagnostics

This cell defines helper functions for preprocessing numerical data and generating diagnostic plots used to analyze distributional properties and tail behavior.

- **`_sanitize`**: Removes NaN and infinite values from an input array and flattens it to a one-dimensional representation.
- **`_downsample`**: Randomly reduces the number of data points to a specified maximum using an explicit RNG, improving performance for large datasets.
- **`_clamp_by_percentile`**: Restricts the data range based on lower and upper percentiles to limit the influence of extreme outliers.
- **`plot_zipf`**: Generates a Zipf plot on a log–log scale, visualizing the relationship between data ranks and sorted absolute values.
- **`plot_me`**: Creates a Mean Excess (ME) plot, showing the expected excess above varying thresholds to assess heavy-tailed behavior.
- **`plot_qq_exp`**: Generates an exponential Q–Q plot, comparing empirical quantiles of the data against theoretical quantiles of an Exp(1) distribution.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

def _sanitize(x):
    x = np.asarray(x).ravel()
    return x[~np.isnan(x) & ~np.isinf(x)]

def _downsample(x, max_points=50_000, rng=None):
    n = x.shape[0]
    if n <= max_points:
        return x
    if rng is None:
        rng = np.random.default_rng()
    idx = rng.choice(n, size=max_points, replace=False)
    return x[idx]

def _clamp_by_percentile(x, pct=99.5):
    lo = np.nanpercentile(x, 100 - pct)
    hi = np.nanpercentile(x, pct)
    if not np.isfinite(lo): lo = np.nanmin(x)
    if not np.isfinite(hi): hi = np.nanmax(x)
    x = np.clip(x, lo, hi)
    return x, (lo, hi)

# --- existing plot functions; only change: use _downsample(x, rng=...) ---
def plot_zipf(x, save_path, rng=None):
    x = _sanitize(np.abs(x))
    x = _downsample(x, rng=rng)
    x = np.sort(x)[::-1]
    ranks = np.arange(1, x.size + 1)
    fig, ax = plt.subplots()
    ax.loglog(ranks, x, marker=".", linewidth=0)
    ax.set_xlabel("rank"); ax.set_ylabel("|x| (sorted)")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(save_path)
    plt.close(fig)

def plot_me(x, save_path, n_bins=200, rng=None):
    x = _sanitize(np.abs(x))
    x = _downsample(x, rng=rng)
    x, (lo, hi) = _clamp_by_percentile(x, 99.5)
    xs = np.sort(x)
    us = np.linspace(lo, hi, n_bins)
    e_vals = []
    for u in us:
        exceed = xs[xs > u]
        e_vals.append(np.mean(exceed - u) if exceed.size else np.nan)
    fig, ax = plt.subplots()
    ax.plot(us, e_vals, marker=".", linewidth=1)
    ax.set_xlabel("u"); ax.set_ylabel("e(u)")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(save_path)
    plt.close(fig)

def plot_qq_exp(x, save_path, rng=None):
    """
    Exponential QQ-plot: compare sample quantiles of |x| to theoretical exponential(1) quantiles.
    """
    x = _sanitize(np.abs(x))
    x = _downsample(x, rng=rng)
    x = np.sort(x)
    n = x.size
    if n == 0:
        # create an empty plot if needed
        fig, ax = plt.subplots()
        ax.set_title("Empty sample")
        plt.tight_layout(); plt.savefig(save_path); plt.close(fig); return
    # theoretical quantiles for Exp(1): F^{-1}(p) = -ln(1-p)
    p = (np.arange(1, n + 1) - 0.5) / n
    q_theory = -np.log1p(-p)
    fig, ax = plt.subplots()
    ax.plot(q_theory, x, marker=".", linewidth=0)
    ax.set_xlabel("Exp(1) quantiles"); ax.set_ylabel("Sample |x| quantiles")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(save_path)
    plt.close(fig)


### 3.4 Data Augmentation Functions (Explicit RNG)

This cell defines simple augmentation functions that are applied to the generated datasets to increase robustness and variability during model training.  
All functions support an explicit random number generator to ensure reproducibility and consistent interfaces.

- **`jitter`**: Adds small Gaussian noise to the data, scaled proportionally to the sample’s standard deviation.
- **`bootstrap_resample`**: Creates a new sample by drawing values with replacement from the original data.
- **`slight_scale`**: Applies a deterministic scaling factor to the data; the RNG argument is included for a uniform function signature.

In [None]:
# === Augmentations (explicit RNG) ===
import numpy as np

def jitter(x, scale=0.02, rng=None):
    """
    Add small Gaussian noise proportional to the sample std.
    """
    if rng is None:
        rng = np.random.default_rng()
    x = np.asarray(x).ravel().astype(np.float32)
    s = float(np.std(x)) if np.std(x) > 0 else 1.0
    return x + rng.normal(0.0, scale * s, size=len(x)).astype(np.float32)

def bootstrap_resample(x, rng=None):
    """
    Sample with replacement from x.
    """
    if rng is None:
        rng = np.random.default_rng()
    x = np.asarray(x).ravel()
    idx = rng.integers(0, len(x), size=len(x))
    return x[idx].astype(np.float32, copy=False)

def slight_scale(x, factor=1.05, rng=None):
    """
    Deterministic scaling; rng included for a uniform signature.
    """
    x = np.asarray(x).ravel()
    return (x * factor).astype(np.float32, copy=False)


### 3.5 Plot Rendering & Metadata Export

This cell defines functions for rendering diagnostic plots for each dataset (including augmentations) and for exporting the collected metadata in structured formats.

- **`LOCAL_TMP`**: Local staging directory used for temporary files to avoid Google Drive latency during generation.
- **`render_plots`**: Creates and saves Zipf, Mean Excess (ME), and exponential Q–Q plots for a dataset and its augmentations, returning a structured dictionary of plot file paths.
- **`rng_for` Usage**: Derives deterministic RNG streams for downsampling, augmentation sampling, and augmentation plotting to keep results controlled and reproducible.
- **`aug_streams` / `aug_data`**: Builds separate augmentation variants (`jitter`, `bootstrap`, `scale105`) using independent RNG streams and then generates plots for each variant.
- **`write_metadata`**: Writes the collected dataset metadata to both JSON (nested, full record) and CSV (flat view) formats.
- **`SEED`**: Stores the run seed inside the metadata outputs to keep runs traceable even when using fresh randomness per execution.

In [None]:
# === Local temp dir (RAM-friendly, no Drive lag) ===
from pathlib import Path
import json, csv
LOCAL_TMP = Path("/content/ds_tmp")
LOCAL_TMP.mkdir(parents=True, exist_ok=True)

def render_plots(ds_id: str, npy_path: Path):
    """
    Render and save plots for a dataset.
    Uses derived RNG streams for downsampling and augmentations to keep runs controlled.
    """
    ds_dir = PLOTS_DIR / ds_id
    ds_dir.mkdir(parents=True, exist_ok=True)

    x = np.load(npy_path, mmap_mode="r", allow_pickle=False)

    paths = {}

    # ORIGINAL
    zipf_p = ds_dir / "zipf.png"
    me_p   = ds_dir / "me.png"
    qq_p   = ds_dir / "qq_exp.png"

    # derive RNG for original downsampling (plotting)
    rng_down = rng_for(SEED, "render_plots", ds_id, "downsample")
    plot_zipf(x, zipf_p, rng=rng_down)
    plot_me(x,   me_p,   rng=rng_down)
    plot_qq_exp(x, qq_p, rng=rng_down)

    paths["original"] = {"zipf": str(zipf_p), "me": str(me_p), "qq_exp": str(qq_p)}

    # AUGMENTATIONS: take a manageable subsample first (with explicit RNG)
    sample_n = min(50_000, x.shape[0])
    rng_pick = rng_for(SEED, "render_plots", ds_id, "pick_for_aug")
    idx = rng_pick.integers(0, x.shape[0], size=sample_n)
    x_small = np.asarray(x[idx])

    paths["aug"] = {}
    # independent RNG streams per augmentation
    aug_streams = {
        "jitter":    rng_for(SEED, "render_plots", ds_id, "aug", "jitter"),
        "bootstrap": rng_for(SEED, "render_plots", ds_id, "aug", "bootstrap"),
        "scale105":  rng_for(SEED, "render_plots", ds_id, "aug", "scale105"),
    }
    aug_data = {
        "jitter":    jitter(x_small, rng=aug_streams["jitter"]),
        "bootstrap": bootstrap_resample(x_small, rng=aug_streams["bootstrap"]),
        "scale105":  slight_scale(x_small, 1.05, rng=aug_streams["scale105"]),
    }

    # plotting RNGs per augmentation
    for name, arr in aug_data.items():
        z_a = ds_dir / f"zipf_{name}.png"
        m_a = ds_dir / f"me_{name}.png"
        q_a = ds_dir / f"qq_exp_{name}.png"

        rng_plot = rng_for(SEED, "render_plots", ds_id, "plot_aug", name)
        plot_zipf(arr, z_a, rng=rng_plot)
        plot_me(arr,   m_a, rng=rng_plot)
        plot_qq_exp(arr, q_a, rng=rng_plot)

        paths.setdefault("aug", {})[name] = {"zipf": str(z_a), "me": str(m_a), "qq_exp": str(q_a)}

    return paths

def write_metadata(records):
    """
    Write metadata to JSON and CSV; include the run SEED so results are traceable even with fresh randomness.
    """
    META_DIR.mkdir(parents=True, exist_ok=True)
    meta_json = META_DIR / "datasets_metadata.json"
    meta_csv  = META_DIR / "datasets_metadata.csv"

    # JSON
    payload = {
        "seed": SEED,
        "count": len(records),
        "records": records,
    }
    with open(meta_json, "w", encoding="utf-8") as f:
        json.dump(payload, f, ensure_ascii=False, indent=2)

    # CSV (flat view)
    cols = ["dataset_id","distribution","n","heavy_tailed","data_path","params_json",
            "plot_zipf","plot_me","plot_qq_exp",
            "plot_zipf_jitter","plot_me_jitter","plot_qq_jitter",
            "plot_zipf_bootstrap","plot_me_bootstrap","plot_qq_bootstrap",
            "plot_zipf_scale105","plot_me_scale105","plot_qq_scale105",
            "seed"]
    with open(meta_csv, "w", newline="", encoding="utf-8") as f:
        w = csv.writer(f); w.writerow(cols)
        for r in records:
            p = r["plots"]
            w.writerow([
                r["dataset_id"], r["distribution"], r["n"], int(r["heavy_tailed"]), r["data_path"],
                json.dumps(r["params"], ensure_ascii=False),
                p["original"]["zipf"], p["original"]["me"], p["original"]["qq_exp"],
                p["aug"]["jitter"]["zipf"], p["aug"]["jitter"]["me"], p["aug"]["jitter"]["qq_exp"],
                p["aug"]["bootstrap"]["zipf"], p["aug"]["bootstrap"]["me"], p["aug"]["bootstrap"]["qq_exp"],
                p["aug"]["scale105"]["zipf"], p["aug"]["scale105"]["me"], p["aug"]["scale105"]["qq_exp"],
                SEED,
            ])
    print("Wrote metadata:", meta_json, "and", meta_csv)


## 4. Run Data Generation Pipeline

This cell serves as the main execution entry point for data generation. It orchestrates the creation of datasets, rendering of diagnostic plots, and persistence of metadata.

- **`build_corpus`**: Generates datasets for all configured distributions, including sampling, labeling (heavy-tailed vs. not), plot rendering, and artifact storage.
  - **`n_per_dist`**: Number of datasets generated per distribution.
  - **`n_samples`**: Number of samples per dataset.
  - **`chunk_size`**: Chunk size used when writing large `.npy` files to control memory usage.
- **`write_metadata`**: Persists dataset metadata to JSON and CSV files, including distribution parameters, labels, plot paths, and the run seed.
- **`SEED`**: Printed at the end of execution to ensure the run can be traced and reproduced if needed.
- **Artifacts Location**: All generated datasets, plots, and metadata are stored under **`BASE_DIR`**.

This cell should be executed once all helper functions and configurations are defined.

In [None]:
# === Run data generation ===
records = build_corpus(
    n_per_dist=200,
    n_samples=2000,
    chunk_size=250_000,
)
write_metadata(records)
print("SEED used for this run:", SEED)
print("Data generation complete. Artifacts in:", BASE_DIR)


Generating: normal
Generating: exponential
Generating: pareto
Generating: student_t
Generating: lognormal
Generating: mix_norm_pareto
Wrote metadata: /content/drive/MyDrive/Generated Data for Data science project/metadata/datasets_metadata.json and /content/drive/MyDrive/Generated Data for Data science project/metadata/datasets_metadata.csv
SEED used for this run: 37280314975076901032992437105419038345
Data generation complete. Artifacts in: /content/drive/MyDrive/Generated Data for Data science project
