<a href="https://colab.research.google.com/github/laraAkg/Data-Science-Project/blob/main/Generate_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This cell imports necessary libraries for the project.

It also configures Matplotlib to use a non-GUI backend and checks if the code is running in Google Colab. If in Colab, it mounts Google Drive to access files.

In [None]:
from pathlib import Path
import os
import shutil
import json
import csv
import time
import math
import gc
from dataclasses import dataclass
from datetime import datetime, timezone

import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt

try:
    import scipy.stats as st
except ImportError:
    st = None

try:
    import google.colab
    IN_COLAB = True
except Exception:
    IN_COLAB = False

if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)


Mounted at /content/drive


This cell defines the directory structure for the project, creating paths for storing datasets, plots, metadata, models, reports, and real-world data.

- `DEFAULT_PROJECT_DIR`: Sets the default directory name within Google Drive.
- `BASE_DIR`: Determines the base directory for project outputs, either in Google Drive if running in Colab or a local `./project_outputs` directory otherwise.
- `DATA_DIR`, `PLOTS_DIR`, `META_DIR`, `MODELS_DIR`, `REPORTS_DIR`, `REAL_DIR`: Define specific subdirectories within the `BASE_DIR` for different types of project artifacts.
- The code then iterates through the defined directories and creates them if they don't already exist using `mkdir(parents=True, exist_ok=True)`.
- `IMG_SIZE`, `DPI`, `FIGSIZE`: Define constants for image size, dots per inch for saving figures, and the figure size for matplotlib plots.

In [None]:
DEFAULT_PROJECT_DIR = "MyDrive/Generated Data for Data science project"
BASE_DIR = Path("/content/drive") / DEFAULT_PROJECT_DIR if IN_COLAB else Path("./project_outputs")

DATA_DIR   = BASE_DIR / "datasets"
PLOTS_DIR  = BASE_DIR / "plots"
META_DIR   = BASE_DIR / "metadata"
MODELS_DIR = BASE_DIR / "models_tf"
REPORTS_DIR= BASE_DIR / "reports"
REAL_DIR   = BASE_DIR / "real"

for p in [DATA_DIR, PLOTS_DIR, META_DIR, MODELS_DIR, REPORTS_DIR, REAL_DIR]:
    p.mkdir(parents=True, exist_ok=True)

IMG_SIZE = (128, 128)  # (H, W)
DPI      = 150
FIGSIZE  = (4.0, 4.0)

print("BASE_DIR:", BASE_DIR)

BASE_DIR: /content/drive/MyDrive/Generated Data for Data science project


This cell contains utility functions and configures Matplotlib for plot creation and saving.

- `ensure_parent`: Ensures that the parent directory of a file exists.
- `_patched_savefig`: A modified version of the Matplotlib `savefig` function that redirects relative paths to the `PLOTS_DIR` and ensures that the parent directories exist.
- `save_dataset`: Saves a NumPy array as a `.npy` file.

In [None]:
matplotlib.use("Agg")

plt.rcParams["figure.dpi"] = DPI
plt.rcParams["savefig.dpi"] = DPI
plt.rcParams["figure.figsize"] = FIGSIZE

def ensure_parent(path: Path):
    path = Path(path); path.parent.mkdir(parents=True, exist_ok=True); return path

_original_savefig = plt.savefig
def _patched_savefig(*args, **kwargs):
    if len(args) > 0 and isinstance(args[0], (str, Path)):
        fname = Path(args[0])
        if not fname.is_absolute():
            fname = PLOTS_DIR / fname
        ensure_parent(fname)
        args = (str(fname),) + tuple(args[1:])
    else:
        auto = PLOTS_DIR / f"plot_{int(time.time()*1000)}.png"
        ensure_parent(auto)
        args = (str(auto),) + args
    kwargs.setdefault("dpi", DPI)
    return _original_savefig(*args, **kwargs)
plt.savefig = _patched_savefig

def save_dataset(arr: np.ndarray, path: Path):
    path = ensure_parent(path)
    np.save(path, arr.astype(np.float32))

This cell defines different probability distributions, functions to sample from them, and a function to determine if a distribution is considered "heavy-tailed" based on its parameters.

- `DistSpec`: A data class to hold the name, parameter generation function, and sampling function for each distribution.
- `rng_for`: Creates a new random number generator based on a distribution name and index.
- `sample_*`: Functions to generate samples from specific distributions (normal, exponential, Pareto, Student's t, lognormal, and a mix of normal and Pareto).
- `p_*`: Functions to generate random parameters for each distribution.
- `DISTRIBUTIONS`: A list of `DistSpec` objects, one for each supported distribution.
- `make_dataset_id`: Creates a unique ID for each generated dataset.
- `is_heavy_tailed`: Determines if a distribution is heavy-tailed based on its name and parameters.

In [None]:
# === Distributions, seeding, RNG utilities, and dataset builder (fresh randomness each run) ===
from dataclasses import dataclass
import json, hashlib
from datetime import datetime, timezone
import numpy as np
from pathlib import Path

# --- 1) Fresh randomness each run (record the seed) ---
# If you want reproducibility later, you can set SEED to a fixed int instead.
SEED = int(np.random.SeedSequence().entropy)   # fresh each run
RNG  = np.random.default_rng(SEED)

# --- 2) Deterministic per-task RNGs derived from (seed, labels) ---
def rng_for(seed, *labels) -> np.random.Generator:
    """
    Create an independent RNG stream for any (seed, labels...) combination.
    Labels can include: distribution name, replicate index, 'params'/'data', augmentation name, etc.
    """
    h = hashlib.blake2b(digest_size=16)
    h.update(f"seed={seed}".encode())
    h.update(json.dumps(labels, separators=(',', ':'), default=str).encode())
    entropy = int.from_bytes(h.digest(), "big")
    return np.random.default_rng(entropy)

# --- 3) Spec for a distribution: parameter generator + sampler (both receive an RNG) ---
@dataclass
class DistSpec:
    name: str
    param_fn: callable   # signature: param_fn(i: int, rng: np.random.Generator) -> dict
    sample_fn: callable  # signature: sample_fn(n: int, rng: np.random.Generator, **params) -> np.ndarray

# --- 4) Parameter generators (ranges unchanged; only RNG is now explicit) ---
def p_normal(i, rng):       # μ in [-1, 1], σ in [0.5, 2.0]
    mu    = rng.uniform(-1.0, 1.0)
    sigma = rng.uniform(0.5, 2.0)
    return {"mu": float(mu), "sigma": float(sigma)}

def p_exponential(i, rng):  # λ in [0.5, 2.0]
    lam = rng.uniform(0.5, 2.0)
    return {"lam": float(lam)}

def p_pareto(i, rng):       # α in [1.15, 1.9], xm in [0.7, 2.5]
    alpha = rng.uniform(1.15, 1.9)
    xm    = rng.uniform(0.7, 2.5)
    return {"alpha": float(alpha), "xm": float(xm)}

def p_student_t(i, rng):    # df in [2, 12], scale in [0.5, 2.0], loc in [-1,1]
    df    = rng.integers(2, 13)
    loc   = rng.uniform(-1.0, 1.0)
    scale = rng.uniform(0.5, 2.0)
    return {"df": int(df), "loc": float(loc), "scale": float(scale)}

def p_lognormal(i, rng):    # μ in [-0.5, 1.0], σ in [0.4, 1.2]
    mu    = rng.uniform(-0.5, 1.0)
    sigma = rng.uniform(0.4, 1.2)
    return {"mu": float(mu), "sigma": float(sigma)}

def p_mix_norm_pareto(i, rng):  # mixture π in [0.02, 0.25]
    pi    = rng.uniform(0.02, 0.25)  # heavy-tail weight
    mu    = rng.uniform(-0.5, 0.5)
    sigma = rng.uniform(0.6, 1.5)
    alpha = rng.uniform(1.2, 1.8)
    xm    = rng.uniform(0.8, 1.6)
    return {"pi": float(pi), "mu": float(mu), "sigma": float(sigma), "alpha": float(alpha), "xm": float(xm)}

# --- 5) Samplers (explicit RNG; no global sampling) ---
def sample_normal(n, rng, mu, sigma):
    return rng.normal(mu, sigma, size=n)

def sample_exponential(n, rng, lam):
    return rng.exponential(1.0/lam, size=n)

def sample_pareto(n, rng, alpha, xm=1.0):
    return (rng.pareto(alpha, size=n) + 1.0) * xm

def sample_student_t(n, rng, df, loc=0.0, scale=1.0):
    return loc + scale * rng.standard_t(df=df, size=n)

def sample_lognormal(n, rng, mu, sigma):
    return rng.lognormal(mean=mu, sigma=sigma, size=n)

def sample_mix_norm_pareto(n, rng, pi, mu, sigma, alpha, xm=1.0):
    k = rng.binomial(1, pi, size=n).astype(bool)
    x = np.empty(n, dtype=np.float64)
    x[k]  = (rng.pareto(alpha, size=k.sum()) + 1.0) * xm
    x[~k] = rng.normal(mu, sigma, size=(~k).sum())
    return x

# --- 6) Distribution registry (unchanged names) ---
DISTRIBUTIONS = [
    DistSpec("normal",          p_normal,          sample_normal),
    DistSpec("exponential",     p_exponential,     sample_exponential),
    DistSpec("pareto",          p_pareto,          sample_pareto),
    DistSpec("student_t",       p_student_t,       sample_student_t),
    DistSpec("lognormal",       p_lognormal,       sample_lognormal),
    DistSpec("mix_norm_pareto", p_mix_norm_pareto, sample_mix_norm_pareto),
]

# --- 7) Utility: dataset id, tail label heuristic (unchanged logic) ---
def make_dataset_id(dist_name: str, idx: int) -> str:
    timestamp = datetime.now(timezone.utc).strftime("%Y%m%d%H%M%S%f")
    return f"{dist_name}_{idx:03d}_{timestamp}"

def is_heavy_tailed(dist_name: str, params: dict) -> bool:
    name = dist_name.lower()
    if name in ["pareto", "cauchy", "mix_norm_pareto"]:
        return True
    if name in ["normal", "gaussian", "exponential"]:
        return False
    if name in ["student_t", "studentt", "t"]:
        return params.get("df", 10) <= 5
    if name in ["lognormal", "lognorm"]:
        return params.get("sigma", 1.0) >= 1.0
    return False


In [None]:
# === Helper: safe_move for cross-device moves ===
import shutil
from pathlib import Path

def safe_move(src: Path, dst: Path):
    dst.parent.mkdir(parents=True, exist_ok=True)
    shutil.move(str(src), str(dst))  # copy+delete when needed


In [None]:
# === build_corpus (open_memmap -> schreibt gültige .npy-Dateien; cross-device-safe move) ===
from numpy.lib.format import open_memmap  # wichtig: schreibt .npy-Header

def build_corpus(n_per_dist=5, n_samples=2000, chunk_size=200_000, dtype=np.float32, seed: int = SEED):
    """
    Builds datasets for all distributions defined in DISTRIBUTIONS.
    - Lokales Staging mit open_memmap (erzeugt echte .npy-Dateien, header-integriert)
    - Cross-device-sicheres Verschieben per safe_move
    - Per-task RNGs (params vs data)
    """
    records = []

    for dist in DISTRIBUTIONS:
        print("Generating:", dist.name)
        for i in range(n_per_dist):
            # unabhängige RNG-Streams
            rng_params = rng_for(seed, dist.name, i, "params")
            rng_data   = rng_for(seed, dist.name, i, "data")

            ds_id  = make_dataset_id(dist.name, i)
            params = dist.param_fn(i, rng_params)
            label  = is_heavy_tailed(dist.name, params)

            # --- lokal (.npy) schreiben
            local_out_dir = LOCAL_TMP / dist.name
            local_out_dir.mkdir(parents=True, exist_ok=True)
            local_data_path = local_out_dir / f"{ds_id}.npy"

            remaining = int(n_samples)
            pos = 0
            # open_memmap erzeugt eine .npy-Datei mit Header (kompatibel zu np.load)
            mm = open_memmap(filename=str(local_data_path), mode='w+', dtype=dtype, shape=(n_samples,))
            try:
                while remaining > 0:
                    m = min(remaining, chunk_size)
                    x = dist.sample_fn(m, rng_data, **params)
                    if x.dtype != dtype:
                        x = x.astype(dtype, copy=False)
                    mm[pos:pos+m] = x
                    pos += m
                    remaining -= m
                mm.flush()
            finally:
                del mm  # Handle freigeben

            # --- cross-device-sicher nach Drive verschieben
            final_data_path = DATA_DIR / dist.name / f"{ds_id}.npy"
            safe_move(local_data_path, final_data_path)

            # --- Plots rendern (np.load(..., allow_pickle=False) funktioniert jetzt)
            plots = render_plots(ds_id, final_data_path)

            records.append({
                "dataset_id": ds_id,
                "distribution": dist.name,
                "n": int(n_samples),
                "heavy_tailed": bool(label),
                "data_path": str(final_data_path),
                "params": params,
                "plots": plots,
            })

    return records

This code cell contains several helper functions for processing data and generating plots:

- `_sanitize`: Removes NaN and infinite values from an array.
- `_downsample`: Reduces the number of data points in an array to a specified maximum, useful for performance with large datasets.
- `_clamp_by_percentile`: Limits the data range by values at specified percentiles to reduce the impact of outliers.
- `plot_zipf`: Generates a Zipf plot, which visualizes the distribution of data ranks against their values on a log-log scale.
- `plot_me`: Generates a Mean Excess plot, which shows the expected value of data exceeding a certain threshold.
- `_fit_lambda_mle_pos`: Calculates the maximum likelihood estimate for the lambda parameter of an exponential distribution for positive values.
- `plot_qq_exponential`: Generates an Exponential Q-Q plot, comparing the quantiles of the data to the theoretical quantiles of an exponential distribution.

In [None]:
# === Plotting helpers (sanitize, downsample uses explicit RNG) ===
import numpy as np
import matplotlib.pyplot as plt

def _sanitize(x):
    x = np.asarray(x).ravel()
    return x[~np.isnan(x) & ~np.isinf(x)]

def _downsample(x, max_points=50_000, rng=None):
    n = x.shape[0]
    if n <= max_points:
        return x
    if rng is None:
        rng = np.random.default_rng()
    idx = rng.choice(n, size=max_points, replace=False)
    return x[idx]

def _clamp_by_percentile(x, pct=99.5):
    lo = np.nanpercentile(x, 100 - pct)
    hi = np.nanpercentile(x, pct)
    if not np.isfinite(lo): lo = np.nanmin(x)
    if not np.isfinite(hi): hi = np.nanmax(x)
    x = np.clip(x, lo, hi)
    return x, (lo, hi)

# --- existing plot functions; only change: use _downsample(x, rng=...) ---
def plot_zipf(x, save_path, rng=None):
    x = _sanitize(np.abs(x))
    x = _downsample(x, rng=rng)
    x = np.sort(x)[::-1]
    ranks = np.arange(1, x.size + 1)
    fig, ax = plt.subplots()
    ax.loglog(ranks, x, marker=".", linewidth=0)
    ax.set_xlabel("rank"); ax.set_ylabel("|x| (sorted)")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(save_path)
    plt.close(fig)

def plot_me(x, save_path, n_bins=200, rng=None):
    x = _sanitize(np.abs(x))
    x = _downsample(x, rng=rng)
    x, (lo, hi) = _clamp_by_percentile(x, 99.5)
    xs = np.sort(x)
    us = np.linspace(lo, hi, n_bins)
    e_vals = []
    for u in us:
        exceed = xs[xs > u]
        e_vals.append(np.mean(exceed - u) if exceed.size else np.nan)
    fig, ax = plt.subplots()
    ax.plot(us, e_vals, marker=".", linewidth=1)
    ax.set_xlabel("u"); ax.set_ylabel("e(u)")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(save_path)
    plt.close(fig)

def plot_qq_exp(x, save_path, rng=None):
    """
    Exponential QQ-plot: compare sample quantiles of |x| to theoretical exponential(1) quantiles.
    """
    x = _sanitize(np.abs(x))
    x = _downsample(x, rng=rng)
    x = np.sort(x)
    n = x.size
    if n == 0:
        # create an empty plot if needed
        fig, ax = plt.subplots()
        ax.set_title("Empty sample")
        plt.tight_layout(); plt.savefig(save_path); plt.close(fig); return
    # theoretical quantiles for Exp(1): F^{-1}(p) = -ln(1-p)
    p = (np.arange(1, n + 1) - 0.5) / n
    q_theory = -np.log1p(-p)
    fig, ax = plt.subplots()
    ax.plot(q_theory, x, marker=".", linewidth=0)
    ax.set_xlabel("Exp(1) quantiles"); ax.set_ylabel("Sample |x| quantiles")
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig(save_path)
    plt.close(fig)


This cell defines simple augmentation functions that are applied to the generated data to increase the robustness of the model.

- `jitter`: Adds a small amount of noise to the data.
- `bootstrap_resample`: Creates a new sample by drawing with replacement from the original data.
- `slight_scale`: Slightly scales the data.

In [None]:
# === Augmentations (explicit RNG) ===
import numpy as np

def jitter(x, scale=0.02, rng=None):
    """
    Add small Gaussian noise proportional to the sample std.
    """
    if rng is None:
        rng = np.random.default_rng()
    x = np.asarray(x).ravel().astype(np.float32)
    s = float(np.std(x)) if np.std(x) > 0 else 1.0
    return x + rng.normal(0.0, scale * s, size=len(x)).astype(np.float32)

def bootstrap_resample(x, rng=None):
    """
    Sample with replacement from x.
    """
    if rng is None:
        rng = np.random.default_rng()
    x = np.asarray(x).ravel()
    idx = rng.integers(0, len(x), size=len(x))
    return x[idx].astype(np.float32, copy=False)

def slight_scale(x, factor=1.05, rng=None):
    """
    Deterministic scaling; rng included for a uniform signature.
    """
    x = np.asarray(x).ravel()
    return (x * factor).astype(np.float32, copy=False)


This cell contains functions for rendering the plots for each dataset and its augmentations, building a corpus of datasets, and storing metadata.

- `render_plots`: Creates and saves the Zipf, ME, and QQ plots for a given dataset and its augmentations.
- `build_corpus`: Generates datasets for each distribution and the specified number, saves the data, and renders the plots. It also collects metadata for each dataset.
- `write_metadata`: Saves the collected metadata in JSON and CSV files.

In [None]:
# === Local temp dir (RAM-friendly, no Drive lag) ===
from pathlib import Path
import json, csv
LOCAL_TMP = Path("/content/ds_tmp")
LOCAL_TMP.mkdir(parents=True, exist_ok=True)

def render_plots(ds_id: str, npy_path: Path):
    """
    Render and save plots for a dataset.
    Uses derived RNG streams for downsampling and augmentations to keep runs controlled.
    """
    ds_dir = PLOTS_DIR / ds_id
    ds_dir.mkdir(parents=True, exist_ok=True)

    x = np.load(npy_path, mmap_mode="r", allow_pickle=False)

    paths = {}

    # ORIGINAL
    zipf_p = ds_dir / "zipf.png"
    me_p   = ds_dir / "me.png"
    qq_p   = ds_dir / "qq_exp.png"

    # derive RNG for original downsampling (plotting)
    rng_down = rng_for(SEED, "render_plots", ds_id, "downsample")
    plot_zipf(x, zipf_p, rng=rng_down)
    plot_me(x,   me_p,   rng=rng_down)
    plot_qq_exp(x, qq_p, rng=rng_down)

    paths["original"] = {"zipf": str(zipf_p), "me": str(me_p), "qq_exp": str(qq_p)}

    # AUGMENTATIONS: take a manageable subsample first (with explicit RNG)
    sample_n = min(50_000, x.shape[0])
    rng_pick = rng_for(SEED, "render_plots", ds_id, "pick_for_aug")
    idx = rng_pick.integers(0, x.shape[0], size=sample_n)
    x_small = np.asarray(x[idx])

    paths["aug"] = {}
    # independent RNG streams per augmentation
    aug_streams = {
        "jitter":    rng_for(SEED, "render_plots", ds_id, "aug", "jitter"),
        "bootstrap": rng_for(SEED, "render_plots", ds_id, "aug", "bootstrap"),
        "scale105":  rng_for(SEED, "render_plots", ds_id, "aug", "scale105"),
    }
    aug_data = {
        "jitter":    jitter(x_small, rng=aug_streams["jitter"]),
        "bootstrap": bootstrap_resample(x_small, rng=aug_streams["bootstrap"]),
        "scale105":  slight_scale(x_small, 1.05, rng=aug_streams["scale105"]),
    }

    # plotting RNGs per augmentation
    for name, arr in aug_data.items():
        z_a = ds_dir / f"zipf_{name}.png"
        m_a = ds_dir / f"me_{name}.png"
        q_a = ds_dir / f"qq_exp_{name}.png"

        rng_plot = rng_for(SEED, "render_plots", ds_id, "plot_aug", name)
        plot_zipf(arr, z_a, rng=rng_plot)
        plot_me(arr,   m_a, rng=rng_plot)
        plot_qq_exp(arr, q_a, rng=rng_plot)

        paths.setdefault("aug", {})[name] = {"zipf": str(z_a), "me": str(m_a), "qq_exp": str(q_a)}

    return paths

def write_metadata(records):
    """
    Write metadata to JSON and CSV; include the run SEED so results are traceable even with fresh randomness.
    """
    META_DIR.mkdir(parents=True, exist_ok=True)
    meta_json = META_DIR / "datasets_metadata.json"
    meta_csv  = META_DIR / "datasets_metadata.csv"

    # JSON
    payload = {
        "seed": SEED,
        "count": len(records),
        "records": records,
    }
    with open(meta_json, "w", encoding="utf-8") as f:
        json.dump(payload, f, ensure_ascii=False, indent=2)

    # CSV (flat view)
    cols = ["dataset_id","distribution","n","heavy_tailed","data_path","params_json",
            "plot_zipf","plot_me","plot_qq_exp",
            "plot_zipf_jitter","plot_me_jitter","plot_qq_jitter",
            "plot_zipf_bootstrap","plot_me_bootstrap","plot_qq_bootstrap",
            "plot_zipf_scale105","plot_me_scale105","plot_qq_scale105",
            "seed"]
    with open(meta_csv, "w", newline="", encoding="utf-8") as f:
        w = csv.writer(f); w.writerow(cols)
        for r in records:
            p = r["plots"]
            w.writerow([
                r["dataset_id"], r["distribution"], r["n"], int(r["heavy_tailed"]), r["data_path"],
                json.dumps(r["params"], ensure_ascii=False),
                p["original"]["zipf"], p["original"]["me"], p["original"]["qq_exp"],
                p["aug"]["jitter"]["zipf"], p["aug"]["jitter"]["me"], p["aug"]["jitter"]["qq_exp"],
                p["aug"]["bootstrap"]["zipf"], p["aug"]["bootstrap"]["me"], p["aug"]["bootstrap"]["qq_exp"],
                p["aug"]["scale105"]["zipf"], p["aug"]["scale105"]["me"], p["aug"]["scale105"]["qq_exp"],
                SEED,
            ])
    print("Wrote metadata:", meta_json, "and", meta_csv)


This cell is the main trigger for data generation. It calls the `build_corpus` and `write_metadata` functions to create the datasets, render the plots, and save the metadata.

- `build_corpus(n_per_dist=5, n_samples=2000)`: Generates 5 datasets per distribution with 2000 samples each.
- `write_metadata(records)`: Saves the metadata of the generated datasets.

In [None]:
# === Run data generation ===
records = build_corpus(
    n_per_dist=200,
    n_samples=2000,
    chunk_size=250_000,
)
write_metadata(records)
print("SEED used for this run:", SEED)
print("Data generation complete. Artifacts in:", BASE_DIR)


Generating: normal
Generating: exponential
Generating: pareto
Generating: student_t
Generating: lognormal
Generating: mix_norm_pareto
Wrote metadata: /content/drive/MyDrive/Generated Data for Data science project/metadata/datasets_metadata.json and /content/drive/MyDrive/Generated Data for Data science project/metadata/datasets_metadata.csv
SEED used for this run: 37280314975076901032992437105419038345
Data generation complete. Artifacts in: /content/drive/MyDrive/Generated Data for Data science project
