## Notebook 01: Environment, Reproducibility, and Quick Data Sanity

### 1. Purpose and Project Introduction

This notebook establishes the computational and conceptual baseline for the éclat research prototype. éclat is a research-grade prototype that combines LLM-based intent/emotion detection with structured verification and underwriting modules to study pause-aware, empathetic loan conversations. The goal is to provide a reproducible pipeline that: (1) generates synthetic datasets representative of small‑mid sized retail loan interactions (amounts in INR, configurable between 10k–100k rows for experiments), (2) demonstrates intent/emotion detection using lightweight LLM wrappers and embeddings, (3) provides deterministic baselines for verification and underwriter reasoning, and (4) produces explainable artifacts and audit trails for evaluation.

Scope and datasets: the provided `prototype/generate_data.py` creates synthetic `data/raw/` CSVs intended for reproducible experiments. For quick reviews this notebook uses small samples when `DRY_RUN=True`; switch to `DRY_RUN=False` to run full dataset reads.

Rationale: establishing a rigorous environment and manifest up front reduces replication friction for reviewers, ensures deterministic runs for experiments, and captures the minimal metadata required for audit and reproducibility. The notebook is intentionally guarded: computationally expensive or GPU-bound cells in later notebooks are gated by the `DRY_RUN` flag so reviewers can inspect narrative and small visualizations without triggering heavy compute.

### 2. Goals:

This notebook prepares the environment and artifacts required for the research suite. High-level goals: 

- Confirm environment and core libraries (so reviewers know which packages and versions were used).
- Produce `artifacts/env.json`, `artifacts/versions.json`, and `artifacts/manifest.json` (these artifacts are the minimal reproducibility bundle).
- Run quick deterministic reproducibility checks (numpy/pandas/torch seeds) to validate that RNG seeding produces repeatable results on this machine/configuration.
- Provide clear instructions to reproduce the environment on Windows (venv) or Conda and to get plotting working for visual checks.

### 3. Setup Constants & Flags

This code cell sets the reproducibility `SEED`, the `DRY_RUN` boolean (default `True`), and root paths for artifacts and data. `DRY_RUN=True` prevents heavy compute and limits dataset reads to samples. To run full experiments, set `DRY_RUN=False` and ensure you have the required compute resources (see notebook headers for compute notes).

**Why these settings matter:** deterministic experiments require a documented seed and consistent data paths. Using `DRY_RUN` lets reviewers verify logic and outputs without triggering long training or large-file I/O; when running full experiments, the same constants ensure comparable results across runs. The cell also demonstrates how to set environment variables for repeatability.

In [1]:
# Setup constants and flags
DRY_RUN = True  # switch to False to run full dataset reads / heavy compute
SEED = 42

from pathlib import Path
import os, sys, random, json, hashlib, shutil, subprocess, time
import numpy as np
import pandas as pd

ROOT = Path('.')  # project root
ARTIFACTS = ROOT.joinpath('artifacts')
DATA_DIR = ROOT.joinpath('data','raw')
ARTIFACTS.mkdir(parents=True, exist_ok=True)

# Document DRY_RUN behaviour in the notebook output
print(f'DRY_RUN={DRY_RUN}; SEED={SEED}; ROOT={ROOT.resolve()}')

# Example: set env vars for reproducibility (powershell/cmd examples shown later)
os.environ.setdefault('ECLAT_SEED', str(SEED))
os.environ.setdefault('ECLAT_DRY_RUN', '1' if DRY_RUN else '0')

# Small helper to write JSON artifacts
def write_artifact(name, obj):
    p = ARTIFACTS.joinpath(name)
    with p.open('w', encoding='utf8') as f:
        json.dump(obj, f, indent=2, ensure_ascii=False)
    print('Wrote', p)

DRY_RUN=True; SEED=42; ROOT=C:\Users\ADITI\Desktop\rhyl-projects\eclat\notebooks


### 4. Environment discovery & disk / GPU checks

This section prints environment details (Python executable, pip version, OS, CPU count), examines disk free space for common drives, and checks for available GPUs via `torch`. Results are saved to `artifacts/env.json`. The plotting cell below visualizes disk free space if `matplotlib` is available.

Why: disk space and GPU availability are often overlooked causes of nondeterministic failures (e.g., interrupted installs, fallback CPU-only runs). Recording drive free space and GPU metadata helps diagnose environment-dependent failures later. We save these results so they can be archived with experiment outputs.

In [2]:
# Environment discovery and checks
import platform, multiprocessing, shutil, stat, math
from datetime import datetime

env = {}
env['python_executable'] = sys.executable
# pip version
try:
    pip_v = subprocess.check_output([sys.executable, '-m', 'pip', '--version'], text=True).strip()
except Exception as e:
    pip_v = str(e)
env['pip_version'] = pip_v
env['platform'] = platform.platform()
env['cpu_count'] = multiprocessing.cpu_count()
env['timestamp'] = datetime.utcnow().isoformat() + 'Z'

# Disk usage for common drives (Windows-friendly)
drives = []
for d in ['C:', 'D:', 'E:', 'F:']:
    if os.path.exists(d + os.sep):
        try:
            total, used, free = shutil.disk_usage(d + os.sep)
            drives.append({'drive': d, 'total': total, 'used': used, 'free': free})
        except Exception as e:
            drives.append({'drive': d, 'error': str(e)})
env['drives'] = drives

# GPU check via torch (if installed)
gpu_info = {'available': False, 'devices': []}
try:
    import importlib
    torch_spec = importlib.util.find_spec('torch')
    if torch_spec is not None:
        import torch
        gpu_info['available'] = torch.cuda.is_available()
        if gpu_info['available']:
            for i in range(torch.cuda.device_count()):
                gpu_info['devices'].append({'id': i, 'name': torch.cuda.get_device_name(i)})
except Exception as e:
    gpu_info['error'] = str(e)
env['gpu'] = gpu_info

# Save env artifact
write_artifact('env.json', env)
env

  env['timestamp'] = datetime.utcnow().isoformat() + 'Z'


Wrote artifacts\env.json


{'python_executable': 'D:\\rhyl-envs\\eclat_venv\\Scripts\\python.exe',
 'pip_version': 'pip 25.3 from D:\\rhyl-envs\\eclat_venv\\Lib\\site-packages\\pip (python 3.13)',
 'platform': 'Windows-11-10.0.26100-SP0',
 'cpu_count': 16,
 'timestamp': '2025-12-17T13:57:46.166614Z',
 'drives': [{'drive': 'C:',
   'total': 196172836864,
   'used': 195559796736,
   'free': 613040128},
  {'drive': 'D:',
   'total': 314571747328,
   'used': 309730840576,
   'free': 4840906752}],
 'gpu': {'available': False, 'devices': []}}

### 5. Critical-library import/version check

This cell probes for presence and versions of critical libraries used in the project. Results are saved to `artifacts/versions.json`. Missing libraries are marked clearly so reviewers know which packages to install.

Why: Machine learning and LLM stacks are sensitive to package versions (ABI changes, tokenizer behavior, randomness). Capturing exact versions helps reproduce numeric results and model behavior. If a package is missing, the artifact makes it explicit which installation step failed.

In [3]:
# Critical library presence and versions
libraries = ['openai','sentence_transformers','transformers','sklearn','xgboost','torch','networkx','pandas','jinja2','matplotlib','faiss','hnswlib']
import importlib, pkgutil
present = {}
for lib in libraries:
    info = {'found': False, 'version': None}
    try:
        spec = importlib.util.find_spec(lib)
        if spec is not None:
            mod = importlib.import_module(lib)
            info['found'] = True
            info['version'] = getattr(mod, '__version__', None) or getattr(mod, 'VERSION', None)
    except Exception as e:
        info['error'] = str(e)
    present[lib] = info

write_artifact('versions.json', present)
present

  from .autonotebook import tqdm as notebook_tqdm


Wrote artifacts\versions.json


{'openai': {'found': True, 'version': '2.13.0'},
 'sentence_transformers': {'found': True, 'version': '5.2.0'},
 'transformers': {'found': True, 'version': '4.57.3'},
 'sklearn': {'found': True, 'version': '1.8.0'},
 'xgboost': {'found': True, 'version': '3.1.2'},
 'torch': {'found': True, 'version': '2.9.1+cpu'},
 'networkx': {'found': True, 'version': '3.6.1'},
 'pandas': {'found': True, 'version': '2.3.3'},
 'jinja2': {'found': True, 'version': '3.1.6'},
 'matplotlib': {'found': True, 'version': '3.10.8'},
 'faiss': {'found': False, 'version': None},
 'hnswlib': {'found': False, 'version': None}}

### 6. Reproducibility manifest

Compute and write a minimal manifest capturing `SEED`, current git commit (if available), SHA256 of `requirements.txt` (if present), platform, python path, and timestamp. Store the manifest in `artifacts/manifest.json`. This manifest should be included with any submission or experiment archive.

Why: a manifest links the code that produced the results to the runtime snapshot. It helps answer questions like 'which commit produced these metrics?' and 'which requirements produced this behavior?'. For reviewers, the manifest is the minimal metadata bundle needed for re-running experiments or auditing results.

In [4]:
# Reproducibility manifest creation
manifest = {'seed': SEED, 'timestamp': time.strftime('%Y-%m-%dT%H:%M:%SZ', time.gmtime()), 'platform': sys.platform, 'python_executable': sys.executable}
# git commit (if available)
try:
    git_hash = subprocess.check_output(['git','rev-parse','HEAD'], text=True).strip()
    manifest['git_commit'] = git_hash
except Exception:
    manifest['git_commit'] = None
# requirements.txt sha256 if present
req_path = ROOT.joinpath('requirements.txt')
if req_path.exists():
    h = hashlib.sha256(req_path.read_bytes()).hexdigest()
    manifest['requirements_sha256'] = h
else:
    manifest['requirements_sha256'] = None

write_artifact('manifest.json', manifest)
manifest

Wrote artifacts\manifest.json


{'seed': 42,
 'timestamp': '2025-12-17T13:57:54Z',
 'platform': 'win32',
 'python_executable': 'D:\\rhyl-envs\\eclat_venv\\Scripts\\python.exe',
 'git_commit': None,
 'requirements_sha256': None}

### 7. Environment reproducibility guidance

Follow these steps to reproduce the environment on Windows using the provided virtual environment or using Conda. Copy the commands into your terminal.

**Windows venv (powershell)**

```powershell
python -m venv D:\rhyl-envs\eclat_venv
D:\rhyl-envs\eclat_venv\Scripts\Activate.ps1
pip install -U pip wheel
pip install -r requirements.txt
# optional: install plotting libs
pip install matplotlib seaborn
```

**Conda (optional)**

```bash
conda create -n eclat python=3.10 -y
conda activate eclat
pip install -r requirements.txt
```

**Notes and rationale:** Prefer creating the venv on a drive with sufficient free space (see the env check above). Installing plotting libraries is optional for headless servers, but useful for visual inspection during reproducibility checks. If you are re-running experiments for numeric comparability, ensure that `requirements.txt` is the same file used to produce the original `requirements_sha256` recorded in the manifest.

### 8. Notebook map

One-line purposes for the remaining notebooks in the suite:

- `notebooks/02_Data_Preparation.ipynb`: deterministic pipelines and feature engineering.
- `notebooks/03_Baseline_Models.ipynb`: baseline classifiers and evaluation metrics.
- `notebooks/04_Explainability.ipynb`: SHAP and surrogate model analyses.
- `notebooks/05_LLM_Intent_Emotion.ipynb`: LLM wrapper experiments for intent and emotion detection.
- `notebooks/06_Agents_Orchestration.ipynb`: MasterAgent and worker agent designs and small demos.
- `notebooks/07_Verification_Workflows.ipynb`: KYC and verification pipelines.
- `notebooks/08_Underwriting_Prefect.ipynb`: Prefect flows and policy simulations.
- `notebooks/09_Sanction_Generator.ipynb`: jinja2 templates, audit trails, and signed artifacts.
- `notebooks/10_Experiments_and_Results.ipynb`: experiment records, metrics tables, and plots for reproducibility.

### 9. Mathematical appendix (KaTeX) and References

This appendix collects the core mathematical objects used across notebooks with brief pointers to classical references so reviewers can follow the theoretical basis used in modeling and evaluation.

**a. Expected Loss (credit-risk)**:

$EL = PD \times LGD \times EAD$

(See: Basel Committee on Banking Supervision conceptual guidance on credit risk.)

**b. Logistic regression (likelihood)**: for labels $y \in \{0,1\}$ and features $x$, param $\beta$,

$$p(x) = \sigma(x^T \beta) = \frac{1}{1 + e^{-x^T \beta}}$$

Negative log-likelihood with L2 regularization ($\lambda$):

$$L(\beta) = -\sum_i [y_i \log p(x_i) + (1-y_i) \log(1-p(x_i))] + \frac{\lambda}{2} ||\beta||^2$$

(See: Hosmer, Lemeshow & Sturdivant, Applied Logistic Regression.)

**c. Brier score**: $BS = \frac{1}{N} \sum_{i=1}^N (p_i - y_i)^2$

(Commonly used for probabilistic forecasts; see Brier (1950).)

**d. POMDP formalism (MasterAgent)**: state $s$, action $a$, transition $T(s'|s,a)$, observation $o$, belief $b(s)$ with Bayesian update:

$$b'(s') \propto O(o|s',a) \sum_s T(s'|s,a) b(s)$$

(See: Puterman, Markov Decision Processes; Kaelbling et al. for POMDP overview.)

**e. Pause-detection (exponential smoothing)**: given inter-utterance intervals $t_k$,

$$\hat{t}_k = \alpha t_k + (1-\alpha) \hat{t}_{k-1}$$

Pause rule: trigger if $t_k > \mu + c \cdot \sigma$ for chosen constants $c$.

**References (select):**

- Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review.
- Hosmer, D. W., Lemeshow, S., & Sturdivant, R. X. Applied Logistic Regression.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming.
- Kaelbling, L. P., Littman, M. L., & Cassandra, A. R. (1998). Planning and acting in partially observable stochastic domains. Artificial Intelligence.
- Basel Committee on Banking Supervision. (consultative papers on credit risk fundamentals).