# InvarLock: Bring Your Own Data (BYOD)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/invarlock/invarlock/blob/main/notebooks/invarlock_custom_datasets.ipynb)

**Purpose:** Run InvarLock on custom/proprietary datasets using offline-capable providers.
**Runtime:** ~5â€“10 minutes
**Requires:** `invarlock[hf]` (or `invarlock[eval]` + your adapter extra)

This notebook focuses on the `local_jsonl` provider (fully offline for data).


In [None]:
!pip -q install "invarlock[hf]"
import sys
import invarlock

print("Python:", sys.version.split()[0])
print("InvarLock:", invarlock.__version__)


In [None]:
!invarlock doctor --json || true


## Create a local JSONL dataset


In [None]:
%%bash
cat > byod.jsonl <<'JSONL'
{"text": "InvarLock BYOD demo: hello world."}
{"text": "Add your own domain text here."}
{"text": "Longer texts improve evaluation stability."}
JSONL
wc -l byod.jsonl


## Write a preset that uses `local_jsonl`

`invarlock certify` uses a preset to configure the dataset provider.
This preset is created locally so the notebook works without cloning the repo.


In [None]:
%%bash
cat > byod_preset.yaml <<'YAML'
dataset:
  provider: local_jsonl
  path: byod.jsonl
  text_field: text
  seq_len: 128
  stride: 128
  preview_n: 16
  final_n: 16
eval:
  metric:
    kind: ppl_causal
YAML
python -c "import yaml; yaml.safe_load(open('byod_preset.yaml'))"


## Run certification on BYOD


In [None]:
%%bash
export INVARLOCK_ALLOW_NETWORK=1 INVARLOCK_DEDUP_TEXTS=1 INVARLOCK_TINY_RELAX=1 TRANSFORMERS_NO_TORCHVISION=1 TOKENIZERS_PARALLELISM=false
invarlock certify \
  --baseline gpt2 \
  --subject  gpt2 \
  --adapter  auto \
  --profile  dev \
  --tier     balanced \
  --preset   byod_preset.yaml \
  --out      runs/byod \
  --cert-out reports/byod


## Verify and inspect provider digest


In [None]:
%%bash
invarlock verify --json reports/byod/evaluation.cert.json


In [None]:
import json

with open('reports/byod/evaluation.cert.json', 'r', encoding='utf-8') as f:
    cert = json.load(f)
print('dataset.provider:', cert.get('dataset', {}).get('provider'))
print('provider_digest:', cert.get('provenance', {}).get('provider_digest'))


## Offline / air-gapped notes

- `local_jsonl` is offline for data.
- For offline model loading, point `--baseline/--subject` at local model directories.
- For HF datasets providers, cache once then set `HF_DATASETS_OFFLINE=1`.

Related docs:

- `docs/user-guide/bring-your-own-data.md`
- `docs/reference/datasets.md`
