Hello everyone!

## Winning Imports

Finding out how to import or what imports to use was extremely difficult. Mainly, because we needed a Data Collator that does a very niche thing that is not built in. So we needed to use TRL's DataCollatorForCompletionOnlyLM. I will try to explain this in more detail later. But for now I can share the import code that worked well. 

```
%%capture
import os
os.environ["UNSLOTH_VLLM_STANDBY"] = "1" # [NEW] Extra 30% context lengths!
!pip install --upgrade -qqq uv
try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
except: get_numpy = "numpy"
try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
except: is_t4 = False
get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
!uv pip install -qqq --upgrade     unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
!uv pip install -qqq {get_triton}
!uv pip install "huggingface_hub>=0.34.0" "datasets>=3.4.1,<4.0.
!uv pip install transformers==4.55.4
!uv pip install "trl==0.9.6"
```

Now you might be wondering what all of this does. 

## Data Creation

Data creation is an interesting case. Because we need to do several things. We first need to process each task into a form that we intuitively think LLMs will have an easier time understanding. 

The original grind world problems are represented by JSON. So we have a bunch of unique task keys.

So for each task the task data looks like this:
```
{test: {input: gridArray} ,
 train: [
    {input: gridArray,
     output: gridArray},
    {input: gridArray,
     output: gridArray}, ...
 ]}
```
and then the solution data looks like this:
```
[gridArray]
```
Note: sometimes there are multiple test inputs and solution outputs

Now, we need to transform this into something that the ML model will have a less messy, easier time reading and have the proper format for our data loaders in training to be able to us. 

Note: we know what we want it to look like so now we just have to figure out what our data loader would expect.

# Data Collator

This was a very difficult part of the pr

Okay what am I trying to do.
I want to get the fine-tuning pipeline build using multiple gpus.
I want to use somewhat basic configs so they are simple for me to understand.
I want to have a clear file system that is easy to navigate
The end goal is to have a fine-tuned model sent to my huggingface folder
I think I will have to merge the weights together before sending them to huggingface
I need the data to be created and then sent through the pipeline. I want to do DDP.
I should figure out if they create the data and then send put it in the data collator or what
So I need to set up data infra training infra and merging and sending infra



/shared/arc/
├── data/                       # read-only datasets (everyone can read)
│   └── ARC-Data/
│       └── input/
│           ├── arc-prize-2024/
│           │   ├── arc-agi_evaluation_challenges.json
│           │   └── arc-agi_evaluation_solutions.json
│           ├── re_arc/
│           │   ├── metadata.json
│           │   └── tasks/
│           │       ├── <task-id-1>.json
│           │       └── ...
│           └── arc-dataset-collection/
│               └── dataset/
│                   └── ConceptARC/
│                       └── data/
│                           ├── group-1/*.json
│                           └── ...
├── outputs/                    # shared, writable (checkpoints, logs, merged models)
│   ├── runs/                   # per-run folders
│   ├── checkpoints/
│   ├── logs/
│   └── models/
│       └── Llama-3/            # where your script saves LoRA + merged
└── cache/
    ├── hf/                     # shared HF cache (optional)
    └── ds/                     # optional DeepSpeed NVMe/AIO cache (if used)

main/
├── code/
│   └── arc-trainer/            # your git repo (your Python package + scripts)
│       ├── train_v1.py
│       ├── ds_configs/
│       │   ├── ds_ze[email protected]          # deepspeed jsons
│       │   └── ds_zero2.json
│       ├── configs/
│       │   └── training.yaml   # hyperparams you may want in YAML
│       ├── scripts/
│       │   ├── run_slurm.sh    # SLURM launcher
│       │   └── run_local.sh    # single-node launcher
│       └── README.md
└── .cache/
    └── huggingface/            # per-user HF cache (if not using shared)

/scratch/$USER/                 # fast temp (per-job)
└── arc-run-<jobid>/            # created by run script at runtime
    ├── offload/                # DS CPU/NVMe offload (if enabled)
    └── tmp/                    # temp files



# SLURM / cluster issues

* **Invalid partition (`general`)**

  * **Symptom:** `sbatch: error: invalid partition specified: general`
  * **Fix:** Use the cluster’s real GPU queue: `#SBATCH -p GPU-shared` and include `#SBATCH -A cis250063p`.
  * **Why:** You requested a queue that doesn’t exist on this system; SLURM rejects the job.

# Conda / environment

* **Conda activation path wrong**

  * **Symptom:** `No such file: .../miniconda3/etc/profile.d/conda.sh`, `CondaError: Run 'conda init' before 'conda activate'`
  * **Fix:** Use the right init: `eval "$($HOME/miniconda3/bin/conda shell.bash hook)"` → `conda activate arc-env`.
  * **Why:** You sourced a non-existent path; using the shell hook lets `conda activate` work non-interactively.

* **`torchrun: command not found`**

  * **Fix:** Activate the env that has PyTorch (`arc-env`) before launching.
  * **Why:** `torchrun` is provided by the PyTorch package in your env.

# CUDA / DeepSpeed

* **CUDA module not found**

  * **Symptom:** `The following module(s) are unknown: "cuda/12.1"`
  * **Fix:** Load an available one (e.g., `module load cuda/12.4.0`).
  * **Why:** Only specific CUDA versions are installed as modules.

* **DeepSpeed tries to compile ops (CUDA\_HOME missing)**

  * **Symptoms:**

    * `MissingCUDAException: CUDA_HOME does not exist, unable to compile CUDA op(s)`
    * `.../cpu_adam.so: cannot open shared object file`
  * **Fix (any of these):**

    * **Simplest:** uninstall DeepSpeed: `pip uninstall deepspeed`.
    * Or keep DS inert: `export DS_BUILD_OPS=0 DS_SKIP_CUDA_CHECK=1 DS_BUILD_AIO=0`.
    * Or fully support DS: `module load cuda/12.4.0`, set `CUDA_HOME`, and (if needed) `TORCH_EXTENSIONS_DIR` to a writable per-job scratch; avoid CPU offload or use Torch’s AdamW.
  * **Why:** DS imports trigger a JIT build of CUDA/C++ extensions; without a matching CUDA toolkit (and build env), import fails. Disabling or uninstalling DS prevents the JIT path.

# Multi-GPU / DDP specifics

* **Invalid device ordinal**

  * **Symptom:** `RuntimeError: CUDA error: invalid device ordinal`
  * **Fix:** Match ranks to GPUs (`torchrun --nproc_per_node=2` when you actually have 2), and don’t constrain `CUDA_VISIBLE_DEVICES` incorrectly.
  * **Why:** A rank tried to set `cuda:N` where `N` didn’t exist/was masked.

* **4/8-bit model can’t move devices**

  * **Symptom:** `You can't train a model loaded in 8-bit or 4-bit precision on a different device...`
  * **Fix:** Load *per-rank* on its GPU and never move it:

    ```python
    local_rank = int(os.environ["LOCAL_RANK"])
    torch.cuda.set_device(local_rank)
    model = AutoModelForCausalLM.from_pretrained(..., quantization_config=..., device_map={"": local_rank})
    ```
  * **Why:** bitsandbytes quantized weights aren’t transferrable between devices after load; each rank must load onto its own GPU.

* **DDP + gradient checkpointing + LoRA crash**

  * **Symptom:** `RuntimeError: Expected to mark a variable ready only once... lora_B...weight marked ready twice`
  * **Fixes:**

    * Use **non-reentrant** checkpointing:
      `TrainingArguments(..., gradient_checkpointing=True, gradient_checkpointing_kwargs={"use_reentrant": False})`
      or `model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": False})`
    * Enable GC **once** (don’t also pass `use_gradient_checkpointing=True` to `prepare_model_for_kbit_training` if you call `.gradient_checkpointing_enable()` later).
    * `ddp_find_unused_parameters=False`.
    * Ensure LoRA wrap (`get_peft_model`) happens **once**.
  * **Why:** Reentrant checkpointing can re-enter the same module during backward; DDP then sees the same param twice in one step.

# Transformers / TRL warnings (harmless, but you cleaned them up)

* **`use_cache=True` incompatible with gradient checkpointing**

  * **Fix:** `model.config.use_cache = False` (or let the auto-toggle happen).
  * **Why:** Checkpointing re-computes forward; cached KV states conflict.

* **Pad/EOS token warnings**

  * **Fix:**

    ```python
    if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = tokenizer.pad_token_id
    model.config.eos_token_id = tokenizer.eos_token_id
    ```
  * **Why:** Ensures consistent special tokens for collation/loss masking.

# Paths / outputs / caches

* **Mixed paths (relative vs absolute; wrong base path)**

  * **Fix:** Use absolute paths under your project (e.g., `/ocean/.../shared/arc/...`) for outputs, logs, and caches; create directories up front.
  * **Why:** SLURM jobs start in arbitrary working dirs; absolute paths avoid surprises and ensure visibility across nodes.

# Monitoring / tmux


* **Pane “frozen” / just a viewer**

  * **Symptoms:** Keys don’t appear; or stuck in copy/`watch`.
  * **Fix:** `Ctrl-C` (stop `watch`), `q`/`Esc` (exit copy/less), or open a new tmux window (`Ctrl-b c`). If flow control, `Ctrl-Q`.
  * **Why:** You were in a program/copy mode that captures input, not a regular shell.

# Performance / utilization sanity

* **Zero GPU utilization**

  * **Symptom:** `nvidia-smi` shows \~0% util, \~1 MiB used.
  * **Fix:** The job has crashed or ended. Relaunch and monitor logs.
  * **Why:** No process bound to the GPUs.

* **Healthy training**

  * **Symptom:** `nvidia-smi` shows >80% util, tens of GB memory used on both GPUs.
  * **Why:** Both ranks are running forward/backward as expected.




**Stage 0** (format/IO warm-up): very small set of clean, short synthetic RE-ARC samples (small grids, e.g., sizes=[3]), heavy augmentation on rotations/transpositions to force the IO format to “click”. Keep sequences short and consistent with your fmt_opts and collator templates.

**Stage 1** (easy→medium structure): larger slice of RE-ARC with a mix of small/medium sizes; start to mix in ConceptARC but cap its ratio so the model still gets lots of easy pattern wins. Keep augmentation but reduce aggressiveness a bit.

**Stage 2** (harder distributional match): increase ConceptARC share and introduce harder RE-ARC generators or bigger sizes=[3,4,5,6] (whatever “hard” is in your pipeline). Consider a small rehearsal buffer (5–20%) of Stage-1 data to avoid forgetting.

**Stage 3** (final polish): use a dev-like mix (closest to your intended eval distribution) with minimal augmentation (preserve signal). Tight LR, short epochs, early-stop on dev solve-rate.

In [None]:
python - <<'PY'
import os, textwrap
from huggingface_hub import HfApi, create_repo, upload_folder, whoami

token = os.environ.get("HUGGINGFACE_HUB_TOKEN", None)
api = HfApi()
user = whoami(token=token)["name"]

base_model = "chuanli11/Llama-3.2-3B-Instruct-uncensored"
LORA_DIR   = os.environ["LORA_DIR"]
MERGED_DIR = os.environ["MERGED_DIR"]

lora_repo_id   = f"{user}/arc-lora-mistral-8b-full-rearc-concept"
merged_repo_id = f"{user}/arc-merged-mistral-8b-full-rearc-concept"

# Create private repos (flip to public later in the UI if you want)
create_repo(lora_repo_id,   repo_type="model", private=True, exist_ok=True, token=token)
create_repo(merged_repo_id, repo_type="model", private=True, exist_ok=True, token=token)

# Minimal model cards
lora_readme = f"""\
# ARC LoRA Adapter for Mistral 8B
base_model: {base_model}
library_name: peft
pipeline_tag: text-generation
tags:
- lora
- qlora
- causal-lm
- arc

This repo contains the **LoRA adapter only**.

## Usage
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
tok = AutoTokenizer.from_pretrained("{base_model}")
base = AutoModelForCausalLM.from_pretrained("{base_model}", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "{lora_repo_id}")
"""

merged_readme = f"""\

ARC Fine-tuned Llama 3.2 3B (Merged)
base_model: {base_model}
pipeline_tag: text-generation
tags:

merged-weights

causal-lm

arc

This repo contains the merged weights (base + LoRA fused).

Usage
python
Copy code
from transformers import AutoTokenizer, AutoModelForCausalLM
tok = AutoTokenizer.from_pretrained("{merged_repo_id}")
model = AutoModelForCausalLM.from_pretrained("{merged_repo_id}", torch_dtype="auto", device_map="auto")
"""

#Write READMEs locally so they get uploaded
with open(os.path.join(LORA_DIR, "README.md"), "w") as f: f.write(textwrap.dedent(lora_readme))
with open(os.path.join(MERGED_DIR, "README.md"), "w") as f: f.write(textwrap.dedent(merged_readme))


print("Uploading LoRA adapter...")
upload_folder(
repo_id=lora_repo_id,
folder_path=LORA_DIR,
repo_type="model",
token=token,
commit_message="Upload LoRA adapter",
ignore_patterns=["/runs/","/logs/","/*.pt","/tmp_*"],
)

print("Uploading merged model...")
upload_folder(
repo_id=merged_repo_id,
folder_path=MERGED_DIR,
repo_type="model",
token=token,
commit_message="Upload merged model",
ignore_patterns=["/runs/","/logs/","**/tmp_*"],
)

print("\nDone.")
print("LoRA repo:", lora_repo_id)
print("Merged repo:", merged_repo_id)
PY