Chatbench simulator fine-tune project.
├── data/ # this is released sperately
│ ├── finetuning_experiments.ipynb # raw ChatBench JSONL (train + val + test)
│ └── prepare_splits.py # data‐prep script (see next section)
│ └── split_0/
│ ├── train.jsonl
│ └── test.jsonl
│ └── split_1/
│ ├── train.jsonl
│ └── test.jsonl
│ ... up to ...
│ └── split_9/
│ ├── train.jsonl
│ └── test.jsonl
├── src/
│ ├── finetune.py # training script (model specific examples provided by finetune_<modelname>)
│ ├── eval_chatbench.py # paper chatbench metrics
│ └── eval_ppl.py # perplexity evaluation script (model specific examples provided by val_ppl_<modelname>)
├── configs/
│ └── training_args.json # optional HF Trainer config
├── results
│ ├── logs/<model>/ # where logs go
│ ├── models/<model>/ # where checkpoints & model weights go
│ └── ppl/ # tensorboard or Accelerate logs
└── .env # where the HF token goes (HF_TOKEN=<yourtoken>)
└── README.md
sudo apt update
sudo apt install -y git wget build-essential
sudo apt install -y python3 python3-venv python3-pip
python3 -m venv ~/venv
source ~/venv/bin/activate
pip install --upgrade pip setuptools
pip install -r requirements.txt
python - <<EOF
import torch
print(torch.cuda.is_available(), torch.cuda.device_count())
EOF
#True 4
- Download the data from https://huggingface.co/datasets/microsoft/ChatBench
.ChatBench/conversations.json...etc
- Prepare the train data splits (N folds)
user the process from notebook: data/finetuning_experiments.ipynb
i.e. mmlu_interactive_full.jsonl
(ChatBench data) in format:
{"messages":[
{"role":"system","content":"Clippy is a factual chatbot that is also sarcastic."},
{"role":"user","content":"Who discovered Antarctica?"},
{"role":"assistant","content":"Some chaps named Fabian…"}
]}
Split questions into train/test (10 random splits):
python data/prepare_splits.py #Run once (for “Split 1”) Repeat to create split_2/, split_3/ (change the random seed on each run)
Make sure to convert data into ['train.jsonl', 'test.jsonl','split.jsonl'] for each split and put into the respective folders i.e. data/split_0, split_1,...etc.
DistilGPT-2 expects input as prompt–completion pairs. We need to turn each JSONL “messages” list into a single string prompt, with the assistant’s content as the “completion.” But since ChatBench defines two tasks:
Task 1: generate the first user utterance
Task 2: given conversation so far (system + alternating roles up to last assistant turn), generate next user utterance or final “Answer: ”
We’ll follow the same as Azure OpenAI expecting format for chat fine-tuning:
{"prompt": "<FULL CHAT CONTEXT>\nuser:", "completion": "<NEXT USER UTTERANCE>"}
- Choose a split
export SPLIT_ID=split_1
- Launch multi-GPU training via accelerate
First configure:
accelerate config
- Choose “multi‐GPU” (no DeepSpeed if you only want simple).
- Pick a default location for state.
- Set fp16=True since if GPU supports mixed‐precision (i.e. RTX A6000).
Then lanch:
accelerate launch src/finetune.py <model_name>
(Use our src/finetune_<model_nam>.py version to replicate our released model weights).
Look at the end of each training run’s stdout—the Trainer prints {'eval_loss': …} after each epoch. Open TensorBoard pointed at logs//:
tensorboard --logdir logs/
export SPLIT_ID=split_1
python src/eval_ppl.py <model_name>
(Use our src/eval_ppl_<model_nam>.py version to replicate our released model evaluation).
(You can use the run_all_train_ppl.sh
and other model sepecific shell scripts for each train + perplexity score full pipeline.)
run run_all_train_ppl.sh
to fine-tune and record validation perplexities.
chmod +x run_all_train_ppl.sh
./run_all_train_ppl_distilgpt2.sh distilgpt2
# More models:
# Mistral 7B
./run_all_train_ppl_mistral.sh mistralai/Mistral-7B-v0.1
# Llama 3 8B
./run_all_train_ppl_llama3.sh meta-llama/Meta-Llama-3-8B
In a new terminal, monitor the GPU load:
watch -n 1 nvidia-smi
If you don't want the header repeated, you can monitor GPU load by:
nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.total,memory.used \
--format=csv --loop-ms=1000
Note: Please use this notebook as example to run end-to-end benchmark testing for Chatbench Paper.
run run_all_chatbench.sh
to get BLEU, ROUGE, accuracy correlation, and MAE across folds.
chmod +x run_all_chatbench.sh
./run_all_chatbench.sh distilgpt2
# More models:
# Mistral 7B
./run_all_chatbench.sh mistralai/Mistral-7B-v0.1
# Llama 3 8B
./run_all_chatbench.sh meta-llama/Meta-Llama-3-8B
run run_quick_sanity.sh
to review results.
chmod +x run_quick_sanity.sh
./run_quick_sanity.sh distilgpt2
# or
./run_quick_sanity.sh meta-llama/Meta-Llama-3-8B
adapter_config.json # your LoRA hyperparameters (r, α, target_modules…) adapter_model.safetensors # the only file you truly need to re-load your LoRA adapter checkpoint-1500/ # checkpoint directory saved every save_steps checkpoint-1550/ # └─ contains a copy of adapter_*.safetensors + trainer state config.json # the base model’s TransformerConfig (e.g. hidden_size etc.) training_args.bin # a binary dump of all your TrainingArguments README.md # auto-generated pointer on how to re-load with PEFTModel special_tokens_map.json # maps your added special tokens to IDs tokenizer_config.json # tokenizer settings (truncation side, normalization…) tokenizer.json (or tokenizer.model) # the actual tokenizer files (vocab + merges/spm)
-
Data format Each line of your
train.jsonl
andtest.jsonl
looks like:{ "messages": [ { "role": "system", "content": "…instructions…" }, { "role": "user", "content": "How many bacteria…?" }, { "role": "assistant", "content": "Answer: C" } ] }
- system: the task instructions (e.g. “You are a human user interacting with an AI…”)
- user: the human’s question or conversation context
- assistant: the “next user turn” we actually want the model to learn to produce
-
Prompt ↦ Completion When we fine-tune, each example becomes a
<prompt, completion>
pair:prompt = "[SYSTEM] …instructions…\n\nHow many bacteria…?\n\n[USER] " completion = "Answer: C [END]"
In other words, we feed the model both the system instructions and the user’s prior context (up through
[USER]
), and train it to generate the assistant’s content (the “next user turn”) up through[END]
. -
Perplexity evaluation At evaluation time (via
eval_ppl.py
), for each user turn in the test set we:- Reconstruct the same
<prompt, completion>
- Mask out the prompt tokens (they contribute no loss)
- Compute loss only on the tokens of the ground-truth completion
- Aggregate over all tokens to get a corpus-level PPL
That PPL directly measures “how well does my fine-tuned model assign probability to the true assistant response given the exact same prompt?” A lower PPL means the fine-tuned model is closer (more confident) in the human-like reply it was trained on, so it:
- Scores exactly the same <prompt, completion> pairs you trained on—i.e. for every assistant turn, not user turns.
- Reuses the fine-tune prompt format ([SYSTEM] …\n\n…\n\n[USER] ) and completion (assistant_content + " [END]").
- Masks out prompt tokens exactly, computes loss only on the true assistant tokens, then aggregates into a corpus-level PPL.
- Reconstruct the same
— So as a summary:
We treat “system + user” as the prompt, and the original assistant message as the completion. We fine-tune the LM to generate that completion, and then we measure perplexity on exactly that same split of
<prompt, completion>
to see how well the model learned to imitate the human-AI conversation.