chatbench

Chatbench simulator fine-tune project.

ChatBench/

├── data/                                # this is released sperately
│   ├── finetuning_experiments.ipynb # raw ChatBench JSONL (train + val + test)
│   └── prepare_splits.py   # data‐prep script (see next section)
│   └── split_0/
│       ├── train.jsonl
│       └── test.jsonl
│   └── split_1/
│         ├── train.jsonl
│         └── test.jsonl
│     ... up to ...
│     └── split_9/
│         ├── train.jsonl
│         └── test.jsonl             
├── src/
│   ├── finetune.py                      # training script (model specific examples provided by finetune_<modelname>)
│   ├── eval_chatbench.py                # paper chatbench metrics
│   └── eval_ppl.py                      # perplexity evaluation script (model specific examples provided by val_ppl_<modelname>)
├── configs/
│   └── training_args.json               # optional HF Trainer config
├── results
│   ├── logs/<model>/                 # where logs go
│   ├── models/<model>/               # where checkpoints & model weights go
│   └── ppl/                          # tensorboard or Accelerate logs
└── .env                              # where the HF token goes  (HF_TOKEN=<yourtoken>)
└── README.md

Getting Started:

System packages and installation:

sudo apt update
sudo apt install -y git wget build-essential
sudo apt install -y python3 python3-venv python3-pip
python3 -m venv ~/venv
source ~/venv/bin/activate
pip install --upgrade pip setuptools
pip install -r requirements.txt

Verify GPU availability:

python - <<EOF
import torch
print(torch.cuda.is_available(), torch.cuda.device_count())
EOF
#True 4

Data Preparation (Note: This is released seperately):

Download the data from https://huggingface.co/datasets/microsoft/ChatBench

.ChatBench/conversations.json...etc

Prepare the train data splits (N folds)
user the process from notebook: data/finetuning_experiments.ipynb

i.e. mmlu_interactive_full.jsonl (ChatBench data) in format:

{"messages":[
    {"role":"system","content":"Clippy is a factual chatbot that is also sarcastic."},
    {"role":"user","content":"Who discovered Antarctica?"},
    {"role":"assistant","content":"Some chaps named Fabian…"}
]}

Split questions into train/test (10 random splits):

python data/prepare_splits.py #Run once (for “Split 1”) Repeat to create split_2/, split_3/ (change the random seed on each run)

Make sure to convert data into ['train.jsonl', 'test.jsonl','split.jsonl'] for each split and put into the respective folders i.e. data/split_0, split_1,...etc.

Converting Conversations → Fine‐Tuning Format

DistilGPT-2 expects input as prompt–completion pairs. We need to turn each JSONL “messages” list into a single string prompt, with the assistant’s content as the “completion.” But since ChatBench defines two tasks:

Task 1: generate the first user utterance

Task 2: given conversation so far (system + alternating roles up to last assistant turn), generate next user utterance or final “Answer: ”

We’ll follow the same as Azure OpenAI expecting format for chat fine-tuning:

{"prompt": "<FULL CHAT CONTEXT>\nuser:", "completion": "<NEXT USER UTTERANCE>"}

Run fine-tuning (i.e. DIstilGPT-2)

Choose a split

export SPLIT_ID=split_1

Launch multi-GPU training via accelerate

First configure:

accelerate config

Choose “multi‐GPU” (no DeepSpeed if you only want simple).
Pick a default location for state.
Set fp16=True since if GPU supports mixed‐precision (i.e. RTX A6000).

Then lanch:

accelerate launch src/finetune.py <model_name>

(Use our src/finetune_<model_nam>.py version to replicate our released model weights).

Evaluation

Validation losses:

Look at the end of each training run’s stdout—the Trainer prints {'eval_loss': …} after each epoch. Open TensorBoard pointed at logs//:

tensorboard --logdir logs/

Perplexity on Held-Out Test Set:

export SPLIT_ID=split_1
python src/eval_ppl.py <model_name>

(Use our src/eval_ppl_<model_nam>.py version to replicate our released model evaluation).

Automate N-Fold Runs & Evaluation:

(You can use the run_all_train_ppl.sh and other model sepecific shell scripts for each train + perplexity score full pipeline.)

Train & PPL:

run run_all_train_ppl.sh to fine-tune and record validation perplexities.

chmod +x run_all_train_ppl.sh
./run_all_train_ppl_distilgpt2.sh distilgpt2

# More models:

# Mistral 7B
./run_all_train_ppl_mistral.sh mistralai/Mistral-7B-v0.1

# Llama 3 8B
./run_all_train_ppl_llama3.sh meta-llama/Meta-Llama-3-8B

Monitor the training:

In a new terminal, monitor the GPU load:

watch -n 1 nvidia-smi

If you don't want the header repeated, you can monitor GPU load by:

nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.total,memory.used \
  --format=csv --loop-ms=1000

ChatBench metrics:

Note: Please use this notebook as example to run end-to-end benchmark testing for Chatbench Paper.

run run_all_chatbench.sh to get BLEU, ROUGE, accuracy correlation, and MAE across folds.

chmod +x run_all_chatbench.sh
./run_all_chatbench.sh distilgpt2
# More models:
# Mistral 7B
./run_all_chatbench.sh mistralai/Mistral-7B-v0.1
# Llama 3 8B
./run_all_chatbench.sh meta-llama/Meta-Llama-3-8B

Quick Sanity Check:

run run_quick_sanity.sh to review results.

chmod +x run_quick_sanity.sh
./run_quick_sanity.sh distilgpt2
# or
./run_quick_sanity.sh meta-llama/Meta-Llama-3-8B

Output:

adapter_config.json # your LoRA hyperparameters (r, α, target_modules…) adapter_model.safetensors # the only file you truly need to re-load your LoRA adapter checkpoint-1500/ # checkpoint directory saved every save_steps checkpoint-1550/ # └─ contains a copy of adapter_*.safetensors + trainer state config.json # the base model’s TransformerConfig (e.g. hidden_size etc.) training_args.bin # a binary dump of all your TrainingArguments README.md # auto-generated pointer on how to re-load with PEFTModel special_tokens_map.json # maps your added special tokens to IDs tokenizer_config.json # tokenizer settings (truncation side, normalization…) tokenizer.json (or tokenizer.model) # the actual tokenizer files (vocab + merges/spm)

Training Piepline Overview

What’s happening in the pipeline:

Data format Each line of your train.jsonl and test.jsonl looks like:

{
  "messages": [
    { "role": "system",    "content": "…instructions…"              },
    { "role": "user",      "content": "How many bacteria…?"         },
    { "role": "assistant", "content": "Answer: C"                  }
  ]
}

system: the task instructions (e.g. “You are a human user interacting with an AI…”)
user: the human’s question or conversation context
assistant: the “next user turn” we actually want the model to learn to produce

Prompt ↦ Completion When we fine-tune, each example becomes a <prompt, completion> pair:
```
prompt     = "[SYSTEM] …instructions…\n\nHow many bacteria…?\n\n[USER] "
completion = "Answer: C [END]"
```
In other words, we feed the model both the system instructions and the user’s prior context (up through [USER] ), and train it to generate the assistant’s content (the “next user turn”) up through [END].
Perplexity evaluation At evaluation time (via eval_ppl.py), for each user turn in the test set we:
- Reconstruct the same <prompt, completion>
- Mask out the prompt tokens (they contribute no loss)
- Compute loss only on the tokens of the ground-truth completion
- Aggregate over all tokens to get a corpus-level PPL
That PPL directly measures “how well does my fine-tuned model assign probability to the true assistant response given the exact same prompt?” A lower PPL means the fine-tuned model is closer (more confident) in the human-like reply it was trained on, so it:
- Scores exactly the same <prompt, completion> pairs you trained on—i.e. for every assistant turn, not user turns.
- Reuses the fine-tune prompt format ([SYSTEM] …\n\n…\n\n[USER] ) and completion (assistant_content + " [END]").
- Masks out prompt tokens exactly, computes loss only on the true assistant tokens, then aggregates into a corpus-level PPL.

— So as a summary:

We treat “system + user” as the prompt, and the original assistant message as the completion. We fine-tune the LM to generate that completion, and then we measure perplexity on exactly that same split of <prompt, completion> to see how well the model learned to imitate the human-AI conversation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

chatbench

ChatBench/

Getting Started:

System packages and installation:

Verify GPU availability:

Data Preparation (Note: This is released seperately):

Converting Conversations → Fine‐Tuning Format

Run fine-tuning (i.e. DIstilGPT-2)

Evaluation

Validation losses:

Perplexity on Held-Out Test Set:

Automate N-Fold Runs & Evaluation:

Train & PPL:

Monitor the training:

ChatBench metrics:

Quick Sanity Check:

Output:

Training Piepline Overview

What’s happening in the pipeline:

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
results/models		results/models
src		src
.env.template		.env.template
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
requirements.txt		requirements.txt
run_all_chatbench.sh		run_all_chatbench.sh
run_all_train_ppl.sh		run_all_train_ppl.sh
run_all_train_ppl_distilgpt2.sh		run_all_train_ppl_distilgpt2.sh
run_all_train_ppl_llama3.sh		run_all_train_ppl_llama3.sh
run_all_train_ppl_mistral.sh		run_all_train_ppl_mistral.sh

License

microsoft/ChatBench

Folders and files

Latest commit

History

Repository files navigation

chatbench

ChatBench/

Getting Started:

System packages and installation:

Verify GPU availability:

Data Preparation (Note: This is released seperately):

Converting Conversations → Fine‐Tuning Format

Run fine-tuning (i.e. DIstilGPT-2)

Evaluation

Validation losses:

Perplexity on Held-Out Test Set:

Automate N-Fold Runs & Evaluation:

Train & PPL:

Monitor the training:

ChatBench metrics:

Quick Sanity Check:

Output:

Training Piepline Overview

What’s happening in the pipeline:

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages