Skip to content

microsoft/ChatBench

Repository files navigation

chatbench

Chatbench simulator fine-tune project.

ChatBench/

├── data/                                # this is released sperately
│   ├── finetuning_experiments.ipynb # raw ChatBench JSONL (train + val + test)
│   └── prepare_splits.py   # data‐prep script (see next section)
│   └── split_0/
│       ├── train.jsonl
│       └── test.jsonl
│   └── split_1/
│         ├── train.jsonl
│         └── test.jsonl
│     ... up to ...
│     └── split_9/
│         ├── train.jsonl
│         └── test.jsonl             
├── src/
│   ├── finetune.py                      # training script (model specific examples provided by finetune_<modelname>)
│   ├── eval_chatbench.py                # paper chatbench metrics
│   └── eval_ppl.py                      # perplexity evaluation script (model specific examples provided by val_ppl_<modelname>)
├── configs/
│   └── training_args.json               # optional HF Trainer config
├── results
│   ├── logs/<model>/                 # where logs go
│   ├── models/<model>/               # where checkpoints & model weights go
│   └── ppl/                          # tensorboard or Accelerate logs
└── .env                              # where the HF token goes  (HF_TOKEN=<yourtoken>)
└── README.md

Getting Started:

System packages and installation:

sudo apt update
sudo apt install -y git wget build-essential
sudo apt install -y python3 python3-venv python3-pip
python3 -m venv ~/venv
source ~/venv/bin/activate
pip install --upgrade pip setuptools
pip install -r requirements.txt

Verify GPU availability:

python - <<EOF
import torch
print(torch.cuda.is_available(), torch.cuda.device_count())
EOF
#True 4

Data Preparation (Note: This is released seperately):

  1. Download the data from https://huggingface.co/datasets/microsoft/ChatBench

.ChatBench/conversations.json...etc

  1. Prepare the train data splits (N folds)
    user the process from notebook: data/finetuning_experiments.ipynb

i.e. mmlu_interactive_full.jsonl (ChatBench data) in format:

{"messages":[
    {"role":"system","content":"Clippy is a factual chatbot that is also sarcastic."},
    {"role":"user","content":"Who discovered Antarctica?"},
    {"role":"assistant","content":"Some chaps named Fabian…"}
]}

Split questions into train/test (10 random splits):

python data/prepare_splits.py #Run once (for “Split 1”) Repeat to create split_2/, split_3/ (change the random seed on each run)

Make sure to convert data into ['train.jsonl', 'test.jsonl','split.jsonl'] for each split and put into the respective folders i.e. data/split_0, split_1,...etc.

Converting Conversations → Fine‐Tuning Format

DistilGPT-2 expects input as prompt–completion pairs. We need to turn each JSONL “messages” list into a single string prompt, with the assistant’s content as the “completion.” But since ChatBench defines two tasks:

Task 1: generate the first user utterance

Task 2: given conversation so far (system + alternating roles up to last assistant turn), generate next user utterance or final “Answer: ”

We’ll follow the same as Azure OpenAI expecting format for chat fine-tuning:

{"prompt": "<FULL CHAT CONTEXT>\nuser:", "completion": "<NEXT USER UTTERANCE>"}

Run fine-tuning (i.e. DIstilGPT-2)

  1. Choose a split
export SPLIT_ID=split_1
  1. Launch multi-GPU training via accelerate

First configure:

accelerate config
  • Choose “multi‐GPU” (no DeepSpeed if you only want simple).
  • Pick a default location for state.
  • Set fp16=True since if GPU supports mixed‐precision (i.e. RTX A6000).

Then lanch:

accelerate launch src/finetune.py <model_name>  

(Use our src/finetune_<model_nam>.py version to replicate our released model weights).

Evaluation

Validation losses:

Look at the end of each training run’s stdout—the Trainer prints {'eval_loss': …} after each epoch. Open TensorBoard pointed at logs//:

tensorboard --logdir logs/

Perplexity on Held-Out Test Set:

export SPLIT_ID=split_1
python src/eval_ppl.py <model_name>

(Use our src/eval_ppl_<model_nam>.py version to replicate our released model evaluation).

Automate N-Fold Runs & Evaluation:

(You can use the run_all_train_ppl.sh and other model sepecific shell scripts for each train + perplexity score full pipeline.)

Train & PPL:

run run_all_train_ppl.sh to fine-tune and record validation perplexities.

chmod +x run_all_train_ppl.sh
./run_all_train_ppl_distilgpt2.sh distilgpt2

# More models:

# Mistral 7B
./run_all_train_ppl_mistral.sh mistralai/Mistral-7B-v0.1

# Llama 3 8B
./run_all_train_ppl_llama3.sh meta-llama/Meta-Llama-3-8B

Monitor the training:

In a new terminal, monitor the GPU load:

watch -n 1 nvidia-smi

If you don't want the header repeated, you can monitor GPU load by:

nvidia-smi --query-gpu=index,utilization.gpu,utilization.memory,memory.total,memory.used \
  --format=csv --loop-ms=1000

ChatBench metrics:

Note: Please use this notebook as example to run end-to-end benchmark testing for Chatbench Paper.

run run_all_chatbench.sh to get BLEU, ROUGE, accuracy correlation, and MAE across folds.

chmod +x run_all_chatbench.sh
./run_all_chatbench.sh distilgpt2
# More models:
# Mistral 7B
./run_all_chatbench.sh mistralai/Mistral-7B-v0.1
# Llama 3 8B
./run_all_chatbench.sh meta-llama/Meta-Llama-3-8B

Quick Sanity Check:

run run_quick_sanity.sh to review results.

chmod +x run_quick_sanity.sh
./run_quick_sanity.sh distilgpt2
# or
./run_quick_sanity.sh meta-llama/Meta-Llama-3-8B

Output:

adapter_config.json # your LoRA hyperparameters (r, α, target_modules…) adapter_model.safetensors # the only file you truly need to re-load your LoRA adapter checkpoint-1500/ # checkpoint directory saved every save_steps checkpoint-1550/ # └─ contains a copy of adapter_*.safetensors + trainer state config.json # the base model’s TransformerConfig (e.g. hidden_size etc.) training_args.bin # a binary dump of all your TrainingArguments README.md # auto-generated pointer on how to re-load with PEFTModel special_tokens_map.json # maps your added special tokens to IDs tokenizer_config.json # tokenizer settings (truncation side, normalization…) tokenizer.json (or tokenizer.model) # the actual tokenizer files (vocab + merges/spm)

Training Piepline Overview

What’s happening in the pipeline:

  1. Data format Each line of your train.jsonl and test.jsonl looks like:

    {
      "messages": [
        { "role": "system",    "content": "…instructions…"              },
        { "role": "user",      "content": "How many bacteria…?"         },
        { "role": "assistant", "content": "Answer: C"                  }
      ]
    }
    • system: the task instructions (e.g. “You are a human user interacting with an AI…”)
    • user: the human’s question or conversation context
    • assistant: the “next user turn” we actually want the model to learn to produce
  2. Prompt ↦ Completion When we fine-tune, each example becomes a <prompt, completion> pair:

    prompt     = "[SYSTEM] …instructions…\n\nHow many bacteria…?\n\n[USER] "
    completion = "Answer: C [END]"
    

    In other words, we feed the model both the system instructions and the user’s prior context (up through [USER] ), and train it to generate the assistant’s content (the “next user turn”) up through [END].

  3. Perplexity evaluation At evaluation time (via eval_ppl.py), for each user turn in the test set we:

    • Reconstruct the same <prompt, completion>
    • Mask out the prompt tokens (they contribute no loss)
    • Compute loss only on the tokens of the ground-truth completion
    • Aggregate over all tokens to get a corpus-level PPL

    That PPL directly measures “how well does my fine-tuned model assign probability to the true assistant response given the exact same prompt?” A lower PPL means the fine-tuned model is closer (more confident) in the human-like reply it was trained on, so it:

    • Scores exactly the same <prompt, completion> pairs you trained on—i.e. for every assistant turn, not user turns.
    • Reuses the fine-tune prompt format ([SYSTEM] …\n\n…\n\n[USER] ) and completion (assistant_content + " [END]").
    • Masks out prompt tokens exactly, computes loss only on the true assistant tokens, then aggregates into a corpus-level PPL.

— So as a summary:

We treat “system + user” as the prompt, and the original assistant message as the completion. We fine-tune the LM to generate that completion, and then we measure perplexity on exactly that same split of <prompt, completion> to see how well the model learned to imitate the human-AI conversation.

About

ChatBench Interactive Benchmark fine-tune pipeline

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published