Welcome to the afternoon lab accompanying the presentations Neural Conversation AI and LLMs.
The Part 2 of the lab is self-contained in jupyter notebook, and deals with cluserting and visualization of MultiWoz data using pretrained language models, and simple unsupervised techniques. See the standalone notebook
This Part 1 of the lab is described here in the README
The lab will get you familiar you with response generation for task-oriented dialogues (TOD) using end-to-end neural models.
We will use the MultiWOZ 2.2[ 1, 2] dataset and causal language models implemented the huggingface/transformer
for a conditional generation.
The QLoRa implementation from huggingface/peft
library will allow us to finetune large pretrained Large Langauge Models (LLMS) e.g. LLAMA π¦ and Falcon, on relatively small GPUs in Google Colab Notebook or on your cluster.
What will you learn?
- How to finetune large language model (LLM) using QLoRa.π‘
- Tweaking parameters of decoding/generation process with HuggingFace LLMs. π€
- Get familiar with a typical TOD textual dataset MultiWoz[ 1, 2].
- How to evaluate task-oriented dialogues (TOD) using standardized scripts.
We prepared for you a series of tasks. A ready-to-use solution accompanies each task. The solutions are intentionally hidden, so you have the chance to try to work on the task on your own. Share your answers to the questions preferable over a pull request or over Discord.
Share your findings. Improve the code. Pick your rewards π!
We prepared a qlora.py
main python script and several bash launch scripts which showcase the qlora.py
functionality.
The same functionality is demonstrated in a Google Colab.
The Google Colab is arguably more straightforward to set up but harder to work with.
If you have a machine with a recent GPU with 16GB of memory, we recommend creating a conda environment
and installing the complete list of dependencies specified in environment.yml
.
# Have a look at the environment.yml
# The QLoRa finetuning requires cutting-edge libraries versions
# Note: please use conda deactivate if you have other environment activated
# sometimes it creates problems.
conda env create --prefix ./env -f environment.yml # grab a coffee
# activating the locally stored environment is easy
# if you want to delete the environment simply delete the ./env folder
conda activate ./env
# Run the next turn prediction with the "debug" model argument argument.
# It should trigger downloading a small pretrained model and the MultiWoz dataset from HuggingFace.
# The finetuning will run for 4 iterations.
./scripts/finetune_multiwoz22_conditional_mlm.sh debug
- How to run this script on the JSALT cluster? ππππ
- What is your iteration speed for the training with the
debug
setup? π - What machine and CUDA version do you have? π
- How to run this script on the JSALT cluster? Contributions are welcome! ππππ
Feel free to fill in partial information, e.g., if you do not know your CUDA version, just write '-'.
GPU model | CUDA | train [it/s] | infer [it/s] |
---|---|---|---|
GC-Tesla T4 | 12.0 | - | - |
TODO | 12.0 | 2.43s/it | 2.41s/it |
NVIDIA GTX 1080 | 11.5 | 0:03:39.51 | 0:00:00.04 |
Open the Google Colab. Run the whole notebook and write down which GPU you were assigned and how much memory you have available. The first dummy training should take around 20 minutes. The script downloads a small pretrained model and the MultiWoz dataset from HuggingFace.
- What is your iteration speed for the training with the default values? π
- What is your iteration speed for the inference speed with the default values? π
- What machine and CUDA version do you have? ππ
- Can you get free machine with a GPU RAM larger than 16GB e.g. on Kaggle? ππππ
Please fill the Task 1: Results
in the section for running on cluster. In the column GPU model
prefix the GPU type with GC
.
Let us start by comparing an untuned LLM (LLAMA) and minimally fined-tuned oplatek/pythia-70m-multi_woz_v22
which I fine-tuned for you in 4 steps.
You will finetune your adapter/LoRa weights in the next task. In the lab you will also learn how to upload your model to HuggingFace Lab too.
- Let's use the next turn generation, conditioned on previous dialogue context using the
./scripts/generate_prompted.sh
script. - However the script is prepared to load the base model in 4bit but also the additional trained weights from the LoRa trained checkpoint.
- We do not have the LoRa checkpoint trained yet, so we need to modify the script.
- Copy the script
cp ./scripts/generate_prompted.sh ./scripts/pp.sh # prompted_pretrained
- Open the
pp.sh
script and remove the--checkpoint_dir "$checkpoint_dir"
line. - Also adjust the
output_dir
to be namedoutput/$model_name_or_path/REST_IS_THE_SAME
- The results should look like
qlora.py \
--dataloader_num_workers 0 \
--max_eval_samples 1000 \
--model_name_or_path huggyllama/llama-7b \
--output_dir "output/huggyllama/llama-7b/pred_multi_woz_v22_turns_1000_$$" \
--do_train False \
--do_eval False \
--do_predict True \
--predict_with_generate \
--per_device_eval_batch_size 4 \
--dataset $dataset \
--dataset_format $dataset_format \
--source_max_len 256 \
--target_max_len 288 \
--max_new_tokens 32 \
--do_sample \
--top_p 0.9 \
--num_beams 1 \
- Note that setting dataloader_num_workers to
0
is good for debugging. The dataloader runs in the main python thread. However, it is good to use more CPUs per 1 GPU if you are not debugging. - Explore the options and
qlora.py
especially the Generation arguments. You can easily add them to the command line.
Play with parameters like top_k
, temperature
, max_new_tokens
, penalty_alpha
, etc.
Investigate different decoding strategies.
- What is the highest
batch_size
you can use for decoding with otherwise default values? π - What is the longest reply you can force the model to generate with default values? ππ
- How can you force the code to behave deterministically when having the same dialogue history and already fixed random seed? πππ
- Best bleu, success, inform, richness score without fine tuning?
LLM model | Decoding params | Bleu | Success | Inform | Richness |
---|---|---|---|---|---|
waiting | for | your | numbers | again |
Finally! Let us train the LoRa weights!
./scripts/finetune_multiwoz22_conditional_mlm.sh huggyllama/llama-7b
- However, you may want to start small; Explore small models like
EleutherAI/pythia-70m
, set number of training steps to much lower number, etc. - Warning: see how checkpoint works. Adjust
save_steps
so you will have at least some checkpoint after training.
- What LoRa modules work best?
attention
,ffn
,regexp_keys|values
, ...? πππ - For the default parameters, what is the best number of training steps?ππ
- What is the best learning rate and number of training steps?πππ
- Can you implement prompting to generate the conversation of certain length?πππππ
- Hint: I would start with
multi_woz_v22_dialogs
format as used infinetune_multiwoz22_standard_mlm.sh
. - The
multi_woz_v22_turns
format always "prompts" the model with dialogue history ending with...\nbot>
telling the model to reply as a bot. - The
multi_woz_v22_turns
format is used inscripts/finetune_multiwoz22_conditional_mlm.sh
- Hint: I would start with
LLM model | Training params | Bleu | Success | Inform | Richness |
---|---|---|---|---|---|
waiting | for | your | numbers | again |
Open the Open LLM Leaderboard and try to run different models.
The LLAMA models and their derivations, such as Alpaca and Vicuna, should be compatible with the script.
We tested the code with EleutherAI/pythia-70m
.
Try to scale the models' size, e.g., EleutherAI/pythia-12b
instead EleutherAI/pythia-70m
.
Note that the pythia-70m
model is excellent for debugging.
Try models trained on different datasets OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5
.
- Do zero-shot models perform better as the number of parameters grows? For which metrics?
- Report results with
huggyllama/llama*
orEleutherAI/pythia*
checkpoints. ππ - For other models, try at least three different sizes for the same model. πππ
- Report results with
- What is the largest model you were able to finetune? π
Please, insert the answers into Task 2: Results table.
Please open a Pull Request.
- Add the possibility to add an "instruction" prompt before dialogue historyπππ
- Implement Evaluation callback to evaluate regularly during training.πππ
- Train from scratch using
full_finetune
and reinitializing the weights with reasonable hyperparameters.ππππ - Add
span_info
to the dataloader and tag named entities.ππππ. - Add dialogue state information to the dataloader and predict dialogue state instead of the words of the next response.πππππ.
- Clean the code π
- Check the documentation and setup an account on Hugging Face if you don't have it already.
- Create an user token and authenticate yourself in a command line. See the quickstart for details.
- Create a repository on the huggingface.
- See
./merge_peft.py
script which will merge your weights to the base model so it could be used as regular transformer again. Finally use it with--push_to_hub
option.π
# tested on GPU with this command
python merge_peft.py \
--base_model_name_or_path EleutherAI/pythia-70m \
--peft_model_path output/EleutherAI/pythia-70m_1687207221_1159787/checkpoint-4/ \
--device cuda \
--push_to_hub oplatek/pythia-70m-multi_woz_v22 \
--output_dir some_local_outdir
If you have implemented a new feature, found a bug, or want to fix a typo, please submit a pull request.π
Use the black formatter to avoid merge conflicts in large PRs.
In other cases, feel free to reach us too:
OndΕej PlΓ‘tek, (UFAL, Charles University, Prague)
Santosh Kesiraju, (FIT, VUT, Brno)
Petr Schwarz, (FIT, VUT, Brno)
If you use the code or results from this tutorial, please cite the tutorial in the following manner:
@article{oplatek2023qlora-multiwoz,
title={Investigating Masked Language Model and Instruction finetuning of LLMs using QLoRa for Task-Oriented Dialogue Models},
author={PlΓ‘tek, OndΕej and Kesiraju, Santosh and Schwarz, Petr and DuΕ‘ek, OndΕej},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/keya-dialog/jsalt-dialogue-lab}},
commit = {todo}
year={2023}
}
Please, also cite the artidoro/qlora project on which our work is built on.
@article{dettmers2023qlora,
title={QLoRA: Efficient Finetuning of Quantized LLMs},
author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
journal={arXiv preprint arXiv:2305.14314},
year={2023}
}