# Finetune Your Chatbot on Multi-node SPR

NeuralChat is a customizable chat framework designed to create user own chatbot within few minutes on multiple architectures. This notebook will introduce how to finetune your chatbot with the customized data on multi-node SPR server.

## Prepare Environment
We support Distributed Data Parallel (DDP) finetuning on both single node and multi-node settings. Before using DDP to speedup training, we need to configure the environment. 

Recommend python 3.9 or higher version.

```bash
pip install -r requirements.txt
# To use ccl as the distributed backend in distributed training on CPU requires to install below requirement.
python -m pip install oneccl_bind_pt==2.1.0 -f https://developer.intel.com/ipex-whl-stable-cpu
```

Then, follow the [hugginface guide](https://huggingface.co/docs/transformers/perf_train_cpu_many) to install Intel® oneCCL Bindings for PyTorch, IPEX

oneccl_bindings_for_pytorch is installed along with the MPI tool set. Need to set the environment before using it.

For Intel® oneCCL >= 1.12.0:
``` bash
oneccl_bindings_for_pytorch_path=$(python -c "from oneccl_bindings_for_pytorch import cwd; print(cwd)")
source $oneccl_bindings_for_pytorch_path/env/setvars.sh
```

For Intel® oneCCL whose version < 1.12.0:
``` bash
torch_ccl_path=$(python -c "import torch; import torch_ccl; import os;  print(os.path.abspath(os.path.dirname(torch_ccl.__file__)))")
source $torch_ccl_path/env/setvars.sh
```

## Prepare the Dataset
We select 3 kind of datasets to conduct the finetuning process for different tasks.

1. Text Generation (General domain instruction): We use the [Alpaca dataset](https://github.com/tatsu-lab/stanford_alpaca) from Stanford University as the general domain dataset to fine-tune the model. This dataset is provided in the form of a JSON file, [alpaca_data.json](https://github.com/tatsu-lab/stanford_alpaca/blob/main/alpaca_data.json). In Alpaca, researchers have manually crafted 175 seed tasks to guide `text-davinci-003` in generating 52K instruction data for diverse tasks.

2. Summarization: An English-language dataset [cnn_dailymail](https://huggingface.co/datasets/cnn_dailymail) containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail, is used for this task.

3. Code Generation: To enhance code performance of LLMs (Large Language Models), we use the [theblackcat102/evol-codealpaca-v1](https://huggingface.co/datasets/theblackcat102/evol-codealpaca-v1).

## Finetune Your Chatbot
Before start the finetuning, you need to create a node configuration file which contains the IP addresses of each node (for example hostfile) and pass that configuration file path as an argument. Here, we take a training with a total of 16 processors on 4 Xeon SPR nodes as an example. We use node 0/1/2/3 to conduct the finetuning, where node 0 is served as the master node, each node has two sockets. ppn (processes per node) is set to 4, means each socket has two processors. The variables OMP_NUM_THREADS/CCL_WORKER_COUNT can be tuned for optimal performance.

In node 0, you could use the following code to set the node configuration.
``` bash
 cat hostfile
 xxx.xxx.xxx.xxx #node 0 ip
 xxx.xxx.xxx.xxx #node 1 ip
 xxx.xxx.xxx.xxx #node 2 ip
 xxx.xxx.xxx.xxx #node 3 ip
```

If you have enabled passwordless SSH in cpu clusters, you could use mpirun in the master node to start the DDP finetune. Run the following command in node0 and **4DDP** will be enabled in node 0/1/2/3 with BF16 auto mixed precision:
``` bash
export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
## DDP p-tuning for Llama
mpirun -f hostfile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --train_file ./alpaca_data.json \
    --bf16 True \
    --output_dir ./llama_peft_finetuned_model \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --peft ptun \
    --group_by_length True \
    --dataset_concatenation \
    --use_fast_tokenizer false \
    --do_train \
    --no_cuda \
    --ddp_backend ccl \
```
`Note: In case your SPR nodes' network interface controller use irdma driver (you can check it with command "ibv_devices"), some extra environment variables are needed, please also export them as below.`
```bash
export PSM3_ALLOW_ROUTERS=1
export PSM3_MULTI_EP=1
export FI_PROVIDER=psm3
```
you could also indicate `--peft` to switch peft tuning method in ptun (P-tuning), prefix (Prefix tuning), prompt (Prompt tuning), llama_adapter (LLama Adapter), lora (LORA), see https://github.com/huggingface/peft for more detail.

Similarly, you can train you chatbot on the summarization task:
``` bash
export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
## DDP p-tuning for Llama
mpirun -f hostfile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --dataset_name "cnn_dailymail" \
    --dataset_config_name "3.0.0" \
    --bf16 True \
    --output_dir ./llama_peft_finetuned_model \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --peft ptun \
    --group_by_length True \
    --dataset_concatenation \
    --use_fast_tokenizer false \
    --do_train \
    --no_cuda \
    --ddp_backend ccl \
```

Train your chatbot on the code generation task:
``` bash
export CCL_WORKER_COUNT=1
export MASTER_ADDR=xxx.xxx.xxx.xxx #node0 ip
## DDP p-tuning for Llama
mpirun -f hostfile -n 16 -ppn 4 -genv OMP_NUM_THREADS=56 python3 finetune_clm.py \
    --model_name_or_path decapoda-research/llama-7b-hf \
    --dataset_name "theblackcat102/evol-codealpaca-v1" \
    --bf16 True \
    --output_dir ./llama_peft_finetuned_model \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 2000 \
    --save_total_limit 1 \
    --learning_rate 1e-4 \
    --logging_steps 1 \
    --peft ptun \
    --group_by_length True \
    --dataset_concatenation \
    --use_fast_tokenizer false \
    --do_train \
    --no_cuda \
    --ddp_backend ccl \
```
