update tokenizer process. #2138

shibing624 · 2023-08-02T06:09:58Z

support llama2, baichuan, bloom, chatglm model data tokenize
make the code more readable

Why are these changes needed?

Related issue number (if applicable)

Checks

I've run format.sh to lint the changes in this PR.
I've included any doc changes needed.
I've made sure the relevant tests are passing (if applicable).

1. support llama2, baichuan, bloom, chatglm model data tokenize 2. make the code more readable

merrymercy · 2023-08-02T06:14:20Z

Could you double-check that the output of this line (

FastChat/fastchat/train/train.py

Lines 140 to 143 in 820de33

    
           if False:  # Inspect and check the correctness of masking 
        
               z = target.clone() 
        
               z = torch.where(z == IGNORE_TOKEN_ID, tokenizer.unk_token_id, z) 
        
               rank0_print(tokenizer.decode(z))

) is exactly the same before and after your change?
You can see more discussions and test cases at #1894 (comment)

shibing624 · 2023-08-02T06:55:32Z

i double chekced, it is right with llama2.

merrymercy · 2023-08-08T10:37:45Z

@shibing624 Thanks! It is merged.

merrymercy · 2023-08-08T14:53:48Z

fastchat/train/train.py

@@ -117,19 +118,17 @@ def preprocess(
        total_len = int(target.ne(tokenizer.pad_token_id).sum())

        turns = conversation.split(conv.sep2)
-        cur_len = 1
-        target[:cur_len] = IGNORE_TOKEN_ID


@shibing624 The first token is for BOS. I found BOS is dropped after your PR. Could you fix this?

merrymercy · 2023-08-08T15:09:25Z

I reverted this PR because of the BOS problem (#2138 (comment)). Please fix it and submit a new one.
This part is very error-prone. Please make sure input_ids and targets keep exactly the same for Llama models after your PR.

shibing624 · 2023-08-09T04:09:32Z

yes, the tokenizer process with split and find multi round eos_token is very error-prone. so i rewrite the logic of get dialogs from preprocess, then just need tokenize prompt and answer, the code is changed in my repo: https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py#L653

update tokenizer process.

bbce1e2

1. support llama2, baichuan, bloom, chatglm model data tokenize 2. make the code more readable

merrymercy added the high-priority label Aug 8, 2023

merrymercy merged commit bad72ef into lm-sys:main Aug 8, 2023
1 check failed

merrymercy reviewed Aug 8, 2023

View reviewed changes

merrymercy mentioned this pull request Aug 8, 2023

Revert tokenizer changes in train.py #2186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update tokenizer process. #2138

update tokenizer process. #2138

shibing624 commented Aug 2, 2023

merrymercy commented Aug 2, 2023 •

edited

shibing624 commented Aug 2, 2023

merrymercy commented Aug 8, 2023

merrymercy Aug 8, 2023

merrymercy commented Aug 8, 2023

shibing624 commented Aug 9, 2023

update tokenizer process. #2138

update tokenizer process. #2138

Conversation

shibing624 commented Aug 2, 2023

Why are these changes needed?

Related issue number (if applicable)

Checks

merrymercy commented Aug 2, 2023 • edited

shibing624 commented Aug 2, 2023

merrymercy commented Aug 8, 2023

merrymercy Aug 8, 2023

Choose a reason for hiding this comment

merrymercy commented Aug 8, 2023

shibing624 commented Aug 9, 2023

merrymercy commented Aug 2, 2023 •

edited