LEX inference support and checkpoint #45

RulinShao · 2023-07-30T18:23:56Z

Hello, thanks for your great work!

I was trying to benchmark your work on LEX, which I found in this fork.
However, that fork repo doesn't have issue feature so I'm posting my questions here.

I tried to test your BCA technique with the LLaMa models. So I implemented the BCA according to the commit I pasted above. However, my model failed to extrapolate when going beyond the block size. So I am wondering if you could provide one checkpoint of your LEX model such that I could test and compare with my codes to see where is the bug?

Thanks!

sunyt32 · 2023-08-03T08:17:53Z

What's your extrapolation setting? Is it identical to our paper? Maybe you can try window attention which is much easier to implement to see the performance first.

sunyt32 · 2023-08-03T08:20:30Z

If you use it on LongEval setting, I think it doesn't work to retrieve very long topics. The local techniques maintain the local modeling where ppl is more stable.

RulinShao · 2023-08-03T13:20:33Z

Thank for the reply @sunyt32 ! I was actually using the rotary embedding as implemented in the LLaMa HF codes. I only implemented the BCA to help it extrapolate to longer context. I did very simple tests for debugging:

For example, I set the window size to be w where my prompt is padded on the left to 2w (e.g., w = 16, 32, 128). (Do you think it's a reasonable case for debugging?) The LLama model worked well when I turned off the BCA. With BCA, I expect it to generate reasonable answers following the prompt, but I got gibberish like 6.666666 after the generation of three or five new tokens. I think this dummy case indicates there might be a bug in my codes. So I appreciate any additional information that can help me check the expected outputs and intermediate tensors (like the k,v cache and rotary positional embedding calculation with bca) in the context of generation, which would be super helpful!

Thanks a lot for your time!

sunyt32 · 2023-08-03T16:31:02Z

I see, the reason here is similar, the window attention actually doesn't have the ability for longer context. However, using BCA or window attention should not cause gibberish. The reasonable generation sequence is at least coherent.

I have to admit that the long context evaluation is much more reasonable nowadays...It's a wrong idea just to concentrate on ppl. Let's forget window attention styles...

ntk extrapolation is a good technique for these tasks. But xPos still has its values. Our experiments show that xPos+ntk will have a more stable performance than RoPE, including ppl and retrieval.

RulinShao · 2023-08-03T18:26:10Z

Gotcha! Thanks for the nice advice! I'll try the other way you suggested!

shumingma assigned donglixp Aug 3, 2023

RulinShao closed this as completed Aug 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LEX inference support and checkpoint #45

LEX inference support and checkpoint #45

RulinShao commented Jul 30, 2023

sunyt32 commented Aug 3, 2023

sunyt32 commented Aug 3, 2023

RulinShao commented Aug 3, 2023

sunyt32 commented Aug 3, 2023

RulinShao commented Aug 3, 2023

LEX inference support and checkpoint #45

LEX inference support and checkpoint #45

Comments

RulinShao commented Jul 30, 2023

sunyt32 commented Aug 3, 2023

sunyt32 commented Aug 3, 2023

RulinShao commented Aug 3, 2023

sunyt32 commented Aug 3, 2023

RulinShao commented Aug 3, 2023