Inquiry Regarding Evaluation Metrics in Your Paper #39

teslacool · 2023-11-13T05:39:02Z

Firstly, I would like to extend my sincere appreciation for your remarkable work. It is truly commendable and has served as a valuable resource for the community.

Upon reading your paper, I encountered some confusion regarding the evaluation metrics employed. Specifically, in Section 4.3.1, you state: "...selected 10 random samples from Proof-pile with at least 128k tokens each and evaluated the perplexity of each of these samples when truncated at 2k steps from a sequence length of 2k tokens through 128k tokens." Could you kindly clarify what is meant by "2k steps" in this context?

Additionally, the term "Sliding window perplexity (S = 256) of ten 128k Proof-pile documents truncated to evaluation context window size" is used multiple times. However, I am uncertain how sliding window perplexity is applied if the documents are truncated to the evaluation context window size. Does it mean the documents are truncated to the maximum evaluation context window size (128k)?

Your insights and clarifications on these points would be greatly appreciated, as they might resolve some misunderstandings I have regarding the paper.

Thank you for your time and consideration.

bloc97 · 2023-11-13T19:06:02Z

Lets say you have one document of 160k tokens in length. We start by truncating that document to 128k and evaluate by setting our model context size to 2k, and slide it across the 128k document with a stride of S=256.

Then we do the same again but with a context size of 4k (we go up by 2k each time), again with a stride of S=256.
We do 6k, 8k, 10k, 12k, .... until we hit 128k. Of course at 128k the stride doesn't matter because you can evaluate the entire document in a single pass.

Edit: By "evaluation context window size" we mean the actual biggest window size of the model, and not the arbitrarily limited context size we do in order to obtain less than ideal perplexities (in order to obtain a fair comparison against models with smaller effective context sizes).

teslacool · 2023-11-14T01:05:37Z

@bloc97 Thank you for your detailed explanation. I now understand that the '2k steps' refer to the incremental context sizes used during evaluation, and that the document is consistently truncated to 128k tokens, irrespective of the evaluation context window size. This clarification helps, but I'd like to note that the phrase 'truncated to evaluation context window size,' which appears several times in the paper, might be a bit misleading or unclear. Initially, it suggested to me that the truncation length varied in accordance with the evaluation context window size, which, as you've explained, is not the case. Additionally, I have studied your code, and I noticed in your implementation that the document is truncated to the current context window size. Could you please clarify this second point of confusion?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry Regarding Evaluation Metrics in Your Paper #39

Inquiry Regarding Evaluation Metrics in Your Paper #39

teslacool commented Nov 13, 2023 •

edited

Loading

bloc97 commented Nov 13, 2023 •

edited

Loading

teslacool commented Nov 14, 2023 •

edited

Loading

Inquiry Regarding Evaluation Metrics in Your Paper #39

Inquiry Regarding Evaluation Metrics in Your Paper #39

Comments

teslacool commented Nov 13, 2023 • edited Loading

bloc97 commented Nov 13, 2023 • edited Loading

teslacool commented Nov 14, 2023 • edited Loading

teslacool commented Nov 13, 2023 •

edited

Loading

bloc97 commented Nov 13, 2023 •

edited

Loading

teslacool commented Nov 14, 2023 •

edited

Loading