Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected larger perplexity on PG19 #48

Open
Yiyi-philosophy opened this issue Dec 10, 2023 · 1 comment
Open

Unexpected larger perplexity on PG19 #48

Yiyi-philosophy opened this issue Dec 10, 2023 · 1 comment

Comments

@Yiyi-philosophy
Copy link

Hi Yarn team,

I hope this finds you well. I've been using your code jquesnelle/yarn for testing the PG19 dataset. While reviewing the eval.sh script, I noticed some definitions related to the PG19 dataset, but the code for testing perplexity results seems somewhat unclear.

Settings:

  • Base Model: llama2-7b
  • Base Context Size: 4096
  • Sliding Window: 256, 4096
  • Scale to: 8192

In eval.sh, I found the following definition for the PG19 dataset:

# python eval/perplexity.py -m meta-llama/Llama-2-7b-hf --dataset pg19 --split test --feature text --save-tokenized output/pg19-test-tokenized
PG19="--tokenized emozilla/pg19-test-tokenized"

However, I did not find the actual code for testing perplexity results. Therefore, I attempted to use our own defined code for testing:

python eval/perplexity.py --dataset pg19 --feature "text" --samples 5 -m meta-llama/Llama-2-7b-hf --max-tokens $max_tokens --min-tokens $max_tokens --tokens-step 4000 --tokenized emozilla/pg19-test-tokenized --yarn $((max_tokens / 4096)) --max-position-embeddings 4096 --original-max-position-embeddings 4096 --dataset-min-tokens $max_tokens --sliding-window 4096 --custom-model --aggressive-memory --flash-attention

I observed that the results differ when the sliding window is set to 4096 and 256. In comparison to other PI and dy-ntk methods, the performance is unstable with a sliding window set to 256 and stable with a sliding window set to 4096.

Results:

  • --sliding-window 4096:
    • meta-llama/Llama-2-7b-hf: 8192=9.89344
  • --sliding-window 256:
    • meta-llama/Llama-2-7b-hf: 8192=32.76145

In contrast, other PI and dy-ntk methods maintain relatively stable performance when the sliding window is set to 256 and 4096:

  • Sliding window: 4096 / 256
    • PI: 10.79598 / 10.65644
    • dy-ntk: 10.19125 / 10.214816

I would appreciate your insights on this phenomenon. Is this behavior considered normal, or could there be potential configuration issues? If possible, could you provide more detailed information about the PG19 dataset testing script to help me better understand and adjust the testing configuration?

Thank you very much for your time and assistance. I look forward to your response.

Best regards,
Yiran

@ClarkChin08
Copy link

Hi! I also experienced this issue. How many samples do you use? Have you solved the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants