bert: fix layer norm epsilon value #1946
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ref https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2/blob/7dbbc90392e2f80f3d3c277d6e90027e55de9125/config.json#L13
This is a quick-and-dirty fix since this code is going to be replaced anyway. It would be more correct to read layer_norm_eps when we convert to GGUF, and load that hyperparameter from the GGUF at inference time.
The difference between an epsilon of 1e-6 in LLaMA 1 and 1e-5 in LLaMA-2 created a significant difference in perplexity, so they implemented this parameter to ggml_norm and ggml_rms_norm soon after LLaMA-2 came out, and until the switch to GGUF they defaulted to 5e-6, which was a suitable middleground, and allowed the user to customize the parameter at inference time via a command-line option.
The difference between 1e-5 and 1e-12 is certainly more significant... if only we had benchmarks for this code.