You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current benchmarks repo does the performance benchmarking. However, just understanding which implementation is fast might compromise the quality of generation of the LLM. There is a very direct relationship between degradation of quality and with decrease in precision. Sometimes, even implementation or backend change can also affect this.
So here is the idea:
We need to curate a very good evaluation dataset with good prompts (the type or subjects of prompts need to be discussed)
Once that is done, we need to implement a simple evaluation pipeline or a script/function that can do a one-shot evaluation of this dataset
Then we can show all the results of those prompts per engine. Where inside each engine/implementation file, we will have a results.md file, under which we can show the results in the following sample format:
AutoGPTQ
Float 32 precision
Id
Prompt
Result
Score
1
This is a sample prompt
This is a sample result
5.5
Float 16 precision
Id
Prompt
Result
Score
1
This is a sample prompt
This is a sample result
5.5
Whether to implement a scoring mechanism or not, is still a question open for discussion. However, this can be the format.
So here are the subtasks
Get 5 good prompts
Make a simple evaluation pipeline supporting all the engines
State the results in a readme or better if it can be done through the function
(Optional) Make a huggingface space out of it
The text was updated successfully, but these errors were encountered:
The current benchmarks repo does the performance benchmarking. However, just understanding which implementation is fast might compromise the quality of generation of the LLM. There is a very direct relationship between degradation of quality and with decrease in precision. Sometimes, even implementation or backend change can also affect this.
So here is the idea:
results.md
file, under which we can show the results in the following sample format:AutoGPTQ
Float 32 precision
Float 16 precision
Whether to implement a scoring mechanism or not, is still a question open for discussion. However, this can be the format.
So here are the subtasks
The text was updated successfully, but these errors were encountered: