An Evaluation Dataset for quality benchmarking of different inference engine implementation. #116

Anindyadeep · 2024-01-16T15:35:12Z

The current benchmarks repo does the performance benchmarking. However, just understanding which implementation is fast might compromise the quality of generation of the LLM. There is a very direct relationship between degradation of quality and with decrease in precision. Sometimes, even implementation or backend change can also affect this.

So here is the idea:

We need to curate a very good evaluation dataset with good prompts (the type or subjects of prompts need to be discussed)
Once that is done, we need to implement a simple evaluation pipeline or a script/function that can do a one-shot evaluation of this dataset
Then we can show all the results of those prompts per engine. Where inside each engine/implementation file, we will have a results.md file, under which we can show the results in the following sample format:

AutoGPTQ

Float 32 precision

Id	Prompt	Result	Score
1	This is a sample prompt	This is a sample result	5.5

Float 16 precision

Id	Prompt	Result	Score
1	This is a sample prompt	This is a sample result	5.5

Whether to implement a scoring mechanism or not, is still a question open for discussion. However, this can be the format.

So here are the subtasks

Get 5 good prompts
Make a simple evaluation pipeline supporting all the engines
State the results in a readme or better if it can be done through the function
(Optional) Make a huggingface space out of it

The text was updated successfully, but these errors were encountered:

Anindyadeep · 2024-04-13T18:21:17Z

We are closing this, since we are going for this approach mentioned in issue #162

cc: @nsosio

Anindyadeep assigned Anindyadeep and vittoriop17 Jan 16, 2024

Anindyadeep added the high priority label Mar 11, 2024

Anindyadeep closed this as completed Apr 13, 2024

Anindyadeep mentioned this issue Apr 29, 2024

For quality checks reference should be taken from actual PyTorch version of model #162

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An Evaluation Dataset for quality benchmarking of different inference engine implementation. #116

An Evaluation Dataset for quality benchmarking of different inference engine implementation. #116

Anindyadeep commented Jan 16, 2024 •

edited

Anindyadeep commented Apr 13, 2024

An Evaluation Dataset for quality benchmarking of different inference engine implementation. #116

An Evaluation Dataset for quality benchmarking of different inference engine implementation. #116

Comments

Anindyadeep commented Jan 16, 2024 • edited

AutoGPTQ

Float 32 precision

Float 16 precision

Anindyadeep commented Apr 13, 2024

Anindyadeep commented Jan 16, 2024 •

edited