Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An Evaluation Dataset for quality benchmarking of different inference engine implementation. #116

Closed
4 tasks
Anindyadeep opened this issue Jan 16, 2024 · 1 comment
Assignees

Comments

@Anindyadeep
Copy link
Member

Anindyadeep commented Jan 16, 2024

The current benchmarks repo does the performance benchmarking. However, just understanding which implementation is fast might compromise the quality of generation of the LLM. There is a very direct relationship between degradation of quality and with decrease in precision. Sometimes, even implementation or backend change can also affect this.

So here is the idea:

  • We need to curate a very good evaluation dataset with good prompts (the type or subjects of prompts need to be discussed)
  • Once that is done, we need to implement a simple evaluation pipeline or a script/function that can do a one-shot evaluation of this dataset
  • Then we can show all the results of those prompts per engine. Where inside each engine/implementation file, we will have a results.md file, under which we can show the results in the following sample format:

AutoGPTQ

Float 32 precision

Id Prompt Result Score
1 This is a sample prompt This is a sample result 5.5

Float 16 precision

Id Prompt Result Score
1 This is a sample prompt This is a sample result 5.5

Whether to implement a scoring mechanism or not, is still a question open for discussion. However, this can be the format.

So here are the subtasks

  • Get 5 good prompts
  • Make a simple evaluation pipeline supporting all the engines
  • State the results in a readme or better if it can be done through the function
  • (Optional) Make a huggingface space out of it
@Anindyadeep
Copy link
Member Author

We are closing this, since we are going for this approach mentioned in issue #162

cc: @nsosio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants