Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Baseline benchmark for 17 coding models : r/LocalLLaMA #456

Open
1 task
irthomasthomas opened this issue Jan 26, 2024 · 0 comments
Open
1 task

Baseline benchmark for 17 coding models : r/LocalLLaMA #456

irthomasthomas opened this issue Jan 26, 2024 · 0 comments
Labels
code-generation code generation models and tools like copilot and aider llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-experiments experiments with large language models MachineLearning ML Models, Training and Inference prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re

Comments

@irthomasthomas
Copy link
Owner

Baseline Benchmark for 17 Coding Models

Discussion

I am currently working on implementing some ideas for coding models inference strategies (prompting, control, context exploration, CoT, ToT, etc) and I needed a baseline benchmark on a bunch of models. Since I work on a 3060 12GB, I was limited in what I can test so I went for every model that is 7/13B and has an AWQ quant available, since that is what the inference library that I use supports. I thought I'd share some numbers.

Notes:

  • This is a benchmark for getting a local baseline. I'm interested in improvement from here, so the absolute values are less important for me. Don't take the absolute values too seriously. (well, maybe except deepseek-coder-1.3b, that is a bit suspect).
  • I used the HumanEval dataset. This is superseded by HumanEval+ and other more recent benchmarks. I chose this because it was the first one I tried. Again, with my tests I'm looking for improvements over the baseline, so this is mostly fine.
  • AWQ quant is not the best out there, but all my tests will be done with this quant, so for me it is OK.
  • Temp tests were done in only one generation. In general you'd want to average the score over many generations at a given temp.
  • Each model was prompted according to the model card template. Here's an example for the codellama series -
f"""<s>You are a helpful and respectful assistant. Answer the following question: {question}"""

Results

I've plotted the results (with horrendous contrasting colors, but alas) to look for any interesting patterns in problem solving. You can find the plots here.

Model Temp Correct / 164 Percentage
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.0 67 0.40853658536585363
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.1 63 0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.2 68 0.4146341463414634
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.3 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.4 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.5 63 0.38414634146341464
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.6 54 0.32926829268292684
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.7 61 0.3719512195121951
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.8 60 0.36585365853658536
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 0.9 59 0.3597560975609756
TheBloke/Mistral-7B-Instruct-v0.2-AWQ 1.0 65 0.39634146341463417

Suggested labels

{ "label-name": "coding-models", "description": "Discussion and benchmark of coding models implementation strategies.", "confidence": 96.82 }

@irthomasthomas irthomasthomas added llama llm-benchmarks testing and benchmarking large language models llm-experiments experiments with large language models MachineLearning ML Models, Training and Inference New-Label Choose this option if the existing labels are insufficient to describe the content accurately code-generation code generation models and tools like copilot and aider and removed New-Label Choose this option if the existing labels are insufficient to describe the content accurately labels Jan 26, 2024
@irthomasthomas irthomasthomas added llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re and removed prompt-tuning labels May 5, 2024
@ShellLM ShellLM removed the llama label May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
code-generation code generation models and tools like copilot and aider llm-benchmarks testing and benchmarking large language models llm-evaluation Evaluating Large Language Models performance and behavior through human-written evaluation sets llm-experiments experiments with large language models MachineLearning ML Models, Training and Inference prompt-engineering Developing and optimizing prompts to efficiently use language models for various applications and re
Projects
None yet
Development

No branches or pull requests

2 participants