CodeLlama quantisation weirdness

When messing around with quantisation to see what kind of models I could fit on an RTX 3090 I came across some strange behaviour. I was comparing three sizes of CodeLlama model, with different quantisations:

codellama/CodeLlama-7b-Instruct-hf in full-fat, 8-bit and 4-bit
codellama/CodeLlama-13b-Instruct-hf in 8-bit and 4-bit
codellama/CodeLlama-34b-Instruct-hf in 4-bit

The quality of the response to my test question was not too bad with any of these, apart from codellama/CodeLlama-34b-Instruct-hf in 4-bit, which was often (but not always) heavily glitched with missing tokens -- that is, it was worse than codellama/CodeLlama-7b-Instruct-hf in 4-bit. That surprised me! In the case captured in this notebook, it generates Java instead of the requested Python, and I've also seen it output glitched JavaScript.

I was expecting quantisation to worsen the results, but not to make a larger model worse than a smaller one at the same level of quantisation. I've put this repo up to see if anyone can repro these results, and to find out if anyone has any idea why it's happening.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
README.md		README.md
codellama-quantisation-weirdness.ipynb		codellama-quantisation-weirdness.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

codellama-quantisation-weirdness.ipynb

codellama-quantisation-weirdness.ipynb

requirements.txt

requirements.txt

Repository files navigation

CodeLlama quantisation weirdness

About

Releases

Packages

Languages

gpjt/codellama-quantisation-weirdness

Folders and files

Latest commit

History

Repository files navigation

CodeLlama quantisation weirdness

About

Resources

Stars

Watchers

Forks

Languages