diff --git a/.spellcheck-en-custom.txt b/.spellcheck-en-custom.txt index 4ebd275a..0154a3b5 100644 --- a/.spellcheck-en-custom.txt +++ b/.spellcheck-en-custom.txt @@ -34,6 +34,7 @@ Dropdown env EP Eval +eval Excalidraw exfiltrate exfiltrating @@ -52,6 +53,7 @@ Inferencing instructlab ISA JIT +JSON Jupyter KAGGLE Kaggle @@ -63,8 +65,8 @@ LLM llms LLVM lora -md Markdownlint +md Mergify Merlinite mimimum @@ -72,10 +74,11 @@ Miniforge Mixtral MLX mlx +MMLU NVidia Nvidia -ollama Ollama +ollama orchestrator ots Pareja @@ -104,12 +107,12 @@ RX safetensors Salawu SDG -Sigstore sdg sexualized SHA Shivchander Signoff +Sigstore Srivastava subdirectory Sudalairaj diff --git a/docs/evaluation/eval-repo.md b/docs/evaluation/eval-repo.md new file mode 100644 index 00000000..1529567c --- /dev/null +++ b/docs/evaluation/eval-repo.md @@ -0,0 +1,49 @@ +# New Repository Proposal: eval + +## Summary + +This document proposes a new repository under the `instructlab` GitHub organization: + +- `instructlab/eval` + +## Background + +The `instructlab/instructlab` repository currently includes no real implementation +of Evaluation as described by the [LAB paper](https://arxiv.org/abs/2403.01081). The +closest implementation currently in `instructlab/instructlab` via the `ilab test` command. + +`ilab test` as of this writing is only implemented for macOS with M-series chips. It uses +a JSON Lines file and a LoRA adapter to compare output of a given model before and after +LoRA training with MLX, thus the macOS M-series dependency. + +We desire to build out a library for methods that satisfy the evaluation described in the +paper, using more high-level evaluation schemes such as +[Multi-turn Benchmark](https://arxiv.org/abs/2306.05685) for skills and +[Massive Multitask Language Understanding](https://arxiv.org/abs/2009.03300) (MMLU) for +knowledge. We propose a new repository to house this code that publishes a new Python +library called `instructlab-eval`. The reasoning for a new repository and library includes: + +- We expect multiple consumers of this code. The `ilab` CLI is one, but we also envision +building a REST API around it to help support scaling out this functionality on a cluster. +- We expect there is broader community interest in an open-source library and service for +evaluation. We envision this library could support other evaluation techniques over time. +- We also realize that much of model evaluation is generally useful outside the context of +InstructLab. Other libraries may emerge in the broader ecosystem that handle parts of what +we need, while this library will always remain to handle the InstructLab-specific details +of how evaluation works in our workflow. + +## Maintainers + +The initial team of maintainers for this repository will be a copy of the +`Backend Maintainers` GitHub team. + +## Alternatives Considered + +### Add to `instructlab/instructlab` + +We could add this code to the existing `instructlab/instructlab` repository. + +The primary argument against this approach is that we expect the scope of an +`instructlab-eval` library to expand beyond the scope of what would be run by the +`ilab` CLI. We instead envision a different community of contributors organizing +around Evaluation specifically.