Refactor: encapsulate annotators used in safe tests #560

rogthefrog · 2024-09-10T23:25:29Z

#557

Extracts the internals of annotator configuration out of safe test to separate concerns better.

github-actions · 2024-09-10T23:25:41Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

rogthefrog · 2024-09-10T23:26:42Z

@dhosterman @bkorycki @wpietri for feedback please.

I'm aware some of the tests are failing. :)

wpietri · 2024-09-10T23:37:54Z

Could you say a bit more about how this code fits into your overall cleanup plan? E.g., is this step 1, or all the steps?

bkorycki · 2024-09-11T00:18:44Z

This looks great so far. It definitely makes sense to extract annotator configuration out of the test file. I don't have any high-level feedback, but let me know when it's ready for a full review!

rogthefrog · 2024-09-11T00:48:39Z

Could you say a bit more about how this code fits into your overall cleanup plan? E.g., is this step 1, or all the steps?

This is step 2 of potentially 6. I think this will enable faster iterations and integration of different models and evaluation functions, and I'd be OK leaving it like this if that's all we need. But if we do have additional needs, here are the steps.

Step 1 was making the private annotators' public interface more consistent across annotators, to make it easier and less risky to add new ones to modelgauge as they were being developed and iterated on in the private repo. This was done loosely, i.e. there's no actual API contract with ABC or interfaces the annotators need to implement; just enough to make it easier for engineering to integrate into modelgauge quickly and safely enough, and easy enough for non-engineers to adhere to without requiring more advanced Python features like ABCs or refactoring their existing annotators.

Step 2 (this) is to hide the internals of annotators and expose a simple interface to client code like safe tests, so that engineering can program to that interface. This keeps our client code cleaner, more readable, and testable, and easier to reason about. More interestingly, it allows people to create their own arbitrary groups of annotators and ensemble evaluation functions without engineering having to do much work to add them. They just create an AnnotatorSet with whatever combination of annotators evaluation function, including their configuration, and we can pull it in and use it in tests. So if you have annotators A, B, C, D, and E, you can run A,B,C together, A,B,C,D,E together, A,B,E together, etc. with minimal effort.

Possible future steps if we find they are needed:

a more generic way to inject the ensemble evaluation function into the AnnotatorSet, like a Strategy pattern. Currently it's hardcoded to EnsembleAnnotator.simple_join_evaluator_responses because that's what we have. This PR makes it relatively easy to support either 1) an arbitrary method in the EnsembleAnnotator class (not just majority vote) or 2) a different EnsembleAnnotator object altogether instead.
enforce the structure of annotators in code (e.g. provide ABCs) rather than convention ("hey so-and-so, make sure your client module includes a Config-like object to help us integrate it"). Some classes already have ABCs to enforce this, but the ensemble model wrappers that use them don't.
a declarative way to create annotator bundles. This would increase complexity and I don't know if there would be much benefit, but conceptually it's the logical termination of this work. E.g. instead of writing EnsembleAnnotatorSet and all its annotator loading and scoring logic in Python, we could expose a way for others to create arbitrary bundles declaratively in yaml or json or whatever, specifying their package names and evaluation functions as fully qualified names, authentication parameters in a requirements section (kind of like wants in systemd), etc.

Another thing we may consider:

instead of model client code managing third-party configuration explicitly (secrets, hostnames, endpoints, etc), we could conceivably use mixins for families of models, e.g. instead of

MISTRAL_8x22B_CONFIG.llm_config.api_key = self.configuration["together_api_key"]
LLAMA_3_70B_CONFIG.llm_config.api_key = self.configuration["together_api_key"]

we could make the Mistral and Llama3 clients implement a Together mixin, and likewise have a HuggingFace mixin for services hitting models on HF, a VLLM mixin for models we host, etc.

wpietri

Looks like good progress, but I'd like to take it a bit further if we can.

modelgauge/tests/safe_v1.py

wpietri · 2024-09-11T15:06:14Z

modelgauge/tests/safe_v1.py

-    HUGGINGFACE_KEY = InjectSecret(HuggingFaceKey)  # was: os.getenv("HF_TOKEN", "")
+# private annotators, if available
+try:
+    from modelgauge.private_ensemble_annotator_set import EnsembleAnnotatorSet


Could all of this be moved to private code? If we have to have private stuff in public files, I'd at least like to keep it to the minimum.

Yes, I'll give it a go.

rogthefrog · 2024-09-12T03:16:18Z

Note: the unit tests still don't all pass. I will fix them.

The lifecycle of the injected secrets has made it necessary to instantiate the annotator set inside the test constructor, which requires passing all the necessary secrets to that constructor. Using references to InjectSecret in a pre-configured AnnotatorSet object and passing that AnnotatorSet object in to the test constructor does not cause the references to be hydrated.

This means we're almost back to square one:

either the test constructor needs to know a lot about the internals of the annotators it's trying to exercise (specifically the secrets they need), which is the way Barbara had it originally,
or we make the test constructor flexible with metaprogramming, which is how I did it to make it work in this PR.

This solution is more generic and does insulate the test class from the internals of the annotators, but it's pretty gross, and the annotators are no longer completely self-configuring: the client has to interrogate the AnnotatorSet object for the secrets it needs, and give it the secrets, or a way to fetch the secrets, when instantiating it.

@bkorycki @wpietri feedback appreciated!

Thanks to @bkorycki for the assist!

wpietri · 2024-09-12T15:21:04Z

I should add that the original design here isn't Barbara's, but something we all inherited.

Is it possible we could improve things here by separating concerns? One of my ongoing struggles with the Tests is that they do all of these, and maybe more

express the formal definition of a benchmark component and its subcomponents that needs to be stable for reproducibility reasons
instantiate and manage those subcomponents
perform a variety of evaluation-time actions
manage and call (or trigger the calling of) a variety of external APIs, hiding the operational details

For a while, I've found the secrets stuff especially annoying, which is why ModelBench has the ModelGaugeSut, a wrapper that allows us to refer to a ModelGauge SUT without having to fully instantiate it. But over time, I'm finding the operational issues at least as much of a headache. For example, if many SUTs are calling the same Together API, to get maximum speed I need to centrally manage the connections to respect rate limits. Or looking at the HuggingFace stuff, having to manage instances and possibly wait many minutes for them shouldn't be hidden away in a variety of Test subcomponents; that's something better handled at a high level. That will become even more of an issue once we're using our own annotator VMs in earnest.

I'm not sure what the right breakdown is, but I'm hoping we can keep the Test objects themselves to doing the defining and the straightforward calculations. And it would be great if we could make things more composable, too. Although we do eventually want a locked-down set of Test objects to go with a locked-down set of Benchmarks, for now we want to iterate quickly.

Does that inspire any thoughts?

…st from the internals of its annotator(s). No functional change.

…t constructor; they are not hydrated if they are inside an object passed in to the test constructor. So we restructured the AnnotatorSet classes and the annotator registration to use the AnnotatorSet class and references to injectable secrets to the test constructor, and having the test constructor instantiate the AnnotatorSet object with injected secrets.

rogthefrog · 2024-09-13T07:59:44Z

@wpietri thank you for your comments. They are helpful in suggesting next steps to look at.

The tests are now passing. Question for the group: should we...

(1) merge this in its current state, which partially but incompletely addresses the initial goal to move some of the internal config logic out of the test, then add William's suggested improvements after that?
(2) continue with this branch, adding more separation of concerns per William's suggestions, and then merge?
(3) not merge this branch at all, chalking one up to R&D and knowledge sharing, and start on William's improvements off today's main, without this code?

I do agree with William that more separation of concerns is desirable, and I do think these changes are in line with that idea, even if they don't completely achieve the goal they set out to meet due to the intricacies of secrets. And they will make it easier to create arbitrary ensembles and evaluator functions for those ensembles, which I believe is something Shaona was hoping for. So I'm in favor of either option (1) or (3) above.

Please let me know what you think @bkorycki @wpietri @bollacker

wpietri · 2024-09-13T15:19:02Z

If you think this is a step in the right direction, I'm all for merging it now.

rogthefrog · 2024-09-13T16:20:08Z

@wpietri I do, but let's see what Barbara thinks.

bkorycki

Thanks for adding the private annotator tests! I agree that this is good to merge.

rogthefrog linked an issue Sep 10, 2024 that may be closed by this pull request

Encapsulate private annotators better #557

Closed

rogthefrog requested review from bkorycki, dhosterman and wpietri September 10, 2024 23:26

rogthefrog force-pushed the feat/557-encapsulate branch from aeca2c9 to e2f1728 Compare September 11, 2024 00:56

wpietri suggested changes Sep 11, 2024

View reviewed changes

rogthefrog mentioned this pull request Sep 11, 2024

HuggingFace Inference SUT #561

Merged

rogthefrog force-pushed the feat/557-encapsulate branch from ba94a65 to 075fcbf Compare September 11, 2024 19:31

rogthefrog added 12 commits September 13, 2024 00:37

Refactor: encapsulate variable, extract class. This decouples safe_te…

db4b0ee

…st from the internals of its annotator(s). No functional change.

add wildguard annotator that was commented out unintentionally

a6564e1

explicit arguments for easier debugging

3608548

noop; mypy

ba8e8b2

fix wrong import name and class name

92c98d7

mock secrets

a1f138b

fix secrets properly; call configure automatically upon construction

69b0159

fake more keys for the new annotators

8add664

fix annotator set creation

6bf75b8

make sure the tests run even if the private annotators aren't available

7f54ca0

mypy fix

8ab1461

rogthefrog force-pushed the feat/557-encapsulate branch from 4911c3f to 8ab1461 Compare September 13, 2024 07:39

pass in a variable, not a string

b4f233b

rogthefrog marked this pull request as ready for review September 13, 2024 08:00

rogthefrog requested a review from a team as a code owner September 13, 2024 08:00

rogthefrog mentioned this pull request Sep 13, 2024

Improve separation of concerns in tests #570

Open

bkorycki approved these changes Sep 13, 2024

View reviewed changes

rogthefrog merged commit 0605f15 into main Sep 14, 2024

rogthefrog deleted the feat/557-encapsulate branch September 14, 2024 00:45

github-actions bot locked and limited conversation to collaborators Sep 14, 2024

Refactor: encapsulate annotators used in safe tests #560

Refactor: encapsulate annotators used in safe tests #560

Uh oh!

Conversation

rogthefrog commented Sep 10, 2024

Uh oh!

github-actions bot commented Sep 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rogthefrog commented Sep 10, 2024

Uh oh!

wpietri commented Sep 10, 2024

Uh oh!

bkorycki commented Sep 11, 2024

Uh oh!

rogthefrog commented Sep 11, 2024

Uh oh!

wpietri left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wpietri Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

rogthefrog Sep 11, 2024

Choose a reason for hiding this comment

Uh oh!

rogthefrog commented Sep 12, 2024

Uh oh!

wpietri commented Sep 12, 2024

Uh oh!

rogthefrog commented Sep 13, 2024

Uh oh!

wpietri commented Sep 13, 2024

Uh oh!

rogthefrog commented Sep 13, 2024

Uh oh!

bkorycki left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Sep 10, 2024 •

edited

Loading