Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor to introduce Trainer & TrainingArguments, add SetFit ABSA #265

Merged
merged 97 commits into from
Nov 10, 2023

Conversation

tomaarsen
Copy link
Member

@tomaarsen tomaarsen commented Jan 11, 2023

Closes #179, closes #238, closes #320

Hello!

Pull Request overview

  • Additions:
    • Trainer, a modified version of the old SetFitTrainer.
    • DistilledTrainer, a modified version of the old DistillationSetFitTrainer.
    • TrainingArguments, a dataclass used throughout the updated Trainer classes.
  • Changes & Deprecations:
    • SetFitTrainer and DistillationSetFitTrainer: a soft deprecation, i.e. old code should give a warning, but still works.
    • Trainer.unfreeze and Trainer.freeze now simply point to SetFitModel.unfreeze and SetFitModel.freeze. This involves a soft deprecation for the keep_body_frozen parameter, i.e. old code should give a warning, but still works. Furthermore, Trainer.freeze() now freezes the full model, not just the head.
    • Trainer.train no longer accepts training arguments like num_epochs: a fairly soft deprecation, i.e. old code no longer works like before as the keyword arguments are ignored, but the code won't crash. A warning is thrown instead. This will likely affect all users of the differentiable head.
    • SupConLoss was moved to src/setfit/losses.py.
  • Hard removals:
    • SetFitBaseModel and SKLearnWrapper, which were both unused.
  • Tests:
    • Introduced test_deprecated_trainer_distillation.py and test_deprecated_trainer.py. These files contain old unmodified tests, and they show that despite drastic changes, old code still generally works (with exception of the aforementioned Trainer.train training argument deprecation).
    • Updated test_trainer_distillation.py and test_trainer.py to use the new training procedure.
    • Introduced test_training_args.py.

Goals

Based on my own experiments and useful feedback from @kgourgou in the refactoring discussion in #238, I have reduced my goal for this refactoring to the following:

  1. Ensure a familiar, intuitive training scheme involving Trainer and TrainingArguments classes.
  2. Additionally, ensure that the model can also be trained without a TrainingArguments class. In short, a fit method of a component (body or head) must not accept a TrainingArguments instance as a parameter, but merely the hyperparameters like learning_rate, etc.
  3. Ensure that the entire SetFit model can be trained with a single trainer.train(), regardless of the head chosen.
  4. Ensure a smooth and unintrusive way for users to update their code to after this refactor.

Notably, I'm not introducing an interface for classifier heads like I had originally planned, and I'm not deprecating SetFitModel.from_pretrained(..., use_differentiable_head=True) with SetFitModel.from_pretrained(..., head=...). Perhaps some other time we can have a new discussion about those.

Upgrading guide

To update your code to work with this PR, the following changes must be made:

  1. Replace all uses of SetFitTrainer with Trainer, and all uses of DistillationSetFitTrainer with DistillationTrainer.
  2. Remove num_iterations, num_epochs, learning_rate, batch_size, seed, use_amp, warmup_proportion, distance_metric, margin, samples_per_label from a Trainer initialisation, and move them to a TrainerArguments initialisation instead. This instance should then be passed to the trainer via the args argument.
  3. Refactor multiple trainer.train(), trainer.freeze() and trainer.unfreeze() calls that were previously necessary to train the differentiable head into just one trainer.train() call by setting batch_size and num_epochs on the TrainingArguments dataclass with Tuples. The first value is for training the embeddings, and the second is for training the classifier.

I applied this guide to all of the tests to upgrade them to work with the new classes.

An example using a differentiable head:

Before:

# Load a SetFit model from Hub
model: SetFitModel = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    use_differentiable_head=True,
    head_params={"out_features": 2},
)

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    learning_rate=2e-5,
    batch_size=16,
    num_iterations=20,
    num_epochs=1,
)

trainer.freeze() # Freeze the head
trainer.train() # Train only the body

# Unfreeze the head and unfreeze the body -> end-to-end training
trainer.unfreeze(keep_body_frozen=False)

trainer.train(
    num_epochs=16,
    batch_size=2,
    body_learning_rate=1e-5,
    learning_rate=1e-2,
)
metrics = trainer.evaluate()

After:

# Load a SetFit model from Hub
model: SetFitModel = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    use_differentiable_head=True,
    head_params={"out_features": 2},
)

# Create Training Arguments
args = TrainingArguments(
    # When an argument is a tuple, the first value is for training the embeddings,
    # and the latter is for training the differentiable classification head:
    batch_size=(16, 2),
    num_iterations=20,
    num_epochs=(1, 16),
    body_learning_rate=(2e-5, 1e-5),
    head_learning_rate=1e-2,
    end_to_end=True,
    loss=CosineSimilarityLoss,
)

# Create Trainer
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()

Deprecations

SetFitTrainer & DistillationSetFitTrainer

This is a soft deprecation, i.e. these can still be used like before. For reference, the following snippet still works.

First snippet from the README, using SetFitTrainer
from datasets import load_dataset
from sentence_transformers.losses import CosineSimilarityLoss

from setfit import SetFitModel, SetFitTrainer, sample_dataset


# Load a dataset from the Hugging Face Hub
dataset = load_dataset("sst2")

# Simulate the few-shot regime by sampling 8 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=8)
eval_dataset = dataset["validation"]

# Load a SetFit model from Hub
model: SetFitModel = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
)

# Create trainer
trainer = SetFitTrainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=4,
    num_iterations=20,  # The number of text pairs to generate for contrastive learning
    num_epochs=1,  # The number of epochs to use for contrastive learning
    column_mapping={"sentence": "text", "label": "label"},  # Map dataset columns to text/label expected by trainer
)

# Train and evaluate
trainer.train()
metrics = trainer.evaluate()
The corresponding output
Found cached dataset sst2 ([sic])
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 299.96it/s] 
fb86.arrow
Loading cached processed dataset at [sic]
Loading cached processed dataset at [sic]
Loading cached shuffled indices for dataset at [sic]
model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.  
[sic]\demo_readme.py:20: DeprecationWarning: `SetFitTrainer` has been deprecated. Please use `Trainer` instead.
  trainer = SetFitTrainer(
Applying column mapping to training dataset
***** Running training *****
  Num examples = 640
  Num epochs = 1
  Total optimization steps = 160
  Total train batch size = 4
Iteration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 160/160 [02:13<00:00,  1.20it/s] 
Epoch: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [02:13<00:00, 133.68s/it] 
Applying column mapping to evaluation dataset
***** Running evaluation *****
{'accuracy': 0.8818807339449541}
Note that the output contains a DeprecationWarning.

keep_body_frozen on trainer.unfreeze:

This is also a soft deprecation, so we can use this like before. Any use of keep_body_frozen results in a warning, and trainer.unfreeze(keep_body_frozen=True) is made equivalent to model.unfreeze("head"), while trainer.unfreeze(keep_body_frozen=False) is like model.unfreeze(). However, there should be no need for this method anymore, so we may be better off to deprecate it fully.

Please note the following snippet:

trainer.freeze()  # Freeze the head
trainer.train()  # Train the head

# unfreeze and train the body

This snippet was common when training with a differentiable head, but stops working after this PR, as trainer.freeze() will now fully freeze the model, after which trainer.train() fails as it can't perform a backward on the frozen sentence transformer body. This causes an error:

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Trainer.train arguments

This is a fairly soft deprecation, so code with e.g. trainer.train(num_epochs=..., ...) no longer works like before. The arguments passed via keywords are simply ignored now, while a warning is thrown:

[sic]\demo_readme_diff.py:44: DeprecationWarning: `SetFitTrainer.train` does not accept keyword arguments anymore. Please provide training arguments via a `TrainingArguments` instance to the `SetFitTrainer`initialisation or the `SetFitTrainer.train` method.
  trainer.train(

Training Arguments

Note that the TrainingArgument class must be able to hold separate values for training the embeddings vs the classifier. Most difficult is the case of learning rates, for which 3 separate values must exist:

  1. The learning rate of the sentence transformer body while training the embeddings.
  2. The learning rate of the sentence transformer body while training the classifier.
  3. The learning rate of the head while training the classifier.

For several other arguments (e.g. num_epochs, batch_sizes), two values must exist:

  1. For training the embeddings.
  2. For training the classifier.

For these last two cases, TrainingArguments(num_epochs=12) can be used, which will set both embedding_num_epochs and classifier_num_epochs to 12. Alternatively, TrainingArguments(num_epochs=(1, 12)) will set embedding_num_epochs to 1 and classifier_num_epochs to 12. Beyond that, TrainingArguments(embedding_num_epochs=1, classifier_num_epochs=12) is also allowed, i.e. they can be set directly. In practice, only embedding_num_epochs and classifier_num_epochs are used, while num_epochs exists only to make setting the arguments more natural.

The learning rate case is similar. For these, body_learning_rate and head_learning_rate must be set explicitly. Like before, body_learning_rate can be either set as a float or as a tuple of floats. If a tuple is given, then the first value is once again for training the embeddings, while the second value is for training the classifier. If instead only one value is given to body_learning_rate, then it is internally expanded to a tuple. Afterwards, body_embedding_learning_rate and body_classifier_learning_rate are either explicitly provided or implicitly set via body_learning_rate. Like before, only body_embedding_learning_rate and body_classifier_learning_rate are used, while body_learning_rate exists only to make setting the arguments more natural.

Beyond that, there is a new training argument called end_to_end. If the model is initialized with a differentiable head, then this argument specifies whether the sentence transformer body is frozen (if end_to_end=False) or not (if end_to_end is True) during training of the classifier. This replaces the previous need of trainer.freeze() before and during trainer.train() calls.

See the Upgrading Guide for an example of how the TrainingArguments may be initialized in a relatively complex case.

Tests

The new tests\test_deprecated_trainer.py and tests\test_deprecated_trainer_distillation.py files are equivalent to the tests\test_trainer.py and tests\test_trainer_distillation.py files at ac958ae, with four exception: In tests\test_deprecated_trainer.py, three tests are skipped:

  • Two are skipped because they contain trainer.train(max_length=max_length, ...) and then expect a warning based on that. However, these arguments are now ignored, so no warning is ever thrown.
  • Another test is skipped because it contains the following deprecated snippet:
    trainer.freeze()  # Freeze the head
    trainer.train()  # Train the head
    
    # unfreeze and train the body
    See the Deprecation section for more details on why this crashes.

And as a fourth change, from setfit.modeling import SupConLoss was changed to from setfit.losses import SupConLoss.

Using these tests, it is clear that users will not immediately face dozens of errors when updating. Instead, they may face several warnings and perhaps an error if they are using differentiable heads.

Discussion points

  1. How should users be notified on how to update their code? A section in the README? Via descriptive deprecation warnings? A link to this PR in the release notes?
  2. What should the deprecation timeline be? i.e. when do we remove SetFitTrainer altogether? We should include this in the deprecation warning messages.
  3. Should we simply deprecate Trainer.freeze() and Trainer.unfreeze? They should not be needed anymore, and they are simply equivalent to calling trainer.model.freeze() and trainer.model.unfreeze(). I'm in favour of the deprecation.
  4. If we do not deprecate Trainer.freeze(), should we then introduce a special error message when people use trainer.freeze() followed by trainer.train(), which now breaks whereas it worked previously? (See the Deprecation section)
  5. Are we a fan of the name end_to_end for training a differentiable model in full, i.e. not freezing the body when training the classifier?
  6. Should other arguments to Trainer be absorbed into TrainingArguments? In particular, loss_class?
  7. Which release should this fall under? 0.6.0? 0.7.0? 1.0.0?

And perhaps most importantly, I'm interested in all of your feedback on how you feel about these changes.

To do:

There are several things that still need to be done before this can be merged:

  • Docstrings for all arguments in TrainingArgument.
  • Update all method docstrings.
  • Pick deprecation timeline for deprecated classes and methods.
  • Create additional tests to ensure that e.g. embedding_batch_size and classifier_batch_size are set correctly whenever batch_size is set.

I'll make a separate PR for updating the README/Notebook/scripts.

Feel free to ask me anything if you have any questions.

cc: @lewtun

  • Tom Aarsen

Note: This commit involves three deprecations: the SetFitTrainer class (and DistilledSetFitTrainer), additional arguments to Trainer.train and the keep_body_frozen argument to Trainer.unfreeze. The first and last of these are 'graceful', i.e. old code will still work, but the Trainer.train changes are breaking in some situations. For example, e.g.
um_epochs can no longer be passed to Trainer.train. The new 'deprecated' test files are identical to the old test files. The goal here is to test whether old behaviour is still possible. For the most part it is, with exception of using Trainer.train with extra arguments. As a result, I skipped two tests in test_deprecated_trainer.py. Also note that docstrings have yet to be updated!
@tomaarsen
Copy link
Member Author

Merged 4cad62c...174eb00 into this PR via 94106cc.
Merged 29c0348 into this PR via a39e772.

…rate

The reasoning is that with body_learning_rate, the tuple is for (training embedding phase, training classifier phase), which matches the tuples that you should give to num_epochs and batch_size.
@tomaarsen
Copy link
Member Author

Merged 0cb8ffd into this PR via 7d4ad00.

Copy link
Member

@lewtun lewtun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a real tour de force of a PR @tomaarsen - great job refactoring all this complexity into clean APIs 🚀 !

Overall the code looks great and most of my comments are just nits + some suggestions for the deprecation warnings. I'll respond to your discussion points in a separate comment

src/setfit/modeling.py Outdated Show resolved Hide resolved
src/setfit/modeling.py Outdated Show resolved Hide resolved
src/setfit/modeling.py Outdated Show resolved Hide resolved
src/setfit/trainer.py Outdated Show resolved Hide resolved
src/setfit/trainer.py Outdated Show resolved Hide resolved
src/setfit/trainer_distillation.py Show resolved Hide resolved
src/setfit/training_args.py Show resolved Hide resolved
tests/test_deprecated_trainer.py Show resolved Hide resolved
tests/test_training_args.py Outdated Show resolved Hide resolved
tests/test_training_args.py Outdated Show resolved Hide resolved
@lewtun
Copy link
Member

lewtun commented Feb 6, 2023

Regarding the discussion points:

How should users be notified on how to update their code? A section in the README? Via descriptive deprecation warnings? A link to this PR in the release notes?

In transformers we usually do this in two ways: add descriptive deprecations warnings that feature X will be removed in the next major version of the lib & add a "Breaking changes" section to the release notes with a link to this PR (example). I think we can adopt the same strategy here :)

What should the deprecation timeline be? i.e. when do we remove SetFitTrainer altogether? We should include this in the deprecation warning messages.

I would like to remove it in v1 if possible, so one approach for that could be to make a v0.6 or v0.7 release that has all the deprecation warnings included. Alternatively, we could merge this in it's current form and tell users it will be removed in v2

Should we simply deprecate Trainer.freeze() and Trainer.unfreeze? They should not be needed anymore, and they are simply equivalent to calling trainer.model.freeze() and trainer.model.unfreeze(). I'm in favour of the deprecation.

Yes, please deprecate!

Are we a fan of the name end_to_end for training a differentiable model in full, i.e. not freezing the body when training the classifier?

I think it's OK, but happy to change it if you have a different name :)

Should other arguments to Trainer be absorbed into TrainingArguments? In particular, loss_class?

Yes, I've been wondering this too - I think it would be good to put all non-dataset-dependent objects into TrainingArguments - WDYT? I think technically we could also put metric and column_mapping there too right?

Which release should this fall under? 0.6.0? 0.7.0? 1.0.0?

I think we could aim v1 to be a big release where we are allowed to have a few breaking changes - WDYT?

@tomaarsen
Copy link
Member Author

tomaarsen commented Feb 6, 2023

Merged 9b7f74e into this PR via aab2377.

I'm down to move metric, loss and column_mapping to TrainingArguments.

I'll try to get to work on your comments soon.

tomaarsen and others added 25 commits October 25, 2023 14:08
The old conditional was True with the default -1, not ideal
Sampling strategy (new sampler)
* Initial version for SetFit ABSA

* Create complete test suite for ABSA (100%,90%,96%)

Only push_to_hub is not under test

* Run formatting

* Allow initializing models with different span_context

* Remove resolved TODO

* Raise error if args is the wrong type

* Update column mapping, allow partial maps

* Remove model_init from ABSA Trainer, not used

* Split train into train_aspect and train_polarity

And reformat

* Prefix logs with aspect/polarity when training

* Add ABSA-specific model cards

* If spaCy doesn't agree with the start/end, just ignore those cases

* If there are no aspects, just return

* Elaborate on the required columns in the datasets

* Add Absa to the WIP docs
As it requires accelerate to be updated
solved an issue that was causing failures
@tomaarsen tomaarsen changed the base branch from main to v1.0.0-pre November 10, 2023 11:10
@tomaarsen tomaarsen changed the title Refactor to introduce Trainer & TrainingArguments Refactor to introduce Trainer & TrainingArguments, add SetFit ABSA Nov 10, 2023
@tomaarsen tomaarsen merged commit b636cd7 into huggingface:v1.0.0-pre Nov 10, 2023
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactor Updates of old code without necessarily changing behaviour
Projects
None yet
5 participants