Unlearning dynamics curve on multi-gpu

Thank you very much for maintaining the repository. This is an absolutely fantastic job.
I would like to seek help for the following three questions:

1. Which checkpoint is being evaluated in the file open-unlearning/**eval.py**? **trainer.save_model(output_dir)** would save the last checkpoint, rather than the best model.
2. For the forget10 split, per_device_train_batch_size=8 and 4 A800 GPUs, why was the following **trainer_state.json** obtained? （400/8/4=12.5 steps for one epoch, why doesn't it match here?）
```json
{
 "log_history": [
    {
      "epoch": 1.6,
      "grad_norm": 3.0877197396864764,
      "learning_rate": 8.620689655172414e-06,
      "loss": -0.0265,
      "step": 5
    },
    {
      "epoch": 3.2,
      "grad_norm": 31.033850214532777,
      "learning_rate": 6.896551724137932e-06,
      "loss": -0.7045,
      "step": 10
    },
    {
      "epoch": 4.8,
      "grad_norm": 60.4976325117357,
      "learning_rate": 5.172413793103449e-06,
      "loss": -3.8988,
      "step": 15
    },
    {
      "epoch": 6.4,
      "grad_norm": 98.01466701483473,
      "learning_rate": 3.448275862068966e-06,
      "loss": -10.4987,
      "step": 20
    },
    {
      "epoch": 8.0,
      "grad_norm": 812.0205372731976,
      "learning_rate": 1.724137931034483e-06,
      "loss": -24.3783,
      "step": 25
    },
    {
      "epoch": 9.6,
      "grad_norm": 609.3995261258575,
      "learning_rate": 0.0,
      "loss": -67.0836,
      "step": 30
    },
    {
      "epoch": 9.6,
      "step": 30,
      "total_flos": 0.0,
      "train_loss": -17.765076827506224,
      "train_runtime": 322.3017,
      "train_samples_per_second": 12.411,
      "train_steps_per_second": 0.093
    }
  ],
  "logging_steps": 5,
  "max_steps": 30,
  "num_input_tokens_seen": 0,
  "num_train_epochs": 10
}
```
3. How can I use _**FinetuneTrainer.evaluate**_ in multi-GPU setting (currently it only works on a single GPU)? Because I see that most papers present the **unlearning dynamics curve** of evaluation. In other words, I want to obtain evaluation metrics during the training process using multiple GPUs.
```python
if self.evaluator:
    if self.accelerator.is_local_main_process:
        eval_metrics = {}
        if self.accelerator.num_processes == 1:
            run_dir = self._get_output_dir(trial=trial)
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unlearning dynamics curve on multi-gpu #94

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unlearning dynamics curve on multi-gpu #94

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions