Skip to content

Unlearning dynamics curve on multi-gpu #94

@sh-qiangchen

Description

@sh-qiangchen

Thank you very much for maintaining the repository. This is an absolutely fantastic job.
I would like to seek help for the following three questions:

  1. Which checkpoint is being evaluated in the file open-unlearning/eval.py? trainer.save_model(output_dir) would save the last checkpoint, rather than the best model.
  2. For the forget10 split, per_device_train_batch_size=8 and 4 A800 GPUs, why was the following trainer_state.json obtained? (400/8/4=12.5 steps for one epoch, why doesn't it match here?)
{
 "log_history": [
    {
      "epoch": 1.6,
      "grad_norm": 3.0877197396864764,
      "learning_rate": 8.620689655172414e-06,
      "loss": -0.0265,
      "step": 5
    },
    {
      "epoch": 3.2,
      "grad_norm": 31.033850214532777,
      "learning_rate": 6.896551724137932e-06,
      "loss": -0.7045,
      "step": 10
    },
    {
      "epoch": 4.8,
      "grad_norm": 60.4976325117357,
      "learning_rate": 5.172413793103449e-06,
      "loss": -3.8988,
      "step": 15
    },
    {
      "epoch": 6.4,
      "grad_norm": 98.01466701483473,
      "learning_rate": 3.448275862068966e-06,
      "loss": -10.4987,
      "step": 20
    },
    {
      "epoch": 8.0,
      "grad_norm": 812.0205372731976,
      "learning_rate": 1.724137931034483e-06,
      "loss": -24.3783,
      "step": 25
    },
    {
      "epoch": 9.6,
      "grad_norm": 609.3995261258575,
      "learning_rate": 0.0,
      "loss": -67.0836,
      "step": 30
    },
    {
      "epoch": 9.6,
      "step": 30,
      "total_flos": 0.0,
      "train_loss": -17.765076827506224,
      "train_runtime": 322.3017,
      "train_samples_per_second": 12.411,
      "train_steps_per_second": 0.093
    }
  ],
  "logging_steps": 5,
  "max_steps": 30,
  "num_input_tokens_seen": 0,
  "num_train_epochs": 10
}
  1. How can I use FinetuneTrainer.evaluate in multi-GPU setting (currently it only works on a single GPU)? Because I see that most papers present the unlearning dynamics curve of evaluation. In other words, I want to obtain evaluation metrics during the training process using multiple GPUs.
if self.evaluator:
    if self.accelerator.is_local_main_process:
        eval_metrics = {}
        if self.accelerator.num_processes == 1:
            run_dir = self._get_output_dir(trial=trial)

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions