-
Notifications
You must be signed in to change notification settings - Fork 104
Closed
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation
Description
Thank you very much for maintaining the repository. This is an absolutely fantastic job.
I would like to seek help for the following three questions:
- Which checkpoint is being evaluated in the file open-unlearning/eval.py? trainer.save_model(output_dir) would save the last checkpoint, rather than the best model.
- For the forget10 split, per_device_train_batch_size=8 and 4 A800 GPUs, why was the following trainer_state.json obtained? (400/8/4=12.5 steps for one epoch, why doesn't it match here?)
{
"log_history": [
{
"epoch": 1.6,
"grad_norm": 3.0877197396864764,
"learning_rate": 8.620689655172414e-06,
"loss": -0.0265,
"step": 5
},
{
"epoch": 3.2,
"grad_norm": 31.033850214532777,
"learning_rate": 6.896551724137932e-06,
"loss": -0.7045,
"step": 10
},
{
"epoch": 4.8,
"grad_norm": 60.4976325117357,
"learning_rate": 5.172413793103449e-06,
"loss": -3.8988,
"step": 15
},
{
"epoch": 6.4,
"grad_norm": 98.01466701483473,
"learning_rate": 3.448275862068966e-06,
"loss": -10.4987,
"step": 20
},
{
"epoch": 8.0,
"grad_norm": 812.0205372731976,
"learning_rate": 1.724137931034483e-06,
"loss": -24.3783,
"step": 25
},
{
"epoch": 9.6,
"grad_norm": 609.3995261258575,
"learning_rate": 0.0,
"loss": -67.0836,
"step": 30
},
{
"epoch": 9.6,
"step": 30,
"total_flos": 0.0,
"train_loss": -17.765076827506224,
"train_runtime": 322.3017,
"train_samples_per_second": 12.411,
"train_steps_per_second": 0.093
}
],
"logging_steps": 5,
"max_steps": 30,
"num_input_tokens_seen": 0,
"num_train_epochs": 10
}- How can I use FinetuneTrainer.evaluate in multi-GPU setting (currently it only works on a single GPU)? Because I see that most papers present the unlearning dynamics curve of evaluation. In other words, I want to obtain evaluation metrics during the training process using multiple GPUs.
if self.evaluator:
if self.accelerator.is_local_main_process:
eval_metrics = {}
if self.accelerator.num_processes == 1:
run_dir = self._get_output_dir(trial=trial)Metadata
Metadata
Assignees
Labels
documentationImprovements or additions to documentationImprovements or additions to documentation