Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some problem while fine-tuing on Paraphrase Dataset #6

Closed
Ricardokevins opened this issue Dec 3, 2021 · 9 comments
Closed

Some problem while fine-tuing on Paraphrase Dataset #6

Ricardokevins opened this issue Dec 3, 2021 · 9 comments

Comments

@Ricardokevins
Copy link

I'm sorry to trouble you that I am interesting in paraphrase-fine tuing part.
I use the concise code you provide in train/bart.py.
However when i use the command “ python bart.py" , I encounter OOM problem ( On single V100-32GB )
I reread your paper, and notice that you can even use batchsize=20 to fine-tuing one epoch within a hour ( impressive speed~).

And i check the train/args.py in this repo, the batchsize is 8 and Max-epoch is 3 which is not consistency with the setting in paper (batch = 20,epoch =1).

Can you release the setting/code you used in fine-tuing? I want to check whats the problem for OOV ( maybe the larger max_length? or larger batchsize? Currently i cut TrainBatchSize to 4 (EvalBatchSize to 2) and the model fine-tuned very very very slowly....

Or can you give me some reproducing advice ? I am not sure lower batchsize or shoter sequence length could achieve same precision with paper's result.

Thank you vey much and any suggestions are greatly appreciated.
Sorry to trouble you QAQ

@yyy-Apple
Copy link
Collaborator

Hi~
Thanks for your interest.

Actually, we didn't use the training script we provide here since we only have 4 small GPUs (11G each). So we used model parallel to shard the model into 4 GPUs during training. I think one V100-32G is enough to train BART with small micro batchsize (maybe 1 or 2), you can set the gradient accumulation step to adjust the effective batchsize (= gradient accumulation step * micro batchsize).

For reproducing, you should mostly follow the setting in our paper. And also, we only fine-tuned on a subset of ParaBank, which contains 30000 data.

@Ricardokevins
Copy link
Author

Hi~ Thanks for your interest.

Actually, we didn't use the training script we provide here since we only have 4 small GPUs (11G each). So we used model parallel to shard the model into 4 GPUs during training. I think one V100-32G is enough to train BART with small micro batchsize (maybe 1 or 2), you can set the gradient accumulation step to adjust the effective batchsize (= gradient accumulation step * micro batchsize).

For reproducing, you should mostly follow the setting in our paper. And also, we only fine-tuned on a subset of ParaBank, which contains 30000 data.

Thank you for your reply ~
I will try more setting to get the result :D

@ZacharyChenpk
Copy link

Hi~ Thanks for your interest.

Actually, we didn't use the training script we provide here since we only have 4 small GPUs (11G each). So we used model parallel to shard the model into 4 GPUs during training. I think one V100-32G is enough to train BART with small micro batchsize (maybe 1 or 2), you can set the gradient accumulation step to adjust the effective batchsize (= gradient accumulation step * micro batchsize).

For reproducing, you should mostly follow the setting in our paper. And also, we only fine-tuned on a subset of ParaBank, which contains 30000 data.

Hello!
I'm interested in the fine-tuning part of your works too. Where can I get the data files (seem to be data/parabank2.json and data/eval.json) to reproduce the results in the paper?

@Ricardokevins
Copy link
Author

Hi~ Thanks for your interest.
Actually, we didn't use the training script we provide here since we only have 4 small GPUs (11G each). So we used model parallel to shard the model into 4 GPUs during training. I think one V100-32G is enough to train BART with small micro batchsize (maybe 1 or 2), you can set the gradient accumulation step to adjust the effective batchsize (= gradient accumulation step * micro batchsize).
For reproducing, you should mostly follow the setting in our paper. And also, we only fine-tuned on a subset of ParaBank, which contains 30000 data.

Hello! I'm interested in the fine-tuning part of your works too. Where can I get the data files (seem to be data/parabank2.json and data/eval.json) to reproduce the results in the paper?

Hi~ after such a long time, do you sucessfully reproduce the results in paper?

@ZacharyChenpk
Copy link

Hi~ Thanks for your interest.
Actually, we didn't use the training script we provide here since we only have 4 small GPUs (11G each). So we used model parallel to shard the model into 4 GPUs during training. I think one V100-32G is enough to train BART with small micro batchsize (maybe 1 or 2), you can set the gradient accumulation step to adjust the effective batchsize (= gradient accumulation step * micro batchsize).
For reproducing, you should mostly follow the setting in our paper. And also, we only fine-tuned on a subset of ParaBank, which contains 30000 data.

Hello! I'm interested in the fine-tuning part of your works too. Where can I get the data files (seem to be data/parabank2.json and data/eval.json) to reproduce the results in the paper?

Hi~ after such a long time, do you sucessfully reproduce the results in paper?

I have reproduced the evaluation results of released trained model, like what analysis.ipynb do. But I cannot get access to the training data files , let alone reproduce the training process :-(

@Ricardokevins
Copy link
Author

Ricardokevins commented Feb 22, 2022

Hi~ Thanks for your interest.
Actually, we didn't use the training script we provide here since we only have 4 small GPUs (11G each). So we used model parallel to shard the model into 4 GPUs during training. I think one V100-32G is enough to train BART with small micro batchsize (maybe 1 or 2), you can set the gradient accumulation step to adjust the effective batchsize (= gradient accumulation step * micro batchsize).
For reproducing, you should mostly follow the setting in our paper. And also, we only fine-tuned on a subset of ParaBank, which contains 30000 data.

Hello! I'm interested in the fine-tuning part of your works too. Where can I get the data files (seem to be data/parabank2.json and data/eval.json) to reproduce the results in the paper?

Hi~ after such a long time, do you sucessfully reproduce the results in paper?

I have reproduced the evaluation results of released trained model, like what analysis.ipynb do. But I cannot get access to the training data files , let alone reproduce the training process :-(

same situation like you.
Do you try to fine-tuing the model in your own data? and observe the Improved result?

@yyy-Apple
Copy link
Collaborator

Sorry for not paying attention to this closed issue. We have added our training script inside the train folder, as well as instructions for preparing the data.

@Ricardokevins
Copy link
Author

Ricardokevins commented Mar 3, 2022

Sorry for not paying attention to this closed issue. We have added our training script inside the train folder, as well as instructions for preparing the data.

Thanks a lot !
I try the code and training scripts.
I notice that because of the limited GPU , You define the SharedBART and put Model in different GPU.
That would lead to ERROR while load_state_dict In bart_score.py
( Common Load is used to load BartForConditionalGeneration while fine-tuing Script Save the whole SharedBART )

I try to fix this error by the following code
But I encounter another Error: "decode_attention_mask is None"

'''
from bart_utils import ShardedBART
self.bart = ShardedBART(self.checkpoint)
self.bart.load_state_dict(torch.load('xxxxxxxxx/bart_3000.pth', map_location=self.device))
self.model = self.bart
'''

Is there any convenient and feasible way to use the fine-tuned model ?
( fine-tuing script can be used smoothly And Thanks for open source !!!!! )

@Ricardokevins Ricardokevins reopened this Mar 3, 2022
@Ricardokevins
Copy link
Author

Ricardokevins commented Mar 3, 2022

Well i think you maybe should modify the save_model in bart.py

I modify the code and solve the problem

Previous

def save_model(self, path):
  torch.save(self.bart.state_dict(), path)
  print(f'Model saved in {path}.')

def load_model(self, path):
  self.bart.load_state_dict(torch.load(path, map_location=self.device))
  print(f'Model {path} loaded.')

After

def save_model(self, path):
    torch.save(self.bart.model.state_dict(), path)
    print(f'Model saved in {path}.')

def load_model(self, path):
    self.bart.model.load_state_dict(torch.load(path, map_location=self.device))
    print(f'Model {path} loaded.')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants