Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lm_fine-tuning on small dataset of 3 documents #1907

Closed
vr25 opened this issue Nov 21, 2019 · 3 comments
Closed

lm_fine-tuning on small dataset of 3 documents #1907

vr25 opened this issue Nov 21, 2019 · 3 comments

Comments

@vr25
Copy link

vr25 commented Nov 21, 2019

❓ Questions & Help

Hi,

I am trying to use run_lm_finetuning.py on a sample dataset here.

I am running the script with following arguments but I get the exact identical pytorch_model.bin [440.5 MB] saved in the output_dir=op:
python run_lm_finetuning.py --train_data_file=sample_text.txt --output_dir=op --mlm --do_train --overwrite_output_dir --do_lower_case --save_steps=50

I was wondering if this dataset of 3 documents is too small to fine-tune on or if I can modify some arguments to get a domain-fine-tuned model.

Thanks!

@iedmrc
Copy link
Contributor

iedmrc commented Nov 22, 2019

How do you know you have exact identical pytorch_model.bin files? Do you just compare file sizes? IF so, it is not a qualified method just because weights usually are just float numbers and they (almost) always occupy same size on the disk. You can compare the hashes of files to make sure.

@vr25
Copy link
Author

vr25 commented Nov 22, 2019

Yes, I just thought of comparing the files naively by comparing their sizes.

I see, yes, "hashes" sounds a much better way of comparing files, thanks. I'll post here if that works.

Also, do you have any beginner suggestions on generating the hashes quickly and efficiently?

@vr25
Copy link
Author

vr25 commented Nov 22, 2019

I used md5sum pytorch_model.bin to generate the hashes of the files and both are different. Anyway, thanks, again!

@vr25 vr25 closed this as completed Nov 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants