-
Notifications
You must be signed in to change notification settings - Fork 25.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix rag finetuning + add finetuning test #8585
Conversation
f9ffc27
to
e480871
Compare
Hi, I tried to execute finetune.py on two GPUs. It mainly fails with the following error. But when I run with the single GPU it works. I have also attached a screen shot. RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [130.216.209.142]:55728 |
What command did you run exactly ? |
|
Does changing the port with |
It says, |
I tried with |
What's your pytorch lightning version ? |
Version: 1.0.4 |
@lhoestq
Hi just wanted to know .. did you managed to run the finetune.sh script
without any errors.
|
Ok I fixed the tensor issue and updated the readme I also had to rename some the examples files of RAG to avoid collisions with the files of the seq2seq examples. The name collision broke the CI tests with failed imports. I did:
All tests are green now :) |
Thanks a lot for your quick response.
…On Sat, Nov 21, 2020, 00:15 Quentin Lhoest ***@***.***> wrote:
Ok I fixed the tensor issue and updated the readme
I also had to rename some the examples files of RAG to avoid collisions
with the files of the seq2seq examples. The name collision broke the CI
tests with failed imports.
I did:
examples/rag/utils.py -> exmaples/rag/utils_rag.py
examples/rag/callbacks.py -> exmaples/rag/callbacks_rag.py
examples/rag/finetune.py -> exmaples/rag/finetune_rag.py
examples/rag/finetune.sh -> exmaples/rag/finetune_rag.sh
All tests are green now :)
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#8585 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEA4FGTA626DKBJ5S7JXLFDSQZFVNANCNFSM4TYMOBIQ>
.
|
@@ -38,7 +38,7 @@ def get_checkpoint_callback(output_dir, metric): | |||
monitor=f"val_{metric}", | |||
mode="max", | |||
save_top_k=3, | |||
period=0, # maybe save a checkpoint every time val is run, not just end of epoch. | |||
period=1, # maybe save a checkpoint every time val is run, not just end of epoch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why go from 0 to 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I changed that to speed the the test and forgot to remove it. I can just modify the validation frequency
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome test! Just two things if we could I'd rather avoid them:
- Don't like adding dummy data files to master
- Would be nice to not force
return_dict=True
I took your comment into account @patrickvonplaten |
@lhoestq hello, thank you for this amazing feature. when I try to create my custom dataset I receveing this error:
I'm using Google Colab to test this - https://colab.research.google.com/drive/1Cjj18rYmeS0Bueis_KPB5Wbybl-JNDLL?usp=sharing |
Well, i didn't install the specific dependencies you defined. excuse me. Solved running - !pip install -r /transformers/examples/rag/requirements.txt At least it is registered if someone has the same problem. haha |
Following #7715 we need more test coverage of the RAG example scripts.
In this PR I'm adding a test for the finetuning script.
The test includes a single gpu test and a multi gpu test. Both are passing.
As mentioned in #7816 and #8345 there were some errors in the script that I had to fix.
Moreover since @amogkam has been working on the finetuning script as well to integrate Ray, I made sure to reduce the possible conflicts with his PR #8583 . More precisely I'm reusing the CustomAccel class that will allow to init either the pytorch distributed retrieval or the ray distributed retrieval.
Also fixed a bug in RAG forward pass (see #8665 )
Fix #7816
Fix #8345