Fix rag finetuning + add finetuning test #8585

lhoestq · 2020-11-17T10:42:22Z

Following #7715 we need more test coverage of the RAG example scripts.
In this PR I'm adding a test for the finetuning script.
The test includes a single gpu test and a multi gpu test. Both are passing.

As mentioned in #7816 and #8345 there were some errors in the script that I had to fix.

Moreover since @amogkam has been working on the finetuning script as well to integrate Ray, I made sure to reduce the possible conflicts with his PR #8583 . More precisely I'm reusing the CustomAccel class that will allow to init either the pytorch distributed retrieval or the ray distributed retrieval.

Also fixed a bug in RAG forward pass (see #8665 )

Fix #7816
Fix #8345

shamanez · 2020-11-19T11:28:57Z

@lhoestq

Hi, I tried to execute finetune.py on two GPUs. It mainly fails with the following error. But when I run with the single GPU it works. I have also attached a screen shot.

RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [130.216.209.142]:55728

lhoestq · 2020-11-19T11:30:13Z

What command did you run exactly ?

shamanez · 2020-11-19T11:33:08Z

What command did you run exactly ?

python examples/rag/finetune.py --data_dir ./examples/rag/test_data/dummy_seq2seq --output_dir ./examples/rag/outputs --model_name_or_path facebook/rag-token-base --model_type rag_sequence --do_train --do_predict --n_val -1 --val_check_interval 0.25 --train_batch_size 1 --eval_batch_size 1 --max_source_length 128 --max_target_length 25 --val_max_target_length 25 --test_max_target_length 25 --label_smoothing 0.1 --dropout 0.1 --attention_dropout 0.1 --weight_decay 0.001 --adam_epsilon 1e-08 --max_grad_norm 0.1 --lr_scheduler polynomial --learning_rate 3e-05 --num_train_epochs 100 --warmup_steps 500 --gradient_accumulation_steps 1 --index_name custom --passages_path ./examples/rag/data/my_knowledge_dataset --index_path ./examples/rag/data/my_knowledge_dataset_hnsw_index.faiss --gpus 2

lhoestq · 2020-11-19T11:36:48Z

Does changing the port with --distributed_port 8888 help in your case ?

shamanez · 2020-11-19T11:39:56Z

It says,
finetune.py: error: unrecognized arguments: --distributed-port 8888

shamanez · 2020-11-19T11:43:51Z

I tried with --distributed-port 8888 still gives the same error.
btw my torch version is Version: 1.7.0+cu110

lhoestq · 2020-11-19T11:53:52Z

What's your pytorch lightning version ?
(also sorry I misspelled distributed-port)

shamanez · 2020-11-19T11:55:31Z

pytorch lightning

Version: 1.0.4

shamanez · 2020-11-19T21:51:46Z

@lhoestq Hi just wanted to know .. did you managed to run the finetune.sh script without any errors.

lhoestq · 2020-11-20T11:15:19Z

Ok I fixed the tensor issue and updated the readme

I also had to rename some the examples files of RAG to avoid collisions with the files of the seq2seq examples. The name collision broke the CI tests with failed imports.

I did:

examples/rag/utils.py -> exmaples/rag/utils_rag.py
examples/rag/callbacks.py -> exmaples/rag/callbacks_rag.py
examples/rag/finetune.py -> exmaples/rag/finetune_rag.py
examples/rag/finetune.sh -> exmaples/rag/finetune_rag.sh

All tests are green now :)

shamanez · 2020-11-20T11:26:23Z

Thanks a lot for your quick response.

…

On Sat, Nov 21, 2020, 00:15 Quentin Lhoest ***@***.***> wrote: Ok I fixed the tensor issue and updated the readme I also had to rename some the examples files of RAG to avoid collisions with the files of the seq2seq examples. The name collision broke the CI tests with failed imports. I did: examples/rag/utils.py -> exmaples/rag/utils_rag.py examples/rag/callbacks.py -> exmaples/rag/callbacks_rag.py examples/rag/finetune.py -> exmaples/rag/finetune_rag.py examples/rag/finetune.sh -> exmaples/rag/finetune_rag.sh All tests are green now :) — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#8585 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEA4FGTA626DKBJ5S7JXLFDSQZFVNANCNFSM4TYMOBIQ> .

examples/rag/README.md

patrickvonplaten · 2020-11-20T14:24:45Z

examples/rag/callbacks_rag.py

@@ -38,7 +38,7 @@ def get_checkpoint_callback(output_dir, metric):
        monitor=f"val_{metric}",
        mode="max",
        save_top_k=3,
-        period=0,  # maybe save a checkpoint every time val is run, not just end of epoch.
+        period=1,  # maybe save a checkpoint every time val is run, not just end of epoch.


why go from 0 to 1?

Oh I changed that to speed the the test and forgot to remove it. I can just modify the validation frequency

examples/rag/finetune_rag.py

examples/rag/test_data/dummy_seq2seq/test.source

src/transformers/models/rag/modeling_rag.py

patrickvonplaten

Awesome test! Just two things if we could I'd rather avoid them:

Don't like adding dummy data files to master
Would be nice to not force return_dict=True

lhoestq · 2020-11-20T17:46:47Z

I took your comment into account @patrickvonplaten
The only thing I didn't change is the return_dict=True - I kept them to avoid playing with tuples indices.

marcoaleixo · 2020-12-16T00:58:47Z

@lhoestq hello, thank you for this amazing feature.

when I try to create my custom dataset I receveing this error:

2020-12-16 00:48:44.645715: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 INFO:__main__:Step 1 - Create the dataset Using custom data configuration default Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-d44cf86c96b535d8/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2) Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-d44cf86c96b535d8/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-ad363af188e673b0.arrow 100% 1/1 [00:00<00:00, 10.92ba/s] INFO:__main__:Step 2 - Index the dataset Traceback (most recent call last): File "examples/rag/use_own_knowledge_dataset.py", line 200, in <module> main(rag_example_args, processing_args, index_hnsw_args) File "examples/rag/use_own_knowledge_dataset.py", line 102, in main index = faiss.IndexHNSWFlat(index_hnsw_args.d, index_hnsw_args.m, faiss.METRIC_INNER_PRODUCT) File "/usr/local/lib/python3.6/dist-packages/faiss/swigfaiss.py", line 3746, in __init__ this = _swigfaiss.new_IndexHNSWFlat(*args) NotImplementedError: Wrong number or type of arguments for overloaded function 'new_IndexHNSWFlat'. Possible C/C++ prototypes are: faiss::IndexHNSWFlat::IndexHNSWFlat() faiss::IndexHNSWFlat::IndexHNSWFlat(int,int)

I'm using Google Colab to test this - https://colab.research.google.com/drive/1Cjj18rYmeS0Bueis_KPB5Wbybl-JNDLL?usp=sharing

marcoaleixo · 2020-12-16T01:12:15Z

Well, i didn't install the specific dependencies you defined. excuse me.

Solved running - !pip install -r /transformers/examples/rag/requirements.txt

At least it is registered if someone has the same problem. haha

lhoestq requested review from patrickvonplaten and thomwolf November 17, 2020 10:43

lhoestq force-pushed the fix-rag-finetuning branch from f9ffc27 to e480871 Compare November 17, 2020 11:05

lhoestq changed the title ~~Fix rag finetuning~~ Fix rag finetuning + add finetuning test Nov 17, 2020

This was referenced Nov 19, 2020

RAG: OSError: Can't load tokenizer for 'facebook/rag-sequence-nq/question_encoder_tokenizer' #8651

Closed

Fix missing return_dict in RAG example to use a custom knowledge source #8653

Merged

lhoestq added 15 commits November 20, 2020 09:38

replace init_ddp_connection for index init

ccc3661

style

a16b78b

add finetune test

a528279

add test data

10de28b

move generate tensors to device

fdd59fb

add test on EM metric

c1cd3c1

style

ae1366c

allow multi process test

adf56d4

keep gloo process group for retrieval

ddf2eca

add multi-gpu test

73194c7

use custom accelerator

aebdcb8

clean test finetune

f307b9f

minor

30cd148

style

46f8660

style

572b070

lhoestq added 7 commits November 20, 2020 10:31

store as float32 as well in the custom knowledge dataset example

90bee66

style

d6147dd

rename to finetune_rag

16d9dc9

style

a080425

update readme

611568c

rename utils and callbacks to utils_rag and callbacks_rag

9652768

fix test

ae0c3a3