Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix rag finetuning + add finetuning test #8585

Merged
merged 30 commits into from
Nov 20, 2020

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Nov 17, 2020

Following #7715 we need more test coverage of the RAG example scripts.
In this PR I'm adding a test for the finetuning script.
The test includes a single gpu test and a multi gpu test. Both are passing.

As mentioned in #7816 and #8345 there were some errors in the script that I had to fix.

Moreover since @amogkam has been working on the finetuning script as well to integrate Ray, I made sure to reduce the possible conflicts with his PR #8583 . More precisely I'm reusing the CustomAccel class that will allow to init either the pytorch distributed retrieval or the ray distributed retrieval.

Also fixed a bug in RAG forward pass (see #8665 )

Fix #7816
Fix #8345

@shamanez
Copy link
Contributor

@lhoestq

Hi, I tried to execute finetune.py on two GPUs. It mainly fails with the following error. But when I run with the single GPU it works. I have also attached a screen shot.

RuntimeError: [/pytorch/third_party/gloo/gloo/transport/tcp/pair.cc:575] Connection closed by peer [130.216.209.142]:55728

Selection_006
Selection_007

@lhoestq
Copy link
Member Author

lhoestq commented Nov 19, 2020

What command did you run exactly ?

@shamanez
Copy link
Contributor

What command did you run exactly ?

python examples/rag/finetune.py --data_dir ./examples/rag/test_data/dummy_seq2seq --output_dir ./examples/rag/outputs --model_name_or_path facebook/rag-token-base --model_type rag_sequence --do_train --do_predict --n_val -1 --val_check_interval 0.25 --train_batch_size 1 --eval_batch_size 1 --max_source_length 128 --max_target_length 25 --val_max_target_length 25 --test_max_target_length 25 --label_smoothing 0.1 --dropout 0.1 --attention_dropout 0.1 --weight_decay 0.001 --adam_epsilon 1e-08 --max_grad_norm 0.1 --lr_scheduler polynomial --learning_rate 3e-05 --num_train_epochs 100 --warmup_steps 500 --gradient_accumulation_steps 1 --index_name custom --passages_path ./examples/rag/data/my_knowledge_dataset --index_path ./examples/rag/data/my_knowledge_dataset_hnsw_index.faiss --gpus 2

@lhoestq
Copy link
Member Author

lhoestq commented Nov 19, 2020

Does changing the port with --distributed_port 8888 help in your case ?

@shamanez
Copy link
Contributor

shamanez commented Nov 19, 2020

It says,
finetune.py: error: unrecognized arguments: --distributed-port 8888

@shamanez
Copy link
Contributor

I tried with --distributed-port 8888 still gives the same error.
btw my torch version is Version: 1.7.0+cu110

@lhoestq
Copy link
Member Author

lhoestq commented Nov 19, 2020

What's your pytorch lightning version ?
(also sorry I misspelled distributed-port)

@shamanez
Copy link
Contributor

pytorch lightning

Version: 1.0.4

@shamanez
Copy link
Contributor

shamanez commented Nov 19, 2020 via email

@lhoestq
Copy link
Member Author

lhoestq commented Nov 20, 2020

Ok I fixed the tensor issue and updated the readme

I also had to rename some the examples files of RAG to avoid collisions with the files of the seq2seq examples. The name collision broke the CI tests with failed imports.

I did:

examples/rag/utils.py -> exmaples/rag/utils_rag.py
examples/rag/callbacks.py -> exmaples/rag/callbacks_rag.py
examples/rag/finetune.py -> exmaples/rag/finetune_rag.py
examples/rag/finetune.sh -> exmaples/rag/finetune_rag.sh

All tests are green now :)

@shamanez
Copy link
Contributor

shamanez commented Nov 20, 2020 via email

@@ -38,7 +38,7 @@ def get_checkpoint_callback(output_dir, metric):
monitor=f"val_{metric}",
mode="max",
save_top_k=3,
period=0, # maybe save a checkpoint every time val is run, not just end of epoch.
period=1, # maybe save a checkpoint every time val is run, not just end of epoch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why go from 0 to 1?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I changed that to speed the the test and forgot to remove it. I can just modify the validation frequency

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome test! Just two things if we could I'd rather avoid them:

  1. Don't like adding dummy data files to master
  2. Would be nice to not force return_dict=True

@lhoestq
Copy link
Member Author

lhoestq commented Nov 20, 2020

I took your comment into account @patrickvonplaten
The only thing I didn't change is the return_dict=True - I kept them to avoid playing with tuples indices.

@marcoaleixo
Copy link

@lhoestq hello, thank you for this amazing feature.

when I try to create my custom dataset I receveing this error:

2020-12-16 00:48:44.645715: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1 INFO:__main__:Step 1 - Create the dataset Using custom data configuration default Reusing dataset csv (/root/.cache/huggingface/datasets/csv/default-d44cf86c96b535d8/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2) Loading cached processed dataset at /root/.cache/huggingface/datasets/csv/default-d44cf86c96b535d8/0.0.0/2960f95a26e85d40ca41a230ac88787f715ee3003edaacb8b1f0891e9f04dda2/cache-ad363af188e673b0.arrow 100% 1/1 [00:00<00:00, 10.92ba/s] INFO:__main__:Step 2 - Index the dataset Traceback (most recent call last): File "examples/rag/use_own_knowledge_dataset.py", line 200, in <module> main(rag_example_args, processing_args, index_hnsw_args) File "examples/rag/use_own_knowledge_dataset.py", line 102, in main index = faiss.IndexHNSWFlat(index_hnsw_args.d, index_hnsw_args.m, faiss.METRIC_INNER_PRODUCT) File "/usr/local/lib/python3.6/dist-packages/faiss/swigfaiss.py", line 3746, in __init__ this = _swigfaiss.new_IndexHNSWFlat(*args) NotImplementedError: Wrong number or type of arguments for overloaded function 'new_IndexHNSWFlat'. Possible C/C++ prototypes are: faiss::IndexHNSWFlat::IndexHNSWFlat() faiss::IndexHNSWFlat::IndexHNSWFlat(int,int)

I'm using Google Colab to test this - https://colab.research.google.com/drive/1Cjj18rYmeS0Bueis_KPB5Wbybl-JNDLL?usp=sharing

@marcoaleixo
Copy link

Well, i didn't install the specific dependencies you defined. excuse me.

Solved running - !pip install -r /transformers/examples/rag/requirements.txt

At least it is registered if someone has the same problem. haha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error in RAG finetuning script RAG - MissingIndex: Index with index_name 'embeddings' not initialized yet
4 participants