[RAG] Propagating of n_docs as parameter to all RagModel's related functions #7891

lalitpagaria · 2020-10-18T22:14:57Z

What does this PR do?

Before submitting

This PR fixes a typo or improves the docs (you can dimiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to the it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors which may be interested in your PR.

@patrickvonplaten

…hat defaults to self.config.n_docs

…ce type. T5PreTrainedModel do not have n_docs as parameter

src/transformers/modeling_rag.py

patrickvonplaten

Besides a small suggestion for the docstrings this PR looks great! Thanks a lot @lalitpagaria !

patrickvonplaten · 2020-10-19T07:04:54Z

@lhoestq would be great if you can review as well

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

lalitpagaria · 2020-10-19T07:48:32Z

@patrickvonplaten Thanks for the review.

while working on this PR I found that in RagTokenForGeneration we are computing batch_size as follows -

batch_size = context_input_ids.shape[0] // n_docs

So still issue can come when ((context_input_ids.shape[0] % n_docs) != 0), but I can't think of solution to address this.

lhoestq

Thanks, looks good to me :)

src/transformers/modeling_rag.py

lhoestq · 2020-10-19T07:57:39Z

@patrickvonplaten Thanks for the review.

while working on this PR I found that in RagTokenForGeneration we are computing batch_size as follows -
batch_size = context_input_ids.shape[0] // n_docs
So still issue can come when ((context_input_ids.shape[0] % n_docs) != 0), but I can't think of solution to address this.

context_input_ids is always supposed to have a size of n_docs times the number of input questions

lalitpagaria · 2020-10-19T08:01:58Z

@patrickvonplaten Thanks for the review.
while working on this PR I found that in RagTokenForGeneration we are computing batch_size as follows -
batch_size = context_input_ids.shape[0] // n_docs
So still issue can come when ((context_input_ids.shape[0] % n_docs) != 0), but I can't think of solution to address this.
context_input_ids is always supposed to have a size of n_docs times the number of input questions

It would be better if we mention it explicitly by assert. WDYT?
In one of my test case I used n_docs=3 for retriever and n_docs=2 for generator and it failed

lhoestq · 2020-10-19T08:07:43Z

It would be better if we mention it explicitly by assert. WDYT?
In one of my test case I used n_docs=3 for retriever and n_docs=2 for generator and it failed

Yes indeed. Also if ((context_input_ids.shape[0] % n_docs) != 0) then we should raise an error otherwise some retrieved documents will be ignored for generation.

patrickvonplaten · 2020-10-19T08:24:00Z

Yes @lalitpagaria - it would be nice if you can add an asserte statement verifying that n_docs is correctly set. n_docs should be the same for both retriever and generator.

…s should be the same for both retriever and generator.

lalitpagaria · 2020-10-19T08:44:59Z

@patrickvonplaten @lhoestq Added assert at two places please verify, along with supporting unit test. Pardon my naming convention for test function, and please suggest proper name :)

n_docs should be the same for both retriever and generator.

This can't be check if generator does not know about retriever hence using this ((context_input_ids.shape[0] % n_docs) != 0)

tests/test_modeling_rag.py

…agSequenceForGeneration context_input_ids can be null

lalitpagaria · 2020-10-19T11:43:21Z

@patrickvonplaten and @lhoestq Thanks for the review. I liked the test coverage of this project. Initially I struggled but letter all worked nicely. You can merge when you want.

src/transformers/modeling_rag.py

patrickvonplaten · 2020-10-19T13:02:26Z

Slow tests pass => ready to merge

patrickvonplaten · 2020-10-19T13:15:59Z

Good job @lalitpagaria !

…nctions (huggingface#7891) * Propagating n_docs as parameter to all RagModel's related functions that defaults to self.config.n_docs * Making n_docs parameter's default value to None in marginalize function * Fixing code quality issues * Handle the special case when generator is of T5PreTrainedModel instance type. T5PreTrainedModel do not have n_docs as parameter * T5PreTrainedModel do not have n_docs as parameter * Addressing review comment Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com> * Correcting comment by addressing review comment * Adding assert statement verifying that n_docs is correctly set. n_docs should be the same for both retriever and generator. * Fixing flake8 reported issue * Correcting test datasets for rag * Using doc_scores instead of context_input_ids to check assert as in RagSequenceForGeneration context_input_ids can be null * doc_scores second dimension have number of retrieved docs * Changing assert comment * Apply suggestions from code review Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

…lated functions (huggingface#7891)" This reverts commit 64a50cb.

lalitpagaria added 5 commits October 19, 2020 00:09

Propagating n_docs as parameter to all RagModel's related functions t…

78ff152

…hat defaults to self.config.n_docs

Making n_docs parameter's default value to None in marginalize function

f35a1c8

Fixing code quality issues

a64c861

Handle the special case when generator is of T5PreTrainedModel instan…

f7c7ae9

…ce type. T5PreTrainedModel do not have n_docs as parameter

T5PreTrainedModel do not have n_docs as parameter

94892ca

patrickvonplaten reviewed Oct 19, 2020

View reviewed changes

src/transformers/modeling_rag.py Outdated Show resolved Hide resolved

patrickvonplaten approved these changes Oct 19, 2020

View reviewed changes

patrickvonplaten requested a review from lhoestq October 19, 2020 07:04

Addressing review comment

a5acc86

Co-authored-by: Patrick von Platen <patrick.v.platen@gmail.com>

lhoestq approved these changes Oct 19, 2020

View reviewed changes

src/transformers/modeling_rag.py Outdated Show resolved Hide resolved

Correcting comment by addressing review comment

630ff8b

Adding assert statement verifying that n_docs is correctly set. n_doc…

de8de89

…s should be the same for both retriever and generator.

Fixing flake8 reported issue

d80b233

lhoestq reviewed Oct 19, 2020

View reviewed changes

tests/test_modeling_rag.py Outdated Show resolved Hide resolved

lalitpagaria added 4 commits October 19, 2020 11:04

Correcting test datasets for rag

93a661e

Using doc_scores instead of context_input_ids to check assert as in R…

41a0706

…agSequenceForGeneration context_input_ids can be null

doc_scores second dimension have number of retrieved docs

08e6e1c

Changing assert comment

e6c9ff8