Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

fabiencro · 2019-02-20T10:39:49Z

Hi,

Thank you for releasing your pretrained model and saving us training time. I am currently exploring possible applications, but I ran into a problem that might also annoy many researchers trying to use your model.

AFAIK, you have not released the WebText corpus (although I know this is currently discussed in issue #24). This is fine by me, except for one aspect: it makes it impossible for me to know if my test
data is somehow included in WebText. Which, in turns, makes it impossible for me to tell if any improvement I am getting is due to the quality of GPT or the fact that the pretrained model has already seen my test data.

If you do not plan to release WebText in the very near future, I was thinking you could release the bloom filters you describe in your technical paper (code + filled filters). This would allow us to evaluate the proportion of 8-grams in our test data that is also in WebText.

Would this be possible?
Thank you.

WuTheFWasThat · 2019-02-20T17:52:42Z

this is very reasonable - I'll discuss this with others and get back to you. as a forewarning, though:

we have lots of other higher priority items to address right now
the bloom filter library i used loads them into memory, and it's about 45G, so you need a pretty beefy machine

fabiencro · 2019-02-21T05:16:25Z

OK. Thank you very much for considering anyway.

Indeed, 45GB is larger than my expectations. But I do have access to machines with enough memory, so it would still be usable, at least for my case.

WuTheFWasThat closed this as completed Feb 20, 2019

WuTheFWasThat reopened this Feb 20, 2019

WuTheFWasThat added the question Further information is requested label Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

fabiencro commented Feb 20, 2019

WuTheFWasThat commented Feb 20, 2019

fabiencro commented Feb 21, 2019

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

Comments

fabiencro commented Feb 20, 2019

WuTheFWasThat commented Feb 20, 2019

fabiencro commented Feb 21, 2019