Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: Release Bloom Filters for WebText (or provide other method to check a given text is in WebText) #63

Open
fabiencro opened this issue Feb 20, 2019 · 2 comments
Labels
question Further information is requested

Comments

@fabiencro
Copy link

Hi,

Thank you for releasing your pretrained model and saving us training time. I am currently exploring possible applications, but I ran into a problem that might also annoy many researchers trying to use your model.

AFAIK, you have not released the WebText corpus (although I know this is currently discussed in issue #24). This is fine by me, except for one aspect: it makes it impossible for me to know if my test
data is somehow included in WebText. Which, in turns, makes it impossible for me to tell if any improvement I am getting is due to the quality of GPT or the fact that the pretrained model has already seen my test data.

If you do not plan to release WebText in the very near future, I was thinking you could release the bloom filters you describe in your technical paper (code + filled filters). This would allow us to evaluate the proportion of 8-grams in our test data that is also in WebText.

Would this be possible?
Thank you.

@WuTheFWasThat
Copy link
Collaborator

this is very reasonable - I'll discuss this with others and get back to you. as a forewarning, though:

  • we have lots of other higher priority items to address right now
  • the bloom filter library i used loads them into memory, and it's about 45G, so you need a pretty beefy machine

@WuTheFWasThat WuTheFWasThat reopened this Feb 20, 2019
@WuTheFWasThat WuTheFWasThat added the question Further information is requested label Feb 20, 2019
@fabiencro
Copy link
Author

OK. Thank you very much for considering anyway.

Indeed, 45GB is larger than my expectations. But I do have access to machines with enough memory, so it would still be usable, at least for my case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants