Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release raw lambada dataset #131

Closed
yaroslavvb opened this issue May 11, 2019 · 9 comments
Closed

Release raw lambada dataset #131

yaroslavvb opened this issue May 11, 2019 · 9 comments

Comments

@yaroslavvb
Copy link

Is it possible to release the Lambada dataset used to generate accuracy numbers in Table 3 of the paper? This would make it easier to do comparisons with other models :)
@Newmu

@WuTheFWasThat
Copy link
Contributor

we just use the plain text files which can be downloaded here https://zenodo.org/record/2630551#.XNxg89NKjUI

@yaroslavvb
Copy link
Author

yaroslavvb commented May 15, 2019

That's a post-processed version, ie, "don't" is split into "do n't" etc. GPT-2-small gets around 31% on that set. My understanding from @Newmu was that 45.99 figure from Table 3. in the paper was on raw/non-processed version

@WuTheFWasThat
Copy link
Contributor

we apply "de-tokenizers" to remove some of the artifacts. Alec can verify but I think in this case it's simply

def preprocess(text):
    text = text.replace("“", '"')
    text = text.replace("”", '"')
    return '\n'+text.strip()

in fact the detokenizer should be invertible, although i don't think that's important for the accuracy numbers

yaroslavvb added a commit to cybertronai/bflm that referenced this issue May 22, 2019
As recommended in openai/gpt-2#131

Original suggestion makes no difference because official release doesn’t have smart quotes. Adding ``->” and ‘’->” rules improves result 0.3%
@yaroslavvb
Copy link
Author

This detokenizer doesn't do anything on the official Lambada dataset since there are no smart quotes in it. My understanding is that OpenAI used its own version of Lambada dataset generated from book corpus/lambada. This dataset is interesting because of the accuracy gap in GPT2-small numbers -- 34% on official Lambada vs 46% on OpenAI's version.

@WuTheFWasThat
Copy link
Contributor

my bad, you're right, whoops! try this: gs://gpt-2/data/lambada_test.jsonl

yaroslavvb added a commit to cybertronai/bflm that referenced this issue May 29, 2019
@yaroslavvb
Copy link
Author

Thanks, that dataset makes a difference.

I'm now getting 41.98 using GPT2-small, this version of dataset + with length-5 beam search decoding of last word for stop-word removal.

Simplifying the procedure to test accuracy by comparing for equality of last BPE token instead of last word the accuracy is up to 46.89

I'm wondering if this should be called "lambada-openai" or something in tables to avoid confusion. I looked at the errors between the two datasets, and it seems easier because formatting provides extra information.

Official Lambada

she and zach were covered in dust and sweat when helen found them. "wow, lexi! you rock." lexi groaned at the bad pun. helen surveyed the work, which was nearly complete. "how did you do this?" lexi shrugged. "don't know." "it's her gift," said zach

This version

She and Zach were covered in dust and sweat when Helen found them. "Wow, Lexi! You rock."

Lexi groaned at the bad pun.

Helen surveyed the work, which was nearly complete. "How did you do this?"

Lexi shrugged. "Don't know."

"It's her gift," said Zach

@WuTheFWasThat
Copy link
Contributor

yeah i agree keeping the extra information is potentially useful (even for non-zero-shot) and it's probably good to distinguish it from the original dataset

@lukesalamone
Copy link

Hi I'm also looking to run the same test. Can you fix the gs://gpt-2/data/lambada_test.jsonl link? I'm getting

BucketNotFoundException: 404 gs://gpt-2 bucket does not exist.

@WuTheFWasThat
Copy link
Contributor

should now be at https://openaipublic.blob.core.windows.net/gpt-2/data/lambada_test.jsonl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants