This dataset contains:
- 250K documents from the WebText test set
- For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation
We look forward to the research produced using this data!
For each model, we have a training split of 250K generated examples, as well as validation and test splits of 5K examples.
All data is located in Google Cloud Storage, under the directory
There, you will find files:
where split is one of
We've provided a script to download all of them, in
We're interested in seeing research in detectability of GPT-2 model family generations.
We've provided a starter baseline which trains a logistic regression detector on TF-IDF unigram and bigram features, in
baseline.py. The baseline achieves the following accuracies:
|Model||Temperature 1||Top-K 40|
Unsurprisingly, shorter documents are harder to detect and performance improves gradually with length. Accuracy of detection of short documents of 500 characters (a long paragraph) is about 15% lower.
Truncated sampling, which is commonly used for high-quality generations from the GPT-2 model family, results in a shift in the part of speech distribution of the generated text compared to real text. A clear example is the underuse of proper nouns and overuse of pronouns which are more generic. This shift contributes to the 8% to 18% higher detection rate of Top-K samples compared to random samples across models.
Data removal requests
If you believe your work is included in WebText and would like us to remove it, please let us know at firstname.lastname@example.org.