Skip to content

Paraphrase Datasets: contains researches and links to datasets that can be used to sentence paraphrase model training

Notifications You must be signed in to change notification settings

otanadzetsotne/paraphrase_datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Paraphrasing datasets

  • GLUE (General Language Understanding Evaluation benchmark)

    Home page ->

    tensorflow ->

    github ->

  • MRPC (Microsoft Research Paraphrase Corpus)

    The Microsoft Research Paraphrase Corpus (Dolan & Brockett, 2005) is a corpus of sentence pairs automatically retrieved from online news sources, with human annotations indicating whether the sentences in the pair are semantically equivalent.

    Home page ->

    Download ->

  • CoLA (The Corpus of Linguistic Acceptability)

    The corpus of linguistic acceptability consists of judgments about the acceptability of the English language taken from books and journal articles on linguistic theory. Each example is a sequence of words annotated with whether it is grammatically an English sentence.

    Home page ->

    Download ->

  • QQP (Quora Question Pairs)

    The Quora Question Pairs2 dataset is a collection of question pairs from the community question-answering website Quora. The task is to determine whether a pair of questions are semantically equivalent.

    Kaggle ->

  • STS (The Semantic Textual Similarity Benchmark)

    The Semantic Textual Similarity Benchmark (Cer et al., 2017) is a collection of sentence pairs drawn from news headlines, video and image captions, and natural language inference data. Each pair is human-annotated with a similarity score from 0 to 5.

    Home page ->

    Download ->

  • PAWS (Paraphrase Adversaries from Word Scrambling)

    This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification. The dataset has two subsets, one based on Wikipedia and the other one based on the Quora Question Pairs (QQP) dataset.

    paper ->

    github ->

    Download (Wiki) (размеченный) ->

    Download (Wiki) (размеченный, только с перестановками) ->

  • PAWS-x

    This dataset contains 23,659 human translated PAWS evaluation pairs and 296,406 machine translated training pairs in six typologically distinct languages: French, Spanish, German, Chinese, Japanese, and Korean. All translated pairs are sourced from examples in PAWS-Wiki.

    github ->

    Download ->

  • PIT (Paraphrase and Semantic Similarity in Twitter)

    Paraphrase and Semantic Similarity in Twitter (PIT) presents a constructed Twitter Paraphrase Corpus that contains 18,762 sentence pairs.

    github ->

  • SciTail

    The SciTail dataset is an entailment dataset created from multiple-choice science exams and web sentences. Each question and the correct answer choice are converted into an assertive statement to form the hypothesis.

    Home page ->

    Paper ->

    Download ->

  • TURL (Twitter News URL Corpus)

    Requires Access

    Twitter News URL Corpus is a human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification.

    github ->

  • CQADupStack

    CQADupStack is a benchmark dataset for community question-answering research. It contains threads from twelve StackExchange subforums, annotated with duplicate question information. Pre-defined training and test splits are provided, both for retrieval and classification experiments, to ensure maximum comparability between different studies using the set. Furthermore, it comes with a script to manipulate the data in various ways.

    Home page ->

    github ->

    Download ->

  • Paralex

    Paralex learns from a collection of 18 million question-paraphrase pairs scraped from WikiAnswers.

    Home page ->

    Cкачать ->

  • Benchmark for Neural Paraphrase Detection

    This is a benchmark for neural paraphrase detection, to differentiate between original and machine-generated content.

    Home page ->

    Download ->

About

Paraphrase Datasets: contains researches and links to datasets that can be used to sentence paraphrase model training

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published