# Sentence Pair Classification : Quora Question Pairs

This is a simple notebook to download and store the Quora Question Pairs dataset. We will take it from Huggingface datasets libray (<3) and turn it into one of the two formats that classy is able to parse (i.e. jsonl or tsv).

After creating this dataset you can train the model by exectuing the following bash command
```bash
classy train sentence-pair data/sentence-pair/quora_question_pairs -n my_firt_qqp_run
```

In [1]:
! pip install datasets



In [2]:
from datasets import load_dataset
import os
from tqdm.notebook import tqdm

In [3]:
# here we load the dataset dataset from "datasets"
dataset = load_dataset("glue", "qqp")

Reusing dataset glue (/home/edobobo/.cache/huggingface/datasets/glue/qqp/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


In [4]:
# here we build a simple mapping from the labels in the int format stored in the hf-datasets' version of qqp
# to a more readable string format.
mapping = {
    -1: "NO-LABEL",
    0: "NOT-EQUIVALENT",
    1: 'EQUIVALENT',
}

mapping

{-1: 'NO-LABEL', 0: 'NOT-EQUIVALENT', 1: 'EQUIVALENT'}

In [5]:
# let's create a repositiory that will contains the dataset splits
import os
dir_path = "quora_question_pairs"
os.mkdir(dir_path)
! ls

quora_question_pairs  quora_question_pairs.ipynb


In [6]:
# if you want the output format to be tab separated decomment the second line
output_format = "jsonl"
# output_format = "tsv"

if output_format == "jsonl":
    import json

In [7]:
for k in ['train', 'validation', 'test']:

    with open(f'{dir_path}/{k}.{output_format}', 'w') as f:

        for instance in tqdm(dataset[k]):
            
            question1 = instance["question1"]
            question2 = instance["question2"]
            label = instance["label"]
            label = mapping[label]

            if output_format == "jsonl":
                json_dict = dict(sentence1=question1, sentence2=question2, label=label)
                dump_line = json.dumps(json_dict)
            else:
                dump_line = f"{question1}\t{question2}\t{label}"

            f.write(dump_line)
            f.write("\n")

  0%|          | 0/363846 [00:00<?, ?it/s]

  0%|          | 0/40430 [00:00<?, ?it/s]

  0%|          | 0/390965 [00:00<?, ?it/s]

In [8]:
! head -5 $dir_path/train.$output_format

{"sentence1": "How is the life of a math student? Could you describe your own experiences?", "sentence2": "Which level of prepration is enough for the exam jlpt5?", "label": "NOT-EQUIVALENT"}
{"sentence1": "How do I control my horny emotions?", "sentence2": "How do you control your horniness?", "label": "EQUIVALENT"}
{"sentence1": "What causes stool color to change to yellow?", "sentence2": "What can cause stool to come out as little balls?", "label": "NOT-EQUIVALENT"}
{"sentence1": "What can one do after MBBS?", "sentence2": "What do i do after my MBBS ?", "label": "EQUIVALENT"}
{"sentence1": "Where can I find a power outlet for my laptop at Melbourne Airport?", "sentence2": "Would a second airport in Sydney, Australia be needed if a high-speed rail link was created between Melbourne and Sydney?", "label": "NOT-EQUIVALENT"}


In [9]:
! head -5 $dir_path/validation.$output_format

{"sentence1": "Why are African-Americans so beautiful?", "sentence2": "Why are hispanics so beautiful?", "label": "NOT-EQUIVALENT"}
{"sentence1": "I want to pursue PhD in Computer Science about social network,what is the open problem in social networks?", "sentence2": "I handle social media for a non-profit. Should I start going to social media networking events? Are there any good ones in the bay area?", "label": "NOT-EQUIVALENT"}
{"sentence1": "Is there a reason why we should travel alone?", "sentence2": "What are some reasons to travel alone?", "label": "EQUIVALENT"}
{"sentence1": "Why are people so obsessed with having a girlfriend/boyfriend?", "sentence2": "How can a single male have a child?", "label": "NOT-EQUIVALENT"}
{"sentence1": "What are some good baby girl names starting with D?", "sentence2": "What are some good baby girl names starting with D or H?", "label": "NOT-EQUIVALENT"}


In [10]:
! head -5 $dir_path/test.$output_format

{"sentence1": "Would the idea of Trump and Putin in bed together scare you, given the geopolitical implications?", "sentence2": "Do you think that if Donald Trump were elected President, he would be able to restore relations with Putin and Russia as he said he could, based on the rocky relationship Putin had with Obama and Bush?", "label": "NO-LABEL"}
{"sentence1": "What are the top ten Consumer-to-Consumer E-commerce online?", "sentence2": "What are the top ten Consumer-to-Business E-commerce online?", "label": "NO-LABEL"}
{"sentence1": "Why don't people simply 'Google' instead of asking questions on Quora?", "sentence2": "Why do people ask Quora questions instead of just searching google?", "label": "NO-LABEL"}
{"sentence1": "Is it safe to invest in social trade biz?", "sentence2": "Is social trade geniune?", "label": "NO-LABEL"}
{"sentence1": "If the universe is expanding then does matter also expand?", "sentence2": "If universe and space is expanding? Does that mean anything th