# Sentence Pair Classification : Quora Question Pairs

This is a simple notebook to download and store the Quora Question Pairs dataset. We will take it from Huggingface datasets libray (<3) and turn it into one of the two formats that classy is able to parse (i.e. jsonl or tsv).

After creating this dataset you can train the model by exectuing the following bash command
```bash
classy train sentence-pair data/sentence-pair/qqp -n my_firt_qqp_run
```

In [None]:
! pip install datasets

In [None]:
from datasets import load_dataset
import os
from tqdm.notebook import tqdm

In [None]:
# here we load the dataset dataset from "datasets"
dataset = load_dataset("glue", "qqp")

In [None]:
# here we build a simple mapping from the labels in the int format stored in the hf-datasets' version of qqp
# to a more readable string format.
mapping = {
    -1: "NO-LABEL",
    0: "NOT-EQUIVALENT",
    1: 'EQUIVALENT',
}

mapping

In [None]:
# let's create a repositiory that will contains the dataset splits
import os
dir_path = "qqp"
os.mkdir(dir_path)
! ls

In [None]:
# if you want the output format to be tab separated decomment the second line
output_format = "jsonl"
# output_format = "tsv"

if output_format == "jsonl":
    import json

In [None]:
for split in ['train', 'validation', 'test']:

    with open(f'{dir_path}/{split}.{output_format}', 'w') as f:

        for instance in tqdm(dataset[split], desc=split):
            
            question1 = instance["question1"]
            question2 = instance["question2"]
            label = instance["label"]
            label = mapping[label]

            if output_format == "jsonl":
                json_dict = dict(sentence1=question1, sentence2=question2, label=label)
                dump_line = json.dumps(json_dict)
            else:
                dump_line = f"{question1}\t{question2}\t{label}"

            f.write(dump_line)
            f.write("\n")

In [None]:
! head -5 $dir_path/train.$output_format

In [None]:
! head -5 $dir_path/validation.$output_format

In [None]:
! head -5 $dir_path/test.$output_format