# Sequence Classification : Stanford sentiment treebank (SST2, GLUE)

This is a simple notebook to download and store the [Stanford sentiment treebank](https://nlp.stanford.edu/sentiment/index.html) (SST2) dataset. We will take it from Huggingface datasets libray (<3) and turn it into one of the two formats that classy is able to parse (i.e. jsonl or tsv).

After creating this dataset you can train the model by exectuing the following bash command
```bash
classy train sequence data/sequence/sst2 -n my_firt_sst2_run
```

In [None]:
! pip install datasets

In [None]:
from datasets import load_dataset
from tqdm.notebook import tqdm

In [None]:
# here we load the dataset dataset from "datasets"
dataset = load_dataset("glue", "sst2")

In [None]:
dataset

In [None]:
# here we build a simple mapping from the labels in the int format stored in the hf-datasets' version of sst2
# to a more readable string format.
mapping = {
    -1: "Unlabelled",
    0: "Negative",
    1: "Positive",
}

mapping

In [None]:
# let's create a repositiory that will contains the dataset splits
import os
dir_path = "sst2"
os.mkdir(dir_path)
! ls

In [None]:
# if you want the output format to be tab separated comment decomment the second line
output_format = "jsonl"
# output_format = "tsv"

if output_format == "jsonl":
    import json

In [None]:
for split in ["train", "validation", "test"]:

    with open(f"{dir_path}/{split}.{output_format}", "w") as f:

        for instance in tqdm(dataset[split], desc=split):
            
            sentence = instance["sentence"].replace("\t", "    ").strip()
            sentiment = instance["label"]
            sentiment = mapping[sentiment]

            if output_format == "jsonl":
                json_dict = dict(sequence=sentence, label=sentiment)
                dump_line = json.dumps(json_dict)
            else:
                dump_line = f"{text}\t{sentiment}"

            f.write(dump_line)
            f.write("\n")

In [None]:
! head -5 $dir_path/train.$output_format

In [None]:
! head -5 $dir_path/validation.$output_format

In [None]:
! head -5 $dir_path/test.$output_format