# Sequence Level Classification : Sentiment140

This is a simple notebook to download and store the Sentiment140 Sentiment Analysis dataset. We will take it from Huggingface datasets libray (<3) and turn it into one of the two formats that classy is able to parse (i.e. jsonl or tsv).

After creating this dataset you can train the model by exectuing the following bash command
```bash
classy train sequence data/sequence/sentiment_140 -n my_firt_sentiment140_run
```

In [1]:
! pip install datasets



In [2]:
from datasets import load_dataset
from tqdm.notebook import tqdm

In [3]:
# here we load the dataset dataset from "datasets"
dataset = load_dataset('sentiment140')

Reusing dataset sentiment140 (/home/edobobo/.cache/huggingface/datasets/sentiment140/sentiment140/1.0.0/7fab67e78ff0003f3f5cdd1532c460b739c9342bf53421ccde27bc640ed45cd7)


In [4]:
# here we build a simple mapping from the labels in the int format stored in the hf-datasets' version of sentiment140
# to a more readable string format.
mapping = {
    0: "Negative",
    2: "Neutral",
    4: "Positive",
}

mapping

{0: 'Negative', 2: 'Neutral', 4: 'Positive'}

In [5]:
# let's create a repositiory that will contains the dataset splits
import os
dir_path = "sentiment_140"
os.mkdir(dir_path)
! ls

sentiment_140  sentiment_140.ipynb


In [6]:
# if you want the output format to be tab separated comment decomment the second line
output_format = "jsonl"
# output_format = "tsv"

if output_format == "jsonl":
    import json

In [7]:
for k in ['train', 'test']:

    with open(f'{dir_path}/{k}.{output_format}', 'w') as f:

        for instance in tqdm(dataset[k]):
            
            text = instance['text'].replace('\t', '    ')
            sentiment = instance['sentiment']
            sentiment = mapping[sentiment]

            if output_format == "jsonl":
                json_dict = dict(sequence=text, label=sentiment)
                dump_line = json.dumps(json_dict)
            else:
                dump_line = f"{text}\t{sentiment}"

            f.write(dump_line)
            f.write("\n")

  0%|          | 0/1600000 [00:00<?, ?it/s]

  0%|          | 0/498 [00:00<?, ?it/s]

In [8]:
! head -5 $dir_path/train.$output_format

{"sequence": "@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. ;D", "label": "Negative"}
{"sequence": "is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!", "label": "Negative"}
{"sequence": "@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds", "label": "Negative"}
{"sequence": "my whole body feels itchy and like its on fire ", "label": "Negative"}
{"sequence": "@nationwideclass no, it's not behaving at all. i'm mad. why am i here? because I can't see you all over there. ", "label": "Negative"}


In [9]:
! head -5 $dir_path/test.$output_format

{"sequence": "@stellargirl I loooooooovvvvvveee my Kindle2. Not that the DX is cool, but the 2 is fantastic in its own right.", "label": "Positive"}
{"sequence": "Reading my kindle2...  Love it... Lee childs is good read.", "label": "Positive"}
{"sequence": "Ok, first assesment of the #kindle2 ...it fucking rocks!!!", "label": "Positive"}
{"sequence": "@kenburbary You'll love your Kindle2. I've had mine for a few months and never looked back. The new big one is huge! No need for remorse! :)", "label": "Positive"}
{"sequence": "@mikefish  Fair enough. But i have the Kindle2 and I think it's perfect  :)", "label": "Positive"}
