# TODO
## Unified Emotion
- Remove samples with no-emotion class
- Convert multiple labels to multiple examples with different labels
- Come up with better assignment scheme for train/valid/test splits
- Drop "#SemST" from ssec sentences

## Go Emotions
- Get rid of print when loading (low priority)
- Include cases for manual tokenizer
- Convert multiple labels to multiple examples with different labels (check with Luuk)

## Manual Tokenizer
- check if works for go emotion
- incorporate special tokens into huggingface tokenizer

## Dataloaders
- loop Stratifiedloader for infinite sampling
- ~~rewrite train script to use correct dataloaders~~
- ~~allow Stratifiedloader to keep all classes (for supervised training)~~
- ~~allow Stratifiedloader to keep map classes subset to internal mapping (for meta training)~~

In [1]:
import torch

from data.utils.data_loader import StratifiedLoader, AdaptiveNKShotLoader

# Datasets
## Unified Emotion

Data source download: https://drive.google.com/file/d/1y7yjshepNRPhnh-Qz5MTRbnopGn7KzUm/view?usp=sharing
Originally from: https://github.com/sarnthil/unify-emotion-datasets


Klinger, R. & Bostan, L. (2018, August). An analysis of annotated corpora for emotion classification in text. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 2104-2119).

In [2]:
import pandas as pd
from data.unified_emotion import unified_emotion, unified_emotion_info

pd.DataFrame(unified_emotion_info())

Unnamed: 0,source,size,domain,classes,special
0,affectivetext,250,headlines,6,"non-discrete, multiple labels"
1,crowdflower,40000,tweets,14,includes no-emotions class
2,dailydialog,13000,conversations,6,includes no-emotions class
3,electoraltweets,4058,tweets,8,includes no-emotions class
4,emobank,10000,headlines,3,VAD regression
5,emoint,7097,tweets,6,annotated by experts
6,emotion-cause,2414,artificial,6,
7,fb-valence-arousal-anon,2800,facebook,3,VA regression
8,grounded_emotions,2500,tweets,2,
9,ssec,4868,tweets,8,multiple labels per sentence


In [3]:
unified = unified_emotion("./data/datasets/unified-dataset.jsonl",\
    include=['crowdflower', 'dailydialog', 'electoraltweets', 'emoint', 'emotion-cause', 'grounded_emotions', 'ssec', 'tec'])

unified.prep()

In [4]:
unified.lens

{'grounded_emotions': 2585,
 'ssec': 4868,
 'crowdflower': 40000,
 'dailydialog': 102979,
 'emotion-cause': 2414,
 'tec': 21051,
 'emoint': 7102,
 'electoraltweets': 4056}

In [5]:
for k in unified.lens.keys():
    dataset = unified.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)
    labels, text, _, _ = next(trainloader)
    print(k)
    for lab, sent in zip(labels, text):
        print(lab, sent)
    print()

grounded_emotions
1 @TRobinsonNewEra @eaaknighterrant @geertwilderspvv in Los Angeles we riot when the Lakers win! Big time! Bunch o' amateurs in Holland.
0 Forever Goalsâ¡#i2i20proofâ¡Love Each Other â¡#BlackLove #blacklovers #loversandfriendsâ¦ https://t.co/XrlkpAVBpE

ssec
0 @madisonfletch_ but it's ok. at least it's only yo your hand that smells like dick #SemST
2 It's hotter in Oregon then it is in California right now. #SemST
3 @Anomaly100 I'm confident that will piss off a few white Anglo-Saxon Protestants!  Keep talking Jeb. #SemST
1 @SCOTUSblog My @TheGoodGodAbove, is it 1973 already? #SemST
6 John Watterson is attending IPPC meeting to help shape the future of international greenhouse gas emission estimation methods #SemST
4 Why is there no #CaptainPlanet movie??? Rt if u want one. #environment #planet #earth #SemST
5 The most revealing part about Hillary's released emails is that they're pretty much as boring as mine. #nofriends #SemST

crowdflower
5 @djhazzard whoa... i

## GoEmotion

In [6]:
from data.go_emotions import go_emotions

go_emotion = go_emotions(first_label_only=True)
go_emotion.prep()

go_emotion.lens

No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-07a134cc41feca48.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-d4dc7f7f3530d91a.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-91437c43c19e5bec.arrow


{'go_emotions': 36491}

In [7]:
go_emotion = go_emotions(first_label_only=False)
go_emotion.prep()

go_emotion.lens

No config specified, defaulting to: go_emotions/simplified
Reusing dataset go_emotions (C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e)
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-07a134cc41feca48.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-d4dc7f7f3530d91a.arrow
Loading cached processed dataset at C:\Users\ivoon\.cache\huggingface\datasets\go_emotions\simplified\0.0.0\ef1c18ea192c771555f1e0d638889dd5f1896255782c57c6a0b934d5f94f779e\cache-91437c43c19e5bec.arrow


{'go_emotions': 44208}

In [8]:
for k in go_emotion.lens.keys():
    dataset = go_emotion.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)
    labels, text, _, _ = next(trainloader)
    print(k)
    for lab, sent in zip(labels, text):
        print(lab, sent)
    print()

go_emotions
2 Bitch yes!
14 Must be a terrible swim team. The only thing that your neck beard does is accentuates your double chin.
3 So far she seems pretty decent but I'm side-eyeing the hell out of him. His conduct at the restaurant was so bizarre and creepy.
26 This is not a comedy subreddit. It is a subreddit about the unexpected.
15 >Relax ms bi weeb who leans left haha... thanks i love it
8 I wish I could get the entire interview but it's from a SiriusXM Radio Show
20 Problem is we don't have much else of value. [NAME], [NAME], [NAME]. We have to hope [NAME] likes THJ.
0 Amazing!! Congrats!
6 I can't tell if the is comfortable in that outfit
1 Lol you sweet summer child. I wish I weren't so cynical and had hope like this.
4 I’d like to add that it’s British slang. We also use ‘kiddy fiddler’.
5 Oh sorry dude, I was just looking for the WiFi password. [NAME], but I hope you stay safe
12 I kept hoping [NAME] would walk out and shame DG.
22 Can't play defence when everything is cal

# Custom Tokenizer
Here we define some rules for manually cleaning the imported data.
Given this is all internet sourced, it's strongly recommended to define something at least.
Current manual tokenizer will:
- Correct the text encodings
- Align contractions with BERT tokenizers
- Handles emojis (using emoji package) and twitter handles
- Deals with some edge cases where Spacy's tokenizer fails

In [9]:
from transformers import AutoTokenizer, AutoModel

from data.utils.tokenizer import manual_tokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
tokenizer.additional_special_tokens = ["HTTPURL", "@USER"]

The raw data

In [10]:
dataset = unified.datasets['grounded_emotions']['train']
trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=8)

labels, text, _, _ = next(trainloader)
text

["She disagreed with his club choice probably:Henrik Stenson's 2-year-old daughter runs on-course to Dad mid-round https://t.co/U6h8B4Sikr",
 'This is my view on Sunday! WooHoo. https://t.co/0bLiGHNWkz',
 'The Crash caused by GOP Policies killed housing &amp; put banking in jeopardy! #Trumpism tests our Judicial System &amp; ouâ\x80¦ https://t.co/74lrq6Ko4k',
 "This moment. #morning #tagthecat laying on my chest and throat. He's #happy I'm happy. #cat @â\x80¦ https://t.co/leje0GLjGN",
 "@mikkwallace Communications is the ONLY way to work 2gether. Communication makes everyone become US.We are US  (the world) &amp; I'm not stppng",
 'This is all over my Facebook feed. Seems fair to me. @TheDemocrats @SenateGOP @HouseGOP #trumpcare https://t.co/dVPo1igjfb',
 '@dcb97 yep m totally jealous!',
 'RT @shakeology: Sunday plans: Shakeology and a stretch with Beachbody Yoga Super Yogi @EliseJoanFitnes  #Shakeology #BeachbodyYoga https://â\x80¦',
 'RT @puppymnkey: I had a premonition that the Dow 

The same, but now manually tokenized, sample

In [11]:
list(map(manual_tokenizer, text))

['she disagreed with his club choice probably : henrik stenson s 2 - year - old daughter runs on - course to dad mid - round HTTPURL',
 'this is my view on sunday ! woohoo . HTTPURL',
 'the crash caused by gop policies killed housing put banking in jeopardy ! # trumpism tests our judicial system ou ... HTTPURL',
 'this moment . # morning # tagthecat laying on my chest and throat . he s # happy i am happy . # cat @ ... HTTPURL',
 '@USER communications is the only way to work 2gether . communication makes everyone become us . we are us ( the world ) i am not stppng',
 'this is all over my facebook feed . seems fair to me . @USER @USER @USER # trumpcare HTTPURL',
 '@USER yep m totally jealous !',
 'rt @USER : sunday plans : shakeology and a stretch with beachbody yoga super yogi @USER # shakeology # beachbodyyoga HTTPURL ...',
 'rt @USER : i had a premonition that the dow dropped below 11 , 000 . very vivid .',
 'a # fish filled # birthday # lunch the # eating tour continues at garcias . 

Can be easily slotted into the data loading process
Does quite a lot longer though...

In [12]:
# Use below if you additionally want to limit sentences to those that overlap well with BERT
# Not recommended for initial training 
#unified.prep(text_tokenizer=manual_tokenizer, text_tokenizer_kwargs={'bert_vocab': tokenizer.vocab.keys(), 'OOV_cutoff' :0.5, 'verbose':True})

unified.prep(text_tokenizer=manual_tokenizer)

Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.


In [13]:
print('\nExample data')
for k in unified.lens.keys():
    dataset = unified.datasets[k]['train']
    trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=1)

    labels, text, _, _ = next(trainloader)
    print(k)

    label_map = {v: k for k, v in unified.label_map[k].items()}
    tokenized_texts = list(map(tokenizer.decode, tokenizer(text)['input_ids']))
    for txt, label in zip(tokenized_texts, labels):
        print(label_map[label], txt)
    print()


Example data
grounded_emotions
sadness [CLS] rt @ user : it s rafa timeeee!!!! vamos rafa, you got thisss!!!! : red _ heart : : red _ heart : : red _ heart : : flexed _ biceps : : flexed _ biceps : httpurl [SEP]
joy [CLS] rt @ user : glenn, you know i love you. but the joke was how easy trump can get away with being a lying conspiracy theorist. maybe not... [SEP]

ssec
anger [CLS] @ user i don't trust you performing your " science - based medicine " on children. i don't think it s science, or why reiterate? # semst [SEP]
fear [CLS] the only esteem that will not abandon us is the esteem given to us by jesus. ~ scott sauls @ user # esteem # semst [SEP]
joy [CLS] if i ever meet hillary clinton i will have died and then come back to life b / c i will be so happy # idol # semst [SEP]
disgust [CLS] she has a brain, a heart, and her own unique dna, not her mother s. she is alive and human. please don't kill her. # semst [SEP]
trust [CLS] lk 6 : 37 kjv judge not, and ye shall not be judged : 

In [14]:
#go_emotion.prep(text_tokenizer=manual_tokenizer)

# Sampling
## Dataset sampling

In [15]:
from data.utils.sampling import dataset_sampler

source_name = dataset_sampler(unified, sampling_method='sqrt')
source_name

'dailydialog'

## Dataloaders
Changed somewhat from last time. 

Now dataloaders must be generated manually using specific dataset (dict with labels as keys, lists of examples as values).

Samples from data and returns **both** the support and query.

Thus,

IN: dataset

OUT: support labels, support text, query labels, query text

If Huggingface tokenizer is passed, text is full model input (attention masks, token types, etc.)

Can be fed into model as,

```
model(**text)
```

### Stratified Sampling
Traditional N-way k-shot, balanced across classes.

Requires manually specifying k, which corresponds to batch size.

In [16]:
from collections import Counter

from data.utils.data_loader import StratifiedLoader, AdaptiveNKShotLoader

In [17]:
dataset = unified.datasets['ssec']['train']
trainloader = StratifiedLoader(dataset=dataset, device=torch.device('cpu'), k=16)
support_labels, support_text, query_labels, query_text = next(trainloader)

In [18]:
Counter(support_labels)

Counter({0: 16, 2: 16, 3: 16, 1: 16, 6: 16, 4: 16, 5: 16})

In [19]:
Counter(query_labels)

Counter({0: 16, 2: 16, 3: 16, 1: 16, 6: 16, 4: 16, 5: 15})

In [20]:
#while True:
#    next(trainloader)

### Adaptive N-way k-shot
Dataloader with adaptive/stochastic N-way, k-shot batches.

Support set has random number of examples per class, although proportional to class size.

Query set is always balanced.

Not all classes are present if more than 5 classes are present in the dataset.

Algorithm taken from:

    Triantafillou et al. (2019). Meta-dataset: A dataset of datasets for learning to learn from few examples. arXiv preprint arXiv:1903.03096.

Steps are as follows:
    
1. Sample subset of classes (min 5, max all classes)
    
2. Define query set size (max 10 per class)
    
3. Define support set size (max 128 for all)
    
4. Fill support set with samples, stochastically proportional to support set size
    
5. Fill query set with remaining samples


In [21]:
dataset = unified.datasets['ssec']['train']
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64)
support_labels, support_text, query_labels, query_text = next(trainloader)

In [22]:
Counter(support_labels)

Counter({3: 1, 0: 34, 1: 2, 4: 1, 2: 6})

In [23]:
Counter(query_labels)

Counter({3: 10, 0: 10, 1: 10, 4: 10, 2: 10})

Set `subset_classes=False` to retain all classes

In [24]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, subset_classes=False)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(Counter(support_labels), len(set(support_labels)))

Counter({0: 34, 3: 15, 2: 4, 1: 3, 4: 2, 5: 1, 6: 1}) 7
Counter({0: 28, 3: 17, 2: 6, 1: 5, 6: 2, 4: 1, 5: 1}) 7
Counter({0: 37, 3: 14, 2: 3, 1: 3, 6: 2, 4: 1, 5: 1}) 7
Counter({0: 46, 3: 6, 2: 3, 4: 2, 1: 2, 5: 1, 6: 1}) 7
Counter({0: 32, 3: 11, 2: 6, 1: 4, 4: 3, 6: 3, 5: 1}) 7
Counter({0: 33, 3: 14, 1: 5, 2: 4, 4: 2, 6: 1, 5: 1}) 7
Counter({0: 25, 3: 16, 2: 6, 1: 5, 4: 3, 6: 2, 5: 1}) 7
Counter({0: 36, 3: 11, 2: 5, 6: 3, 4: 3, 1: 3, 5: 1}) 7
Counter({0: 27, 3: 14, 2: 10, 1: 4, 5: 2, 4: 2, 6: 1}) 7
Counter({0: 23, 3: 23, 2: 4, 1: 3, 6: 3, 4: 2, 5: 2}) 7


In [25]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, subset_classes=True)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(Counter(support_labels), len(set(support_labels)))

Counter({0: 41, 3: 11, 2: 5, 1: 2, 5: 1, 4: 1}) 6
Counter({3: 26, 0: 19, 2: 7, 1: 6, 5: 2, 4: 2}) 6
Counter({0: 11, 2: 6, 3: 1, 1: 1}) 4
Counter({0: 12, 3: 9, 2: 3, 1: 3, 4: 1}) 5
Counter({2: 23, 0: 17, 1: 7, 3: 1, 4: 1}) 5
Counter({0: 36, 3: 11, 2: 8, 1: 4, 5: 1, 4: 1}) 6
Counter({0: 37, 1: 19, 2: 7}) 3
Counter({2: 19, 0: 18, 1: 3, 3: 1}) 4
Counter({0: 38, 3: 9, 2: 7, 1: 5, 5: 1, 4: 1}) 6
Counter({0: 47, 2: 8, 1: 4, 4: 1, 3: 1}) 5


Set `temp_map=False` to retain label definitions according to the dataset. 

Needs to be re-mapped to allow for generating one-hot vectors for loss computation.

In [26]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, temp_map=True)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(sorted(Counter(support_labels).keys()))

[0, 1, 2]
[0, 1, 2]
[0, 1, 2, 3, 4, 5]
[0, 1, 2, 3, 4]
[0, 1, 2, 3]
[0, 1, 2, 3]
[0, 1, 2, 3, 4]
[0, 1, 2]
[0, 1, 2, 3, 4]
[0, 1, 2, 3]


In [27]:
trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cpu'), max_support_size=64, temp_map=False)

for i in range(10):
    support_labels, support_text, query_labels, query_text = next(trainloader)
    print(sorted(Counter(support_labels).keys()))

[1, 2, 4, 5, 6]
[0, 1, 2]
[0, 1, 2, 3, 5]
[1, 2, 3, 4, 6]
[1, 3, 5]
[1, 2, 3, 4, 5, 6]
[0, 1, 3, 4, 5, 6]
[0, 1, 2, 3, 4, 6]
[0, 1, 3, 4, 5, 6]
[1, 5, 6]


In [46]:
dataset = unified.datasets['dailydialog']['train']

trainloader = AdaptiveNKShotLoader(dataset=dataset, device=torch.device('cuda'), tokenizer=tokenizer, max_support_size=64, temp_map=True)
for i in range(1000):
    batch = next(trainloader)
    support_labels, support_text, query_labels, query_text = batch

{'input_ids': tensor([[ 101, 2054,  999,  ...,    0,    0,    0],
        [ 101, 2428, 1029,  ...,    0,    0,    0],
        [ 101, 2025, 2172,  ...,    0,    0,    0],
        ...,
        [ 101, 2339, 2064,  ...,    0,    0,    0],
        [ 101, 2092, 1010,  ...,    0,    0,    0],
        [ 101, 3524, 1010,  ...,    0,    0,    0]], device='cuda:0'), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0'), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')}