# Data Pre-processing and dataloaders

In [1]:
# Data source download: https://drive.google.com/file/d/1y7yjshepNRPhnh-Qz5MTRbnopGn7KzUm/view?usp=sharing
# Originally from: https://github.com/sarnthil/unify-emotion-datasets

In [2]:
from data.unified_emotion import unified_emotion

unified = unified_emotion("./data/datasets/unified-dataset.jsonl")

unified.prep()
#unified.prep(text_tokenizer=manual_tokenizer)

In [3]:
unified.lens

{'grounded_emotions': 2585,
 'crowdflower': 40000,
 'dailydialog': 102979,
 'tales-emotion': 14771,
 'tec': 21051,
 'emoint': 7102}

In [4]:
for k in unified.lens.keys():
    trainloader, testloader = unified.get_dataloader(k, k=1)
    _, text = next(trainloader)
    print(k)
    print(text)
    print()

grounded_emotions
['@NBCNewYork WTF?? SERIOUSLY?!?!', '@assforDLS na my niggas @BlizzardStorm27  and @Smoke_EvryDay got me beat']

crowdflower
["@Puddynface2 Don't know yet  Lemme know if you come up with something though.", "Wee laddie's been SO upset for about 2 hours. Tried soothing him in bed, nursing, etc. Nope. Up at 3:30am for real food. Blue Clues now.", '@themanwhofell compliment taken. Thanks. Key is to be yourself', "gosh it's anoher cloudy day  wish they would go away.. or rain..", '@turhangross Wow! Some person u are!', "TIRED! goodnight twitter  its mother's day  happy mother's day  lov my moomy &lt;3 yayy! God Bless.", "@Zaraa_x ah that's annoying", '#todo Cleaning the Apartment - again - who keeps making this mess? oh yeah .. me. $10 + hug for the person to help come clean']

dailydialog
["Everything seems to be getting worse . I don't know what to do with it .", 'I hope so . And I will definitely tell you if I can not .', "sounds good , and I don't have to queue up at 

In [5]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

trainloader, testloader = unified.get_dataloader('grounded_emotions', tokenizer=tokenizer, shuffle=True)
next(trainloader)

(tensor([1, 1, 1, 1, 0, 0, 0, 0]),
 tensor([[  101, 19387,  1030, 23647,  6873, 26677, 10288,  2319,  1024,  1030,
          10819, 18098,  6610, 26876,  1030,  8962,  2271,  1030, 11271, 22083,
          16257, 10695,  1030,  4702, 16558, 19253, 10262,  2017,  1005,  2128,
           2006,  8398,  2015,  2572, 11283,  2078,  2933,  1029,  2129,  2079,
           2017,  4682,  1004, 23713,  1025,  3637,  2012,  2305,  1029,  1029,
           1029,   102],
         [  101, 19387,  1030,  8902,  5302, 18752, 16150, 18891,  2015,  1024,
           1045,  2106,  7773,  2197,  5353,  1004, 23713,  1025,  2057,  1005,
           2128,  7079,  2676,  1003,  2164,  4171,  2006,  2026,  3394,  2510,
           3477,  1012,  1030,  2613,  5280, 19058, 24456,  2361,  2081,  1002,
           5018,  2213,  1004, 23713,  1025,  2006,  2721, 29649,   102,     0,
              0,     0],
         [  101,  2317,  2160, 26536, 17125,  2740,  2729,  1010,  7318,  2696,
           2361,  4366,  1011, 1322

Sample a dataset using the square-root of their size as a probabilistic weight

In [6]:
from data.utils.sampling import dataset_sampler

source_name = dataset_sampler(unified, sampling_method='sqrt')
source_name

'crowdflower'

Raises StopIteration when there is not enough data left to generate an N x K shot
Done to avoid overfit in small datasets

In [8]:
while True:
    next(testloader)

StopIteration: Some classes ran out of data.

# Custom Tokenizer
Here we define some rules for manually cleaning the imported data.
Given this is all internet sourced, it's strongly recommended to define something at least.
Current manual tokenizer will:
- Correct the text encodings
- Align contractions with BERT tokenizers
- Handles emojis (using emoji package) and twitter handles
- Deals with some edge cases where Spacy's tokenizer fails

In [1]:
from transformers import AutoTokenizer

from data.unified_emotion import unified_emotion
from data.utils.tokenizer import manual_tokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

In [4]:
unified = unified_emotion("./data/datasets/unified-dataset.jsonl")

unified.prep()

The raw data

In [5]:
trainloader, testloader = unified.get_dataloader('tec', shuffle=True)
labels, text = next(trainloader)
text

['RT, follow @unitednude and WIN one of the 5 special collector items that we are giving away to our Twitter friends! #RT #WIN',
 'I only started listening to prince this afternoon, but as a new fan its been a pretty great day.',
 'Damn I get in my car n I got a full tank!!!',
 'WEEKEND! En het beloofd een goeie te worden',
 '@DWKM The mechanic was going to let me take it away... but then he returned, broken brake in hand.',
 'Also, someone took down the Swanson Pyramid of Greatness that was in the lab.',
 "Due giorni fa c'era paul mccartney qui a milano e io non sono potuta andarlo a vedere",
 'work,work,work,work, fucking job',
 '@MisterJayEllBee I bet! Yeah not bad thanks, just about to do the daily commute',
 'Family time tomorrow...',
 'Creo q despues de 8 hras corridas leyendo/memorizando shit lo unico q entiendo es blahblahblah... On to la sentencia sumaria then',
 'One of the nicest things about not working at the library anymore',
 'The 9',
 'That feeling when you know your sl

The same, but now manually tokenized, sample

In [6]:
list(map(manual_tokenizer, text))

['rt , follow @USER and win one of the 5 special collector items that we are giving away to our twitter friends ! # rt # win',
 'i only started listening to prince this afternoon , but as a new fan its been a pretty great day .',
 'damn i get in my car n i got a full tank ! ! !',
 'weekend ! en het beloofd een goeie te worden',
 '@USER the mechanic was going to let me take it away ... but then he returned , broken brake in hand .',
 'also , someone took down the swanson pyramid of greatness that was in the lab .',
 "due giorni fa c ' era paul mccartney qui a milano e io non sono potuta andarlo a vedere",
 'work , work , work , work , fucking job',
 '@USER i bet ! yeah not bad thanks , just about to do the daily commute',
 'family time tomorrow ...',
 'creo q despues de 8 hras corridas leyendo / memorizando shit lo unico q entiendo es blahblahblah ... on to la sentencia sumaria then',
 'one of the nicest things about not working at the library anymore',
 'the 9',
 'that feeling when you

Can be easily slotted into the data loading process
Does quite a lot longer though...

In [8]:
# Use below if you additionally want to limit sentences to those that overlap well with BERT
# Not recommended for initial training 
#unified.prep(text_tokenizer=manual_tokenizer, text_tokenizer_kwargs={'bert_vocab': tokenizer.vocab.keys(), 'OOV_cutoff' :0.5, 'verbose':True})

unified.prep(text_tokenizer=manual_tokenizer)

Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.
Removed sentence for bad encoding.


In [9]:
for k in unified.lens.keys():
    trainloader, testloader = unified.get_dataloader(k, k=3)
    _, text = next(trainloader)
    print(k)
    print(text)
    print()

grounded_emotions
['rt @USER : .@USER that s true , but luckily @USER stayed locked in like a laser on it ! other journalists are dedicated to # trum ...', 'rt @USER : wow . last hour news :- rus agents charged in yahoo hack - comey to testify on rus - sessions refutes wiretap claim - ryan shif ...', 'rt @USER : and who could forget " no pipelines without us steel , " " insurance for everybody " " i will be too busy for vacations gol ...', '@USER so sweet', 'rt @USER : thank you @USER ! i believe we can do this . HTTPURL HTTPURL', '# share your # happyhour # hotspot !']

crowdflower
['@USER mayyyybe', 'HTTPURL : domain umzug und neues design HTTPURL ...', '@USER should not you know your national holidays ?', 'sitting watching britain s got f all talent , but have watched a small girl cry and it s sad', 'is tired of summer already', 'last day working for tend', 'wondering why i am awake at 7am , writing a new song , plotting my evil secret plots muahahaha ... oh damn it , not secret any