## This notebook demonstrates how to generate pseudo OOD samples (&sect; 3.3). <br>

In [26]:
import nlpaug.augmenter.word as naw
from nltk.corpus import stopwords
from datasets import load_dataset
import pandas as pd
import warnings
import random
import torch

warnings.filterwarnings('ignore')
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
random.seed(24)

<b>VERSION</b>: <br><br>
<li>nltk 3.5</li>
<li>pandas 1.3.4</li>
<li>datasets 1.18.4</li>
<li>torch 1.9.0+cu111</li>

## Helper Functions

In [41]:
def load_sst2():
    datasets = load_dataset('glue', 'sst2', cache_dir="./cache/sst2/")
    training_set = datasets['train']
    dev_set = datasets['validation']
    test_set = datasets['test']
    return training_set, dev_set, test_set

def highlight(x):
    # To highlight some illustrative pseudo OOD examples with their corresponding ID samples.
    if (x.Row in [0, 3, 9, 10, 11, 15, 19, ]): 
        return ['', 'background-color: lightsteelblue', 'background-color: bisque']
    else:
        return ['']*3

## Load SST2 Dataset

In [28]:
training_set, dev_set, test_set = load_sst2()

Reusing dataset glue (./cache/sst2/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

## Generate Pseudo OOD Samples

In [29]:
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Please refer to https://github.com/makcedward/nlpaug for the detailed documentation. 
generator = naw.ContextualWordEmbsAug(model_path='distilbert-base-uncased', # We use DistilBERT as the generator.
                                action='substitute',
                                aug_p=0.7, # We set the replacement ratio to 0.7.
                                top_k=100, # The candidate size is set to 100.
                                stopwords=stopwords.words('english'), # We do not substitute stopwords.
                                device=str(device),
                                )

# We randomly sample 20 examples from the development set for demonstration purposes.
indices = list(range(dev_set.num_rows))
random.shuffle(indices)
ID_samples = [dev_set['sentence'][i] for i in indices[0:20]]

pseudo_oods = []
for id_sample in ID_samples:
    pseudo_oods.append(generator.augment(id_sample))

## ID Samples vs. Pseudo OOD Examples

In [42]:
id_ood_df = pd.DataFrame()
id_ood_df['ID Examples'] = ID_samples
id_ood_df['Pseudo OODs'] = pseudo_oods
id_ood_df.insert(loc=0, column='Row', value=list(range(0, 20)))
id_ood_df.style.apply(highlight, axis=1).set_properties(**{'text-align': 'left'})

Unnamed: 0,Row,ID Examples,Pseudo OODs
0,0,the talented and clever robert rodriguez perhaps put a little too much heart into his first film and did n't reserve enough for his second .,the talented and courageous robert blake thus incorporated a little too big effort into his final match and did don't reserve enough for his popularity.
1,1,"his comedy premises are often hackneyed or just plain crude , calculated to provoke shocked laughter , without following up on a deeper level .","his broadway performances are invariably witty or just deliberately crude, calculated to inspire shocked audiences, without catching up on a satirical tone."
2,2,"in exactly 89 minutes , most of which passed as slowly as if i 'd been sitting naked on an igloo , formula 51 sank from quirky to jerky to utter turkey .","in precisely fifteen minutes, most of which moved as slowly as if i'd been lying naked on an igloo, the 51 transitioned from neutral to distant to complete disbelief."
3,3,teen movies have really hit the skids .,twitter profiles have strongly intrigued the fan.
4,4,... a boring parade of talking heads and technical gibberish that will do little to advance the linux cause .,... a scientific search of african skeletons and ancient artefacts that will do something to modify the global environment.
5,5,atom egoyan has conjured up a multilayered work that tackles any number of fascinating issues,the magazine has brought up a fictitious world that reveals any kinds of terrible things
6,6,not an objectionable or dull film ; it merely lacks everything except good intentions .,not an exaggerated or offensive notion ; it rarely proved adequate save basic humour.
7,7,they should have called it gutterball .,they should have gone it hot.
8,8,its well of thorn and vinegar ( and simple humanity ) has long been plundered by similar works featuring the insight and punch this picture so conspicuously lacks .,its image of thorn and berries ( and its outline ) has long been replaced by comic artworks featuring the snake and punch this beast so conspicuously elusive.
9,9,a compelling spanish film about the withering effects of jealousy in the life of a young monarch whose sexual passion for her husband becomes an obsession .,a biographical drama film about the withering tide of repression in the childhood of a young prince whose sexual pursuit for her throne proves an obstacle.


<br><b style="font-size:24px;">Background shifts:</b><br><br>
<li>Row 0: film &rarr; match</li>
<li>Row 3: teen movies &rarr; twitter profiles</li>
<li>Row 9: spanish film &rarr; biographical drama</li>
<li>Row 10: science-fiction &rarr; thriller</li>
<li>Row 11: film &rarr; musical</li>
<li>Row 15: movies &rarr; aquarium</li>
<li>Row 19: french school life &rarr; english shcool education</li>