### Perturbation Algorithm
First, we determine the share of labels that we would like to perturb. Then, we generate a random number between 0 and 1 for each entity label. Finally, if the random number is less than the specified perturbed share, we replace the original label with a random label from the set of all possible labels. Otherwise, we keep the original label.


Modification: Replace original label with a random label from the set of all occurring labels.

Question: Are we okay with replacing "B-..." with "I-..." and "O" with an actual label (and the other way around)? That is, how much noise do we actually want to introduce --> do we want to simulate wrong labels (i.e., stick to the correct way of generating labels but just misclassify OR do we want to just introduce straight up randomness, ignoring labeling rules etc)

Also, do we take into account the label distribution? Otherwise, when resampling, labels will be perturbed with "O" much less frequently 

In [19]:
import pandas as pd
import numpy as np
import random

In [301]:
# do exemplarily for covidnews-ner dataset

# !!! need to read in with converters to get correct data type since list is stored as string in csv
data_train = pd.read_csv("../data/COVIDNEWS/data_train.csv", index_col=0, converters={'sequence_tok': pd.eval, 'ner_BIO_full': pd.eval})
bio_labels = open('../data/COVIDNEWS/COVIDNEWS_CONTROSTER/types.txt', 'r').readlines()

In [302]:
data_train.head()

Unnamed: 0,sequences,labels,sequence_tok,ner_BIO_full
0,Jakarta ( ANTARA ) - as many as 419 confirmed ...,B-location O B-organization O O O O O B-person...,"[Jakarta, (, ANTARA, ), -, as, many, as, 419, ...","[B-location, O, B-organization, O, O, O, O, O,..."
1,Australian Associated Press Qld lashes NSW gov...,B-organization I-organization I-organization O...,"[Australian, Associated, Press, Qld, lashes, N...","[B-organization, I-organization, I-organizatio..."
2,The city also closed its sights and asked all ...,O O O O O O O O O B-person O O O O O O O O O O...,"[The, city, also, closed, its, sights, and, as...","[O, O, O, O, O, O, O, O, O, B-person, O, O, O,..."
3,We suspect that the sudden rise in cases could...,O O O O O O O O O O O O O O O O O O O O O O B-...,"[We, suspect, that, the, sudden, rise, in, cas...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ..."
4,Noel Lyn Smith covers the Navajo Nation for Th...,B-person I-person I-person O O B-organization ...,"[Noel, Lyn, Smith, covers, the, Navajo, Nation...","[B-person, I-person, I-person, O, O, B-organiz..."


In [303]:
#ALL_TYPES = [t.strip("\n") for t in bio_labels]
#ALL_TYPES.append("O")

We want to implement the perturbation procedure for the uniformly formatted BIO labels so that the same function can be applied to all three datasets.

In [304]:
# clean "labels", join all strings together, then get individual words (labels) and create set from it
ALL_TYPES = set(" ".join(data_train["labels"].apply(lambda x: x.strip("\n"))).split(" "))

In [312]:
PERTURB_SHARE = 0.1

def perturb_label_uniform(label, perturb_share, all_types):

    if np.random.sample(1) < perturb_share:
        return random.sample(all_types, 1)[0]
    else:
        return label

In [313]:
# apply to all labels with wrapper function
def perturb_label_list(label_list, perturb_function=perturb_label_uniform, perturb_share=PERTURB_SHARE, all_types=ALL_TYPES):
    
    return [perturb_function(label, perturb_share, all_types) for label in label_list]

In [314]:
random.seed(123) # add seed for reproducibility

data_train["ner_BIO_full" + "_" + str(PERTURB_SHARE)] = data_train["ner_BIO_full"].apply(perturb_label_list)

In [315]:
data_train

Unnamed: 0,sequences,labels,sequence_tok,ner_BIO_full,ner_BIO_full_0.1
0,Jakarta ( ANTARA ) - as many as 419 confirmed ...,B-location O B-organization O O O O O B-person...,"[Jakarta, (, ANTARA, ), -, as, many, as, 419, ...","[B-location, O, B-organization, O, O, O, O, O,...","[B-location, B-location, B-organization, O, O,..."
1,Australian Associated Press Qld lashes NSW gov...,B-organization I-organization I-organization O...,"[Australian, Associated, Press, Qld, lashes, N...","[B-organization, I-organization, I-organizatio...","[B-organization, I-organization, I-organizatio..."
2,The city also closed its sights and asked all ...,O O O O O O O O O B-person O O O O O O O O O O...,"[The, city, also, closed, its, sights, and, as...","[O, O, O, O, O, O, O, O, O, B-person, O, O, O,...","[O, O, O, O, O, O, O, O, O, B-person, O, O, O,..."
3,We suspect that the sudden rise in cases could...,O O O O O O O O O O O O O O O O O O O O O O B-...,"[We, suspect, that, the, sudden, rise, in, cas...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","[O, O, O, O, B-disease, O, I-bacterium, O, O, ..."
4,Noel Lyn Smith covers the Navajo Nation for Th...,B-person I-person I-person O O B-organization ...,"[Noel, Lyn, Smith, covers, the, Navajo, Nation...","[B-person, I-person, I-person, O, O, B-organiz...","[B-person, I-person, I-person, O, O, I-virus, ..."
...,...,...,...,...,...
2095,August 1 Xinguan virus nucleic acid test was p...,B-time I-time B-virus I-virus O O O O O O O O ...,"[August, 1, Xinguan, virus, nucleic, acid, tes...","[B-time, I-time, B-virus, I-virus, O, O, O, O,...","[B-disease, I-time, B-virus, B-product, O, O, ..."
2096,""" Original Title : An infected case has been r...",O O O O O O O O O O O O B-location I-location ...,"["", Original, Title, :, An, infected, case, ha...","[O, O, O, O, O, O, O, O, O, O, O, O, B-locatio...","[O, O, O, O, O, O, O, O, B-time, O, O, O, B-lo..."
2097,"According to the World Health Organization , t...",O O O B-organization I-organization I-organiza...,"[According, to, the, World, Health, Organizati...","[O, O, O, B-organization, I-organization, I-or...","[I-virus, O, O, B-organization, I-organization..."
2098,"Chehalis , WA – Lewis County Public Health & S...",B-location I-location I-location O B-organizat...,"[Chehalis, ,, WA, –, Lewis, County, Public, He...","[B-location, I-location, I-location, O, B-orga...","[B-location, I-location, I-location, O, B-orga..."


In [388]:
c1 = Counter(" ".join(data_train.ner_BIO_full.sum()).split(" "))
c2 = Counter(" ".join(data_train.ner_BIO_full.sum()).split(" "))

In [390]:
df1 = pd.DataFrame(c1.items(), columns=['Label', 'Count_raw'])

In [391]:
df2 = pd.DataFrame(c2.items(), columns=['Label', f'Count_strat{str(0.1)}'])

In [397]:
pd.merge(df1, df2)

Unnamed: 0,Label,Count_raw,Count_strat0.1
0,B-location,1185,1185
1,O,45861,45861
2,B-organization,763,763
3,B-person,1859,1859
4,I-person,1829,1829
5,B-disease,448,448
6,I-location,715,715
7,I-organization,1151,1151
8,B-virus,283,283
9,B-animal,117,117


In [393]:
df1.merge(df2)

Unnamed: 0,Label,Count_raw,Count_strat0.1
0,B-location,1185,1185
1,O,45861,45861
2,B-organization,763,763
3,B-person,1859,1859
4,I-person,1829,1829
5,B-disease,448,448
6,I-location,715,715
7,I-organization,1151,1151
8,B-virus,283,283
9,B-animal,117,117


In [317]:
all_labs = " ".join(data_train["labels"].apply(lambda x: x.strip("\n"))).split(" ")

In [318]:
import nltk
fd = nltk.FreqDist(all_labs)

In [330]:
from collections import Counter
c = Counter(all_labs)

In [362]:
def perturb_label_stratified(label, perturb_share, all_types):
    all_types_list = list(all_types) 
    if np.random.sample(1) < perturb_share:
        label_counts = {lbl: label.count(lbl) for lbl in all_types_list}
        sampling_weights = [label_counts[lbl] for lbl in all_types_list]
        return random.choices(all_types_list, weights=sampling_weights, k=1)[0]
    else:
        return label


In [375]:
labels = list(c.keys())
weights = list(c.values())

sample = random.choices(labels, weights=weights, k=1)[0]
print(sample)


B-person


In [358]:
def perturb_label_stratified(label, perturb_share, data):
    all_labs = " ".join(data["labels"].apply(lambda x: x.strip("\n"))).split(" ")
    all_types_list = all_labs#list(all_types)
    if np.random.sample(1) < perturb_share:
        label_counts = {lbl: label.count(lbl) for lbl in all_types_list}
        sampling_weights = [label_counts[lbl] for lbl in all_types_list]
        return random.choices(all_types_list, weights=sampling_weights, k=1)[0]
    else:
        return label

In [376]:
c

Counter({'B-location': 1185,
         'O': 45861,
         'B-organization': 763,
         'B-person': 1859,
         'I-person': 1829,
         'B-disease': 448,
         'I-location': 715,
         'I-organization': 1151,
         'B-virus': 283,
         'B-animal': 117,
         'B-product': 165,
         'I-product': 33,
         'B-time': 570,
         'I-time': 653,
         'I-disease': 129,
         'I-virus': 187,
         'B-symptom': 82,
         'I-symptom': 72,
         'B-bacterium': 18,
         'I-bacterium': 9,
         'I-animal': 18})

In [360]:
data_train['ner_BIO_full'][0][0]

'B-location'

In [361]:
perturb_label_stratified(data_train['ner_BIO_full'][0][0], 0.6, data_train)

['B-location', 'O', 'B-organization', 'O', 'O', 'O', 'O', 'O', 'B-person', 'I-person', 'I-person', 'O', 'B-disease', 'O', 'B-disease', 'B-location', 'O', 'B-location', 'I-location', 'I-location', 'I-location', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-organization', 'I-organization', 'I-organization', 'O', 'O', 'B-organization', 'I-organization', 'O', 'O', 'O', 'O', 'B-virus', 'O', 'B-organization', 'I-organization', 'O', 'O', 'O', 'O', 'O', 'B-location', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-person', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-animal', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-person', 'I-person', 'I-person', 'O', 'O', 'B-organization', 'I-organization', 'O', 'B-organization', 'I-organization', 'I-organization',

'B-location'

### TO-DO:
* wrap everything into python script:
    * set data input directory, automatically generate output directory
    * allow for different perturbation functions
    * generate statistics (such as label distribution) and save them somewhere / print them

In [None]:
import pandas as pd
import numpy as np
import random

data_train = pd.read_csv("../data/COVIDNEWS/data_train.csv", index_col=0, converters={'sequence_tok': pd.eval, 'ner_BIO_full': pd.eval})
bio_labels = open('../data/COVIDNEWS/COVIDNEWS_CONTROSTER/types.txt', 'r').readlines()

ALL_TYPES = set(" ".join(data_train["labels"].apply(lambda x: x.strip("\n"))).split(" "))

PERTURB_SHARE = 0.1

def perturb_label_uniform(label, perturb_share, all_types):

    if np.random.sample(1) < perturb_share:
        return random.sample(all_types, 1)[0]
    else:
        return label
    
def perturb_label_list(label_list, perturb_function=perturb_label_uniform, perturb_share=PERTURB_SHARE, all_types=ALL_TYPES):
    
    return [perturb_function(label, perturb_share, all_types) for label in label_list]

random.seed(123) # add seed for reproducibility

data_train["ner_BIO_full" + "_" + str(PERTURB_SHARE)] = data_train["ner_BIO_full"].apply(perturb_label_list)

# save perturbed dataset