## Generating a synthetic dataset based on Naïve Bayes Assumptions
This Notebook demonstrates how to **create a synthetic dataset of injury report** based on the assumptions **Naïve bayes model**. More specifically, the synthetic dataset is generated based on the probability of the **label of the document** and the **probability of every token** ***given being assigned a label***.

### Real Dataset
The dataset used is from a competition organized by NASA-Tournament Lab and National Institute for Occupational Safety & Health (NIOSH). The goal is to automate the processing of data in occupational safety and health (OSH) surveillance systems. Specifically, given a free text injury report, such as "*worker fell from the ladder after reaching out for a box.*” , the task is to **assign a injury code** from the Occupational Injuries and Illnesses Classification System (OIICS). The details of the task and competition can be found in a [blogpost](https://blogs.cdc.gov/niosh-science-blog/2020/02/26/ai-crowdsourcing/) by CDC (Center for Disease Control). The dataset is downloaded from [hugging face](https://huggingface.co/datasets/mayerantoine/injury-narrative-coding). The winning solutions can be found on [NASA Tournament Lab's Github Page](https://github.com/NASA-Tournament-Lab/CDC-NLP-Occ-Injury-Coding).

In [1]:
# import libraries
#!pip install -r requirements.txt
import numpy as np
import pandas as pd
from transformers import BertTokenizer

#### Load the dataset
The dataset used is stored on github in *csv* format. We first read the dataset as a *pandas dataframe*.

In [2]:
# load dataset
url = "https://raw.githubusercontent.com/halfmoonliu/InjuryNoteLabel/main/Data/full_dataset.csv"
Dataset = pd.read_csv(url)

### Tokenisation
The next step is to tokenize the documents, or individual injury report in this project. After tokenization, naratives become lists of words. We use the **BERT pretrained tokenizer**, provided by **Hugging face**. Below is an illustrating example.

In [3]:
#load pretrained tokenizer
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

tokenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [4]:
# an illustrating example of input/ output of tokenizer
sample_txt = 'When was I last outside? I am stuck at home for 2 weeks.'
tokens_sample = tokenizer.tokenize(sample_txt)
token_ids_sample = tokenizer.convert_tokens_to_ids(tokens_sample)

print(f' Sentence: {sample_txt}')
print(f'   Tokens: {tokens_sample}')
print(f'Token IDs: {token_ids_sample}')

 Sentence: When was I last outside? I am stuck at home for 2 weeks.
   Tokens: ['When', 'was', 'I', 'last', 'outside', '?', 'I', 'am', 'stuck', 'at', 'home', 'for', '2', 'weeks', '.']
Token IDs: [1332, 1108, 146, 1314, 1796, 136, 146, 1821, 5342, 1120, 1313, 1111, 123, 2277, 119]


### Create Corpus
To generate synthetic injury reports based on naïve Bayes assumption, we need to calculate the **probability distribution of event labels** and **token probabilities given a label**. Then, we use those probability to generate the synthetic dataset.


#### Create A Corpus of Tokenized Reports
First, we create a corpus of retorts, which contains lists of tokenized documents, below is the content of the first element of the corpus.

In [5]:
# Create injury report corpus, i.e. list of lists of totokenized
Report_l = Dataset['text'].tolist()
corpus = [tokenizer.tokenize(report.lower()) for report in Report_l]
print(len(corpus))
print(corpus[0])

229820
['34', '##yo', '##m', 'lifting', 'boxes', 'at', 'work', 'over', 'use', 'of', 'hand', 'd', '##x', 'car', '##pal', 'tunnel', 'syndrome']


To calculate token probabilities given each label, 3 dictionaries are created:
</br> 1. _tok_freq_d_: token counts for every event.
</br> 2. _event_tok_count_d_: counts total number of tokens in the event.
</br> 3. _event_count_d_: counts total number of an event.

In [6]:
# calculate token frequency
unique_event_l = Dataset['event'].unique().tolist()
tok_freq_d = dict()
event_tok_count_d = dict()
event_count_d = dict()
for e in unique_event_l:
  tok_freq_d[e] = dict()
  event_tok_count_d[e] = 0
  event_count_d[e] = 0

print(tok_freq_d)
print(event_tok_count_d)
print(event_count_d)

{71: {}, 53: {}, 55: {}, 62: {}, 24: {}, 64: {}, 70: {}, 13: {}, 63: {}, 60: {}, 99: {}, 73: {}, 42: {}, 11: {}, 40: {}, 12: {}, 43: {}, 27: {}, 66: {}, 44: {}, 69: {}, 26: {}, 41: {}, 29: {}, 72: {}, 51: {}, 32: {}, 49: {}, 31: {}, 25: {}, 78: {}, 20: {}, 52: {}, 21: {}, 23: {}, 22: {}, 67: {}, 50: {}, 65: {}, 61: {}, 56: {}, 30: {}, 59: {}, 79: {}, 74: {}, 54: {}, 45: {}, 10: {}}
{71: 0, 53: 0, 55: 0, 62: 0, 24: 0, 64: 0, 70: 0, 13: 0, 63: 0, 60: 0, 99: 0, 73: 0, 42: 0, 11: 0, 40: 0, 12: 0, 43: 0, 27: 0, 66: 0, 44: 0, 69: 0, 26: 0, 41: 0, 29: 0, 72: 0, 51: 0, 32: 0, 49: 0, 31: 0, 25: 0, 78: 0, 20: 0, 52: 0, 21: 0, 23: 0, 22: 0, 67: 0, 50: 0, 65: 0, 61: 0, 56: 0, 30: 0, 59: 0, 79: 0, 74: 0, 54: 0, 45: 0, 10: 0}
{71: 0, 53: 0, 55: 0, 62: 0, 24: 0, 64: 0, 70: 0, 13: 0, 63: 0, 60: 0, 99: 0, 73: 0, 42: 0, 11: 0, 40: 0, 12: 0, 43: 0, 27: 0, 66: 0, 44: 0, 69: 0, 26: 0, 41: 0, 29: 0, 72: 0, 51: 0, 32: 0, 49: 0, 31: 0, 25: 0, 78: 0, 20: 0, 52: 0, 21: 0, 23: 0, 22: 0, 67: 0, 50: 0, 65: 0, 61: 

#### Calculate token probability
To generate documents based on event probability and token probability given an event, we first calculate count of tokens in for every event, count of total tokens in used for an event, and number of reports for each event.

In [7]:
# loop through the corpus for word count
event_l = Dataset['event'].tolist()

# trcking max/ min document length for later generation
max_doc_len = 0
min_doc_len = len(corpus[0])

# create vocab dictionary
tok_ind = 0 # numbering for dictionary
tok_id_d = dict() # map token to token_ind
id_tok_d = dict() # map token_ind to token

#looping over the corpus
for si in range(len(corpus)):
  # update max/ min document lengtg
  if len(corpus[si]) > max_doc_len:
    max_doc_len = len(corpus[si])
  if len(corpus[si]) < min_doc_len:
    min_doc_len = len(corpus[si])

  # event count
  event_count_d[event_l[si]] += 1

  # loop over tokens in the sentence
  for tok in corpus[si]:
    # update vocab dict
    if tok not in tok_id_d:
      tok_id_d[tok] = tok_ind
      id_tok_d[tok_ind] = tok

      # update token index
      tok_ind += 1
    # event total token count
    event_tok_count_d[event_l[si]] +=1

    # event token count
    if tok not in tok_freq_d[event_l[si]]:
      tok_freq_d[event_l[si]][tok] = 1
    else:
      tok_freq_d[event_l[si]][tok] += 1

In [34]:
# resulted dictionary
print("count of each event")
print(event_count_d)
print("total token count for each event")
print(event_tok_count_d)
print("individual token count for each event")
print(tok_freq_d[71])

count of each event
{71: 38673, 53: 5817, 55: 17421, 62: 36421, 24: 1517, 64: 6540, 70: 7933, 13: 4861, 63: 13520, 60: 13406, 99: 2050, 73: 12409, 42: 23320, 11: 13337, 40: 76, 12: 3338, 43: 9776, 27: 1260, 66: 4289, 44: 556, 69: 144, 26: 4017, 41: 2257, 29: 2, 72: 1141, 51: 741, 32: 494, 49: 40, 31: 1347, 25: 146, 78: 1326, 20: 19, 52: 708, 21: 55, 23: 423, 22: 76, 67: 90, 50: 39, 65: 73, 61: 78, 56: 9, 30: 2, 59: 7, 79: 18, 74: 7, 54: 15, 45: 22, 10: 4}
total token count for each event
{71: 793854, 53: 130616, 55: 412396, 62: 821311, 24: 36537, 64: 146450, 70: 155704, 13: 112379, 63: 289969, 60: 257415, 99: 38173, 73: 267094, 42: 491004, 11: 324878, 40: 1730, 12: 79509, 43: 224581, 27: 32266, 66: 96771, 44: 12882, 69: 3434, 26: 106403, 41: 46821, 29: 41, 72: 26223, 51: 18150, 32: 12745, 49: 1089, 31: 36500, 25: 3483, 78: 31356, 20: 433, 52: 16667, 21: 1419, 23: 9164, 22: 1733, 67: 2263, 50: 891, 65: 1817, 61: 1858, 56: 217, 30: 50, 59: 185, 79: 414, 74: 191, 54: 351, 45: 639, 10: 98}

In [13]:
# get vocab length
all_toks = id_tok_d.keys()
print("length of the vocabulary:")
print(len(all_toks))

length of the vocabulary:
10780


#### Dreate dictioanry of token probability for each event
With total **token count for each event**, and individual token count of each event, we can construct a **dictionary of token probabilities given each event**.

In [24]:
#create mapped token frequency
mapped_tok_freq_d = dict()

for e in unique_event_l:
  # create an array of zeros for mapped token count
  mapped_tok_freq_d[e] = np.zeros(len(all_toks))

  for tok in tok_freq_d[e]:
    # update mapped token count
    mapped_tok_freq_d[e][tok_id_d[tok]] = tok_freq_d[e][tok]
    # get token prob given event
  mapped_tok_freq_d[e] /= event_tok_count_d[e]

#### Generate synthetic dataset
With the information needed, we **generate the same number of reports for every event in the real dataset based on inividual token probability**. The report length is a **random number between maximum and minimum document length**.

In [26]:
dataset_l = list()
# create dataset
for e in unique_event_l:
  for count in range(event_count_d[e]):

    sent_len = np.random.randint(low = min_doc_len, high = max_doc_len)
    gen_sent_ind_l = np.random.choice(len(all_toks),
                                          size=sent_len,
                                          p=mapped_tok_freq_d[e]).tolist()
    gen_sent = [id_tok_d[i] for i in gen_sent_ind_l]
    sent = " ".join(gen_sent)
    dataset_l.append([sent, e])

#### Turn list of generated sentence into a Pandas Data Frame.

In [32]:
GenData = pd.DataFrame(dataset_l, columns = ['text', 'id'])

# randomly shuffle the datarrame
GenData = GenData.sample(frac = 1)
GenData.head()

Unnamed: 0,text,id
116247,horse swelling pu l r ##uli work finger bite l...,13
173938,on ##m 51 46 fell tripped a 29 opening with wi...,42
104340,a ##r 50 yo a ##c wrist r finger finger ##x cu...,64
166097,##ified ##lles d 58 running work ##in complain...,42
180273,work in duty floor f left looked and fell d co...,42


In [33]:
# export synthetic dataset
GenData.to_csv('BayesianSyntheticInjuryReport.csv', header = False)