## Purpose: To resample train sets 
- oversample minority classes (B-, I-)
- undersample majority classes (O)


### Note: Before running this notebook, please configure the following paths

In [7]:
data_folder = "dataset"

In [None]:
#!dir $data_folder

## 1. Resample CoNLL File (train set)
- this is to address class imbalanced
- to oversample minority classes (B-, I-)
- to undersample majority classes (O)

In [2]:
import re
import csv
import numpy as np
import pandas as pd

### Custom Function to re-sample train set

In [3]:
def resample_conll_fortrain(folder, input_filename, undersampling_ratio, oversampling_ratio, output_filename):
    # Read text file
    with open(folder + "\\" + input_filename) as f:
        text = f.read()
         
    # Create dataframe to store the sentences
    out = re.split('\n\n', text)
    df = pd.DataFrame(out, columns = ['Sentences'])
    
    df['Ner Labels'] = df['Sentences'].str.contains(pat = 'B-|I-').astype(int)
    df0 = df[df['Ner Labels'] == 0]
    df1 = df[df['Ner Labels'] == 1]
   
    # Undersample sentences with all "O" labels
    df0_resample = df0.sample(frac = undersampling_ratio, random_state = 42)
    
    # Oversample sentences with "B-" and "I-" Labels
    df1_oversample = pd.concat([df1]*oversampling_ratio, ignore_index=True)
    
    # Combine undersample sentences
    final_df = pd.concat([df0_resample, df1_oversample])
    
    # Shuffle rows within dataframe
    final_df = final_df.sample(frac=1, random_state=42)
    
    #use line terminator to add blank row at the end of each sentence
    final_df[['Sentences']].to_csv('output.txt', header=False, index=False, line_terminator='\n\n')
   
    with open('output.txt') as f:
        text = f.read()
        output = text.replace('"', '')    

    with open(folder + "\\" + output_filename, 'w') as f:
        for i in output:
            f.write(i)
        
    return

In [4]:
# Set Parameters for resample_conll
train_folder = data_folder+"\\02conll\conll_train"

#dataset name
dataset_name = "train4522"   

# change the radio accordingly to suit your dataset
undersampling_ratio = [0.1,0.2,0.3,0.4,0.5]
oversampling_ratio = [1,2,3]

### Generate Re-sample Train sets

In [5]:
# Generate output files with different combinations of undersampling and oversampling ratio
# this will generate 15 training sets in conll_train folder
import re
for u in undersampling_ratio:
    for o in oversampling_ratio:
        output_filename = dataset_name+"_u" + str(u) + "o" + str(o) + ".txt"
        #print(output_path)
        resample_conll_fortrain(train_folder, dataset_name+".txt", u, o, output_filename)

Note: Filename with suffix of shortanno_u0.1o2 refers to a combination of: undersampling ratio (u) of 0.1 and oversampling ratio (o) of 2

1. For under-sampling, only a fraction of sentences that did not contain any “B-” and “I-” labels were kept; an under-sampling ratio of 0.1 means that only 10% of sentences meeting the condition above were kept. 

2. For over-sampling, sentences that contained “B-” and “I-” labels were duplicated to create more training examples; an over-sampling ratio of 2 meant that each sentence meeting the condition was duplicated. 