## Example of Textual Augmenter Usage using NLPAUG
- API documentation https://nlpaug.readthedocs.io/en/latest/

In [1]:
import os,config
model_dir = os.path.join(config.data_folder,'Data_Augumentation_Models')
os.environ["MODEL_DIR"] = model_dir

- Download some model weights 
- w2v : https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM 

In [2]:
## import some functions
#import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

- For char level augumentaion, please see https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb

In [17]:
## sample text for data augumentation 
text1 = "Low-income countries face fewer debt challenges today than they did 25 years ago, thanks in particular to the Heavily Indebted Poor Countries initiative, which slashed unmanageable debt burdens across sub-Saharan Africa and other regions. But although debt ratios are lower than in the mid-1990s, debt has been creeping up for the past decade and the changing composition of creditors will make restructurings more complex."
text2 = "Restructuring Debt of Poorer Nations Requires More Efficient Coordination"
print(text1,'\n\n',text2)

Low-income countries face fewer debt challenges today than they did 25 years ago, thanks in particular to the Heavily Indebted Poor Countries initiative, which slashed unmanageable debt burdens across sub-Saharan Africa and other regions. But although debt ratios are lower than in the mid-1990s, debt has been creeping up for the past decade and the changing composition of creditors will make restructurings more complex. 

 Restructuring Debt of Poorer Nations Requires More Efficient Coordination


### World Level Augumentation 

- #### Synonym Augmenter
- It is ok, but not very good 

In [36]:
aug = naw.SynonymAug(aug_src='wordnet',aug_p=0.6)
print("Original:")
print(text2)
print("Augmented Text substitute:")
print(aug.augment(text2,n=1,num_thread=1)[0])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text substitute:
Reconstitute Debt of Poorer Land Requires More Efficient Coordination


- #### EDA process (Swap word/Delete word/Delete a set of contunous word)

In [40]:
print("Original:")
print(text2)
for a in ['substitute', 'swap', 'delete','crop']:
    aug = naw.RandomWordAug(action=a)
    augmented_text = aug.augment(text2)
    print("Augmented Text {}:".format(a))
    print(augmented_text[0])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text substitute:
Restructuring Debt of _ _ Requires More _ Coordination
Augmented Text swap:
Debt restructuring of Nations Poorer Requires More Coordination Efficient
Augmented Text delete:
Restructuring Debt of More Efficient Coordination
Augmented Text crop:
Restructuring Nations Requires More Efficient Coordination


- ##### W2V augumentation

In [20]:
# model_type: word2vec, glove or fasttext
w2v_dir = os.path.join(model_dir,'w2v','GoogleNews-vectors-negative300.bin')
aug = naw.WordEmbsAug(model_type='word2vec', model_path=w2v_dir,
                      action="substitute",top_k=5,aug_p = 0.3)
# top_k (int) – Controlling lucky draw pool. Top k score token will be used for augmentation. Larger k, more token can be used. Default value is 100. If value is None which means using all possible tokens. This attribute will be ignored when using “insert” action.
# aug_p (float) – Percentage of word will be augmented.

- in general, w2v augumentaion quality does not seems to be too good based on human evaluation.
- likely need to retrained on your context first 

In [24]:
augmented_text = aug.augment([text1,text2])
print("Original:")
print(text2)
print("Augmented Text:")
print(augmented_text[1])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text:
Restructuring Debt of Poorer Six_Nations require Roughly Efficient Coordination


- #### Contextual Word Embeddings Augmenter
- insert/substitute word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)
- overall, results looks better than 


In [31]:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', 
                                action="insert",top_k=10,aug_p=0.2)
aug2 = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', 
                                action="substitute",top_k=10,aug_p=0.2)
print("Original:")
print(text2)
print("Augmented Text insert:")
print(aug.augment(text2)[0])
print("Augmented Text substitute:")
print(aug2.augment(text2)[0])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text insert:
restructuring of debt of much poorer nations requires more efficient coordination
Augmented Text substitute:
restructuring debt for developing nations requires more efficient coordination
