## Example of Textual Augmenter Usage using NLPAUG
- API documentation https://nlpaug.readthedocs.io/en/latest/
- https://amitness.com/2020/05/data-augmentation-for-nlp/

In [1]:
import os,config
model_dir = os.path.join(config.data_folder,'Data_Augumentation_Models')
os.environ["MODEL_DIR"] = model_dir

- Download some model weights 
- w2v : https://drive.google.com/uc?export=download&id=0B7XkCwpI5KDYNlNUTTlSS21pQmM 

In [2]:
## import some functions
#import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

- For char level augumentaion, please see https://github.com/makcedward/nlpaug/blob/master/example/textual_augmenter.ipynb

In [39]:
## sample text for data augumentation 
text1 = "Low-income countries face fewer debt challenges today than they did 25 years ago, thanks in particular to the Heavily Indebted Poor Countries initiative, which slashed unmanageable debt burdens across sub-Saharan Africa and other regions. But although debt ratios are lower than in the mid-1990s, debt has been creeping up for the past decade and the changing composition of creditors will make restructurings more complex."
text2 = "Restructuring Debt of Poorer Nations Requires More Efficient Coordination"
print(text1,'\n\n',text2)

Low-income countries face fewer debt challenges today than they did 25 years ago, thanks in particular to the Heavily Indebted Poor Countries initiative, which slashed unmanageable debt burdens across sub-Saharan Africa and other regions. But although debt ratios are lower than in the mid-1990s, debt has been creeping up for the past decade and the changing composition of creditors will make restructurings more complex. 

 Restructuring Debt of Poorer Nations Requires More Efficient Coordination


### World Level Augumentation 

- #### Synonym Augmenter
- It is ok, but not very good 

In [36]:
aug = naw.SynonymAug(aug_src='wordnet',aug_p=0.6)
print("Original:")
print(text2)
print("Augmented Text substitute:")
print(aug.augment(text2,n=1,num_thread=1)[0])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text substitute:
Reconstitute Debt of Poorer Land Requires More Efficient Coordination


- #### EDA process (Swap word/Delete word/Delete a set of contunous word)

In [40]:
print("Original:")
print(text2)
for a in ['substitute', 'swap', 'delete','crop']:
    aug = naw.RandomWordAug(action=a)
    augmented_text = aug.augment(text2)
    print("Augmented Text {}:".format(a))
    print(augmented_text[0])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text substitute:
Restructuring Debt of _ _ Requires More _ Coordination
Augmented Text swap:
Debt restructuring of Nations Poorer Requires More Coordination Efficient
Augmented Text delete:
Restructuring Debt of More Efficient Coordination
Augmented Text crop:
Restructuring Nations Requires More Efficient Coordination


- ##### W2V augumentation

In [20]:
# model_type: word2vec, glove or fasttext
w2v_dir = os.path.join(model_dir,'w2v','GoogleNews-vectors-negative300.bin')
aug = naw.WordEmbsAug(model_type='word2vec', model_path=w2v_dir,
                      action="substitute",top_k=5,aug_p = 0.3)
# top_k (int) – Controlling lucky draw pool. Top k score token will be used for augmentation. Larger k, more token can be used. Default value is 100. If value is None which means using all possible tokens. This attribute will be ignored when using “insert” action.
# aug_p (float) – Percentage of word will be augmented.

- in general, w2v augumentaion quality does not seems to be too good based on human evaluation.
- likely need to retrained on your context first 

In [24]:
augmented_text = aug.augment([text1,text2])
print("Original:")
print(text2)
print("Augmented Text:")
print(augmented_text[1])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text:
Restructuring Debt of Poorer Six_Nations require Roughly Efficient Coordination


- #### Contextual Word Embeddings Augmenter
- insert/substitute word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)
- overall, results looks better than 


In [44]:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', 
                                action="insert",top_k=10,aug_p=0.2)
aug2 = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', 
                                action="substitute",top_k=10,aug_p=0.2)
print("Original:")
print(text2)
print("Augmented Text insert:")
print(aug.augment(text2)[0])
print("Augmented Text substitute:")
print(aug2.augment(text2)[0])

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Text insert:
restructuring of debt of poorer nations therefore requires more efficient coordination
Augmented Text substitute:
restructuring all of the nations requires more efficient coordination


- #### Sentence Augumentator
- doesn't seems to be very useful most of the time

In [10]:
# model_path: xlnet-base-cased or gpt2
aug = nas.ContextualWordEmbsForSentenceAug(model_path='xlnet-base-cased') ## next token, pick from top 5
augmented_texts = aug.augment(text2, n=3)
print("Original:")
print(text2)
print("Augmented Texts:")
print(augmented_texts)

Original:
Restructuring Debt of Poorer Nations Requires More Efficient Coordination
Augmented Texts:
['Restructuring Debt of Poorer Nations Requires More Efficient Coordination for UN Member State, the Finance Minister said on Saturday.', 'Restructuring Debt of Poorer Nations Requires More Efficient Coordination at a More Affordable Time?', 'Restructuring Debt of Poorer Nations Requires More Efficient Coordination for Responsible Governments In South Africa to Disposage Global Household Debt of']


- ### Back translation 
- looks like prety good quality, but relatively expensive to run 

In [22]:
from transformers import pipeline

In [31]:
en_fr_translator = pipeline("translation_en_to_fr")
fr_en_translator = pipeline(task = 'translation',model = "Helsinki-NLP/opus-mt-fr-en")

Downloading:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/287M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/784k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/760k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.28M [00:00<?, ?B/s]

In [43]:
fr_text = en_fr_translator(text1)[0]['translation_text']
print(text1)
print(fr_en_translator(fr_text)[0]['translation_text'])


Low-income countries face fewer debt challenges today than they did 25 years ago, thanks in particular to the Heavily Indebted Poor Countries initiative, which slashed unmanageable debt burdens across sub-Saharan Africa and other regions. But although debt ratios are lower than in the mid-1990s, debt has been creeping up for the past decade and the changing composition of creditors will make restructurings more complex.
Low-income countries are now facing fewer debt problems than they were 25 years ago, including through the Heavily Indebted Poor Countries (HIPC) initiative, which has reduced the burden of unsustainable debt in sub-Saharan Africa and other regions.


- ### Abstractive Summarization Augmenter
- this looks fiarly ok

In [20]:
article = text1
aug = nas.AbstSummAug(model_path='t5-base')
augmented_text = aug.augment(article)
print("Original:")
print(article)
print("Augmented Text:")
print(augmented_text)

Downloading:   0%|          | 0.00/1.17k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/773k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.32M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/850M [00:00<?, ?B/s]

Original:
Low-income countries face fewer debt challenges today than they did 25 years ago, thanks in particular to the Heavily Indebted Poor Countries initiative, which slashed unmanageable debt burdens across sub-Saharan Africa and other regions. But although debt ratios are lower than in the mid-1990s, debt has been creeping up for the past decade and the changing composition of creditors will make restructurings more complex.
Augmented Text:
['low-income countries face fewer debt challenges today than they did 25 years ago. but debt has been creeping up for the past decade. changing composition of creditors will make restructurings more complex.']
