## 04 – Data Augmentation using NLPaug

In [11]:
# This will be the base text which we will be using throughout this notebook
text="The quick brown fox jumps over the lazy dog ."

In [None]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw

import os
!git clone https://github.com/makcedward/nlpaug.git
os.environ["MODEL_DIR"] = 'nlpaug/model/'

fatal: destination path 'nlpaug' already exists and is not an empty directory.


### Augmentation at the Character Level


In [13]:
# OCR augmenter
# import nlpaug.augmenter.char as nac

aug = nac.OcrAug()  
augmented_texts = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quicr brown fox jumps over the la2y do9 .', 'The quick bkown fox jumps over the la2y dug .', 'The quick brown fox jumps 0ver the 1azy dog .']


In [14]:
aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick brown fox jum0s over the lazy dog .', 'The quick brown fox jum9s over the laSy dog .', 'The quick brown fox jujps over the lazy dog .']


### Augmentation at the Word Level

Augmentation is important at the word level as well , here we use word2vec to insert or substitute a similar word.

**Spelling** **augmentor**


In [15]:
if not os.path.exists("spelling_en.txt"):
    wget.download("https://raw.githubusercontent.com/makcedward/nlpaug/5238e0be734841b69651d2043df535d78a8cc594/nlpaug/res/word/spelling/spelling_en.txt")
else:
    print("File already exists")

File already exists


In [16]:
# Substitute word by spelling mistake words dictionary
aug = naw.SpellingAug('spelling_en.txt')
augmented_texts = aug.augment(text)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
The qchick brown fox jumps over the lazing doog .


**Word embeddings augmentor**

Insert word randomly by word embeddings similarity

In [17]:
import nlpaug.augmenter.word as naw

text = "The quick brown fox jumps over the lazy dog."

# Context-aware word substitutions (BERT)
aug = naw.ContextualWordEmbsAug(
    model_path="bert-base-uncased", 
    action="substitute",
    aug_p=0.25,          # fraction of tokens to modify
    aug_min=1, aug_max=3,
    top_k=50,            # sample from top-k predictions (adds variety)
    stopwords=["the","over","a","an"]  # don't touch function words
)

for i in range(5):
    print(i+1, "→", aug.augment(text))

1 → the quick flying fox stepped over the lazy dog .
2 → the space - fox jumps over the lazy dog .
3 → the large brown fox took over the lazy dog .
4 → the little brown fox jumps over the small dog .
5 → the quick - fox ran over the lazy dog .
