## Data Augmentation using NLPaug

This notebook demostrate the usage of a character augmenter, word augmenter. There are other types such as augmentation for sentences, audio, spectrogram inputs etc. All of the types many before mentioned types and many more can be found at the [github repo](https://github.com/makcedward/nlpaug) and [docs](https://nlpaug.readthedocs.io/en/latest/) of nlpaug.

In [118]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install numpy==1.19.5
# !pip install nlpaug==0.0.14
# !pip install wget==3.2
# !pip install matplotlib==3.2.2
# !pip install requests==2.23.0

# ===========================

In [119]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch2/ch2-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch2-requirements.txt"

# ===========================

In [120]:
# This will be the base text which we will be using throughout this notebook
text="The quick brown fox jumps over the lazy dog ."

In [121]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc
from nlpaug.util import Action
import os
# !git clone https://github.com/makcedward/nlpaug.git
os.environ["MODEL_DIR"] = 'nlpaug/model/'

### Augmentation at the Character Level


1.   OCR Augmenter: To read textual data from on image, we need an OCR(optical character recognition) model. Once the text is extracted from the image, there may be errors like; '0' instead of an 'o', '2' instead of 'z' and other such similar errors.  
2.   Keyboard Augmenter: While typing/texting typos are fairly common this augmenter simulates the errors by substituting characters in words with ones at a similar distance on a keyboard.



In [122]:
# OCR augmenter
# import nlpaug.augmenter.char as nac

aug = nac.OcrAug()  
augmented_texts = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick bk0wn fox jumps over the 1a2y dog.', 'The quick brown fox jomp8 over the lazy du9.', 'The 9oick brown fox jomp8 0vek the lazy dog.']


In [123]:
# Keyboard Augmenter
# import nlpaug.augmenter.word as naw


aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The q tiFk grodn fox jumps oGe% the lazy dog.', 'The quick broQG fox jumOq Kv@r the lazy dog.', 'The quick bro@H fox juJ9s )ger the lazy dog.']


There are other types of character augmenters too. Their details are avaiable in the links mentioned at the beginning of this notebook.

### Augmentation at the Word Level

Augmentation is important at the word level as well , here we use word2vec to insert or substitute a similar word.

**Spelling** **augmentor**


In [124]:
# Downloading the required txt file
import wget

# if not os.path.exists("spelling_en.txt"):
#     wget.download("https://raw.githubusercontent.com/makcedward/nlpaug/5238e0be734841b69651d2043df535d78a8cc594/nlpaug/res/word/spelling/spelling_en.txt")
# else:
#     print("File already exists")

In [125]:
# Substitute word by spelling mistake words dictionary
aug = naw.SpellingAug('data/spelling_en.txt')
augmented_texts = aug.augment(text)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The qchick brown fox jumps overt the grazy dog.']


**Word embeddings augmentor**

In [126]:
import gzip
import shutil

gn_vec_zip_path = "data/bigdata/goog_vec/GoogleNews-vectors-negative300.bin.gz"
gn_vec_path = "data/bigdata/goog_vec/GoogleNews-vectors-negative300.bin"
# Extracting the required model
# with gzip.open(gn_vec_zip_path, 'rb') as f_in:
#     with open(gn_vec_path, 'wb') as f_out:
#         shutil.copyfileobj(f_in, f_out)
# else:
#     gn_vec_path = "../Ch3/" + gn_vec_path


# gn_vec_path = "GoogleNews-vectors-negative300.bin"
# if not os.path.exists("GoogleNews-vectors-negative300.bin"):
#     if not os.path.exists("../Ch3/GoogleNews-vectors-negative300.bin"):
#         # Downloading the reqired model
#         if not os.path.exists("../Ch3/GoogleNews-vectors-negative300.bin.gz"):
#             if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
#                 wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
#             gn_vec_zip_path = "GoogleNews-vectors-negative300.bin.gz"
#         else:
#             gn_vec_zip_path = "../Ch3/GoogleNews-vectors-negative300.bin.gz"
#         # Extracting the required model
#         with gzip.open(gn_vec_zip_path, 'rb') as f_in:
#             with open(gn_vec_path, 'wb') as f_out:
#                 shutil.copyfileobj(f_in, f_out)
#     else:
#         gn_vec_path = "../Ch3/" + gn_vec_path
print(f"Model at {gn_vec_path}")

Model at data/bigdata/goog_vec/GoogleNews-vectors-negative300.bin


Insert word randomly by word embeddings similarity

In [129]:
# model_type: word2vec, glove or fasttext
# Initialize the WordEmbsAug class
import gensim
# aug = naw.WordEmbsAug(
#     model_type='word2vec', model_path=gn_vec_path,
#     action="insert")

model = gensim.models.KeyedVectors.load_word2vec_format(gn_vec_path, binary=True)
aug = naw.WordEmbsAug(model_type='word2vec', model=model, action="insert")

# augmented_text = aug.augment(text)
# print("Original:")
# print(text)
# print("Augmented Text:")
# print(augmented_text)

In [146]:
# get_vocab depricated, need this
def temp():
    return model.index_to_key 
# temp()
model.get_vocab = temp

augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['please The CSG quick brown fox jumps over the ethanol/## lazy dog.']


Substitute word by word2vec similarity


In [147]:
# aug = naw.WordEmbsAug(
#     model_type='word2vec', model_path=gn_vec_path,
#     action="substitute")

model = gensim.models.KeyedVectors.load_word2vec_format(gn_vec_path, binary=True)
aug = naw.WordEmbsAug(model_type='word2vec', model=model, action="insert")

# get_vocab depricated, need this
def temp():
    return model.index_to_key 
# temp()
model.get_vocab = temp

augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['Ravensbruck The seliciclib quick brown fox jumps over the lazy Velupillai dog.']


There are many more features which nlpaug offers you can visit the github repo and documentation for further details