## Data Augmentation using NLPaug

This notebook demostrate the usage of a character augmenter, word augmenter. There are other types such as augmentation for sentences, audio, spectrogram inputs etc. All of the types many before mentioned types and many more can be found at the [github repo](https://github.com/makcedward/nlpaug) and [docs](https://nlpaug.readthedocs.io/en/latest/) of nlpaug.

In [1]:
# To install only the requirements of this notebook, uncomment the lines below and run this cell

# ===========================

# !pip install numpy==1.19.5
# !pip install nlpaug==0.0.14
# !pip install wget==3.2
# !pip install matplotlib==3.2.2
# !pip install requests==2.23.0

# ===========================

Collecting nlpaug==0.0.14
[?25l  Downloading https://files.pythonhosted.org/packages/1f/6c/ca85b6bd29926561229e8c9f677c36c65db9ef1947bfc175e6641bc82ace/nlpaug-0.0.14-py3-none-any.whl (101kB)
[K     |████████████████████████████████| 102kB 4.4MB/s 
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-0.0.14
Collecting wget==3.2
  Downloading https://files.pythonhosted.org/packages/47/6a/62e288da7bcda82b935ff0c6cfe542970f04e29c756b0e147251b2fb251f/wget-3.2.zip
Building wheels for collected packages: wget
  Building wheel for wget (setup.py) ... [?25l[?25hdone
  Created wheel for wget: filename=wget-3.2-cp37-none-any.whl size=9675 sha256=53699b93bdf14ec28540dbe53e9978af5099ce729b974b2c43ec5f5f12e13c59
  Stored in directory: /root/.cache/pip/wheels/40/15/30/7d8f7cea2902b4db79e3fea550d7d7b85ecb27ef992b618f3f
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [2]:
# To install the requirements for the entire chapter, uncomment the lines below and run this cell

# ===========================

# try :
#     import google.colab
#     !curl https://raw.githubusercontent.com/practical-nlp/practical-nlp/master/Ch2/ch2-requirements.txt | xargs -n 1 -L 1 pip install
# except ModuleNotFoundError :
#     !pip install -r "ch2-requirements.txt"

# ===========================

In [4]:
# This will be the base text which we will be using throughout this notebook
text="The quick brown fox jumps over the lazy dog ."

In [2]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action
import os
# !git clone https://github.com/makcedward/nlpaug.git
os.environ["MODEL_DIR"] = 'nlpaug/model/'

### Augmentation at the Character Level


1.   OCR Augmenter: To read textual data from on image, we need an OCR(optical character recognition) model. Once the text is extracted from the image, there may be errors like; '0' instead of an 'o', '2' instead of 'z' and other such similar errors.  
2.   Keyboard Augmenter: While typing/texting typos are fairly common this augmenter simulates the errors by substituting characters in words with ones at a similar distance on a keyboard.



In [5]:
# OCR augmenter
# import nlpaug.augmenter.char as nac

aug = nac.OcrAug()  
augmented_texts = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The 9oick brown fox jumps over the 1a2y d09.', 'The quick bk0wn fox jumps over the lazy dog.', 'The qoicr brown fox jumps uvek the lazy du9.']


In [6]:
# Keyboard Augmenter
# import nlpaug.augmenter.word as naw


aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=3) # specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The 1kick brown fox j*mLs over the OaXy dog.', 'The quick brLEn fox jumps iFer the lqay dog.', 'The quick br(en fox jImpC over the Kasy dog.']


There are other types of character augmenters too. Their details are avaiable in the links mentioned at the beginning of this notebook.

### Augmentation at the Word Level

Augmentation is important at the word level as well , here we use word2vec to insert or substitute a similar word.

**Spelling** **augmentor**


In [7]:
# Downloading the required txt file
import wget

if not os.path.exists("spelling_en.txt"):
    wget.download("https://raw.githubusercontent.com/makcedward/nlpaug/5238e0be734841b69651d2043df535d78a8cc594/nlpaug/res/word/spelling/spelling_en.txt")
else:
    print("File already exists")

In [7]:
# Substitute word by spelling mistake words dictionary
aug = naw.SpellingAug('spelling_en.txt')
augmented_texts = aug.augment(text)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick brown fox jumps over ther lszy doga.']


**Word embeddings augmentor**

In [14]:
!unzip ../glove.6B.zip

Archive:  ../glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


In [18]:
!ls ../glove

ls: ../glove: No such file or directory


In [15]:
# import gzip
# import shutil
# import wget


# import gensim.downloader as api
# wv = api.load('word2vec-google-news-300')

# gn_vec_path = "GoogleNews-vectors-negative300.bin"
# print(f"Model at {gn_vec_path}")

# if not os.path.exists("GoogleNews-vectors-negative300.bin"):
#     if not os.path.exists("../Ch3/GoogleNews-vectors-negative300.bin"):
#         # Downloading the reqired model
#         if not os.path.exists("../Ch3/GoogleNews-vectors-negative300.bin.gz"):
#             if not os.path.exists("GoogleNews-vectors-negative300.bin.gz"):
#                 wget.download("https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz")
#             gn_vec_zip_path = "GoogleNews-vectors-negative300.bin.gz"
#         else:
#             gn_vec_zip_path = "../Ch3/GoogleNews-vectors-negative300.bin.gz"
#         # Extracting the required model
#         with gzip.open(gn_vec_zip_path, 'rb') as f_in:
#             with open(gn_vec_path, 'wb') as f_out:
#                 shutil.copyfileobj(f_in, f_out)
#     else:
#         gn_vec_path = "../Ch3/" + gn_vec_path



Insert word randomly by word embeddings similarity

In [23]:
!ls glove/glove.6B.50d.txt

glove/glove.6B.50d.txt


In [28]:
import nlpaug.augmenter.char as nac

test_sentence = 'I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace'

aug = nac.KeyboardAug(name='Keyboard_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3, 
                      aug_word_min=1, aug_word_max=10, stopwords=None, tokenizer=None, reverse_tokenizer=None, 
                      include_special_char=True, include_numeric=True, include_upper_case=True, lang='en', verbose=0, 
                      stopwords_regex=None, model_path=None, min_char=4)
 
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)

I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace
['I went Shopping 5kday, and my tr(l/y was billrd with BXnsnaC. I xls8 had food at H8rgur palace']


In [29]:
aug = nac.OcrAug(name='OCR_Aug', aug_char_min=1, aug_char_max=10, aug_char_p=0.3, aug_word_p=0.3, aug_word_min=1, 
                 aug_word_max=10, stopwords=None, tokenizer=None, reverse_tokenizer=None, verbose=0, stopwords_regex=None, 
                 min_char=1)
 
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)

I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace
['1 went 8huppin9 Tuday, and my trolly was filled with Eanana8. I also had food at burgur palace']


In [30]:
aug = naw.SynonymAug(aug_src='wordnet', model_path=None, name='Synonym_Aug', aug_min=1, aug_max=10, aug_p=0.3, lang='eng', 
                     stopwords=None, tokenizer=None, reverse_tokenizer=None, stopwords_regex=None, force_reload=False, 
                     verbose=0)
 
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/linghuang/nltk_data...
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /Users/linghuang/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/linghuang/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace
['One went Shopping Today, and my trolly was occupy with Bananas. I besides get nutrient at burgur palace']


In [31]:
aug = naw.AntonymAug(name='Antonym_Aug', aug_min=1, aug_max=10, aug_p=0.3, lang='eng', stopwords=None, tokenizer=None, 
                     reverse_tokenizer=None, stopwords_regex=None, verbose=0)
 
test_sentence_aug = aug.augment("very beautiful")
print("very beautiful")
print(test_sentence_aug)

very beautiful
['very ugly']


In [32]:
aug = naw.SpellingAug(dict_path=None, name='Spelling_Aug', aug_min=1, aug_max=10, aug_p=0.3, stopwords=None, 
                      tokenizer=None, reverse_tokenizer=None, include_reverse=True, stopwords_regex=None, verbose=0)
 
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)

I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace
['0I went Shoping Today, Ande my trolly has filled with Bananas. I also had Feed at burgur palce']


In [33]:

aug = naw.SplitAug(name='Split_Aug', aug_min=1, aug_max=10, aug_p=0.3, min_char=4, stopwords=None, tokenizer=None, 
                   reverse_tokenizer=None, stopwords_regex=None, verbose=0)
 
test_sentence_aug = aug.augment(test_sentence)
print(test_sentence)
print(test_sentence_aug)

I went Shopping Today, and my trolly was filled with Bananas. I also had food at burgur palace
['I went Sh opping Tod ay, and my trol ly was fill ed wi th Bananas. I also had food at burgur pal ace']


Substitute word by word2vec similarity


In [11]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=gn_vec_path,
    action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
His quick brown fox jumps morethan the whiny dog .


There are many more features which nlpaug offers you can visit the github repo and documentation for further details