## Data Augmentation

A bigger dataset is usually preferred for training neural nets, but getting more data isn't always feasible. There is a solution, though - data augmentation. Data augmentation is used in natural language processing to make text datasets bigger. 

There are several common methods used in augmentation.

- Back translation
- Synonym replacement
- Random swap
- Random deletion
- Sentence shuffling
- Word embeddings

## Synonym Replacement

Synonym replacement can be used to create a very different looking corpus by replacing words with their synonyms. There are several steps to this process. 

First, load a corpus or create some sample text. Next, load the tokenizer from NLTK and then create a list of the tokenized words. Then, tag the words in that list of tokens with their part of speech. We'll use NLTK's part of speech (POS) tagging for that. We don't want to replace every single word so we'll use the POS tagging to eliminate proper nouns (NNP), determiners (DT), and personal pronouns (PRP). Finally, using a loop, we'll create a Synset of each word and add one of its lemmas to the synonyms list. This will be used to create the augmented text.

In [1]:
import nltk
#download necessary tools
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')

#could augment the whole hansel and gretel story
#by replacing "original" with "text" in the tokenizer
text = open("Hansel_Gretel.txt").read()

#load the libraries we'll need for the project
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
from random import randint
import nltk.data

[nltk_data] Downloading package punkt to /Users/Peggy/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/Peggy/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to /Users/Peggy/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
#use a couple of sentences from Hansel and Gretel
original = "Then Gretel gave her a push that drove her far into it, and shut the iron door, and fastened the bolt. Then she began to howl quite horribly, but Gretel ran away and the godless witch was miserably burnt to death."
#assign an empty output list. This will be used to build the augmented text
augmented = ""

## NLTK Tokenizer

NLTK's tokenizer breaks up text into smaller chunks (Guru99, 2021). E.g. Sentences into words. These words are then called tokens. POS adds the part of speech to the word, creating a new list (NLTK, 2021). You can see the text in the following example with its part of speech. 

In [3]:
#load the tokenizer
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')

#the tokenized text is now a list of words
tokens = word_tokenize(text)

#use part of speech tagging, creating a new list of tokenized, tagged words
tagged = nltk.pos_tag(tokens)
tagged[:20]

[('Hard', 'NNP'),
 ('by', 'IN'),
 ('a', 'DT'),
 ('great', 'JJ'),
 ('forest', 'NN'),
 ('dwelt', 'VBD'),
 ('a', 'DT'),
 ('poor', 'JJ'),
 ('wood-cutter', 'NN'),
 ('with', 'IN'),
 ('his', 'PRP$'),
 ('wife', 'NN'),
 ('and', 'CC'),
 ('his', 'PRP$'),
 ('two', 'CD'),
 ('children', 'NNS'),
 ('.', '.'),
 ('The', 'DT'),
 ('boy', 'NN'),
 ('was', 'VBD')]

## Wordnet Synset

Wordnet is the dictionary designed for NLP projects (Wordnet, 2021). Sysnet is used to look up synonyms in Wordnet. The synonyms of each word are given in a list with the part of part of speech tagged after the word. In the following example, the word, "push" outputs a list with many options and their part of speech listed. .n for noun, .v for verb, .a for adjective, and .r for adverb. 

In [4]:
syns = wordnet.synsets(tokens[5])
syns

[Synset('brood.v.01'),
 Synset('dwell.v.02'),
 Synset('populate.v.01'),
 Synset('dwell.v.04'),
 Synset('harp.v.01')]

## Creating the Augmented Text

We'll use a for loop to create a new list that is the augmented text. We'll loop through each word in our tokens list. If that word is tagged with NNP, DT, or PRP, we'll skip the word and move on to the next. For words without those tags, we will add the lemmas to the empty synonyms list. The lemmas are the words in the Synset (Educative,2021). Sometimes, words do not have synonyms and sometimes they have more than one. If there are multiple synonyms, we'll choose a random one. That synonym will be appended to the "augmented" list. If there are no synonyms, the original word will be added to the augmented list. 

In [5]:
#loop through each tokenized word in the tokens list
for i in range(0,len(tokens)):
    #make an empty list for the each synonym
    synonyms = []
    
    #using the synonyms in synsets...
    for syn in wordnet.synsets(tokens[i]):

        #break for proper nouns, determiners, and personal pronouns
        if tagged[i][1] == 'NNP' or tagged[i][1] == 'DT' or tagged[i][1] == 'PRP':
            break
        
        #appending synonyms to the list
        for j in syn.lemmas():
            synonyms.append(j.name())
    
    #if the word has more than one synonym, we'll pull just one from the list
    if len(synonyms) > 0:
        #pick a random number from the number of choices 
        synonym = synonyms[randint(0,len(synonyms)-1)]
        #take the augmented text and add the synonym to it
        augmented = augmented + " " + synonym
    else:
        #if there is no synonym, then take the augmented text and add the original word
        augmented = augmented + " " + tokens[i]

print(augmented)

 Hard away a neat timberland harp a poor wood-cutter with his wife and his II child . The boy be telephone Hansel and the girlfriend Gretel . He birth piddling to morsel and to pause , and once when groovy famine light on the put_down , he could nobelium long pander even_out daily simoleons . directly when he guess over this by night inward his eff , and toss about in his anxiety , he groan and say to his wife : ‘ What cost to become of us ? How ar we to feed our poor child , when we no yearner suffer anything even for ourselves ? ’ ‘ I ’ ll say you what , husband , ’ answer the cleaning_woman , ‘ early tomorrow sunup we volition lead the tyke out into the afforest to where it constitute the deep ; in_that_location we leave wakeful a fire for them , and sacrifice each of them single more piece of kale , and so we will go to our figure_out and leave them lonely . They will not feel the way home once_again , and we shall be rid of them. ’ ‘ No , wife , ’ suppose the Isle_of_Man , ‘ I wil

## References

Educative. (2021). How to use WordNet in Python. https://www.educative.io/edpresso/how-to-use-wordnet-in-python

Guru99. (2021). NLTK Tokenize: Words and Sentences Tokenizer with Example. https://www.guru99.com/tokenize-words-sentences-nltk.html

NLTK. (2021). 5. Categorizing and Tagging Words. https://www.nltk.org/book/ch05.html

Wordnet. (2021). What is WordNet?. https://wordnet.princeton.edu