### Augmentation helps to make the model more generalizable when there is less data.

In Computer Vision, it generates more images with the help of Image Rotation, Changing Lighting Conditions, Image Cropping etc.

In Text Augmentation, it generates more text data which is a cost effective in terms of money.

In [1]:
import re
import spacy
import random


In [2]:
nlp = spacy.load("en_core_web_md")

In [3]:
sample_text = "I visited this place on Valentine's day in the evening. We had to wait for around 15 minutes to get a table here. We utilised that time in getting pictures ;) This place had a very chill vibe with loud music in the background. Live telecast of a cricket match was also going on. It was a little difficult to hold a conversation with someone."

In [4]:
doc = nlp(sample_text)

In [5]:
doc

I visited this place on Valentine's day in the evening. We had to wait for around 15 minutes to get a table here. We utilised that time in getting pictures ;) This place had a very chill vibe with loud music in the background. Live telecast of a cricket match was also going on. It was a little difficult to hold a conversation with someone.

### 1. Technique: Take off words randomly

Steps:

1. Tokenize the sample text
2. Find the number of tokens
3. Set the percentage of words to keep
4. Randomly omit the words to generate new text data

In [6]:
Tokenize the sample text 

tokens = []
for token in doc:
    tokens.append(token.text)
    
# find the number of tokens
l = len(tokens)

# number of tokens to keep
random_tokens_count = round(0.8*l)

In [7]:
random_tokens_count

57

In [12]:
# list to store augmented text
augmented_texts = []

# number of versions of augmented text
n = 5

for i in range(n):
    # randomly generate indices of tokens to keep
    new_tokens_index = random.sample(range(l), random_tokens_count)
    new_tokens_index.sort()
    
    # generate augmented list of tokens
    new_tokens = [tokens[t] for t in new_tokens_index]
    
    # create of augmented version of sample_text
    augmented_texts.append(' '.join(new_tokens))

In [13]:
# print augmented texts
for i in augmented_texts:
    print(i + '\n')

this place on Valentine day in the evening . had to for around 15 get a table here . We utilised time getting pictures ;) This place had a very chill with loud music in background . Live telecast of cricket match was also going on . It was a little difficult hold a conversation with someone

I visited this place on Valentine 's day in the evening . We to wait for 15 get table here We utilised that time in getting pictures ;) This place had very vibe with loud music the background . Live of a cricket match going . It was a little difficult to hold a conversation with someone

I visited this on day in the evening . had to wait for around 15 minutes to get a table We utilised time in getting pictures ;) This place had a very chill vibe loud music in the background . telecast of a cricket match was also going on . was a difficult hold a conversation with

I visited place on in the evening . had to wait for around 15 to get a table . We utilised that time in getting pictures ;) This place a

### 2. Concatenate POS Tags to Text

In [14]:
# tokenize sample_text
tokens_pos = []
for token in doc:
    tokens_pos.append(token.text+"_"+token.pos_)

In [16]:
print(tokens_pos)

['I_PRON', 'visited_VERB', 'this_DET', 'place_NOUN', 'on_ADP', 'Valentine_PROPN', "'s_PART", 'day_NOUN', 'in_ADP', 'the_DET', 'evening_NOUN', '._PUNCT', 'We_PRON', 'had_VERB', 'to_PART', 'wait_VERB', 'for_ADP', 'around_ADP', '15_NUM', 'minutes_NOUN', 'to_PART', 'get_VERB', 'a_DET', 'table_NOUN', 'here_ADV', '._PUNCT', 'We_PRON', 'utilised_VERB', 'that_DET', 'time_NOUN', 'in_ADP', 'getting_VERB', 'pictures_NOUN', ';)_PUNCT', 'This_DET', 'place_NOUN', 'had_VERB', 'a_DET', 'very_ADV', 'chill_ADJ', 'vibe_NOUN', 'with_ADP', 'loud_ADJ', 'music_NOUN', 'in_ADP', 'the_DET', 'background_NOUN', '._PUNCT', 'Live_ADJ', 'telecast_NOUN', 'of_ADP', 'a_DET', 'cricket_NOUN', 'match_NOUN', 'was_VERB', 'also_ADV', 'going_VERB', 'on_PART', '._PUNCT', 'It_PRON', 'was_VERB', 'a_DET', 'little_ADJ', 'difficult_ADJ', 'to_PART', 'hold_VERB', 'a_DET', 'conversation_NOUN', 'with_ADP', 'someone_NOUN', '._PUNCT']


In [17]:
print(" ".join(tokens_pos))

I_PRON visited_VERB this_DET place_NOUN on_ADP Valentine_PROPN 's_PART day_NOUN in_ADP the_DET evening_NOUN ._PUNCT We_PRON had_VERB to_PART wait_VERB for_ADP around_ADP 15_NUM minutes_NOUN to_PART get_VERB a_DET table_NOUN here_ADV ._PUNCT We_PRON utilised_VERB that_DET time_NOUN in_ADP getting_VERB pictures_NOUN ;)_PUNCT This_DET place_NOUN had_VERB a_DET very_ADV chill_ADJ vibe_NOUN with_ADP loud_ADJ music_NOUN in_ADP the_DET background_NOUN ._PUNCT Live_ADJ telecast_NOUN of_ADP a_DET cricket_NOUN match_NOUN was_VERB also_ADV going_VERB on_PART ._PUNCT It_PRON was_VERB a_DET little_ADJ difficult_ADJ to_PART hold_VERB a_DET conversation_NOUN with_ADP someone_NOUN ._PUNCT


### 3. Replace Named Entities with their Categories

In [18]:
aug_text = sample_text
for ent in doc.ents:
    aug_text = re.sub(ent.text, ent.label_, aug_text)
    
print(aug_text)

I visited this place on PERSON's day in the evening. We had to wait for TIME to get a table here. We utilised that time in getting pictures ;) This place had a very chill vibe with loud music in the background. Live telecast of a cricket match was also going on. It was a little difficult to hold a conversation with someone.


Workout on a challenge downloading the data from www.yelp.com/dataset