<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center> 

# Librería NLPAug

Hemos estudiado algunas técnicas de aumento de datos para PLN, incluso hemos implementado algunas de ellas. Sin embargo, implementarlos efectivamente desde cero es mucho trabajo.

Ya hemos estudiado la librería, **textaugment**. En este notebook, estudiaremos una nueva librería, **NLPAug**, que ya proporciona una implementación eficiente de las técnicas de DA.

En particular, NLPAug ofrece tres tipos de aumento:
- a nivel de carácter.
- a nivel de palabra.
- a nivel de oración.

En cada uno de estos niveles, NLPAug proporciona todos los métodos discutidos en los cuadernos anteriores, tales como:

- eliminación aleatoria,
- inserción aleatoria,
- alteración del orden,
- sustitución de sinónimos.


Instalamos la librería:

In [None]:
!pip install nlpaug


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting nlpaug
  Downloading nlpaug-1.1.11-py3-none-any.whl (410 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m410.5/410.5 KB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: nlpaug
Successfully installed nlpaug-1.1.11


## Character Augmenter
Aumento de datos a nivel de carácteres suele ser útil para tareas como la transformación de imagen a texto o chatbots. 

Durante el reconocimiento de texto de la imagen, necesitamos un modelo de reconocimiento óptico de caracteres (OCR) para lograrlo, pero OCR introduce algunos errores, como reconocer "o" y "0". OCRAug simula estos errores para realizar el aumento de datos. Para chatbot, todavía tenemos errores tipográficos, aunque la mayoría de las aplicaciones vienen con corrección de palabras. Por lo tanto, se introduce KeyboardAug para simular este tipo de errores.



###  Optical character recognition (OCR) 

In [None]:
import nlpaug.augmenter.char as nac
text = 'The quick brown fox jumps over the lazy dog .'

aug = nac.OcrAug()
augmented_texts = aug.augment(text, n=3)

print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick 6kown fox jumps over the 1a2y dog.', 'The quick brown fux jomp8 over the lazy d09.', 'The quick brown fox jomp8 over the lazy du9.']


### Keyboard Augmenter
Sustituir un carácter por otro próximo en el teclado. 

In [None]:
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick b$pwn fox ju,Os over the lWzj dog.']


### Random augmenter

Un carácter es insertado aleatoriamente

In [None]:
aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The qujicOk b^r9own fox jumps over the 6laMzy dog.']


Se reemplaza un carácter por otro de forma aleatoria:

In [None]:
aug = nac.RandomCharAug(action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The 8uock brown fox jumps VPer the la)z dog.']


Intercambios de carácteres de forma aleatoria:

In [None]:
aug = nac.RandomCharAug(action="swap")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The qukic brown fox jumps vore the alyz dog.']


Borrar carácteres de forma aleatoria:

In [None]:
aug = nac.RandomCharAug(action="delete")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick brown fox ums or the zy dog.']


## Word Augmenter




In [None]:
import nlpaug.augmenter.word as naw


In [None]:
# intercambio de palabras
aug = naw.RandomWordAug(action="swap")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick brown fox jumps over the lazy. dog']


In [None]:
# borrado de palabras
aug = naw.RandomWordAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick fox jumps the dog.']


In [None]:
# se elimina una n-grama de palabras
aug = naw.RandomWordAug(action='crop')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick over the lazy dog.']


### Sustitución de sinónimos


#### WordNet

En primer lugar, utilizaremos WordNet para obtener los sinónimos:



In [None]:

aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The immediate brown charles james fox jumps over the indolent dog.']


#### Word Embeddings Augmenter


La técnica más utilizada y eficaz es la sustitución de sinónimos usando word embeddings, consiguiendo oraciones con el mismo significado pero con diferentes palabras. En lugar de utilizar un diccionario como WordNet en EDA, se utiliza un modelo pre-entrenado de word embeddings. Es decir, podemos utilizar mdoelos de word embeddings no contextuales (como Glove, word2vec, etc.) o embeddings contextuales (como Bert, Roberta, etc.).

In [None]:

# model_type: word2vec, glove or fasttext
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin',
    action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

NameError: ignored


#### Contextual Word Embeddings Augmenter


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.2-py3-none-any.whl (6.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.8/6.8 MB[0m [31m43.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.3-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.3 tokenizers-0.13.2 transformers-4.27.2


In [None]:
import nlpaug.augmenter.word as naw

aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['only the quick thinking brown fox jumps over the giant lazy dog.']


### Reemplazo por antónimos


In [None]:
aug = naw.AntonymAug()
_text = 'Good boy'
augmented_text = aug.augment(_text)
print("Original:")
print(_text)
print("Augmented Text:")
print(augmented_text)

Original:
Good boy
Augmented Text:
['Bad boy']


## Sentence Augmentation




### Contextual Word Embeddings 

Insert sentence by contextual word embeddings (GPT2 or XLNet)

In [None]:
import nlpaug.augmenter.sentence as nas


In [None]:

# model_path: xlnet-base-cased or gpt2
aug = nas.ContextualWordEmbsForSentenceAug(model_path='xlnet-base-cased')
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

ValueError: ignored

In [None]:
aug = nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/548M [00:00<?, ?B/s]

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick brown fox jumps over the lazy dog . number only most next most only in last it of A more A a " .']


In [None]:
aug = nas.ContextualWordEmbsForSentenceAug(model_path='distilgpt2')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/353M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick brown fox jumps over the lazy dog . P E K W W The R This G R of C The W Image The F S This This The S .']


### Abstractive Summarization Augmenter
 
También es posible utilizar técnicas de generación automática de resúmenes como técnica de data augmentation. 

In [None]:
article = """
The history of natural language processing (NLP) generally started in the 1950s, although work can be 
found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and 
Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. 
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian 
sentences into English. The authors claimed that within three or five years, machine translation would
be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, 
which found that ten-year-long research had failed to fulfill the expectations, funding for machine 
translation was dramatically reduced. Little further research in machine translation was conducted 
until the late 1980s when the first statistical machine translation systems were developed.
"""

aug = nas.AbstSummAug(model_path='t5-base')
augmented_text = aug.augment(article)
print("Original:")
print(article)
print("Augmented Text:")
print(augmented_text)

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Original:

The history of natural language processing (NLP) generally started in the 1950s, although work can be 
found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and 
Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. 
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian 
sentences into English. The authors claimed that within three or five years, machine translation would
be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, 
which found that ten-year-long research had failed to fulfill the expectations, funding for machine 
translation was dramatically reduced. Little further research in machine translation was conducted 
until the late 1980s when the first statistical machine translation systems were developed.

Augmented Text:
['the history of natural language processing (NLP) generally started in the 