<center>
<img src="https://upload.wikimedia.org/wikipedia/commons/4/47/Acronimo_y_nombre_uc3m.png"/>

<img src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" width=15%/>
</center> 

# NLP albumentations

In this notebook, we will study how apply data augmentation techniques used in Computer Vision in NLP. To do this, we will use the [Albumentations library]. 


Albumentations (https://github.com/albumentations-team/albumentations)
es una librería de Python para crear imágenes sintéticas a partir de imágenes existentes. A continuación, puedes ver un ejemplo de cómo puede realizar algunos aumentos a nivel de píxeles usando esta librería para crear nuevas imágenes a partir de la original:
<center>
<img src="https://production-media.paperswithcode.com/thumbnails/task/task-0000001560-029cbc00.jpg">
</center>




## Transformaciones basadas en barajar oraciones 
A continuación, vamos a ver cómo podemos utilizar estas técnicas para PLN. 

Esta transformación recibe un texto formado por varias oraciones, y devuelve un nuevo texto donde se ha cambiado el orden de las oraciones. Por ejemplo:

- text = ‘$<Sentence1>. <Sentence2>. <Sentence4>. <Sentence4>. <Sentence5>. <Sentence5>$.’

es transformado a:


- text = ‘<Sentence2>. <Sentence3>. <Sentence1>. <Sentence5>. <Sentence5>. <Sentence4>.’




## Eliminar oraciones duplicadas

En este caso, si el texto de entrada contiene oraciones duplicadas, estas serán eliminadas para crear un nuevo texto. Por ejemplo: 

- text = ‘$<Sentence1>. <Sentence2>. <Sentence4>. <Sentence4>. <Sentence5>. <Sentence5>$.’

será transformada a: 

- ‘$<Sentence1>. <Sentence2>.<Sentence4>. <Sentence5>$.’

## Implementación
Vamos a implementar algunas de las técnicas anteriores. Primero, necesitamos una librería que sea capaz de dividir el texto en oraciones. NLTK ya nos proporciona un método para realizar esto: 

In [1]:
import nltk
nltk.download('punkt') # resource for parsing sentences


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
from nltk import sent_tokenize

text = "Flying with only carry-on is as desirable as ever. But for one traveler, even that isn't minimalist enough. She's hitting the road with only a small, 12-liter (3-gallon) shoulder bag. Brooke Schoenman is an American woman living in Australia. Schoenman's path to burden-easing enlightenment began when she studied in Italy before embarking on a post-graduate round-the-world trip. Along the way, she explored Guatemala and later worked teaching English in Ukraine before moving down under 13 years ago."

sentences = sent_tokenize(text)
for s in sentences:
    print(s)
    

Flying with only carry-on is as desirable as ever.
But for one traveler, even that isn't minimalist enough.
She's hitting the road with only a small, 12-liter (3-gallon) shoulder bag.
Brooke Schoenman is an American woman living in Australia.
Schoenman's path to burden-easing enlightenment began when she studied in Italy before embarking on a post-graduate round-the-world trip.
Along the way, she explored Guatemala and later worked teaching English in Ukraine before moving down under 13 years ago.


Ahora vamos a modificar el orden de las oraciones. El paquete **random** nos proporciona un método que nos permite modificar el orden de los elementos de una lista. 

In [None]:
import random
random.shuffle(sentences)
for s in sentences:
    print(s)

Along the way, she explored Guatemala and later worked teaching English in Ukraine before moving down under 13 years ago.
Flying with only carry-on is as desirable as ever.
But for one traveler, even that isn't minimalist enough.
Brooke Schoenman is an American woman living in Australia.
Schoenman's path to burden-easing enlightenment began when she studied in Italy before embarking on a post-graduate round-the-world trip.
She's hitting the road with only a small, 12-liter (3-gallon) shoulder bag.


Por último, implementamos un método para eliminar las oracioNow we implement the exclude duplicate tranformation. This simply removes the duplicates sentences in a text 

In [None]:
text = "Flying with only carry-on is as desirable as ever. But for one traveler, even that isn't minimalist enough. Flying with only carry-on is as desirable as ever. She's hitting the road with only a small, 12-liter (3-gallon) shoulder bag. Brooke Schoenman is an American woman living in Australia. Flying with only carry-on is as desirable as ever.  Schoenman's path to burden-easing enlightenment began when she studied in Italy before embarking on a post-graduate round-the-world trip. Along the way, she explored Guatemala and later worked teaching English in Ukraine before moving down under 13 years ago."

sentences = sent_tokenize(text)
new_sentences = []
print('\nOriginal sentences:')
for s in sentences:
    print(s)
    if s not in new_sentences:
        new_sentences.append(s)


print('\nExclude duplicate transformation:')
for s in new_sentences:
    print(s)
    


Original sentences:
Flying with only carry-on is as desirable as ever.
But for one traveler, even that isn't minimalist enough.
Flying with only carry-on is as desirable as ever.
She's hitting the road with only a small, 12-liter (3-gallon) shoulder bag.
Brooke Schoenman is an American woman living in Australia.
Flying with only carry-on is as desirable as ever.
Schoenman's path to burden-easing enlightenment began when she studied in Italy before embarking on a post-graduate round-the-world trip.
Along the way, she explored Guatemala and later worked teaching English in Ukraine before moving down under 13 years ago.

Exclude duplicate transformation:
Flying with only carry-on is as desirable as ever.
But for one traveler, even that isn't minimalist enough.
She's hitting the road with only a small, 12-liter (3-gallon) shoulder bag.
Brooke Schoenman is an American woman living in Australia.
Schoenman's path to burden-easing enlightenment began when she studied in Italy before embarking

You can find more of these techniques at https://www.kaggle.com/code/shonenkov/nlp-albumentations/notebook