<a href="https://colab.research.google.com/github/gmauricio-toledo/NLP-LCC/blob/main/01-RegEx.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1>Preprocesamiento: Expresiones Regulares</h1>

En esta notebook usaremos expresiones regulares para tareas de preprocesamiento de texto, concretamente limpieza.

El módulo [re](https://docs.python.org/3/library/re.html) permite usar expresiones regulares para operaciones de *matching*.

Las expresiones regulares se usando también en otros modulos, como hiperparámetros en diferentes tareas.

In [None]:
import re

## Ejemplo 1

Consideremos el siguiente texto:

> The Nebraska Cornhuskers men's tennis team represents the University of Nebraska–Lincoln in the Big Ten Conference. The program was established in 1928 and has made the NCAA Championship twice, most recently in 2011. Five Cornhuskers have won conference championships, and seventeen have been named all-conference selections. In 1989, Steven Jung was the NCAA Singles runner-up and was named NU's first All-American.[2] Jung is the only men's tennis player in the Nebraska Athletic Hall of Fame.[3] <br><br> Assistant Peter Kobelt was named interim head coach of the program following the departure of Sean Maymi in 2023.[4][5] In 2024, Nebraska Athletic Director Troy Dannen named Kobelt the 12th permanent head coach of the team.[6]

In [None]:
texto = '''
The Nebraska Cornhuskers men's tennis team represents the University of Nebraska–Lincoln in the Big Ten Conference. The program was established in 1928 and has made the NCAA Championship twice, most recently in 2011. Five Cornhuskers have won conference championships, and seventeen have been named all-conference selections. In 1989, Steven Jung was the NCAA Singles runner-up and was named NU's first All-American.[2] Jung is the only men's tennis player in the Nebraska Athletic Hall of Fame.[3]

Assistant Peter Kobelt was named interim head coach of the program following the departure of Sean Maymi in 2023.[4][5] In 2024, Nebraska Athletic Director Troy Dannen named Kobelt the 12th permanent head coach of the team.[6]
'''

Realicemos las siguientes tareas:

1. Quitar las citas [$n$] donde $n$ es un número.

[re.sub](https://docs.python.org/3/library/re.html#re.sub)

In [None]:
expresion_1 = r"\[\d{1,2}\]"

texto_limpio = re.sub(expresion_1, "", texto)
print(texto_limpio)

In [None]:
texto_limpio

2. Eliminar los saltos de linea

In [None]:
expresion_2 = r"\n"

texto_limpio = re.sub(expresion_2, "", texto_limpio)
texto_limpio

3. Buscar (y guardar) los años.

Observa que, en este caso, no se modifica la cadena de texto donde se busca.

In [None]:
expresion_3 = r"\d{4}"

years = re.findall(expresion_3, texto_limpio)

print(years)

print(texto_limpio)

4. Eliminar los años y todos los números.

In [None]:
expresion_4 = r"\d{4}"

texto_limpio = re.sub(expresion_4, "", texto_limpio)

print(texto_limpio)

## Ejemplo 2

Ahora, lidiemos con textos más complejos. Usemos el corpus `20newsgroups`.

In [None]:
from sklearn.datasets import fetch_20newsgroups

Lo cargamos con esta instrucción

In [None]:
data = fetch_20newsgroups()

Contemos cuántos documentos tenemos

In [None]:
len(data.data)

Exploremos uno:

In [None]:
print(data.data[0])

In [None]:
data.data[0]

Trabajemos con un documento aleatorio

In [None]:
from random import randrange

idx = randrange(len(data.data)-1)
random_doc = data.data[idx]

random_doc

⭕ Usando expresiones regulares, realizar las siguientes tareas:

1. Quitar los saltos de línea
2. Quitar números
3. Quitar comillas dobles
4. Almacenar en una lista la(s) direccion(es) de correo electrónico presentes.
5. Quitar las direcciones de correo electrónico.
6. Imprimir el texto *limpio* e imprimir la lista obtenida en (4).
7. ¿Qué tan *limpio* consideras que quedó el texto después de los pasos anteriores? ¿qué pasos adicionales considerarías?