### Expresiones regulares

Una expresión regular es un instrumento para encontrar patrones complejos en los textos.

Puedes manipular a tu gusto las ocurrencias de un patrón (extraer, reemplazar, etc.). 

¡Las expresiones regulares son herramientas poderosas que se usan en casi todos los lugares donde aparece un texto!

Las expresiones regulares encuentran secuencias de caracteres, palabras y números mediante el reconocimiento de patrones. Por ejemplo, si necesitáramos encontrar todas las fechas escritas en el formato DD.MM.AAAA, entonces tendríamos que usar el siguiente patrón: dos números, un punto, dos números, un punto, cuatro números.

El patrón de una dirección de correo electrónico sería: una cadena alfanumérica, @, una cadena alfanumérica, punto, una cadena alfanumérica.

Python tiene un módulo integrado para trabajar con expresiones regulares,  re:

import re

Echa un vistazo a la función re.sub(). Esta encuentra todas las partes del texto que coinciden con el patrón dado y luego las sustituye con el texto elegido.

- patrón
- sustitución: con qué debe sustituirse cada coincidencia de patrón
- texto: el texto que la función escanea en busca de coincidencias de patrón

re.sub(pattern, substitution, text)

Las expresiones regulares tienen su propia sintaxis que puede describir varias combinaciones de cadenas. Una simple expresión regular "a.b" coincidirá con cualquier cadena de tres caracteres que comience con "a" y termine con "b". El punto indica que cualquier carácter puede aparecer en la segunda posición.

Las expresiones regulares suelen utilizar el carácter de barra invertida ('\') como parte de su sintaxis. Dado que esto puede causar un problema de interpretación debido a los caracteres de escape, las expresiones regulares se definen usando cadenas sin formato.

Aquí hay un ejemplo rápido de la diferencia entre una cadena normal y una cadena sin formato. Considera que definimos una cadena sin formato escribiendo r antes de la cadena:

>>> print('¡Hola!\n')
¡Hola!

>>> print(r'¡Hola!\n')
¡Hola!\n

Ahora, veamos el siguiente texto de una reseña:

text = """
I liked this show from the first episode I saw, which was the "Rhapsody in Blue" episode (for those that don't know what that is, the Zan going insane and becoming pau lvl 10 ep). Best visuals and special effects I've seen on a television series, nothing like it anywhere.
"""

Como parte del paso de preprocesamiento, debemos eliminar todos los caracteres excepto las letras, los apóstrofos y los espacios, así que vamos a escribir una expresión regular para encontrarlos.

Todas las letras que coinciden con el patrón se enlistan entre corchetes, sin espacios, y se pueden colocar en cualquier orden. Encontremos letras de la a a la z. Si asumimos que pueden estar tanto en minúsculas como en mayúsculas, entonces el código debería escribirse de la siguiente manera:

- un rango de letras se indica con un guión:
- a-z = abcdefghijklmnopqrstuvwxyz

pattern = r"[a-zA-Z]"

Si también queremos encontrar apóstrofos, podemos agregar uno a la expresión regular:

pattern = r"[a-zA-Z']"

Si llamamos a re.sub(pattern, ' ', text), se sustituirán todas las letras y apóstrofos, pero necesitamos conservarlos. Para indicar que queremos encontrar caracteres que no coincidan con el patrón, coloca un signo de intercalación ^ al comienzo de la secuencia. Así es como se verá:

pattern = r"[^a-zA-Z']"
text = re.sub(pattern, " ", text)
print(text)

" I liked this show from the first episode I saw  which was the  Rhapsody in Blue  episode  for those that don't know what that is  the Zan going insane and becoming pau lvl    ep   Best visuals and special effects I've seen on a television series  nothing like it anywhere  "

Ahora solo nos quedan letras, apóstrofos y espacios, aunque al parecer tenemos más espacios de los que necesitamos. En el siguiente paso, vamos a eliminar los espacios adicionales, ya que pueden entorpecer nuestro análisis. Podemos eliminarlos usando una combinación de los métodos join() y split().

Podemos usar el método split() para convertir nuestra cadena en una lista. Si llamamos a split() sin argumentos, este divide el texto en los espacios o grupos de espacios:

text = text.split()

print(text)

El resultado es una lista sin espacios:

['I', 'liked', 'this', 'show', 'from', 'the', 'first', 'episode', 'I', 'saw', 'which', 'was', 'the', 'Rhapsody', 'in', 'Blue', 'episode', 'for', 'those', 'that', "don't", 'know', 'what', 'that', 'is', 'the', 'Zan', 'going', 'insane', 'and', 'becoming', 'pau', 'lvl', 'ep', 'Best', 'visuals', 'and', 'special', 'effects', "I've", 'seen', 'on', 'a', 'television', 'series', 'nothing', 'like', 'it', 'anywhere']

Luego recombinamos estos elementos en una cadena con espacios utilizando el método join():

text = " ".join(text)
print(text)

Entonces obtenemos una línea sin espacios adicionales:

"I liked this show from the first episode I saw which was the Rhapsody in Blue episode for those that don't know what that is the Zan going insane and becoming pau lvl ep Best visuals and special effects I've seen on a television series nothing like it anywhere"


### Ejercicio

Escribe la función clear_text(text) para mantener solo letras latinas, espacios y apóstrofos en el texto. También elimina cualquier espacio adicional. La función tomará el texto inicial y devolverá el texto después de limpiarlo.

Imprime el texto inicial y el texto después de la limpieza y lematización (en precódigo).

In [None]:
import random  # para seleccionar una reseña aleatoria
import pandas as pd
import re
import spacy

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

data = pd.read_csv('/datasets/imdb_reviews_small.tsv', sep='\t')
corpus = data['review']

def clear_text(text):
    
    pattern = r"[^a-zA-Z']"
    text = re.sub(pattern, " ", text)
    text = text.split()
    return " ".join(text)# < escribe tu código aquí >

def lemmatize(text):

    doc = nlp(text.lower())
    
    lemmas = []
    for token in doc:
        lemmas.append(token.lemma_)
        
    return ' '.join(lemmas)

# guarda el índice de revisión en la variable review_idx
# ya sea como un número aleatorio o un valor fijo, por ejemplo, 2557
review_idx = random.randint(0, len(corpus)-1)
# review_idx = 2557

review = corpus[review_idx]

print('El texto original:', review)
print()
print('El texto lematizado:', lemmatize(clear_text(review)))

Resultado:

El texto original: I heard about this series in 2001 which a friend of mine was recording off the television each week. I never bothered to watch though I became acquainted with the series through the magazine which I looked at every now and then in bookstores. I recently purchased series one on DVD and have become addicted to this fascinating and original series. The characters at first seem unlikeable but it is amazing how fast they grow and develop into a united force. As they begin to care for one another it becomes easy to care what happens to them (bearing in mind that this is only a TV series and they are fictional). However it isn't the PC world of Star Trek and so whilst every character shows a good trait they each have their own flaws and demons that they must deal with. Indivual story lines mixed in with an overall multiple story-arc make this one of the most complex and rewarding television experiences I have ever had the pleasure of viewing. I absolutely enjoy watching each of the characters interact with one another. This strange new world we are introduced to is brilliantly portrayed through the eyes of astronaut John Crichton and as he learns and adapts to being on the other side of the galaxy, strange alien creatures, different cultures, being hunted by a character that wants him dead and being treated as inferior by his comrades we can easily relate to what he must be feeling. As he becomes used to his surroundings so do the viewers and his compassionate, strong-willed and brave character is a joy to watch. I have watched only seven episodes of Farscape season one and look forward to continuing through seasons two-four and the mini-series. Maybe one day we can all enjoy a season five. Highly recommended viewing and well worth setting time aside to watch. Buy and enjoy! 10/10.

El texto lematizado: I hear about this series in which a friend of mine be record off the television each week I never bother to watch though I become acquainted with the series through the magazine which I look at every now and then in bookstore I recently purchase series one on dvd and have become addicted to this fascinating and original series the character at first seem unlikeable but it be amazing how fast they grow and develop into a united force as they begin to care for one another it become easy to care what happen to they bear in mind that this be only a tv series and they be fictional however it be not the pc world of star trek and so whilst every character show a good trait they each have their own flaw and demon that they must deal with indivual story line mix in with an overall multiple story arc make this one of the most complex and rewarding television experience I have ever have the pleasure of view I absolutely enjoy watch each of the character interact with one another this strange new world we be introduce to be brilliantly portray through the eye of astronaut john crichton and as he learn and adapt to be on the other side of the galaxy strange alien creature different culture be hunt by a character that want he dead and be treat as inferior by his comrade we can easily relate to what he must be feel as he become used to his surrounding so do the viewer and his compassionate strong willed and brave character be a joy to watch I have watch only seven episode of farscape season one and look forward to continue through season two four and the mini series maybe one day we can all enjoy a season five highly recommend viewing and well worth set time aside to watch buy and enjoy