# Expresiones Regulares

Python proporciona métodos para operar con cadenas como **split** y **find** y usa listas para extraer porciones de las líneas.
Esta tarea de buscar y extraer es tan común que Python tiene una librería especializada llamada *expresiones regulares* que maneja muchas de estas tareas.
Las *expresiones regulares* tienen caso un propio lenguaje de programación para buscar y analizar cadenas. De hecho, se han escrito libros enteros sobre tema de expresiones regulares. Aquí solo revisaremos los conceptos básicos de expresiones.
Para más detalle puede consultarse:

https://en.wikipedia.org/wiki/Regular_expression

https://docs.python.org/library/re.html

## Entendiendo expresiones regulares

- Son muy  potentes y algo críptico.
- Se puede decir que son un lenguaje en sí mismos.
- Lenguaje de "caracteres de marcado" programando con caracteres.


## Guía rápida de expresiones regulares:

    ^ Empata al inicio de la línea. 
    
    $ Empata al fin de la línea. 
    
     . Empata con cualquier caracter (comodín).
     
    \s Empata con cualquier caracter que sea espacio en blanco.
    
    \S Empata con un cualquier caracter que no sea espacio en blanco (opuesto de \s).

    ? Empata con el elemento anterior cero veces o una vez.

    * Empata con el elemento anterior cero o más veces.

    *? Empata con el elemento anterior cero o más veces, pero el menor número de veces que sea posible.

    + Empata con el elemento anterior una o más veces-

    +? Coincide con el elemento anterior una o más veces, pero el menor número de veces que sea posible.

    ? Empata con el elemento anterior cero veces o una vez.

    ?? empata con el elemento anterior cero o una vez, pero el menor número de veces que sea posible.

    [aeiou] Empata con un solo carácter siempre que ese carácter esté en el conjunto especificado.

    [a-z0-9] Se puede especificar rangos de caracteres con el signo menos. Este ejemplo es un carácter único que debe ser una letra minúscula o un dígito.



In [1]:
# Busca una línea que contiene 'From' usando operacioones con caracteres:
texto = open("mbox-short.txt")
for linea in texto:
    linea = linea.rstrip()
    if linea.find('From:')>= 0:
        print(linea)       


From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [3]:
# Busca una línea que contiene 'From' usando re:
import re
texto = open("mbox-short.txt")
for linea in texto:
    linea = linea.rstrip()
    if re.search('From:', linea):
        print(linea)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [4]:
# Busca una línea que contiene 'From' al inicio usando operaciones con caracteres:
texto = open("mbox-short.txt")
for linea in texto:
    linea = linea.rstrip()
    if linea.startswith('From:'):
        print(linea)       


From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [26]:
# Busca una línea que contiene 'From' usando re:
import re
texto = open("mbox-short.txt")
for linea in texto:
    linea = linea.rstrip()
    if re.search('^From:', linea):
        print(linea)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


## Caracteres comodín

El caracter . empata cualquier caracter, si adicionalmente se adiciona el 
caracter * , el caracter indica cualquier número de veces de caracteres (cero o más).

Por ejemplo: ^X.*

In [28]:
import re
texto = open("mbox-short.txt")
for linea in texto:
    linea = linea.rstrip()
    if re.search('^X.*:', linea):
        print(linea)

X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan  5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 18:10:48 2008
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 16:10:39 2008
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 15:46:24 2008
X-DSPAM-Confidence:

In [31]:
# Busca líneas que empiecen con 'F', seguido por dos caracteres cualquiera y luego seguidos por 'm:'
import re
texto = open('mbox-short.txt')
for linea in texto:
    linea = linea.rstrip()
    if re.search('^F..m:', linea):
        print(linea)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [37]:
# Busca líneas que empiezan con 'From' y tienen aroba @
import re
texto = open('mbox-short.txt')
for linea in texto:
    linea = linea.rstrip()
    if re.search('^From:.+@', linea):
        print(linea)

From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: zqian@umich.edu
From: rjlowe@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: gsilver@umich.edu
From: gsilver@umich.edu
From: zqian@umich.edu
From: gsilver@umich.edu
From: wagnermr@iupui.edu
From: zqian@umich.edu
From: antranig@caret.cam.ac.uk
From: gopal.ramasammycook@gmail.com
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: david.horwitz@uct.ac.za
From: stephen.marquard@uct.ac.za
From: louis@media.berkeley.edu
From: louis@media.berkeley.edu
From: ray@media.berkeley.edu
From: cwen@iupui.edu
From: cwen@iupui.edu
From: cwen@iupui.edu


In [29]:
import re
texto = open("mbox-short.txt")
for linea in texto:
    linea = linea.rstrip()
    if re.search('^X-\S+:', linea):
        print(linea)

X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Sat Jan  5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 18:10:48 2008
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 16:10:39 2008
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-Sieve: CMU Sieve 2.3
X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
X-Content-Type-Message-Body: text/plain; charset=UTF-8
X-DSPAM-Result: Innocent
X-DSPAM-Processed: Fri Jan  4 15:46:24 2008
X-DSPAM-Confidence:

## Empatando y extrayendo data

**re.search()** retorna verdadero o falso dependiendo si la cadena empata con la expresión regular

Si lo que se desea es extraer los empates se suele utilizar **re.findall()** 

In [45]:
import re

texto = "mis números favoritos son: 7, 21 y 42"
result1 = re.findall('[0-9]+', texto)
result2 = re.findall('[aeiou]+', texto)
result3 = re.findall('[AEIOU]+', texto)
print (result1)
print (result2)
print (result3)

['7', '21', '42']
['i', 'e', 'o', 'a', 'o', 'i', 'o', 'o']
[]


In [46]:
import re
s = 'A message from csev@umich.edu to cwen@iupui.edu about meeting @2PM'
lst = re.findall('\S+@\S+', s)
print(lst)

['csev@umich.edu', 'cwen@iupui.edu']


In [48]:
# busca y extrae líneas que tengan el signo @ entre caracteres
import re
texto = open('mbox-short.txt')
for linea in texto:
    linea = linea.rstrip()
    x = re.findall('\S+@\S+', linea)
    if len(x) > 0:
        print(x)

['stephen.marquard@uct.ac.za']
['<postmaster@collab.sakaiproject.org>']
['<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042308.m04N8v6O008125@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['<source@collab.sakaiproject.org>;']
['apache@localhost)']
['source@collab.sakaiproject.org;']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['<postmaster@collab.sakaiproject.org>']
['<200801042109.m04L92hb007923@nakamura.uits.iupui.edu>']
['<source@collab.sakaiproject

In [49]:
# busca y extrae líneas que tengan el signo @ entre caracteres
import re
texto = open('mbox-short.txt')
for linea in texto:
    linea = linea.rstrip()
    x = re.findall('[a-zA-Z0-9]\S+@\S+[a-zA-Z]', linea)
    if len(x) > 0:
        print(x)

['stephen.marquard@uct.ac.za']
['postmaster@collab.sakaiproject.org']
['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['source@collab.sakaiproject.org']
['stephen.marquard@uct.ac.za']
['stephen.marquard@uct.ac.za']
['louis@media.berkeley.edu']
['postmaster@collab.sakaiproject.org']
['200801042308.m04N8v6O008125@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject.org']
['apache@localhost']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['source@collab.sakaiproject.org']
['louis@media.berkeley.edu']
['louis@media.berkeley.edu']
['zqian@umich.edu']
['postmaster@collab.sakaiproject.org']
['200801042109.m04L92hb007923@nakamura.uits.iupui.edu']
['source@collab.sakaiproject.org']
['source@collab.sakaiproject

In [50]:
import re
texto = "From: usando el caracter : más de una vez:"
y = re.findall('^F.+:',texto)
print(y)

['From: usando el caracter : más de una vez:']


In [51]:
import re
texto = "From: usando el caracter : más de una vez:"
y = re.findall('^F.+?:',texto)
print(y)

['From:']


In [71]:
#Sin usar expresiones regulares para extraer ell host usando slicing de string:
texto = "From: anavargas@lamolina.edu.pe asunto: todo bien Fri 15 june 14:28:10 2019"
x1 = texto.find('@')
print(x1)
x2 = texto.find(' ',x1)
print(x2)
x3 = texto[x1+1:x2]
print(x3)

15
31
lamolina.edu.pe


In [75]:
#O sin usar expresiones regulares:
texto = "From: anavargas@lamolina.edu.pe asunto: todo bien Fri 15 june 14:28:10 2019"
words = texto.split()
print(words)
email = words[1]
print(email)
piezas = email.split('@')
print(piezas)
print(piezas[1])

['From:', 'anavargas@lamolina.edu.pe', 'asunto:', 'todo', 'bien', 'Fri', '15', 'june', '14:28:10', '2019']
anavargas@lamolina.edu.pe
['anavargas', 'lamolina.edu.pe']
lamolina.edu.pe


In [90]:
#Usando expresiones regulares:
import re
texto = "From: anavargas@lamolina.edu.pe asunto: todo bien Fri 15 june 14:28:10 2019"
x = re.findall('@([^ ]*)',texto)
print(x)

['lamolina.edu.pe']


In [94]:
# Suponga que se quiere extraer los números del texto tales como "X-” such as:
# X-DSPAM-Confidence: 0.8475:
# X-DSPAM-Probability: 0.0000

import re
texto = open('mbox-short.txt')
for linea in texto:
    linea = linea.rstrip()
    if re.search('^X\S*: [0-9.]+', linea):
        print(linea)

X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6178
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6961
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7565
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7626
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7556
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7002
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7615
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7601
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6959
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7606
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7559
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7605
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6932
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.7558
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6526
X-DSPAM-Probability: 0.0000
X-DSPAM-Confidence: 0.6948
X-DSPAM-Probability: 0.0000
X-DSPAM-Co

In [120]:
import re
texto = open('mbox-short.txt')
for linea in texto:
    linea = linea.rstrip()
    x = re.findall('^X\S*: ([0-9.]+)', linea)
    if len(x) > 0:
        print(x)

['0.8475']
['0.0000']
['0.6178']
['0.0000']
['0.6961']
['0.0000']
['0.7565']
['0.0000']
['0.7626']
['0.0000']
['0.7556']
['0.0000']
['0.7002']
['0.0000']
['0.7615']
['0.0000']
['0.7601']
['0.0000']
['0.7605']
['0.0000']
['0.6959']
['0.0000']
['0.7606']
['0.0000']
['0.7559']
['0.0000']
['0.7605']
['0.0000']
['0.6932']
['0.0000']
['0.7558']
['0.0000']
['0.6526']
['0.0000']
['0.6948']
['0.0000']
['0.6528']
['0.0000']
['0.7002']
['0.0000']
['0.7554']
['0.0000']
['0.6956']
['0.0000']
['0.6959']
['0.0000']
['0.7556']
['0.0000']
['0.9846']
['0.0000']
['0.8509']
['0.0000']
['0.9907']
['0.0000']


In [121]:
import re
texto = open('mbox-short.txt')
numeros = []
for linea in texto:
    linea = linea.rstrip()
    x = re.findall('^X\S*: ([0-9.]+)', linea)    
    if len(x) > 0:
        numeros.append(x)
        #print(x)
print(numeros)

[['0.8475'], ['0.0000'], ['0.6178'], ['0.0000'], ['0.6961'], ['0.0000'], ['0.7565'], ['0.0000'], ['0.7626'], ['0.0000'], ['0.7556'], ['0.0000'], ['0.7002'], ['0.0000'], ['0.7615'], ['0.0000'], ['0.7601'], ['0.0000'], ['0.7605'], ['0.0000'], ['0.6959'], ['0.0000'], ['0.7606'], ['0.0000'], ['0.7559'], ['0.0000'], ['0.7605'], ['0.0000'], ['0.6932'], ['0.0000'], ['0.7558'], ['0.0000'], ['0.6526'], ['0.0000'], ['0.6948'], ['0.0000'], ['0.6528'], ['0.0000'], ['0.7002'], ['0.0000'], ['0.7554'], ['0.0000'], ['0.6956'], ['0.0000'], ['0.6959'], ['0.0000'], ['0.7556'], ['0.0000'], ['0.9846'], ['0.0000'], ['0.8509'], ['0.0000'], ['0.9907'], ['0.0000']]


In [131]:
import re
texto = open('mbox-short.txt')
numeros = []
for linea in texto:
    linea = linea.rstrip()
    x = re.findall('^X\S*: ([0-9.]+)', linea)
    #print(x)
    if len(x) != 1: continue
    num = float(x[0])
    numeros.append(num)
    #print(x)
print(numeros)


[0.8475, 0.0, 0.6178, 0.0, 0.6961, 0.0, 0.7565, 0.0, 0.7626, 0.0, 0.7556, 0.0, 0.7002, 0.0, 0.7615, 0.0, 0.7601, 0.0, 0.7605, 0.0, 0.6959, 0.0, 0.7606, 0.0, 0.7559, 0.0, 0.7605, 0.0, 0.6932, 0.0, 0.7558, 0.0, 0.6526, 0.0, 0.6948, 0.0, 0.6528, 0.0, 0.7002, 0.0, 0.7554, 0.0, 0.6956, 0.0, 0.6959, 0.0, 0.7556, 0.0, 0.9846, 0.0, 0.8509, 0.0, 0.9907, 0.0]


## Caracter escape

In [133]:
import re
x = 'Solo se recibió $10.00 por los chocolates.'
y = re.findall('\$[0-9.]+',x)
print(y)

['$10.00']


In [134]:
dir(re)

['A',
 'ASCII',
 'DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'Match',
 'Pattern',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 '_cache',
 '_compile',
 '_compile_repl',
 '_expand',
 '_locale',
 '_pickle',
 '_special_chars_map',
 '_subx',
 'compile',
 'copyreg',
 'enum',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'functools',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'template']