| **Inicio** | **atrás 22** |
|----------- |-------------- |
| [🏠](../../README.md) | [⏪](./22.Exception_Handling.ipynb)|

# **23. Expresiones regulares**

Una expresión regular o RegEx es una cadena de texto especial que ayuda a encontrar patrones en los datos. Se puede usar un RegEx para verificar si existe algún patrón en un tipo de datos diferente. Para usar RegEx en python primero debemos importar el módulo RegEx que se llama re.

## **El módulo re**

Después de importar el módulo, podemos usarlo para detectar o encontrar patrones.

In [1]:
import re

## **Módulo Métodos en re**

Para encontrar un patrón, usamos un conjunto diferente de conjuntos de caracteres re que permiten buscar una coincidencia en una cadena.

* re.match() : busca solo al comienzo de la primera línea de la cadena y devuelve los objetos coincidentes si los encuentra; de lo contrario, devuelve Ninguno.
* re.search : Devuelve un objeto de coincidencia si hay uno en cualquier parte de la cadena, incluidas las cadenas de varias líneas.
* re.findall : Devuelve una lista que contiene todas las coincidencias
* re.split : toma una cadena, la divide en los puntos de coincidencia, devuelve una lista
* re.sub : reemplaza una o varias coincidencias dentro de una cadena

### **Match**

```
# syntac
re.match(substring, string, re.I)
# substring is a string or a pattern, string is the text we look for a pattern , re.I is case ignore
```

In [4]:
import re

txt = 'I love to teach python and javaScript'
# Devuelve un objeto con intervalo y coincidencia.
match = re.match('I love to teach', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to teach'>
# Podemos obtener la posición inicial y final del partido como tupla usando span
span = match.span()
print(span)     # (0, 15)
# Encontremos la posición inicial y final del lapso
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to teach

<re.Match object; span=(0, 15), match='I love to teach'>
(0, 15)
0 15
I love to teach


Como puede ver en el ejemplo anterior, el patrón que estamos buscando (o la subcadena que estamos buscando) es Me encanta enseñar . La función de coincidencia devuelve un objeto solo si el texto comienza con el patrón.

In [5]:
import re

txt = 'I love to teach python and javaScript'
match = re.match('I like to teach', txt, re.I)
print(match)  # None

None


La cadena no coincide con Me gusta enseñar , por lo tanto, no hubo ninguna coincidencia y el método de coincidencia devolvió Ninguno.

### **Search**

```
# syntax
re.match(substring, string, re.I)
# substring is a pattern, string is the text we look for a pattern , re.I is case ignore flag
```

In [6]:
import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# Devuelve un objeto con intervalo y coincidencia.
match = re.search('first', txt, re.I)
print(match)  # <re.Match object; span=(100, 105), match='first'>
# Podemos obtener la posición inicial y final del partido como tupla usando span
span = match.span()
print(span)     # (100, 105)
# Encontremos la posición inicial y final del lapso
start, end = span
print(start, end)  # 100 105
substring = txt[start:end]
print(substring)       # first

<re.Match object; span=(100, 105), match='first'>
(100, 105)
100 105
first


Como puede ver, la búsqueda es mucho mejor que la coincidencia porque puede buscar el patrón en todo el texto. La búsqueda devuelve un objeto de coincidencia con una primera coincidencia que se encontró; de lo contrario, devuelve None . Una función mucho mejor es findall . Esta función busca el patrón en toda la cadena y devuelve todas las coincidencias en forma de lista.

### **Búsqueda de todas las coincidencias mediante findall**

findall() devuelve todas las coincidencias como una lista

In [7]:
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It return a list
matches = re.findall('language', txt, re.I)
print(matches)  # ['language', 'language']

['language', 'language']


Como puede ver, la palabra idioma se encontró dos veces en la cadena. Practiquemos un poco más. Ahora buscaremos las palabras Python y Python en la cadena:

In [8]:
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns list
matches = re.findall('python', txt, re.I)
print(matches)  # ['Python', 'python']

['Python', 'python']


Dado que estamos usando re.I, se incluyen letras minúsculas y mayúsculas. Si no tenemos la bandera re.I, entonces tendremos que escribir nuestro patrón de manera diferente. Vamos a comprobarlo:

In [9]:
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

matches = re.findall('Python|python', txt)
print(matches)  # ['Python', 'python']

#
matches = re.findall('[Pp]ython', txt)
print(matches)  # ['Python', 'python']

['Python', 'python']
['Python', 'python']


### **Reemplazo de una subcadena**

In [10]:
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replaced = re.sub('Python|python', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript es el lenguaje más hermoso que un ser humano haya creado jamás.
# OR
match_replaced = re.sub('[Pp]ython', 'JavaScript', txt, re.I)
print(match_replaced)  # JavaScript es el lenguaje más hermoso que un ser humano haya creado jamás.

JavaScript is the most beautiful language that a human being has ever created.
I recommend JavaScript for a first programming language
JavaScript is the most beautiful language that a human being has ever created.
I recommend JavaScript for a first programming language


Añadamos un ejemplo más. La siguiente cadena es muy difícil de leer a menos que eliminemos el símbolo %. Reemplazar el % con una cadena vacía limpiará el texto.

In [11]:

txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?'''

matches = re.sub('%', '', txt)
print(matches)

I am teacher and  I love teaching. 
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs. 
Does this motivate you to be a teacher?


## **Dividir texto usando RegEx Split**

In [12]:
txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''
print(re.split('\n', txt)) # splitting using \n - end of line symbol

['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']


### **Escribir patrones RegEx**

Para declarar una variable de cadena usamos comillas simples o dobles. Para declarar la variable RegEx r'' . El siguiente patrón solo identifica manzana con minúsculas, para que no distinga entre mayúsculas y minúsculas, debemos reescribir nuestro patrón o agregar una bandera.

In [13]:
import re

regex_pattern = r'apple'
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. '
matches = re.findall(regex_pattern, txt)
print(matches)  # ['apple']

# To make case insensitive adding flag '
matches = re.findall(regex_pattern, txt, re.I)
print(matches)  # ['Apple', 'apple']
# or we can use a set of characters method
regex_pattern = r'[Aa]pple'  # this mean the first letter could be Apple or apple
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']

['apple']
['Apple', 'apple']
['Apple', 'apple']


![regex](../imagenes%20Python/regex.png "regex")

### **Corchete**

Usemos corchetes para incluir mayúsculas y minúsculas

In [14]:
regex_pattern = r'[Aa]pple' # this square bracket mean either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']

['Apple', 'apple']


Si queremos buscar el plátano, escribimos el patrón de la siguiente manera:

In [15]:
regex_pattern = r'[Aa]pple|[Bb]anana' # this square bracket means either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'banana', 'apple', 'banana']

['Apple', 'banana', 'apple', 'banana']


Usando el corchete y el operador or , logramos extraer Apple, apple, Banana y banana.

### **Carácter de escape ( \ ) en RegEx**

In [16]:
regex_pattern = r'\d'  # d is a special character which means digits
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'], this is not what we want

['6', '2', '0', '1', '9', '8', '2', '0', '2', '1']


### **Una o más veces (+)**

In [17]:
regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021'] - now, this is better!

['6', '2019', '8', '2021']


### **Period(.)**

In [18]:
regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

regex_pattern = r'[a].+'  # . any character, + any character one or more times 
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['an', 'an', 'an', 'a ', 'ar']
['and banana are fruits']


### **Cero o más veces (*)**

Cero o muchas veces. El patrón puede no ocurrir o puede ocurrir muchas veces.

In [19]:
regex_pattern = r'[a].*'  # . any character, * any character zero or more times 
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['and banana are fruits']


### **Zero or one time(?)**

Cero o una vez. El patrón puede no ocurrir o puede ocurrir una vez.

In [20]:
txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? means here that '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']

['e-mail', 'email', 'Email', 'E-mail']


### **Cuantificador en RegEx**

Podemos especificar la longitud de la subcadena que estamos buscando en un texto, usando una llave. Imaginemos, nos interesa una subcadena con una longitud de 4 caracteres:

In [21]:
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['2019', '2021']

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{1, 4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021']

['2019', '2021']
[]


### **Cart ^**

* Comienza con

In [22]:
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'^This'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['This']

['This']


* Negación

In [23]:
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6,', '2019', '8', '2021']

['6,', '2019', '8,', '2021']


| **Inicio** | **atrás 22** |
|----------- |-------------- |
| [🏠](../../README.md) | [⏪](./22.Exception_Handling.ipynb)|