# Advanced Regex

## Regular Expression review

- A powerful way to match text

In [1]:
import pandas as pd
import numpy as np
import re

https://regexr.com/

In [2]:
texto = '''
        Quando certa manhã Gregor Samsa acordou de sonhos intranquilos,
        encontrou-se em sua cama metamorfoseado num inseto monstruoso.
        '''

### Literal Strings vs Sets

#### literal strings: find the pattern 'Samsa'

In [3]:
print(re.findall('Samsa', texto))

['Samsa']


In [4]:
print(re.findall('so', texto))

['so', 'so']


In [5]:
print(re.findall('SO', texto))

[]


In [6]:
print(re.findall('samsa', texto))

[]


#### sets: Finding the pattern `m` or `e` or `t` or ...

##### Exemplo 1: m | e | t | a | m | o | r | f | o | s | e

In [7]:
print(re.findall('m|e|t|a|m|o|r|f|o|s|e', texto))

['a', 'o', 'e', 'r', 't', 'a', 'm', 'a', 'r', 'e', 'o', 'r', 'a', 'm', 's', 'a', 'a', 'o', 'r', 'o', 'e', 's', 'o', 'o', 's', 't', 'r', 'a', 'o', 's', 'e', 'o', 't', 'r', 'o', 's', 'e', 'e', 'm', 's', 'a', 'a', 'm', 'a', 'm', 'e', 't', 'a', 'm', 'o', 'r', 'f', 'o', 's', 'e', 'a', 'o', 'm', 's', 'e', 't', 'o', 'm', 'o', 's', 't', 'r', 'o', 's', 'o']


In [8]:
print(re.findall('[metamorfose]', texto))

['a', 'o', 'e', 'r', 't', 'a', 'm', 'a', 'r', 'e', 'o', 'r', 'a', 'm', 's', 'a', 'a', 'o', 'r', 'o', 'e', 's', 'o', 'o', 's', 't', 'r', 'a', 'o', 's', 'e', 'o', 't', 'r', 'o', 's', 'e', 'e', 'm', 's', 'a', 'a', 'm', 'a', 'm', 'e', 't', 'a', 'm', 'o', 'r', 'f', 'o', 's', 'e', 'a', 'o', 'm', 's', 'e', 't', 'o', 'm', 'o', 's', 't', 'r', 'o', 's', 'o']


In [9]:
re.findall('[metamorfose]', texto) == re.findall('m|e|t|a|m|o|r|f|o|s|e', texto)

True

In [14]:
re.findall('[m]|[e]|[t]|[a]', texto)


['a',
 'e',
 't',
 'a',
 'm',
 'a',
 'e',
 'a',
 'm',
 'a',
 'a',
 'e',
 't',
 'a',
 'e',
 't',
 'e',
 'e',
 'm',
 'a',
 'a',
 'm',
 'a',
 'm',
 'e',
 't',
 'a',
 'm',
 'e',
 'a',
 'm',
 'e',
 't',
 'm',
 't']

##### Exemplo 2: Cidade de São Paulo

In [15]:
text = 'São Paulo Sao Paulo Sáo Paulo Sun Paulo seu paulo san paolo sao paulo são paolo sAo Paolo sao_paulo'

pattern = r'[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uo]lo'
re.findall(pattern, text)

['São Paulo',
 'Sao Paulo',
 'Sáo Paulo',
 'Sun Paulo',
 'seu paulo',
 'san paolo',
 'sao paulo',
 'são paolo',
 'sAo Paolo',
 'sao_paulo']

In [16]:
nome = 'seu paulo'
pattern = r'[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uo]lo'
print(re.sub(pattern, 'São Paulo', nome))

São Paulo


In [17]:
pattern = r'[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uo]lo'
print(re.sub(pattern, 'São Paulo', text))

São Paulo São Paulo São Paulo São Paulo São Paulo São Paulo São Paulo São Paulo São Paulo São Paulo


In [22]:
nomes_sp = ['São Paulo', 'Sao Paulo', 'Sáo Paulo', 
            'Sun Paulo', 'seu paulo', 'san paolo', 
            'sao paulo', 'são paolo', 'sAo Paolo', 
            'sao_paulo', 'Rio de Janeiro']
print(re.sub(pattern, 'São Paulo', nomes_sp[0]))

São Paulo


In [19]:
print(re.sub(pattern, 'São Paulo', nomes_sp[1]))

São Paulo


In [20]:
print(re.sub(pattern, 'São Paulo', nomes_sp))

TypeError: expected string or bytes-like object

In [23]:
nome_sp_limpo = [re.sub(pattern, 'São Paulo', nome) for nome in nomes_sp]
print(nome_sp_limpo)

['São Paulo', 'São Paulo', 'São Paulo', 'São Paulo', 'São Paulo', 'São Paulo', 'São Paulo', 'São Paulo', 'São Paulo', 'São Paulo', 'Rio de Janeiro']


So anything within brackets `[ ]` are considered `sets` in RegEx. A set of patterns you want to find. 

## Since it is a set, you can look for complete sets

In [24]:
lista_tarefas = '''
    A) cortar grama
    B) arrumar porta
    C) instalar calha
    D) ligar para Pedro as 9
    '''

For example: The set of upper-case letters from A to D.

In [25]:
re.findall(r'A|B|C|D', lista_tarefas)

['A', 'B', 'C', 'D']

In [26]:
re.findall(r'[ABCD]', lista_tarefas)

['A', 'B', 'C', 'D']

In [29]:
re.findall(r'[A-D]', lista_tarefas)

['A', 'B', 'C', 'D']

In [30]:
re.findall(r'[A-Z]', lista_tarefas)

['A', 'B', 'C', 'D', 'P']

In [34]:
lista_tarefas = '''
    1) cortar grama
    2) arrumar porta
        2a trocar fechadura
    3) instalar calha
    4) ligar para Pedro as 9 9983
    '''
re.findall('1|2|3|4', lista_tarefas)

['1', '2', '2', '3', '4', '3']

In [32]:
re.findall(r'[1234]', lista_tarefas)

['1', '2', '2', '3', '4']

In [33]:
re.findall(r'[1-4]', lista_tarefas)

['1', '2', '2', '3', '4']

In [35]:
re.findall(r'[0-9]', lista_tarefas)

['1', '2', '2', '3', '4', '9', '9', '9', '8', '3']

In [36]:
re.findall(r'[0-9A-Z]', lista_tarefas)

['1', '2', '2', '3', '4', 'P', '9', '9', '9', '8', '3']

In [39]:
re.findall(r'[0-9][a-z]', lista_tarefas)

['2a']

Some useful sets: 

* [a-z]: Any lowercase letter between a and z.
* [A-Z]: Any uppercase letter between A and Z.
* [0-9]: Any numeric character between 0 and 9.

# Meta characters - They means something different than the character they represent.

* `.` : Match **any character** except newline (`\n`)
* `^` : If used within a `set`, negates the condition
> Careful, this pattern also represents another thing: If used <u>outside a set</u>, it represents `match if at the beginning of the line`
* `$` : Match if at end of the line
* `|` : "OR" operator

## OR

In [41]:
text = 'pedro PEDRO adriano ADRIANO'

In [42]:
pattern = '[PA][ED]|[pa][ed]'
re.findall(pattern, text)

['pe', 'PE', 'ad', 'AD']

In [43]:
text = '''
I like penguins
I like lions
I like penguins and lions
'''
pattern = 'penguins|lions'
re.findall(pattern, text)

['penguins', 'lions', 'penguins', 'lions']

In [45]:
print(re.sub(pattern, 'animals', text))


I like animals
I like animals
I like animals and animals



## Match any character

In [54]:
text = '''My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not.'''
pattern = r'.'
print(re.findall(pattern, text))

['M', 'y', ' ', 'b', 'o', 's', 's', ' ', 'a', 's', 'k', 'e', 'd', ' ', 'm', 'e', ' ', 't', 'o', ' ', 't', 'u', 'r', 'n', ' ', 'i', 'n', ' ', 'm', 'y', ' ', 'T', 'P', 'S', ' ', 'r', 'e', 'p', 'o', 'r', 't', 's', '.', ' ', 'I', ' ', 't', 'o', 'l', 'd', ' ', 'h', 'i', 'm', ' ', 't', 'h', 'e', 'y', ' ', 'w', 'e', 'r', 'e', ' ', 'd', 'o', 'n', 'e', ',', ' ', 'b', 'u', 't', ' ', 't', 'h', 'e', 'y', ' ', 'a', 'r', 'e', ' ', 'n', 'o', 't', '.']


In [55]:
print(text)

My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not.


In [56]:
print(''.join(re.findall(pattern, text)))

My boss asked me to turn in my TPS reports. I told him they were done, but they are not.


In [57]:
pattern = r'.|\n'
print(''.join(re.findall(pattern, text)))

My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not.


## Match everything not in specific set

In [64]:
text = """My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not."""
pattern = r'[^a-mA-M]'
print(re.findall(pattern, text))

['y', ' ', 'o', 's', 's', ' ', 's', ' ', ' ', 't', 'o', ' ', 't', 'u', 'r', 'n', ' ', 'n', ' ', 'y', ' ', 'T', 'P', 'S', ' ', 'r', 'p', 'o', 'r', 't', 's', '.', ' ', '\n', ' ', 't', 'o', ' ', ' ', 't', 'y', ' ', 'w', 'r', ' ', 'o', 'n', ',', ' ', 'u', 't', ' ', 't', 'y', ' ', 'r', ' ', 'n', 'o', 't', '.']


In [65]:
print(''.join(re.findall(pattern, text)))

y oss s  to turn n y TPS rports. 
 to  ty wr on, ut ty r not.


## Match sentences `beginning with pattern`

In [72]:
text = '''My boss asked me to turn in my TPS reports.
The boss told him they were done, but they are not.'''

In [69]:
pattern = r'^My boss'
re.findall(pattern, text)

['My boss']

In [70]:
pattern = r'^The boss'
re.findall(pattern, text)

[]

In [71]:
pattern = r'^The boss'
re.findall(pattern, text, re.MULTILINE)

['The boss']

In [73]:
pattern = r'are not.$'
re.findall(pattern, text)

['are not.']

In [75]:
pattern = 'reports.$'
re.findall(pattern, text, re.MULTILINE)

['reports.']

## Characters classes

* `\d`: numeric characters
* `\w`: alphanumeric characters 
* `\s`: spaces
* `\D`: not numeric characters

In [76]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'\d'
print(re.findall(pattern, text))

['3', '2', '3', '2', '1', '3', '3', '1']


In [82]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'[^\w\s]'
print(re.findall(pattern, text))

['(', '$', '-', '=', '™', '¡', '¡', '™', '£', '¡', '.']


# Quantifiers 

* *: Matches previous character 0 or more times
* +: Matches previous character 1 or more times
* ?: Matches previous character 0 or 1 times (optional)
* {}: Matches previous characters however many times specified within:
* {n} : Exactly n times
* {n,} : At least n times
* {n,m} : Between n and m times

## \d* --> Matches any numeric character that appears 0 or more times.

In [86]:
text = 'aaaaaoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'\d*'
print(re.findall(pattern, text))

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '3232', '', '13', '', '', '', '', '', '', '', '', '', '', '3', '', '1', '', '', '', '', '', '', '']


In [87]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'\d+'
print(re.findall(pattern, text))

['3232', '13', '3', '1']


In [96]:
text = '339.211.273-23 33921127323 339.211.27323'
pattern = '[0-9]+\.?[0-9]+\.?[0-9]+-?[0-9]+'
print(re.findall(pattern, text))

['339.211.273-23', '33921127323', '339.211.27323']


In [107]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 33.1 áéóãà'
pattern = r'\d?'
print(re.findall(pattern, text))

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '3', '2', '3', '2', '', '1', '3', '', '', '', '', '', '', '', '', '', '', '3', '3', '', '1', '', '', '', '', '', '', '']


In [112]:
pattern = r'\d+\.?\d+'
pattern = r'\d'
print(re.findall(pattern, text))

['3232', '13', '33.1']


In [None]:
pattern = r'[0-9]\.?[0-9]+'
print(re.findall(pattern, text))

## Application of previous example of `$` using one of the most useful quantifiers `*`

In [118]:
text = '''My boss asked me to turn. in my TPS reports.
My boss told him they were done, but they are not.'''

In [114]:
# Criar padrão que verifica se a frase termina com "are not."
re.findall(r'are not.$', text)

['are not.']

In [115]:
re.findall(r'.are not.$', text)

[' are not.']

In [116]:
re.findall(r'.*are not.$', text)

['My boss told him they were done, but they are not.']

In [120]:
re.findall(r'.*\.$', text, re.MULTILINE)

['My boss asked me to turn. in my TPS reports.',
 'My boss told him they were done, but they are not.']

In [None]:
text = '''My boss asked me to turn in my TPS reports. 
My boss told him they were done, but they are not.'''

In [None]:
re.findall(r'.*[. ]$', text, re.MULTILINE)

# Important Regex Concept: Greediness
https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

In [122]:
text = 'You are yelling! So I will yell too! Let me yell!.'

In [123]:
pattern = r'.*!'
print(re.findall(pattern, text))

['You are yelling! So I will yell too! Let me yell!']


When repeating a regular expression, as in a*, **the resulting action is to consume as much of the pattern as possible.**

In [124]:
pattern = r'.*?!'
print(re.findall(pattern, text))

['You are yelling!', ' So I will yell too!', ' Let me yell!']


# Capturing group
https://docs.python.org/3/howto/regex.html#grouping

In [135]:
text = '''
From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
'''

In [136]:
pattern = r'(.*):(.*)'
re.findall(pattern, text)

[('From', ' author@example.com'),
 ('User-Agent', ' Thunderbird 1.5.0.9 (X11/20061227)'),
 ('MIME-Version', ' 1.0'),
 ('To', ' editor@example.com')]

In [138]:
dict_re = {campo.lower() : valor.strip() for campo, valor in re.findall(pattern, text)}
print(dict_re)

{'from': 'author@example.com', 'user-agent': 'Thunderbird 1.5.0.9 (X11/20061227)', 'mime-version': '1.0', 'to': 'editor@example.com'}


In [None]:
dict_re['user-agent']

# Voltamos 21h25