# Advanced Regex

## Regular Expression review

- A powerful way to match text

In [None]:
import pandas as pd
import numpy as np
import re

https://regexr.com/

In [None]:
texto = '''
        Quando certa manhã Gregor Samsa acordou de sonhos intranquilos,
        encontrou-se em sua cama metamorfoseado num inseto monstruoso.
        '''

### Literal Strings vs Sets

#### literal strings: find the pattern 'Samsa'

In [None]:
print(re.findall('Samsa', texto))

In [None]:
print(re.findall('so', texto))

In [None]:
print(re.findall('SO', texto))

#### sets: Finding the pattern `m` or `e` or `t` or ...

##### Exemplo 1: m | e | t | a | m | o | r | f | o | s | e

In [None]:
print(re.findall('m|e|t|a|m|o|r|f|o|s|e', texto))

In [None]:
print(re.findall('[metamorfose]', texto))

In [None]:
re.findall('[metamorfose]', texto) == re.findall('m|e|t|a|m|o|r|f|o|s|e', texto)

##### Exemplo 2: Cidade de São Paulo

In [None]:
text = 'São Paulo Sao Paulo Sáo Paulo Sun Paulo seu paulo san paolo sao paulo são paolo sAo Paolo sao_paulo'

pattern = r'[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uo]lo'
re.findall(pattern, text)

In [None]:
nome = 'seu paulo'
pattern = r'[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uo]lo'
print(re.sub(pattern, 'São Paulo', nome))

In [None]:
pattern = r'[Ss][ãaáàâAÃÁÀâeu][oun][ _][Pp]a[uo]lo'
print(re.sub(pattern, 'São Paulo', text))

In [None]:
nomes_sp = ['São Paulo', 'Sao Paulo', 'Sáo Paulo', 
            'Sun Paulo', 'seu paulo', 'san paolo', 
            'sao paulo', 'são paolo', 'sAo Paolo', 
            'sao_paulo']
print(re.sub(pattern, 'São Paulo', nomes_sp[0]))

In [None]:
print(re.sub(pattern, 'São Paulo', nomes_sp[1]))

In [None]:
print(re.sub(pattern, 'São Paulo', nomes_sp))

In [None]:
nome_sp_limpo = [re.sub(pattern, 'São Paulo', nome) for nome in nomes_sp]
print(nome_sp_limpo)

So anything within brackets `[ ]` are considered `sets` in RegEx. A set of patterns you want to find. 

## Since it is a set, you can look for complete sets

In [None]:
lista_tarefas = '''
    A) cortar grama
    B) arrumar porta
    C) instalar calha
    D) ligar para Pedro as 9
    '''

For example: The set of upper-case letters from A to D.

In [None]:
re.findall(r'A|B|C|D', lista_tarefas)

In [None]:
re.findall(r'[ABCD]', lista_tarefas)

In [None]:
re.findall(r'[A-D]', lista_tarefas)

In [None]:
re.findall(r'[A-Z]', lista_tarefas)

In [None]:
lista_tarefas = '''
    1) cortar grama
    2) arrumar porta
        2a trocar fechadura
    3) instalar calha
    4) ligar para Pedro as 9
    '''
re.findall('1|2|3|4', lista_tarefas)

In [None]:
re.findall(r'[1234]', lista_tarefas)

In [None]:
re.findall(r'[1-4]', lista_tarefas)

In [None]:
re.findall(r'[0-9]', lista_tarefas)

In [None]:
re.findall(r'[0-9A-Z]', lista_tarefas)

In [None]:
re.findall(r'[0-9][a-z]', lista_tarefas)

Some useful sets: 

* [a-z]: Any lowercase letter between a and z.
* [A-Z]: Any uppercase letter between A and Z.
* [0-9]: Any numeric character between 0 and 9.

# Meta characters - They means something different than the character they represent.

* `.` : Match **any character** except newline (`\n`)
* `^` : If used within a `set`, negates the condition
> Careful, this pattern also represents another thing: If used <u>outside a set</u>, it represents `match if at the beginning of the line`
* `$` : Match if at end of the line
* `|` : "OR" operator

## OR

In [None]:
text = 'pedro PEDRO adriano ADRIANO'

In [None]:
pattern = '[PA][ED]|[pa][ed]'
re.findall(pattern, text)

In [None]:
text = '''
I like penguins
I like lions
I like penguins and lions
'''
pattern = 'penguins|lions'
re.findall(pattern, text)

## Match any character

In [None]:
text = '''My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not.'''
pattern = r'.'
print(re.findall(pattern, text))

In [None]:
print(''.join(re.findall(pattern, text)))

In [None]:
pattern = r'.|\n'
print(''.join(re.findall(pattern, text)))

## Match everything not in specific set

In [None]:
text = """My boss asked me to turn in my TPS reports. 
I told him they were done, but they are not."""
pattern = r'[^a-m]'
print(re.findall(pattern, text))

In [None]:
print(''.join(re.findall(pattern, text)))

## Match sentences `beginning with pattern`

In [None]:
text = '''My boss asked me to turn in my TPS reports.
The boss told him they were done, but they are not.'''

In [None]:
pattern = r'^My boss'
re.findall(pattern, text)

In [None]:
pattern = r'^The boss'
re.findall(pattern, text)

In [None]:
pattern = r'^The boss'
re.findall(pattern, text, re.MULTILINE)

In [None]:
pattern = r'are not.$'
re.findall(pattern, text)

In [None]:
pattern = 'reports.$'
re.findall(pattern, text, re.MULTILINE)

## Characters classes

* `\d`: numeric characters
* `\w`: alphanumeric characters 
* `\s`: spaces
* `\D`: not numeric characters

In [None]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'\d'
print(re.findall(pattern, text))

In [None]:
# pattern que retorna tudo menos números

# Quantifiers 

* *: Matches previous character 0 or more times
* +: Matches previous character 1 or more times
* ?: Matches previous character 0 or 1 times (optional)
* {}: Matches previous characters however many times specified within:
* {n} : Exactly n times
* {n,} : At least n times
* {n,m} : Between n and m times

## \d* --> Matches any numeric character that appears 0 or more times.

In [None]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'\d*'
print(re.findall(pattern, text))

In [None]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'\d+'
print(re.findall(pattern, text))

In [None]:
text = 'aoijo (  $ p io x -o = 3232 13 ™¡¡™£¡Ωå 3.1 áéóãà'
pattern = r'\d?'
print(re.findall(pattern, text))

In [None]:
pattern = r'\d.?\d+'
print(re.findall(pattern, text))

In [None]:
pattern = r'[0-9].?[0-9]+'
print(re.findall(pattern, text))

## Application of previous example of `$` using one of the most useful quantifiers `*`

In [None]:
text = '''My boss asked me to turn in my TPS reports.
My boss told him they were done, but they are not.'''

In [None]:
# Criar padrão que verifica se a frase termina com "are not."
re.findall(r'are not.$', text)

In [None]:
re.findall(r'.are not.$', text)

In [None]:
re.findall(r'.*are not.$', text)

In [None]:
re.findall(r'.*\.$', text, re.MULTILINE)

In [None]:
text = '''My boss asked me to turn in my TPS reports. 
My boss told him they were done, but they are not.'''

In [None]:
re.findall(r'.*[. ]$', text, re.MULTILINE)

# Important Regex Concept: Greediness
https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy

In [None]:
text = 'You are yelling! So I will yell too! Let me yell!.'

In [None]:
pattern = r'.*!'
print(re.findall(pattern, text))

When repeating a regular expression, as in a*, **the resulting action is to consume as much of the pattern as possible.**

In [None]:
pattern = r'.*?!'
print(re.findall(pattern, text))

# Capturing group
https://docs.python.org/3/howto/regex.html#grouping

In [None]:
text = '''
From: author@example.com
User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
MIME-Version: 1.0
To: editor@example.com
'''

In [None]:
pattern = r'(.*):(.*)'
re.findall(pattern, text)

In [None]:
dict_re = {campo.lower() : valor.strip() for campo, valor in re.findall(pattern, text)}

In [None]:
dict_re['user-agent']