# String Super Skills: **Regex**

`Regex` stands for Regular Expression and it describes a special sequence of characters used to search and manipulate words, digits, or other characters in text strings.

Let’s import the re module, unlocking everything related to Regex in python, and also define a few texts on which we will be working on with some examples throughout this notebook.

In [1]:
import pandas as pd
import re

In [2]:
# Assigning a text

text = "If it is written in PYTHON, it's probably machine learning"

#### `re.findall()`

The `re` module has a set of functions that allow us to search for, return and replace a string or any part of a string. We start by the `findall()` function which returns a list containing all occurrences.

In [3]:
regex = re.findall(r'.*PYTHON',text)
print(regex)

['If it is written in PYTHON']


Understanding: firstly, the findall() function returns a list of occurrences in the same order it finds it. Secondly, the ‘r’ is at the beginning to ensure that the string is seen as a “raw string”.

The `‘.*PYTHON’` part: We want to return everything until the word PYTHON, inclusive. Therefore, .* is some sort of a trump symbol, in the sense that * repeats everything zero or more times, until it finds the star, and . tells the star to return everything it finds, be it letters, numbers, symbols or spaces.

If we invert the command, we receive the other half of the sentence, see an example.

rx = re.findall('python.*',text, flags=re.IGNORECASE)
print(rx)

Setting `re.flags` to ignore-case so it matches either if it finds the occurrence in upper or lower cases.

In [4]:
rx = re.findall('PYTHON.*',text)
print(rx)

["PYTHON, it's probably machine learning"]


From this point on, we can build a series of possibilities.

In [5]:
regex = re.findall('written.*machine', text)
print(regex)

["written in PYTHON, it's probably machine"]


In [6]:
regex = re.findall('tt.*bl', text)
print(regex)

["tten in PYTHON, it's probabl"]


Check if a string starts with (symbol ^) or ends with (symbol $) a specific character.

- ^ Evaluates and matches the start of a string (it is the same as \A )
- \w+ Matches and returns the alphanumeric character in the string

If we remove the symbol + we receive only the first character.

In [7]:
rx = re.findall('^\w+', text)
print(rx)

['If']


`rx = re.findall('learning$', text)
print(rx)`

In [8]:
rx = re.findall('learning$', text)
print(rx)

['learning']


If doesn’t match, we receive an empty list.

Every time a character matches as much as it can it is said to be Greedy. The symbol ? checks if the next character matches zero or one time starting from that exact position. Meaning it specifies a non-greedy version of * and + .

In [9]:
rx = re.findall(r' .*? ', text)
print(rx)

[' it ', ' written ', ' PYTHON, ', ' probably ']


`Braces`

The character **Braces `{b,n}`** is used when we want to check at least **`b`** times, and at most **`n`** times, of the pattern.

In [10]:
rx = re.findall(r'(t{1,4}|i{1,})', text)
print(rx)

['i', 't', 'i', 'i', 'tt', 'i', 'i', 't', 'i', 'i']


In the next example, we are asking to check at least 1 `t` and at the most 4 `t` and we get this exact result.

On the other hand, we are also checking for at least 1 `e` and at the most 3 `e`, but as you can see there are 4 `e` in a row, meaning that the 4 `e` will be split into a group of 3 and that is the reason we get a remaining single `e`.

In [11]:
rx = re.findall(r'(t{1,4}|e{1,3})', 'listttt compreheeeension')
print(rx)

['tttt', 'e', 'eee', 'e']


`Square brackets`

The use of **square brackets []** specifies a set of characters that we want to match. For example, `[abc]` has 1 match with `a`, has 3 matches with `cab`, and no match with `hello`.

So, we can specify a range of values using (symbol -) inside square brackets. Thus, `[a-d]` is the same as `[abcd]`, and the range `[1-4]` is the same as `[1234]`, and so on.

Following the same reasoning, the range `[a-z]` matches with any lower case letter and `[A-Z]` with any upper case. If setting the combination of `[a-zA-Z]` we’re checking both upper and lower at the same time. Let’s try with some examples.

In [12]:
# assigning new text

alpha_num = "Hello 1234"

In [13]:
rx = re.findall(r'[a-z]', 'alpha_num')
print(rx)

['a', 'l', 'p', 'h', 'a', 'n', 'u', 'm']


In [14]:
rx = re.findall(r'[a-zA-Z]', 'Hello 1234')
print(rx)

['H', 'e', 'l', 'l', 'o']


In [15]:
rx = re.findall(r'[a-zA-Z0-9]', 'Hello 1234')
print(rx)

['H', 'e', 'l', 'l', 'o', '1', '2', '3', '4']


What if we add the symbol +, what would happen?

In [16]:
rx = re.findall(r'[a-zA-Z0-9]+', 'Hello 1234')
print(rx)

['Hello', '1234']


*Tip:* if the first character inside the set is ^ , everything outside of the set will be matched.

In [17]:
rx = re.findall(r'[^a-zA-Z ]+', 'Hello 1234')
print(rx)

['1234']


`Special Sequences` 

These are written with the backslash \ followed by the desired character (and its meaning).

- \w -As already seen earlier, returns a match where the string contains letters, numbers, and the underscore
- \W -Returns every non-alpha-numeric character
-\d -Returns a match where the string contains digits from zero to nine (0–9)

If the star `*` repeats everything zero or more times, the sign `+` repeats everything one or more times. So what’s the difference? Let us create another string to exemplify and take a closer look.

In [18]:
# assigning new text

letters_numbers = "The letter A, the character * and the numbers 11, 222 and 3456."

In [19]:
rx = re.findall('\w', letters_numbers)
print(rx)

['T', 'h', 'e', 'l', 'e', 't', 't', 'e', 'r', 'A', 't', 'h', 'e', 'c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 'a', 'n', 'd', 't', 'h', 'e', 'n', 'u', 'm', 'b', 'e', 'r', 's', '1', '1', '2', '2', '2', 'a', 'n', 'd', '3', '4', '5', '6']


Instead, if we add the symbol +, what would be the difference?

In [20]:
rx = re.findall('\w+', letters_numbers)
print(rx)

['The', 'letter', 'A', 'the', 'character', 'and', 'the', 'numbers', '11', '222', 'and', '3456']


In [21]:
rx = re.findall('\W', letters_numbers)
print(rx)

[' ', ' ', ',', ' ', ' ', ' ', '*', ' ', ' ', ' ', ' ', ',', ' ', ' ', ' ', '.']


Only extracting digits:

In [22]:
rx = re.findall('\d+', letters_numbers)
print(rx)

['11', '222', '3456']


In [23]:
rx = re.findall('\d{3,}', letters_numbers)
print(rx)

['222', '3456']


Now imagine that we want to extract only the uppercase words from the string in groups of two elements.

In [24]:
upper_extract = "Regex is very NICE for finding and processing text in PYTHON"

In [25]:
rx = re.findall('([A-Z]{2,})', upper_extract)
print(rx)

['NICE', 'PYTHON']


#### `re.split()`

The split method can be handy since it splits the string when it finds a match and returns a list of strings from the exact split.

In [26]:
# Assigning a new text

numbers = 'The air we breath is made up of 78% nitrogen, 21% oxygen and 1% of other stuff.'

In [27]:
rx = re.split(r'\d+', numbers)
print(rx)

['The air we breath is made up of ', '% nitrogen, ', '% oxygen and ', '% of other stuff.']


If the pattern doesn’t match, the original string is returned.

A useful resource is to set the maximum splits that are possible to occur. We can set this by passing the maxsplit argument into the `re.split()` method.

In [28]:
rx = re.split(r'\d+', numbers, 1)
print(rx)

['The air we breath is made up of ', '% nitrogen, 21% oxygen and 1% of other stuff.']


In the next example, set the split at each white-space characater only at the five first occurrences.

In [29]:
rx = re.split(r'\s', numbers, 5)
print(rx)

['The', 'air', 'we', 'breath', 'is', 'made up of 78% nitrogen, 21% oxygen and 1% of other stuff.']


#### `re.sub()`

Sub stands for SubString and with this method on your side you replace any matches by any text at any time.

The syntaxt is simple: `re.sub(pattern, replacement, string)`.

Other parameters can be added such the maximum times the replacement occurs, case-sensivity, etc.

In [30]:
text = "If it is written in PYTHON, it's probably machine learning"

In [31]:
rx = re.sub(r'written', 'coded', text)
print(rx)

If it is coded in PYTHON, it's probably machine learning


In [32]:
rx = re.sub(r'it', 'THAT', text)
print(rx)

If THAT is wrTHATten in PYTHON, THAT's probably machine learning


In the next example, all we want is replacing ‘it’ by ‘THAT’ but only in the first occurence.

In [33]:
rx = re.sub(r'it', 'THAT', text, count=1)
print(rx)

If THAT is written in PYTHON, it's probably machine learning


In the next example, will split by both white-spaces before and after the word ‘PYTHON’ and replace it with ‘code’. Setting to ignore-case so does not matter if we type ‘PYthon’ this way.

In [34]:
rx = re.sub(r'\sPYthon,\s', ' code, ', text, flags=re.IGNORECASE)
print(rx)

If it is written in code, it's probably machine learning


#### `re.subn()`

The `re.subn()` produces the same as `re.sub()` except it returns the number of replacements made.

In [35]:
rx = re.subn(r'it', 'THAT', text)
print(rx)

("If THAT is wrTHATten in PYTHON, THAT's probably machine learning", 3)


---

Author: Gonçalo Guimarães Gomes