# Regex


Regex or regular expressions are used for defining search patterns for text. This makes them extremely useful for cleaning data as we can concisely express which part or the text we want. We'll also use them when scraping websites to specify which links to follow. They have many other usecases such as password or email validation and are pretty programming langauge agonstic, nearly every langauge has a implementation of them. Also some usefull command line tools like grep and sed support them. 

In [1]:
import re

The re module has a lots of functions, the most usefull ones are breifly outlined bellow.

* .search - search whole string for match
* .match - match from the start of the string
* .findall - finds all matches and return a list of strings
* .finditer - finds all matches but returns a iterable object
* .sub - can be used to substitute text 
* .split - used to split using a regex

In [2]:
pattern = r"cat" 
s = "the cat sat on the mat"

In [3]:
m = re.search(pattern,s)
m

<_sre.SRE_Match object; span=(4, 7), match='cat'>

`.search` returns a match object which we can use to obtain our match or the position of it.

In [127]:
m.group()

'cat'

In [128]:
m.span()

(4, 7)

# Character set

Lets say I wanted to match all of the word the ended with 'at' in the sentence 'the cat sat on the mat', then I could use a character set `[]` to contain the letters 'c' ,'s' ,and 'm' .

In [4]:
re.findall(r"[csm]at",s)

['cat', 'sat', 'mat']

But maybe we want the position of the matches instead.

In [130]:
[ m.span() for m in re.finditer(r"[csm]at",s) ]

[(4, 7), (8, 11), (19, 22)]

# Exclude sets 

When we use the carrot `^` at the start of a character set `[]` well match everything excluding those characters. Note the carrot only has this meaning when inside a character set `[]`, it's meaning changes when used our side a character set.

In [132]:
re.findall(r"[^c]at",s) 

['sat', 'mat']

# Ranges

Similar to the carrot the `-` has a special meaning when used inside a character set, it allows us to specify a range or letters or numbers

We can use ranges to specify all of the digits between 5 to 9 or all letters from a to d.

In [138]:
import string

In [139]:
string.ascii_letters

'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [140]:
alphabet = string.ascii_letters

In [141]:
re.search("[a-z]+",alphabet) #only matches lower case a-z

<_sre.SRE_Match object; span=(0, 26), match='abcdefghijklmnopqrstuvwxyz'>

In [22]:
re.search("[e-z]+",alphabet)

<_sre.SRE_Match object; span=(4, 26), match='efghijklmnopqrstuvwxyz'>

In [24]:
re.search("[f-zA-C]+",alphabet) #match lower case f to z and upper case 

<_sre.SRE_Match object; span=(5, 29), match='fghijklmnopqrstuvwxyzABC'>

 # Either
 
The pipe `|` can be used to match either word (or character).

In [133]:
s1 = "The rainbow has many colors."
s2 = "The rainbow has many colours."

In [134]:
re.findall('color|colour',s1)

['color']

In [135]:
re.findall('color|colour',s2)

['colour']

# Quantifiers


Quantifiers are used to specify how many times we want to match something.

* `*` - zero of more  times
* `+` - one or more times
* `?` -  one or zero times.
* `{ n }` - match exactly n times
* `{ n , }` - match at least n times
* `{ n , m }` - match between n to m times

In [136]:
s = "The rainbow has many colors but not the colour silver"
re.findall('colou?r',s)

['color', 'colour']

# Meta Chars

Here some more symbols with weird meanings...

* `\d` - match any digit same as [0-9]
* `\w` - match any word char (a-z, A-Z, 0-9 and _'s)
* `\s` - match white space (spaces, tabs...)
* `\t` - match tab only
* `\D` - match anything but digits same, is true of the above expresions i.e \W is anything but word chars.
* `.`  - match any characters 


In [31]:
s = "Number of bookmarks: 99 "
re.search('\d+', s)

<_sre.SRE_Match object; span=(21, 23), match='99'>

In [32]:
re.search('\w+\s\w+',s)

<_sre.SRE_Match object; span=(0, 9), match='Number of'>

In [33]:
re.search('\D+',s)

<_sre.SRE_Match object; span=(0, 21), match='Number of bookmarks: '>

# Anchors

Anchors allow us to specify where in the text we want the match to be.

* `^` - starts with
* `$` - ends with
* `\b` - word boundary

In [121]:
s = "4252345"
re.search('^\d+$',s) #only match string thats all numbers

<_sre.SRE_Match object; span=(0, 7), match='4252345'>

In [111]:
s = "Hello?"
re.search('^\w+\?$',s)

<_sre.SRE_Match object; span=(0, 6), match='Hello?'>

In [115]:
s = "This island is beautiful."
re.search('is',s) #wrong is :/

<_sre.SRE_Match object; span=(2, 4), match='is'>

In [119]:
re.search(r'\bis\b',s) #the right is :)

<_sre.SRE_Match object; span=(12, 14), match='is'>

# Groups

Anything contained within `()` is a group. They allow us to easily break up our pattern into seperate parts.

In [84]:
s = "a great string"
m = re.search('(\w+)\s(\w+)',s)
m

<_sre.SRE_Match object; span=(0, 7), match='a great'>

In [85]:
# m.group(0)
m.group()

'a great'

In [86]:
m.group(1)

'a'

In [87]:
m.group(2)

'great'

There still more to regexes but these concepts should get you pretty far, a good way to practice is through games, one of my favourites is this [crossword game](https://regexcrossword.com/).

# References

* [Net Ninja Regex Video Tutorials](https://www.youtube.com/watch?v=r6I-Ahc0HB4&list=PL4cUxeGkcC9g6m_6Sld9Q4jzqdqHd2HiD)
* [Coding Train Regex Video Turorials](https://www.youtube.com/watch?v=7DG3kCDx53c&list=PLRqwX-V7Uu6YEypLuls7iidwHMdCM6o2w) 
* [Regular Expression Info](https://www.regular-expressions.info/)
* [Regex cheatsheet](https://www.debuggex.com/cheatsheet/regex/python)
* [Regex Crossword Game](https://regexcrossword.com/)