## Regular Expressions

| <p style="font-size: 16px">character</p>      | <p style="font-size: 16px">meaning</p> |
| ----------- | ----------- |
| <p style="font-size: 16px">.</p>      | <p style="font-size: 16px">any character (except newline)</p>   
| <p style="font-size: 16px">\w </p>      | <p style="font-size: 16px">matches any alphanumerical character  == character range [A-Za-z0-9_]</p>       |
| <p style="font-size: 16px">\W </p>      | <p style="font-size: 16px">any non-alphanumeric character</p>       |
| <p style="font-size: 16px">\d</p>      | <p style="font-size: 16px">matches any digit from 0 to 9</p>       |
| <p style="font-size: 16px">\D</p>      | <p style="font-size: 16px">any non-digit character</p>       |
| <p style="font-size: 16px">+</p>      | <p style="font-size: 16px">matches at least once, but as many as possible **(1 or more)**</p>       
| <p style="font-size: 16px">**[abc+/]+**</p>      | <p style="font-size: 16px">one or more of any a, b, c, +, / character</p>    
| <p style="font-size: 16px">*</p>      | <p style="font-size: 16px">matches between 0 and as many as possible **(0 or more)**</p>
| <p style="font-size: 16px">a{1, 3}</p>      | <p style="font-size: 16px">matches 'a' no more than 3 times, but no less than once</p>
| <p style="font-size: 16px">[wxy]{5}</p>      | <p style="font-size: 16px">five characters, each of which can be a w, x, or y</p>
| <p style="font-size: 16px">.{2,6}</p>      | <p style="font-size: 16px">between two and six of **any** character or y</p>
| <p style="font-size: 16px">{n, m}</p>      | <p style="font-size: 16px">matches exactly, min/max of range times</p>
| <p style="font-size: 16px">.</p>      | <p style="font-size: 16px">**WILDCARD**: anyhting except newline</p>
| <p style="font-size: 16px">[ ]</p>      | <p style="font-size: 16px">**specific characters**: only in range or given choice of characters</p>
| <p style="font-size: 16px">[a-z], [0-6]</p>      | <p style="font-size: 16px">matches any letter; digits from 0 to 6</p>
| <p style="font-size: 16px">[^a]</p>      | <p style="font-size: 16px">**exclude**: NOT the following characters</p>
| <p style="font-size: 16px">?</p>      | <p style="font-size: 16px">optional: either zero or one of the preceding character group</p>
| <p style="font-size: 16px">ab?c</p>      | <p style="font-size: 16px">matches either 'abc' or 'ac'</p>
| <p style="font-size: 16px">\s</p>      | <p style="font-size: 16px">match **any** of the specific whitespace characters (space(␣), tab(\t), newline(\n), carriage return (\r)</p>
| <p style="font-size: 16px">\b</p>      | <p style="font-size: 16px">matches the boundary between a word and a non-word character (e.g. \w+\b</p>
| <p style="font-size: 16px">\S</p>      | <p style="font-size: 16px">any non whitespace character</p>
| <p style="font-size: 16px">parantheses ( )</p>      | <p style="font-size: 16px">group of characters</p>
| <p style="font-size: 16px">^(hat), $(dollar)</p>      | <p style="font-size: 16px">matches pattern at the start or the end of a line</p>
| <p style="font-size: 16px">\|</p>      | <p style="font-size: 16px">logical "or"</p>
| <p style="font-size: 16px">\\</p>      | <p style="font-size: 16px">escapes metacharacters</p>




#### Key commands

| <p style="font-size: 16px">command</p>      | <p style="font-size: 16px">description</p> |
| ----------- | ----------- |
| <p style="font-size: 16px">re.findall()</p>      | <p style="font-size: 16px">returns a list of matching strings</p>   
| <p style="font-size: 16px">re.search()</p>      | <p style="font-size: 16px">returns a match object for the first</p>       |
| <p style="font-size: 16px">re.sub()</p>      | <p style="font-size: 16px">substitute pattern by string</p>       
| <p style="font-size: 16px">re.match()</p>      | <p style="font-size: 16px">match the entire string</p>
| <p style="font-size: 16px">re.compile()</p>      | <p style="font-size: 16px">pre-compile pattern, so it is faster</p>
| <p style="font-size: 16px">*re.DOTALL*</p>      | <p style="font-size: 16px">switch for matching newlines</p>
| <p style="font-size: 16px">*re.IGNORECASE*</p>      | <p style="font-size: 16px">switch for matching upper/lowercase</p>




In [1]:
import re

### Find all matches

In [6]:
text = '''thyme <a href="coriander99"> <a href="rosemary"> cinnamon pepper tarragon basil salvia cumin'''
# match all words starting with a "c":
pattern =  "c\w*" 
re.findall(pattern, text)

['coriander99', 'cinnamon', 'cumin']

In [10]:
# includes "r"
pattern = "\w*r\w+"
re.findall(pattern, text)

['href', 'coriander99', 'href', 'rosemary', 'tarragon']

### Replace patterns

In [13]:
# has to be assigned to a new variable
text2 = re.sub(pattern, "SPICE", text)

In [14]:
text2

'thyme <a SPICE="SPICE"> <a SPICE="SPICE"> cinnamon pepper SPICE basil salvia cumin'

In [15]:
text = '''thyme coriander rosemary cinnamon
pepper tarragon basil salvia cumin'''

pattern = "c\w+"

# find all occurences
re.findall(pattern, text)

['coriander', 'cinnamon', 'cumin']

In [16]:
# dot matches newline
re.findall(pattern, text, re.DOTALL)

['coriander', 'cinnamon', 'cumin']

In [17]:
# find first occurence or None
s = re.search(pattern, text)
s.span()

(6, 15)

In [18]:
# replace
re.sub(pattern, "SPICE", text)

'thyme SPICE rosemary SPICE\npepper tarragon basil salvia SPICE'

In [19]:
# ignore upper/lower case
re.findall(pattern, text, re.IGNORECASE)

['coriander', 'cinnamon', 'cumin']

### Nested groups


When you are working with complex data, you can easily find yourself having to extract multiple layers of information, which can result in nested groups. Generally, the results of the captured groups are in the order in which they are defined (in order by open parenthesis).

Take the example from the previous lesson, of capturing the filenames of all the image files you have in a list. If each of these image files had a sequential picture number in the filename, you could extract both the filename and the picture number using the same pattern by writing an expression like ^(IMG(\d+))\.png$ (using a nested parenthesis to capture the digits).

The nested groups are read from left to right in the pattern, with the first capture group being the contents of the first parentheses group, etc.

For the following strings, write an expression that matches and captures both the full date, as well as the year of the date.

Jan 1987
    
`^(\w{1,3}\s(\d{1,4}))`

### Conditionals


As we mentioned before, it's always good to be precise, and that applies to coding, talking, and even regular expressions. For example, you wouldn't write a grocery list for someone to Buy more .* because you would have no idea what you could get back. Instead you would write Buy more milk or Buy more bread, and in regular expressions, we can actually define these conditionals explicitly.

Specifically when using groups, you can use the | (logical OR, aka. the pipe) to denote different possible sets of characters. In the above example, I can write the pattern "Buy more (milk|bread|juice)" to match only the strings Buy more milk, Buy more bread, or Buy more juice.

Like normal groups, you can use any sequence of characters or metacharacters in a condition, for example, ([cb]ats*|[dh]ogs?) would match either cats or bats, or, dogs or hogs. Writing patterns with many conditions can be hard to read, so you should consider making them separate patterns if they get too complex.

Go ahead and try writing a conditional pattern that matches only the lines with small fuzzy creatures below.

match 	I love cats 	
match 	I love dogs
skip 	I love logs 	
skip 	I love cogs


`\w\s\w{4}\s([c]ats|[d]ogs)`

### Back referencing
Many systems allow you to reference your captured groups by using \0 (usually the full matched text), \1 (group 1), \2 (group 2), etc. This is useful for example when you are in a text editor and doing a search and replace using regular expressions to swap two numbers, you can search for "(\d+)-(\d+)" and replace it with "\2-\1" to put the second captured number first, and the first captured number second for example.