# Regex and re in Python

### The fundamentals of regex and an overview of the re module in Python.


Regular expressions are a widely used toolset for accomplishing many tasks. At the core, their purpose is for finding patterns in character strings, or text, so that we can easily manipulate the text or do something with the information gained or patterns found. This page will go over the special characters used within regular expressions as well as the functions within the re module in Python and how they may be used.


## Regular Expressions

Regular Expressions are a special sequence of characters that will match a particular pattern in another string of characters. It's a bit like a language in it of itself, with its own syntax and behavior. It's not turing complete of course, as you can't increment or decrement a string with it alone.
A regular expression in Python will take the form of:

    r'(pattern to match)'


### Raw String Literals
The 'r' is used to avoid ambiguity in interpretation of certain special characters, such as the backslash ('\') or quotation marks (""), which would be caused by the parsing of the string done by Python's parser, followed by Python's regular expression parser. R or r used as the prefix to the expression means 'raw string literal', which makes Python's first parser treat the sequence as a simple string, ignoring or passing over it, in a way. Namely, a few examples of where this is relevant would include:

    r'\n' == '\\n'
Which matches a '\' followed by a 'n', so a two character string instead of a one, as '\n' (newline) is.
    
    r'["'](.*)["']' == '[\"\'](.*)[\"\']'
    
Likewise, this prevents the quotations from tripping the interpreter up, as the expression would get cut off at this part otherwise:

    r'["'
And the rest of it would cause an error.

So the backslash could be used instead of the 'r' in order to escape these characters, but it is standard practice and easier to read if we just use 'r'.



In [1]:
import re

r = re.match(r'\n', '\n').group()
print('1: ',r, '(newline)', '\n')

r = re.match('\n', '\n').group()
print('2: ', r, '(newline)', '\n')

r = re.match(r'\\n', '\\n').group()
print('3: ',r, '\n')

r = re.match('\\n', '\\n')
print('4: ',r, '\n')

r = re.match('\\n', '\n').group()
print('5: ',r, '(newline)', '\n')

r = re.match('[\"\'](.*)[\"\']', "\'hi\'" ).group()
print('6: ',r)

1:  
 (newline) 

2:  
 (newline) 

3:  \n 

4:  None 

5:  
 (newline) 

6:  'hi'


### Special Characters

Here are a bunch of special charaters used by the engine to represent certain matching behaviors or patterns. Essentially the syntax of this 'language'. 

* ```.``` - Period. This will match any single character, except newline.
* ```\w``` - Word character. This will match any single letter, digit, or underscore.
* ```\W``` - Complement set of word character. Matches anything \w won't.
* ```\s``` - Whitespace character. Matches single newline, tab, space, carriage return (\n, \t, ' ', \r).
  * You can also choose to match a specific one with just \n or \t, or whatever you want.
* ```\S``` - Any non-whitespace character.
* ```\d``` - Any single number (digit).
* ```\D``` - Any non-number.
* ```^``` - Caret. (Equal to \A at start of string) Starts the match only at the start of a string. This can be made to work as a multiline operator with the parameter flags=re.MULTILINE, given in most re functions, which will look for the pattern after every \n in the string
* ```$``` - Dollar sign. Matches a pattern at the end of string. Can also work with multiline.
* ```\z``` - End of string.
* ```\Z``` - End of string or just before newline, if it exists.

In [2]:
r = re.findall('^[\"\'](.*)[\"\']', "\'hi\'\nWon't match\'hello\'\n\'como?\'", flags=re.MULTILINE)
print(r)
r = re.findall('^[\"\'](.*)[\"\']', "\'hi\'\nWon't match\'hello\'\n\'como?\'")
print(r)
r = re.findall('^[\"\'](.*)[\"\']$', "\'hi\'\nWon't match\'hello\'\n\'como?\'", flags=re.MULTILINE)
print(r)

['hi', 'como?']
['hi']
['hi', 'como?']


* ```[abc]``` - Character class (square brackets). Match a or b or c.

In [3]:
r = re.findall(r'[abc]', 'abc\nb')
print(r)

['a', 'b', 'c', 'b']


* ```\b``` - Word boundary. Match where there is a word character preceded or followed by a non-word character.
* ```\B``` - Complement. So the match must be within a word.
* ```[a-zA-Z0-9]``` - Match any letter or digit.
* ```[^a-zA-Z0-9]``` - Match anything else. (Complement set)
* ```()``` - Group. Work pretty much like parenthesis in math equations. Operators apply to each item in them (Shown right below).

#### Repetition Operators
* ```?``` - 0 or 1 of the preceding value.
  * r'or?' will match an o followed by 0 or 1 r's.
  * r'(or)?' will match 0 or 1 instances of 'or'.
* ```+``` - 1 or more instances of the preceding pattern.
* ```*``` - 0 or more instances of the preceding pattern.
* ```{n,}``` - Allows you to specify how many instances of the pattern to match. n or more in this case. May opt to set an upper limit with {n,m}, where m > n, or exactly n matches with {n}.
##### Greedy vs Non-greedy Matching
* <.*> - Greedy repetition: matches < python > perl >
* <.*?> - Non-greedy: matches < python > in < python > perl >

#### Backreferences
* ```r'(['"])[^\1]*\1'``` Here is the proper way to handle grabbing quotes (instead of the previous example). \1 matches whatever was matched by the 1st matching group, in this case ensuring that the quotation marks line up. Use whatever number depending on the desired group.

#### Alternatives
* ```r'Python(!+|\?)'``` - Matches 'Python' followed by one or more '!', or one '?'. The '|' is OR.

Same as ```r'Python[(!+)(\?)]'```

#### Lookarounds
Positive and negative lookbehinds and lookaheads allow you to check before or after a pattern for another pattern which doesn't get returned. It is a conditional, like the start or end of string anchors.
q(?!u) will match a q, only if it is not followed by a u.

* (?=_pattern) - Positive
* (?!_pattern) - Negative


A more comprehensive list with examples can be found at https://www.tutorialspoint.com/python/python_reg_expressions.htm

## Python re Library Functions

* ```re.compile(pattern)``` - Compiles an expression to be used with search, match, or whatever else.

```prog = re.compile(pattern)
result = prog.match(string)```

    ==
```result = re.match(pattern, string)```


* ```re.search() vs re.match()```
These functions will look through a given string for a specified pattern and return a match object if found, or None otherwise. Match will only look from the beginning of the string, even in multiline mode.

In [4]:
r = re.match(r'[abc]', 'ddd\nba')
print(r)
r = re.search(r'[abc]', 'ddd\nba').group()
print(r)

None
b


* ```re.fullmatch()``` - String must == pattern for a match object to be returned.

* ```re.split()``` - Splits string by occurences of specified pattern. Returns list. Can opt to keep the non-matching components in the return list by putting parentheses around the expression. Fuller explanation at https://docs.python.org/3/library/re.html

* ```re.findall()``` - Like search, but returns all matches. Use with .group() for a list of matches, as shown above. If there are one or more groups in the pattern, it will return a list of groups.

* ```re.finditer()``` - Like findall, but yields matches with an iterator. This way, everything isn't kept in memory when it isn't needed.

* ```re.sub(pattern, repl, string, count=0, flags=0)``` - Locate and replace instances of pattern in string with replacement string. Check the previous link for a fuller explanation.

* ```re.subn()``` - Same as sub(), but also returns the number of subs made along with the new string in a tuple.

* ```re.escape(pattern)``` - Puts backslashes before instances of special characters in pattern.

* ```re.purge()``` - Clear regex cache.

If you use re.compile(pattern), you can do this -> ```pattern.(insert re function)()```, which allows the use of optional arguemnts ```(string[, pos[, endpos]])``` for certain functions, like findall or match. You can also return a member variable from the class, like flags, groups, pattern, or groupindex.
