# An intermediate user's guide to RegEx

In [2]:
import re

### Prerequisites
- Understand basic Python syntax (strings, lists, loops)
- Understand what an object is, what a function is
- Understand the very basics of regular expressions, in that they are used for string processing tasks and some simple metacharacters like `*`, `?`, `\d`

### Let's begin
My main motivation for further researching the `re` module (Python's regex module), and writing about it, was that I have often relied on using RegEx in my code, but outside of very simple expressions, I was never comfortable using it in enterprise level code (code that is shared by my team and deployed for our customers). I have relied on Stack Overflow for very simple regular expressions but nothing beyond because I didn't fully understand the more complex ones.

And despite how well written some of the Stackoverflow posts are, there were many doubts in my mind. For example I vividly remember seeing `(?:...)` in some answers, but not appreciating why the author put `?:` inside the parentheses. I tried doing an AB test (with and without the `?:`), and I couldn't deciper the effect this syntax has. Now that I have read much more about RegEx and tried it out, I know this construct is called a 'non-capturing group'. I will unpack this and other features of `re` in this article, which, I think, if you understand, will make you an intermediate user of RegEx.

To give you another spoiler (which will create more context for this article), the core building block on which intermediate usage of regex is built, is Grouping. And that's the core learning I'm trying to share in this article. Many powerful regex features, mentioned here and otherwise, build on top of Grouping, so it will serve you well to understand Grouping well.

##### Side note:
The reason I think these are intermediate features, is because you get by without knowing these, and still implement simple regexes in real code.

At the same time, I don't think these are Advanced either. Those would include Sub-routines and really exploring look-arounds, as well as recursion (Yeah, I know. It's crazy that you can do recursion in Regex)

### Housekeeping items
These may be insightful for you too, and important building blocks for an intermediate Python RegEx user, so read them carefully.

#### 1. Raw Strings
A Python raw string is created by prefixing a string literal with 'r' or 'R', like so: `r"Hello world"`

As explained in https://blog.devgenius.io/beauty-of-raw-strings-in-python-fa627d674cbf:
Python raw string treats backslash(`\`) as a literal character.
It is useful when we want to have a string that contains backslash(`\`) and don’t want it to be treated as an escape character.

__Why do we need to use raw strings in regex__<br>
Because the RegEx engine interprets backslashes in a special way, like when you put `\d` in a regex pattern. But python interprets these backslashes in a certain, non-literal, way as well, which creates a conflict.

So this is why people use raw strings to create regular expressions in Python (read some of the Stack Overflow answers about RegEx if you don't believe me). It's not strictly necessary, but makes your life much easier. To read more about this, check out the Appendix section in this article, where I explain the benefits with some examples.

#### 2. Module level functions vs. Object level functions

Firstly, there are 4 main functions to match a regex in a string
- `match()` - Returns first matching substring as a `re.Match` object, if found at the start of string
- `search()` - Returns first matching substring as a `re.Match` object, found anywhere in the string
- `findall()` - Returns all matching substrings as a list of strings
- `finditer()` - Returns an interater of `re.Match` objects, for all matching substrings in the string

The first way you can use these functions is by using them via a __Pattern object__. A Pattern object is what you get when you compile a regex string (`re.compile`). And the above functions become available as attributes of the pattern object, which can be invoked in the following way:

In [31]:
# Maybe break up the following cell into smaller cells before putting it in article
regex_string = r'(?:1st|2nd|3rd|\dth) article'
p = re.compile(regex_string)

print(f"p is a {type(p)} object\n")

test_string = "My 3rd article and 2nd article"

print(f"p.match(test_string) = {p.match(test_string)}\n")

print(f"p.search(test_string) = {p.search(test_string)}\n")

print(f"p.findall(test_string) = {p.findall(test_string)}\n")

matches_iterable = p.finditer(test_string)
print(f"p.finditer(test_string) = {matches_iterable}")
for i, match in enumerate(matches_iterable):
    print(f"Match # {i} = {match}")
    
    
# TODO: In the article, consider using screenshots of doing this in a Terminal

p is a <class 're.Pattern'> object

p.match(test_string) = None

p.search(test_string) = <re.Match object; span=(3, 14), match='3rd article'>

p.findall(test_string) = ['3rd article', '2nd article']

p.finditer(test_string) = <callable_iterator object at 0x7fbbbe1efbe0>
Match # 0 = <re.Match object; span=(3, 14), match='3rd article'>
Match # 1 = <re.Match object; span=(19, 30), match='2nd article'>


Similarly, you can do all these matches without creating a Pattern object too, by calling functions available on the 'Module level', like so:

In [68]:
# Maybe break up the following cell into smaller cells before putting it in article
regex_string = r'(?:1st|2nd|3rd|\dth) article'
print(f"regex_string = {regex_string}\nBut there's no need to compile it\n")
test_string = "My 3rd article and 2nd article"

print(f"re.match(regex_string, test_string) = {re.match(regex_string, test_string)}\n")

print(f"re.search(regex_string, test_string) = {re.search(regex_string, test_string)}\n")

print(f"re.findall(regex_string, test_string) = {re.findall(regex_string, test_string)}\n")

matches_iterable = re.finditer(regex_string, test_string)
print(f"re.finditer(regex_string, test_string) = {matches_iterable}")
for i, match in enumerate(matches_iterable):
    print(f"Match # {i} = {match}")

regex_string = (?:1st|2nd|3rd|\dth) article
But there's no need to compile it

re.match(regex_string, test_string) = None

re.search(regex_string, test_string) = <re.Match object; span=(3, 14), match='3rd article'>

re.findall(regex_string, test_string) = ['3rd article', '2nd article']

re.finditer(regex_string, test_string) = <callable_iterator object at 0x0000029BBBC92550>
Match # 0 = <re.Match object; span=(3, 14), match='3rd article'>
Match # 1 = <re.Match object; span=(19, 30), match='2nd article'>


<b><u>Takeaway</u></b>:There's 2 convenient ways to use regex matching methods. Both will produce the same results. The only reason you may favor the __Pattern level__ functions is from a computation efficiency standpoint - to save the compiled pattern once and use it again. As opposed to __Module level__ functions that re-compile the regex pattern every time you want to match it with a string. <p>_Side Note_: Actually the Module level functions also store compiled objects in a cache, so realistically, there isn't a big difference unless you have a lot of regex calls using different regex patterns.

### Grouping

Groups are marked by the `(` and `)` metacharacters. Conceptually, they are interpreted in the same way parentheses are treated in mathematical expressions.
<br><br>For example, if you see `(2+3)*5` in algebra, you know that the `2+3` is its own entity, or a group.
<br><br>Similarly, `r"(ab)c"` in regex means `ab` is a group. Currently, it is no different from `r"abc"`, but you can use grouping with other metacharacters to make powerful regexes. For example `r"(ab)*c"` will catch expressions where you can have 0 or more instances of `ab` followed by a `c`.

In [32]:
regex_with_group = r'(ab)*c'
print(re.search(regex_with_group, 'c'))
print(re.search(regex_with_group, 'abc'))
print(re.search(regex_with_group, 'ababc'))

<re.Match object; span=(0, 1), match='c'>
<re.Match object; span=(0, 3), match='abc'>
<re.Match object; span=(0, 5), match='ababc'>


#### Next
Actually groups can also be retrieved individually using the functions like `group()`

In [21]:
regex_with_group = r'(ab)*c'
m = re.search(regex_with_group, 'ababc')
m.group(1)

'ab'

Here we retrieved the first group (index starts at 1)
<br><br>Actually `m.group(0)` or `m.group()` (default argument is 0) will just give you the whole matching string. So the index of every group in your regex starts from 1 onward.
<br><br>In fact, when you see Stackoverflow regex posts, most of them return the matching string by calling `m.group()` right (calling `.group()` on the match object `m`). This is the same as `m.group(0)` because 0 is the default argument of `.group()`.

#### Next
You can even reference a group later in the same regex using a backslash and the index of the group

In [27]:
regex_with_group = r'(ab)*c\1'
m = re.search(regex_with_group, 'ababcabab')
print(m.group(0))

ababcab


Here, the `\1` matches with the group `ab` since that is the first group in the regex

#### Non-capturing groups

When your group starts with a question mark and colon `?:`, it's a non-capturing group. So for example, `r"(?:ab)c"` is the same as before, but `ab` as a group is 'not captured'. Meaning it can't be identified using functions like `m.group(1)` or referenced later in the regex using backslashes like `\1`. Let's see an example:

In [33]:
regex_with_group = r'(?:ab)*c\1'
m = re.search(regex_with_group, 'ababcabab')
print(m.group(0))

error: invalid group reference 1 at position 9

Couldn't reference it

In [34]:
regex_with_group = r'(?:ab)*c'
m = re.search(regex_with_group, 'ababc')
m.group(1)

IndexError: no such group

Couldn't retrieve it

That's it. If you are comfortable with the concepts explained so far, that is, what is a group and how we can define them, Congratulations!
<br>I think you will find the following part to be a breeze.

### Intermediate RegEx feature #1
`m.group()` vs. `m.groups()`

One point of confusion for you may be seeing `m.group()` vs. `m.groups()` in code involving regular expressions. Now that you understand what Groups are in regular expressions, the difference between these two functions will be pretty straightforward to understand:
1. `m.groups()` will give you a list of all captured groups in the regular expression. As you can imagine, non-capturing groups will not be included.
2. `m.group(group_index)` will give you the group specified by the index. You can even specify multiple groups, in which case it returns a tuple of those groups. And as explained before, if you say `m.group()` or `m.group(0)`, it will return the whole matched string.

That's it. Let's see it in action

In [35]:
#TODO: Write code to distinguish m.group(...) and m.groups()

### Intermediate RegEx feature #2
`findall()` vs. `finditer()`

There is actually a nuance to `findall()`, which is applicable when using groups. I'll explain it in my words below, but for reference, it's summarized in the Python docs here - https://docs.python.org/3/library/re.html#re.findall.

Basically, if you're using groups (1 or more) in your regex, the resulting list of strings from `findall()` may not be what you expect. If there's one group, you will only get a list of strings that match the group. If there are more groups, you'll get a list of tuples.

As such, I don't use `findall()` in my code, except for a quick way of debugging my regex pattern. I only use `finditer()`, which will output an iterator over all matches (where a match matches the entire regex pattern, not just a given group in the pattern)

## Bonus Features

### Intermediate RegEx feature #3
Creating regex patterns using formatted strings

In [36]:
# TODO

## Appendix

### Why raw srtings, explained with examples

In [1]:
raw_string = r"\n"
regular_string = "\n"

In [2]:
raw_string == regular_string

False

In [12]:
raw_string == "\\n" # Note no 'r' in front of the string literal

True

- Say your document or actual text contains `\d`, and you want to match that.
- You could create a regex pattern with just that - `\d`, but it won't match with the `\d` in your text because `\d` has a special meaning in regex, namely that it matches with a numerical digit (0,1,2...9).
    - So (side note) if you wanted to match with any digit in your text like `4` or `7`, a regex pattern of `\d` will catch any of those digits
- (Back to original use case) So the regex pattern we create is to catch the literal character `\` followed by literal character `d` but without any special meaning in this juxtaposition.
- In the regex world, you do this by 'escaping' your backslash, so the regex pattern needs to be `\\d`

In [79]:
match = re.match('\\d','\d')
print(match)

None


##### Why didn't this work?

The reason is that when you create Python strings literals with backslashes, they are processed in a unique Pythonic way (which I don't know a lot about tbh, but my point is we that don't need to). This messes up the regex pattern that is passed to the regex engine, and the pattern ends up being something different from what we intended.

So in order to use backslashes in strings in Python, without attributing any special meaning to them, we have to escape them, by putting another backslash (`\`) behind them.

So as you can imagine, since we have two `\` in our actual regex pattern that we want to pass to the regex engine, `\\d`, we need to escape both the backslashes when representing this pattern as a Python string, and the final string that will work in this use case is `\\\\d` <-- four backslashes

In [81]:
match = re.match('\\\\d','\d')
print(match)

<re.Match object; span=(0, 2), match='\\d'>


This is where <b>Raw Strings</b> come in to save the day. As mentioned in the beginning, raw strings treat backslashes as a literal character. So we don't need to escape the backslashes in the regex pattern. We can simply pass `r'\\d'` to the regex engine

In [84]:
print(re.match(r'\\d', '\d'))

<re.Match object; span=(0, 2), match='\\d'>


<b><u>Takeaway</b></u>: Raw strings are not a strict requirement, but make your life easier when working with regular expressions, as you don't need to worry about Python messing up your regex pattern containing backslashes. And trust me, backslashes are a pretty common feature of regexes.

### repr() magic function
Btw 'magic function' is NOT a term I came up with. It's an actual thing - https://www.tutorialsteacher.com/python/magic-methods-in-python.
<br><br>Basically, the goal of `__repr__` or `repr()` is to unambiguously show the underlying representation of the obejct - https://stackoverflow.com/questions/1436703/what-is-the-difference-between-str-and-repr.
<br><br>So we can use it to understand raw strings better

In [4]:
print(repr(raw_string))

'\\n'


In [5]:
print(repr(regular_string))

'\n'


In [6]:
print(f"length of raw string: {len(raw_string)}")
print(f"length of regular string: {len(regular_string)}")

length of raw string: 2
length of regular string: 1


In [7]:
print(repr(raw_string[0]))

'\\'


In [8]:
print(repr(regular_string[0]))

'\n'


In [9]:
print(f"type of raw string: {type(raw_string)}")
print(f"type of regular string: {type(regular_string)}")

type of raw string: <class 'str'>
type of regular string: <class 'str'>
