[Regular Expression HOWTO — Python 3.11.2 documentation](https://docs.python.org/3/howto/regex.html)

# Python Regex

Regex metacharacters include:

`. ^ $ * + ? { } [ ] \ | ( )`

## Matching Characters

`[ ]` a "character class", a set of characters to match. Can be a range, in the form `[a-c]`, or specifics, in the form `[abc]`.

Metacharacters specified inside a class are interpreted as literals.

To NOT match a defined character class, preface the character to NOT be matched with a `^`, i.e. `[^5]` will not match 5. NOTE: if the caret follows the characters, it is interpreted as a literal, i.e. `[5^]` WILL match both 5 and ^.

`\` is used to escape metacharacters for use as literals, and defines special sequences. For example, `\w` matches any alphanumeric character, equivalent to `[a-zA-Z0-9_]`. `\d` matches all decimal digits, equivalent to `[0-9]`. `\D` equivalent to non-digit characters, i.e. `[^0-9]`. `\s` matches whitespace. NOTE: sequences CAN be used in character classes.

`.` matches anything except a newline character.

## Repeating Things with Quantifiers

`*` causes the preceeding character to be matched zero -> infinite times. i.e. `ca*t` will match `cat`, `caat`, and `caaaaaaaaat`. Called "greedy".

Greedy matching mechanisms will attempt to first match as much as possibly, then backpedal until a successful match is reached - this is to match the "greatest" match before the lesser one.

`+` behaves similarly to `*` but requires at least 1 match.

`?` matches 0 or 1 occurance of a pattern. i.e. `home-?brew` matches `homebrew` and `home-brew` equally. Can think of it as an optional pattern.

`{m,n}` where `m` and `n` are decimal integers, can be used to specifically define the minimum and maximum number of repetitions of a pattern. i.e. `{1,4}\d` tries to match 1 to 4 digits. Either `m` or `n` can be omitted, the engine will assume that if `m` is missing, then the lower bound is zero. If `n` is missing, the upper bound is infinity.

## Using RE in Python

Use `r"<string>"` when defining the RE strings in order to avoid having to escape backslashes.

`re.compile()` turns a string into a RE object.

The following `RE` methods are used to apply REs and return a match object:

- `re.match()` = IF match is achieved at the beginning of the string.
- `re.search()` = IF match is achieved anywhere in the string.
- `re.findall()` = returns list of match objects where matched.
- `re.finditer()` = same as find all, returns a iterator rather than a list.

The returned match objects can be *queried* using the following methods:

- `group()` = returns the matched string.
- `start()` = returns the start position of the matched string.
- `end()` = returns the end position of the matched string.
- `span()` = returns a tuple of start and end position index integers.

## Compilation Flags

RE behavior can be modified through Perl style flags:

- `ASCII, A` = forces special sequences to only match ASCII characters.
- `DOTALL, S` = forces '.' to match anything, including \n.
- `IGNORECASE, I` = forces all matches to ignore case.
- `LOCALE, L` = 'locale-aware' matches.
- `MULTILINE, M` = 'multiline matching, modifying ^ and $'.
- `VERBOSE, X` = 'allows for verbose RE.'

Both strings can be used to specify the flag, the latter being the Perl version.

Compilation flag syntax is as follows : `re.method(args, re.FLAG_METHOD)`

i.e. `re.search(string, re.A)`

## Groups

`groups()` returns a tuple of all the groups in the match object, where groups are defined by brackets around the pattern to be captured.



In [None]:
string = "2022-07-09-test"

import re

r = re.search(r"(\d{,4})-(\d{,2})-(\d{,2})-(\w*)", string)

r.groups()

year, month, day, filename = r.groups()

new_date_format = "{}-{}-{}_{}".format(year, month, day, filename)

new_date_format