# A brief summary of *RegEx*
<br>
<div style="opacity: 0.8; font-family: Consolas, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New; font-size: 12px; font-style: italic;">
    ────────
    for more from the author, visit
    <a href="https://github.com/hazemanwer2000">github.com/hazemanwer2000</a>.
    ────────
</div>

## Table of Contents
  * [Special Characters](#special-characters)
  * [Character Classes](#character-classes)
  * [Number of Occurences](#number-of-occurences)
  * [Groupings](#groupings)
  * [*Starts with*, *Ends with*](#starts-with-ends-with)
  * [Case sensitivity](#case-sensitivity)
  * [Capture Groups](#capture-groups)
  * [*Look ahead* and *behind*](#look-ahead-and-behind)


A *regular expression (RegEx)* is a sequence of characters that specifies a search pattern.

A *RegEx engine* is an application, responsible for parsing and recognizing a *RegEx* in text.

Every programming language has its own flavor of *RegEx* syntax and engine.

This document will rely on *Python*'s *RegEx* engine, to discuss *RegEx*.

In [3]:
import re

### Special Characters <a class="anchor" id="special-characters"></a>

A *special character* is a character that is associated with a special meaning in *RegEx* syntax.

| Special Character | Description |
| :---: | :--- |
| `.` | Matches any character except newline.
| `[]` | User-defined character class.
| `\` | Escape character.
| ㅤ |
| `^` | Asserts *starts with*.
| `$` | Asserts *ends with*.
| ㅤ |
| `*` | Zero or more occurences of a character.
| `+` | One or more occurences of a character.
| `?` | Zero or one occurences of a character.
| `{}` | Specified number of occurences of a character.
| ㅤ |
| `\|` | Either, or.
| `()` | Capture, and group.

*Note:* To match a special character, preceed it with an escape character, `\`.

*Note:* In most programming languages, `\n` denotes a newline, `\t` a tab, and `\r` a carriage-return.

### Character Classes <a class="anchor" id="character-classes"></a>

A character class defines a new character, as a range of acceptable characters that may be matched.

| Character Class | Description |
| :---: | :--- |
| `[0-9]` | Any digit.
| `[a-z]` | Any lowercase alphabetical character.
| `[A-Z]` | Any uppercase alphabetical character.
| ㅤ |
| `[a-zA-Z0-9]` | Any alphanumeric character.
| `[aAzZ1-3]` | Any `a` or `z`, case insensitive, or digit, in the `1-3` range.
| `[_\-]` | Any dash or underscore.

In [107]:
txt = r'a1 b0 c8 d5 ?4 \5 ^1 6 (5'

In [108]:
regex = r'[a-c\?\\\^\(][1-8]'

re.findall(regex, txt)

['a1', 'c8', '?4', '\\5', '^1', '(5']

*Note:* Within a character class, only `^`, `?`, `\` and `-` require escaping.

A character class that is the complement of another, is defined using `[^` and `]`.

In [110]:
regex = r'[^ ][1-9]'

re.findall(regex, txt)

['a1', 'c8', 'd5', '?4', '\\5', '^1', '(5']

### Number of Occurences <a class="anchor" id="number-of-occurences"></a>

A character may be matched a specified number of times, sequentially.

| Range | Description |
| :---: | :--- |
| `{5}` | Exactly five occurences.
| `{5,}` | At least five occurences.
| `{,5}` | At most five occurences.
| `{1,5}` | One to five occurences.

In [136]:
txt = r'aaaaabbbbcccd'

In [137]:
regex = r'a{1,2}'

re.findall(regex, txt)

['aa', 'aa', 'a']

By default, a *RegEx* engine is *greedy*. This means, it will attempt to match as many occurences as possible.

To force it to be *lazy*, succeed any occurence specifier with `?`.

In [138]:
regex = r'a{1,2}?'

re.findall(regex, txt)

['a', 'a', 'a', 'a', 'a']

### Groupings <a class="anchor" id="groupings"></a>

A number of characters may be grouped using `(?:` and `)`.

In [139]:
txt = r'hellohellohello'

In [144]:
regex = r'(?:hello){1,2}'

re.findall(regex, txt)

['hellohello', 'hello']

The `|` operator may be used to implement *either, or* logic.

In [4]:
txt = r'Sami, Tamer, Samir, Sameh, Samuel'

In [5]:
regex = r'Sam(?:(?:ir)|(?:i)|(?:uel))'

re.findall(regex, txt)

['Sami', 'Samir', 'Samuel']

### *Starts with*, *Ends with* <a class="anchor" id="starts-with-ends-with"></a>

To specify that a pattern must exist at the beginning of a text, begin the pattern with `^`.

In [165]:
regex = r'^a boy'

In [166]:
txt = r'a boy, he was'

re.findall(regex, txt)

['a boy']

In [167]:
txt = r'he was a boy'

re.findall(regex, txt)

[]

Similarly, end the pattern with `$` to assert that the text ends with the pattern specified.

In [168]:
regex = r'a boy$'

In [169]:
txt = r'a boy, he was'

re.findall(regex, txt)

[]

In [170]:
txt = r'he was a boy'

re.findall(regex, txt)

['a boy']

### Case sensitivity <a class="anchor" id="case-sensitivity"></a>

By default, most *RegEx* engines match case-sensitively.

To switch to case-insensitive mode, use `(?i)`.

In [180]:
txt = r'aa aA Aa AA'

In [189]:
regex = r'aa'

re.findall(regex, txt)

['aa']

In [190]:
regex = r'(?i)aa'

re.findall(regex, txt)

['aa', 'aA', 'Aa', 'AA']

*Note:* *RegEx* engines search for patterns, *left-to-right*. Hence, `(?i)` switches to case-insensitive mode for every alphabetical character forwardly, and `(?-i)` switches it off forwardly as well.

*Note:* The *RegEx* engine in *Python* disallows mode specifiers not at the beginning of a pattern.

### Capture Groups <a class="anchor" id="capture-groups"></a>

A *capture group* is a special group, that captures a specific part of a pattern.

It uses `(` and `)`, instead.

In [203]:
txt = r'hi@cc.com hello@rt.net bye@us.org'

In [204]:
r_pre_at = r'[a-z0-9_\-.]+'
r_post_at = r'[a-z]+\.[a-z]+'

regex = r'(?i)(' + r_pre_at + r')@(' + r_post_at + r')'

re.findall(regex, txt)

[('hi', 'cc.com'), ('hello', 'rt.net'), ('bye', 'us.org')]

### *Look ahead* and *behind* <a class="anchor" id="look-ahead-and-behind"></a>

A *look ahead* asserts the existence (or, lack thereof) of a specific pattern, after the actual pattern to match.

Similarly, a *look behind* makes a similar assertion, before the actual pattern to match.

*Note:* The patterns asserted by *look ahead* and *look behind* are not captured, and consume zero length of the search string.

In [44]:
txt = 'xAAxAAxAAx'

In [48]:
regex = r'x(AA)x'

re.findall(regex, txt)

['AA', 'AA']

In [47]:
regex = r'(?<=x)AA(?=x)'        # positive look-ahead, look-behind

re.findall(regex, txt)

['AA', 'AA', 'AA']

In [49]:
txt = 'wAAxAAyAAz'

In [51]:
regex = r'(?<!x)AA(?!x)'        # negative look-ahead, look-behind
                                
re.findall(regex, txt)          # only 'yAAz' matched */

['AA']