# Intro to regex (regular expression)

A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is `a sequence of characters that specifies a search pattern in text`. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory. ... A regular expression, often called a **pattern**, specifies a set of strings required for a particular purpose.

\- resource: wikipedia

#### Metacharacters

* Examples of metacharacters:

![image.png](attachment:image.png)

* Perhaps the most important metacharacter is the backslash, \\. As in Python string literals, the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a \[ or \\, you can precede them with a backslash to remove their special meaning: \\\[ or \\\\.



#### Special Sequences / Characters
. : anything (letter, number, special character, etc.)      
\[ \] : They’re used for specifying a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.   
^ within \[ \] : Not   
^ : the start of the string   
$ : the end of the string   
| : OR   


\d : number   
\D : not a number (equivalent to \[^\d\])   
\s : whitespace character   
\S : not a space character (equivalent to \[^\s\])   
\w : word character   
\W : not a word character (equivalent to \[^\w\])   

#### Repeating Characters
{n1, } : repeating at least n1 times   
{n1, n2} : repeating between n1 and n2 times   
{, n2} : repeating at most n2 times (including 0)   
\* : appearing 0 or more times (equivalent to {0,})   
\+ : appearing at least once (equivalent to {1,})   
\? : appearing either 0 or 1 times (equivalent to {0,1})

#### Using parenthesis

Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use \( or \), or enclose them inside a character class: \[(\], \[)\].



![image.png](attachment:image.png)

### The main library for regex in python is `re`

In [1]:
import re

#### Step 1. Compile a regular expression pattern into a regular expression object: ***`re.compile(<pattern>, options)`***

Example: finding any sequence of numbers

In [51]:
# compiling a regular expression into a regular expression object using re.compile
pat = re.compile(r'\d+')
string = '12 drummers drumming, 11 pipers piping, 10 lords a-leaping'

#### Step 2. Use `re.findall()` to find all the sequence of numbers : 12, 11, and 10

In [52]:
re.findall(pat, string)

['12', '11', '10']

In [53]:
if len(re.findall(pat, string)):
    print('There is at least one match')
else:
    print('There is no match')

There is at least one match


In [54]:
string = 'no number'

In [55]:
if len(re.findall(pat, string)):
    print('There is at least one match')
else:
    print('There is no match')

There is no match


### Match Object

This time we would like to extract all the sequence of alphabetic characters

In [22]:
pat = re.compile(r'\w+')
string = 'word, Word, WORD'

In [28]:
re.findall(pat, string)

['word', 'Word', 'WORD']

In [29]:
# match object
m = re.match(pat, string)

In [40]:
m.group(0)

'word'

In [42]:
m.start()

0

In [43]:
m.end()

4

In [44]:
m.span()

(0, 4)

### Substitution : ***`re.sub(pattern, repl, string)`***

Example:   
substitute word into Word

In [45]:
string = 'word'

In [46]:
re.sub('w', 'W', string)

'Word'