# 1 Regular expression

## 1.1 A (Very Brief) History of Regular Expressions

In 1951, mathematician **Stephen Cole Kleene** described the concept of a `regular language`, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer `Ken Thompson`, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation.

Since then, regexes have appeared in many programming languages, editors, and other tools as a means of **determining whether a string matches a specified pattern**. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.



## 1.2 The re Module

**Regex functionality in Python resides in a module named re**.


### 1.2.1 The first re Module example

The first re module function which we will learn is called **re.search().** The **re.search(<regex>, <string>) scans <string> looking for the first location where the pattern <regex> matches**. If a match is found, then re.search() returns a match object. Otherwise, it returns None.


In [1]:
import re

In below example, we use the search pattern <regex> `123` and target <string> `sample1` as parameter for re.search(). As sample1 contains **123**, so it will return a match object.

In [2]:
sample1 = "foo123bar"

In [3]:
result = re.search("123", sample1)

print(result)

<re.Match object; span=(3, 6), match='123'>


The match object is [truthy](https://realpython.com/python-data-types/#boolean-type-boolean-context-and-truthiness), so you can use it in a Boolean context like a conditional statement:

In [4]:
if result:
    print("Find a match")
else:
    print("No match")

Find a match


#### Understand the output

All the function in the **re** module will return a standard result when a match is found for the giving pattern:

```text
# general format
# first part is the class of the object, which is re.Match
# Second part is the index of the sub string which matches the pattern
# Third part is the sub string
<re.Match object; span=(start_index, end_index), match='<matched_sub_string>'

# for example, the above example returned below result
# span=(3, 6) indicates the portion of <string> in which the match was found.
# match='123' indicates which characters from <string> matched.
# In this example, the <regex> pattern is just the plain string '123'. So the matched is exactly the same as regex.
# But in the situation that the regex is very complex, the matched string can be very useful.
<re.Match object; span=(3, 6), match='123'>

```

Below code shows the same thing if we slice the given string with the returned span:

In [7]:
print(sample1[3:6])

123


## 1.3 Python Regex Metacharacters

The real power of regex matching in Python emerges when <regex> contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.

The following table briefly summarizes all the metacharacters supported by the re module. Some characters serve more than one purpose:

| Character(s)	 | Meaning                                                                                                                                                 |
|---------------|---------------------------------------------------------------------------------------------------------------------------------------------------------|
| .	            | Matches any single character except newline (i.e. \n)                                                                                                   |
| ^             | Anchors a match at the start of a string, Complements a character class                                                                                 |
| $	            | Anchors a match at the end of a string                                                                                                                  |
| *	            | Matches zero or more repetitions                                                                                                                        |
| +	            | Matches one or more repetitions                                                                                                                         |
| ?	            | Matches zero or one repetition, Specifies the non-greedy versions of *, +, and ?, Introduces a lookahead or lookbehind assertion, Creates a named group |
| {}	           | Matches an explicitly specified number of repetitions                                                                                                   |
| \	            | Escapes a metacharacter of its special meaning ,Introduces a special character class ,Introduces a grouping backreference                               |
| []	           | Specifies a character class                                                                                                                             |
|               | 	Designates alternation                                                                                                                                 |
| ()	           | Creates a group                                                                                                                                         |
| : # = !	      | Designate a specialized group                                                                                                                           |
| <>            | 	Creates a named group                                                                                                                                  |



### 1.3.1 First example, Digit expression

Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.

In a regex, a set of characters specified in square brackets ([]) makes up a character class. For example, [0-9] matches any digit number between 0 and 9.

So we can use below regex to match any three consecutive digits.

In [10]:
regex1 = "[0-9][0-9][0-9]"

In [11]:
re.search(regex1, sample1)

<re.Match object; span=(3, 6), match='123'>

It will also work on below string

In [12]:
re.search(regex1, "888toto")

<re.Match object; span=(0, 3), match='888'>

In [13]:
re.search(regex1, "toto666")

<re.Match object; span=(4, 7), match='666'>

### 1.3.2 The wildcard

The **dot (.)** metacharacter matches any character except a newline (\n), so it functions like a wildcard.
For example, if we have `..` in the regex pattern
- a : Not match, because it only has one character
- ac : 1 match, which is ac
- acd: 1 match,
- acde : 2 match

Below code

In [8]:
regex2 = ".."

test_list=["a","ac","acd","acde"]

for i,test in enumerate(test_list):
    print(f"test {i} result: {re.search(regex2,test)}")

test 0 result: None
test 1 result: <re.Match object; span=(0, 2), match='ac'>
test 2 result: <re.Match object; span=(0, 2), match='ac'>
test 3 result: <re.Match object; span=(0, 2), match='ac'>


> You can notice, even-though there are two match for test 3, but it only shows the first match

Let's retake the example 1, we can write a new regex to match 123:

In [15]:
regex3 = "1.3"

In [16]:
re.search(regex3, sample1)

<re.Match object; span=(3, 6), match='123'>

As dot (.) is a wildcard, so we can replace 2 by anything except none.

In [17]:
re.search(regex3, "1a3toto")

<re.Match object; span=(0, 3), match='1a3'>

In [20]:
re.search(regex3, "1 3toto")

<re.Match object; span=(0, 3), match='1 3'>

In [19]:
res = re.search(regex3, "13toto")
print(res)

None


### 1.3.3 ^ - Caret

The caret symbol ^ is used to check if a string starts with a certain character.

For example, if we have `^ab` as the regex pattern
- a : no match
- ac : no match
- abc : 1 match
- acb : no match


In [9]:
regex4 = "^ab"

test_list=["a","ac","abc","acd"]

for i,test in enumerate(test_list):
    print(f"test {i} result: {re.search(regex4,test)}")

test 0 result: None
test 1 result: None
test 2 result: <re.Match object; span=(0, 2), match='ab'>
test 3 result: None


### 1.3.4 $ - Dollar

The dollar symbol $ is used to check if a string ends with certain characters.

For example, if we have `ab$` as the regex pattern
- a : no match
- ac : no match
- cab : 1 match
- acd : no match


In [10]:
regex5 = "ab$"

test_list=["a","ac","cab","acd"]

for i,test in enumerate(test_list):
    print(f"test {i} result: {re.search(regex5,test)}")

test 0 result: None
test 1 result: None
test 2 result: <re.Match object; span=(1, 3), match='ab'>
test 3 result: None



#### Period examples


https://realpython.com/regex-python/#a-very-brief-history-of-regular-expressions