**NOTE: This notebook is written for the Google Colab platform. However it can also be run (possibly with minor modifications) as a standard Jupyter notebook.** 



In [None]:
#@title -- Installation of Packages -- { display-mode: "form" }
import sys
# !{sys.executable} -m pip install git+https://github.com/michalgregor/class_utils.git

In [None]:
#@title -- Import of Necessary Packages -- { display-mode: "form" }
import re
import string
from IPython.display import HTML, display

In [None]:
#@title -- Downloading Data -- { display-mode: "form" }
# also create a directory for storing any outputs
import os
os.makedirs("output", exist_ok=True)

phone_number_samples = [
    ("0903445772", (None, 903445772)),
    ("(541) 754-3010", (None, 5417543010)),
    ("554$117$22A", None),
    ("die Kartoffel", None),
    ("+1-541-754-3010", (1, 5417543010)),
    ("001-541-754-3010", (1, 5417543010)),
    ("+49-89-636-48018", (49, 8963648018)),
    ("+421 903 445 231", (421, 903445231)),
    ("4422-5588", (None, 44225588)),
    ("41 510 4405", (None, 415104405)),
    ("33 2187945", (None, 332187945)),
    ("+31 33 2187945", (31, 332187945)),
    ("(33) 445-88-76", (None, 334458876)),
    ("+65-2234-1487", (65, 22341487)),
    ("+65-XXXX-YYYY", None)
]

## Regular Expressions

When processing text, we often need to search for matches to some keyword or pattern. We also often need to do search and replace. While it is easy to search for one specific keyword, in order to make more flexible searches that involve patterns, we need some way to express what we are looking for. One way to do this is using regular expressions, which we are going to illustrate in this notebook. Also, in the interest of brevity, we will skip a more formal introduction and jump straight to practical examples.

Also, we will not cover all the regular expression syntax in full – you can find more information e.g. in: [Regular Expression HOWTO](https://docs.python.org/3/howto/regex.html) or in [re — Regular expression operations](https://docs.python.org/3/library/re.html).



In [4]:
#@title [A YouTube Video](https://youtu.be/rhzKDrUiJVk) { display-mode: "form" }
display(HTML("""
<iframe width="560" height="315" src="https://www.youtube.com/embed/rhzKDrUiJVk" frameborder="0"
allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen></iframe>
"""))

### Simple Matches

#### Explicit Match: `keyword`

As our first step, let us show how to create our very first regular expression in Python. To keep things simple, we are going to perform an explicit match, which is the same as doing a very simple keyword search. We are going to be looking for the word "the" in a text (which is also specified below). Our regular expression will simply be the word we are looking for: `the`, and we are going to compile it using Python's `re.compile`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

print(text)

In [None]:
expr = re.compile("the")
list(expr.finditer(text))

As we can see, our search using `expr.finditer` yielded an interator which we transformed into a list. It contains the total of two matches. Each match indicates the matched text and its span in the original text.

#### Alternation: `alternative1|alternative2`

One thing that we can note in our first example is that the first "the" in our text did not get matched because it was actually a "The" and regular expressions are case-sensitive (although this can be changed using optional parameters). If we want to match both "the" and "The", we could therefore use the alternation operator "I" and write an expression that allows two alternative matches: one for "the" and the other for "The", i.e. `the|The`. We will now be able to match all three definite articles "the" in the text.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile("the|The")
list(expr.finditer(text))

We can also combine a standard match with an alternative match, so we could say that there is an alternative between "t" and "T" and the rest of the match needs to equal "he" exactly. The alternative between "t" and "T" will need to be enclosed in a parenthesis: `(t|T)he`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile("(t|T)he")
list(expr.finditer(text))

#### Character Classes: `[cbaA]`, `[0-9a-zA-Z]`

Yet another way to achieve the same would be to specify a character class for the first letter. By enclosing a number of characters in square brackets that any of them is allowed as a match, i.e. in our case: `[tT]he`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile("[tT]he")
list(expr.finditer(text))

While this might seem as a futile exercise in providing many syntactic ways to express the same concept, character classes are actually more flexible than that. For one thing, we can use them to specify ranges of characters. E.g. to cover the entire alphabet we would write `[a-zA-Z]`, which matches all small and capital letters. And we could do the same for numerals: `[0-9]`.

If, for instance, we wanted to match all 2-letter combinations starting with "n", we would write `n[a-zA-Z]`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile("n[a-zA-Z]")
list(expr.finditer(text))

#### Negative Character Classes: `[^cbaA]`

If we want a character class that matches anything except a given set of characters, we start the class using the `^` operator. E.g. `[^e]` would match anything except "e". So if we write `n[^e]` this will exclude the "ne" matches we got before, but it will include a "n " match, because we are now matching whitespace as well.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile("n[^e]")
list(expr.finditer(text))

#### Matching Whitespace: `\s`

Actually, since we are on the subject of whitespace, to match any whitespace character, we can use `\s`. So to further exclude the "n " entry and any entries with whitespace, we might write `n[^e\s]`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile("n[^e\s]")
list(expr.finditer(text))

#### Matching Any Character: `.`

There is also notation for matching any character at all: `.`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile("ne.")
list(expr.finditer(text))

#### Escaping Meta Characters: `\.\^\+` and Using Raw Strings

With all these special characters it should be obvious by now that if we need to match any of them, we will need to escape them somehow. Escaping is done using backslashes `\`. So if we want to match e.g. a literal ".", the regular expression for that would be `\.`. Now, if we write that in Python, it is not really going to work, because `\` are already used in Python strings to express special characters such as a newline character `\n`.

So in order to get an actual backslash into a Python string, we would need to write two backslashes. For the purposes of the regular expression, these two would then act as a single backslash. If we need to chain more backslashes, this will quickly get confusing. Luckily, if we prepend a string with `r` in Python, e.g. `r"\."`, that indicates a special raw string. When using a raw string, there is no need for double backslashes, we can write our regular expression directly.

The list of metacharacters that need to be escaped when used in the literal sense is: `. ^ $ * + ? { } [ ] \ | ( )`.

So if we want to search for any two characters followed by a dot, we can write: `r"..\."`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile(r"..\.")
list(expr.finditer(text))

### Repetition: `+*{n}?`

To make things more interesting still, we can specify whether patterns are allowed to repeat and even how many times. Let us consider a few examples of that.

#### An Exact Number of Repetitions: `expr{n}`

By appending `{n}` to an expression (or a subexpression, enclosed in parentheses as necessary) we specify that we expect it to be repeated exactly `n` times. So to specify that we are interested in a sequence of any 4 characters except "e" and whitespace, we could write `[^e\s]{4}`.

#### Any Number of Repetitions: `expr*`

If we want to allow any number of repetitions, we use operator star `*`.

#### One Time or More: `expr+`

To express that the expression has to occur at least once, but may occur more than once, we use operator `+`.

#### An Optional Expression: `expr?`

To express that an expression is optional (it may or may not occur), we can use operator `?`.

#### Example: Matching All Whole Words

Let's suppose that we want to match all words, i.e. any contiguous collection of letters separated from its context by whitespace.



In [None]:
sentence = "The classic universal approximation theorem."

In [None]:
expr = re.compile(r"\s[a-zA-Z]+\s")
list(expr.finditer(sentence))

Well, that did not quite work, did it? This is what often happens when writing regular expression: we forget about some cases that we want the expression to cover. Let's fix this.

#### Exclude the Spaces; Lookbehind, Lookahead: `(?<=...)`, `(?=...)`

Let's first exclude the spaces from the matches. We need to do this, otherwise the patterns for the neighbouring words will overlap and search will not pick up all of them.

* **Lookbehind:**  To match a pattern at the beginning of our expression but not include it in the match, we can use lookbehind: `(?<=...)`, where we replace `...` with our pattern.


* **Lookahead:**  Similarly, if we want to match a pattern at the end of our expression but not include it in the match, we can use lookahead: `(?=...)`.


So, for our example, we can write: `(?<=\s)[a-zA-Z]+(?=\s)` to exclude the spaces from the matches. This way we should also already be able to pick up one more word.



In [None]:
sentence = "The classic universal approximation theorem."

In [None]:
expr = re.compile(r"(?<=\s)[a-zA-Z]+(?=\s)")
list(expr.finditer(sentence))

#### Use the Boundary Meta

That is a bit better, but we are still missing the first and the last word: because these are not surrounded by spaces. We could handle that using special metacharacters for end (`$`) and start (`^`) of string and also explicitly add all punctuation. However, the expression would get rather complex. Thankfully, we can do the same using the **boundary metacharacter**  `\b`, which matches word boundaries:



In [None]:
sentence = "The classic universal approximation theorem."

In [None]:
expr = re.compile(r"\b[a-zA-Z]+\b")
list(expr.finditer(sentence))

#### Start and End of String `^$`

We have mentioned that there are metacharacters for the start (`^`) and the end (`$`) of string. So let us try to use this to match the very first word in the string. We will simply use `^` followed by the pattern, i.e.: `^[a-zA-Z]+`.



In [None]:
sentence = "The classic universal approximation theorem."

In [None]:
expr = re.compile(r"^[a-zA-Z]+")
list(expr.finditer(sentence))

#### Matching Punctuation

When matching punctuation, we can use the `string.punctuation` string, which contains all basic ASCII punctuation marks. Naturally, some of those punctuation marks happen to be regular expression metacharacters, so we need to escape them. Fortunately, this can be done automatically using `re.escape`. So let us form a character class that matches any punctuation.



In [None]:
sentence = "The classic? Universal; approximation. Theorem!"

In [None]:
punct = string.punctuation
punct

In [None]:
punct_class = "[" + re.escape(punct) + "]"
punct_class

In [None]:
expr = re.compile(punct_class)
list(expr.finditer(sentence))

### Capturing Groups

The matches that we get using regular expressions can also be structured: instead of getting just the full text of the match, we can also get its component parts, provided we enclose them in capturing groups. We create these using parentheses. E.g. if we want to match any two words following a "the" and extract each of them separately, we can write: `[Tt]he ([a-zA-Z]+) ([a-zA-Z]+)`.



In [None]:
sentence = "The classic universal approximation theorem."

In [None]:
expr = re.compile(r"[Tt]he ([a-zA-Z]+) ([a-zA-Z]+)")
match = expr.search(sentence)
match

Having retrieved our match, we can now use `match.group(n)` to refer to its various capturing groups. Group 0 will always refer to the entire match.



In [None]:
match.group(0)

Groups 1 and 2 will correspond to the first and the second word in our case.



In [None]:
print(match.group(1))
print(match.group(2))

#### Non-Capturing Groups

The fact that parentheses serve two purposes: to enclose subexpressions and to denote capturing groups, can sometimes have annoying consequences. If, for instance, we wrote our regular expression as `(T|t)he ([a-zA-Z]+) ([a-zA-Z]+)`, group 1 would now correspond to the first set of parentheses, which encloses the alternative between `T` and `t`.



In [None]:
sentence = "The classic universal approximation theorem."

In [None]:
expr = re.compile(r"(T|t)he ([a-zA-Z]+) ([a-zA-Z]+)")
match = expr.search(sentence)
match.group(1)

This is often not the behaviour we want. In such cases, we can explicitly make a group non-capturing using `(?:...)`. So our regular expression would, in this case, be `(?:T|t)he ([a-zA-Z]+) ([a-zA-Z]+)`. Now group 1 will correspond to the word "classic" because "T" is no longer being captured.



In [None]:
expr = re.compile(r"(?:T|t)he ([a-zA-Z]+) ([a-zA-Z]+)")
match = expr.search(sentence)
match.group(1)

### Search and Replace

Instead of matching and search, regular expressions are also often used to perform search and replace. Let us suppose that we want to replace every definite article in our sentence with string `"XX"`. In Python we can use `expr.subn(repl, string)` to replace all occurences of the expression in `string` with `repl`. Function `subn` returns the resulting string and the number of replacements made. We are just going to display the string.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile(r"\b[tT]he\b")
print(expr.subn("XX", text)[0])

#### Using Captured Groups in the Replacement String

Let us suppose that we want to something a bit more complex: e.g. swap the first and the last word beginning with a "c". We can definitely already match words beginning with "c", but in order to swap them, we need to capture the first "c" word, the last "c" word, the text in between, and to insert them back in reverse order. Fortunately, in the replacement string, we can refer back to any captured group `n` using `\n`. So all we need to write is: `\b(c[a-zA-Z]+)\b` for the first "c" word, `(.*)` for the text in between and `\b(c[a-zA-Z]+)\b` for the last "c" word. The replacement string will be simply `\3\2\1`.



In [None]:
text = ("The classic universal approximation theorem concerns " +
        "the capacity of feedforward neural networks.")

In [None]:
expr = re.compile(r"\b(c[a-zA-Z]+)\b(.*)\b(c[a-zA-Z]+)")
print(expr.subn(r"\3\2\1", text)[0])

### Non-recursive vs. Recursive Patterns

#### Example: Remove HTML Tags

As our further example, we will try to remove any HTML tags from text. This should be relatively straightforward: we simply need to match anything in between `<` and `>`.  Actually, since regular expressions are greedy, they will try to consume as many characters as possible. This is why we need to be very carefuly when specifying the expression. Let's see what would happen, if we specified our regular expression as `<.*>`.



In [None]:
text = """
text above
<div>
div 1 content
<span>inner span</span>
</div><div>
div 2 content
<span>inner span</span>
</div>
text below
"""

In [None]:
expr = re.compile(r"<.*>")
print(expr.subn("", text)[0])

As you can see, this regular expression did not remove just the tags: it also removed the contents of the inner `<span>` tag; something we did not intend. The correct expression would be `<[^>]*>`. This does not allow matching beyond the next closing angle bracket `>`.



In [None]:
expr = re.compile(r"<[^>]*>")
print(expr.subn("", text)[0])

#### Regular Expressions Cannot Express Recursive Patterns

Another idea we might have would be to remove HTML tags including their contents. However, this cannot be done using regular expressions alone: they are not expressive enough to properly track the opening and closing of tags, because they cannot handle recursive patterns.

To properly handle patterns of this kind, we need more expressive languages and parsers: often based on **context-free grammars** .

---
### Task: Matching Phone Numbers

**Given the samples below, write function `match_number(sample)` that will match phone numbers using regular expressions. If the `sample` string is not a valid phone number, return `None`. If it is, return a pair of integers representing the country code (if not present, `None` instead) and the phone number itself.** 

Samples of numbers with formats from [[apache.org](https://stdcxx.apache.org/doc/stdlibug/26-1.html),[wikipedia.org](https://en.wikipedia.org/wiki/National_conventions_for_writing_telephone_numbers)]:

* 754-3010: US, Local
* (541) 754-3010: US, Domestic
* +1-541-754-3010: US, International
* 001-541-754-3010: US, International
* +49-89-636-48018: German, International
* +421 903 445 231: Slovak, International
* 0903 445 231: Slovak, Domestic mobile
* 41 510 4405: Slovak, Domestic landline
* 4422-5588: Iceland, Domestic
* 33 2187945: Netherlands, Domestic
* +31 33 2187945: Netherlands, International
* (33) 445-88-76: Poland, Domestic
* +65-XXXX-YYYY: Singapore, International
---
Notes:

* Use `fullmatch` instead of `match` or `search` to ensure that the full string and not just part of it matches the expression.
* Once you match the number, you will probably need to remove characters such as `'(', ')', '-'` before you can convert the string to an integer using `int`. To replace characters you can use e.g. `.replace` or `str.maketrans` and `.translate`.


In [None]:
expr = re.compile( # ---

def match_number(sample):
    
    
    # ---
    
    

#### Testing

Now we apply the function to some samples and check the results.



In [None]:
num_correct = 0

for sample, ret in phone_number_samples:
    try:
        retm = match_number(sample)
        
        if ret != retm:
            print("Incorrect response for sample '{}'.'".format(sample))
            print("  - Expected: '{}'".format(ret))
            print("  - Got: '{}'".format(retm))
        else:
            num_correct += 1
    except:
        print("Exception raised for sample '{}'.".format(sample))
        raise

print("{} correct out of {} samples".format(num_correct, len(phone_number_samples)))