# Regular Expressions - regex

This notebook explores the use of  **_regular expressions_**, which are a powerful syntax for describing and working with character strings in _Python_. 

Perhaps the simplest use for regular expressions is to check whether a string of characters contains a particular sub-string. For example, `'Hello World'` contains the sub-string `'llo'` but not `'Earth'`. 

However, regular expressions are much more powerful than this simple example. They allow strings to be checked for one or more sub-strings matching some a specific pattern. For example you could define a single regular expression that would match a string containing anywhere between one and three capital letters followed by exactly 5 numbers. Even this example only scratches the surface of the patterns that can be descrived using regular expressions. As first glance the regular expression syntax can seem impenetrably complex but don't worry, we'll go through things step-by-step and introduce some of their useful features.

Regular expressions are a common implementation feature of many different programming languages of which _Python_ is only one. Learning the basics will almost certainly pay dividends even if you end up leaving _Python_ behind.

_Python_ implements the regular expressions via the `re` module. There are 3 main functions that you need to know about - **`re.search()`**, **`re.match()`**, and **`re.compile()`**. Actually, **`re.match()`** and **`re.search()`** are only subtly different. The **`re.match()`** function is more conservative will only succeed if the regular expression pattern matches the beginning of a line and won't consider multiple lines separated by a `'\n'` character. For this reason we'll concentrate on **`re.search()`**, which will succeed if the regular expression pattern matches **anywhere** in a string, even if that string contains multiple lines.

Both **`re.search()`** and **`re.match()`** require a "_template_" search pattern to be specified using a string containing regular expression syntax. In our `'Hello World'` example above this template would simply be the string of characters we are searching for **`'llo'`**.

Let's get started by importing the `re` module.

In [None]:
import re  # Because 're' is so short we don't bother with an alias!

## 1. Using basic `re.search()`

To use **`re.search()`**, we pass two arguments:

1. a search pattern.
2. a string to search. 

If the function finds a match it will return a `re.Match` _object_. If no match is found, **`re.search()`** returns the special _Python_ `None` value. You can check whether a match object has been returned using a simple **`if`** statement.


In [None]:
search_string = r"llo"
full_string = "Hello World"

if re.search(search_string, full_string):
    print("Hooray! - got a llo")
else:
    print("Boo. No llo")

In [None]:
search_string = r"Earth"
full_string = "Hello World"

if re.search(search_string, full_string):
    print("Hooray! - got a Earth")
else:
    print("Boo. No Earth")

Did you notice how we prefixed both of the search pattern strings with an an `r`? Doing this defines a "_raw_" string in _Python_. We'll explain why this is necessary later in the notebook.

### EXERCISE 1

Write a small program were you input a simple string and then search to see if the sentence "The quick brown fox jumps over the lazy dogs" contains it. Put it in a loop if you want to keep trying.


In [None]:
## Write your code here...

In [None]:
ss = input("Please enter a search term ")
long_string = "The quick brown fox jumps over the lazy dogs"
if re.search(ss, long_string):
    print("It is there!")
else:
    print("Sorry, no cigar")

## 2. Matching generic patterns

Hopefully, everything so far has seemed very straightforward. Now let's use some of the features of regular expressions to search for character sequences that match a more generic _pattern_ rather than a fixed string of characters. This is where regular expressions become _slightly_ more complicated, but **much** more powerful!

Firstly, let's look at how to search for a **single** character that's within a specified "_range_". This is done using hyphen separated character pairs enclosed within square brackets. This is probably best illustrated using some concrete examples.

* For practical purposes, you can assume that ranges of letters are interpreted assuming alphabetic order, so **`r"[a-z]"`** will match all the lower case letters from `a` to `z`. 
* Similarly, **`r"[A-Z]"`** will match all the upper case letters from `A` to `Z`.
* Multiple ranges can be specified within the same square brackets, so  **`r"[A-Za-z]"`** will match upper and lower case leters in the Roman alphabet. 
* **`r"[abdf]"`** will match any of the lower case letters `a`, `b`, `d`, or `f`. 
* Ranges of digits can also be represented. For example **`r"[3-8]"`** matches any integer between 3 and 8, inclusive.

The next two code cells demonstrate the range syntax in action. The first cell demonstrates what happens when the search fails to find a match and **`re.search()`** returns `None`.

In [None]:
ss = r"[ahukl]"
result = re.search(ss, "qwerty")
print(result)

Now, let's try an example that matches successfully. It returns a `re.Match` object.

In [None]:
ss = r"[ahukl]"
result = re.search(ss, "asdfg")
print(result)

### 2.1 Special characters

The range syntax can be used to matches anything inside it but there are some shortcuts. For example, **`"r\d"`** is equivalent to `"[0-9]"` and matches any digit, while **`\w`** matches any alphanumeric character i.e. **`r"[0-9a-zA-Z]"`**, as well as underscores `_`. The full stop **`.'`** matches any single character except the newline character, and **`\s`** will match any **single** whitespace character including space (` `), newline (`\n`), tab (`\t`) and carriage return (`\r`).

#### 2.1.1 Raw strings revisited

Earlier in this notebook, we noted that pattern strings supplied to **`re.search()`** and **`re.match()`** should be coverted into _raw_ strings by prefixing them with `r`. Raw _Python_ strings are used to prevent the _Python_ interpreter from interpreting and replacing or removing any `\` characters or special characters in the pattern string that are actually part of the regular expression syntax. Depending on the context, it can be difficult to predict exactly how _Python_ will handle special characters that it finds within ordinary strings. For this reason, when defining regular expressions always use _raw_ strings. As a concrete example you should not use **`'\d\d'`** to define a pattern that would match two digits in successsion. Instead, always use **`r'\d\d'`**.


### 2.2 Matching multiple characters

Flexibly matching individual letters or numbers isn't particularly useful. In this section, we'll examine how to write regular expressions that match _sequences_ containing multiple characters. For example, you might want to match a string that conatins a _specific number_ of digits in a cluster - strings like `'1224'`, `'892301'` or even `'1'`. To write a regular expression that matches a specific number of characters in a row, we start with an expression that would match a single character (like `'\d'`) and add a suppix to specify the character count. The suffix consists of the required number enclosed in brace (`{}`) characters. 

As a concrete example, to match 3 digits we would use **`'[0-9]{3}'`** (or **`'\d{3}'`**). These expressions will not match `3` or `95`, but they _will_ match `735`. 

> Be warned though! Both of these expressions will (when used with **`re.search()`** match any string that includes a sequence of three digits as a substring. This means that `7351` will also match because any 4 digit string also contains a of 3 digit substring! It is 3 digits and we'll see this later. 

The next two code cells show examples of the multiple character match syntax. The first example does not yield a match, while the second does. Note that we still use _raw_ strings to define our regular expressions. 

In [None]:
ss = r"\d{3}"
result = re.search(ss, "qw13hdf")
print(result)

Doesn't match but the following will:

In [None]:
ss = r"\d{3}"
result = re.search(ss, "qw129hdf")
print(result)

### 2.2 Matching at specific positions

The regular expression syntax allows you to specify the position in the target string to search for a match. To do this you can use _positional_ characters withing your pattern definition. Some commonly used positional characters include:
* **`'^'`**, which matches the start of the string.
* **`'$'`**, which matches the end of the string.
* **`'\b'`**, which matches the beginning or end of a word.

The next two code cells show example uses of the `^` character to match a sequence of three digits at the beginning of the target string. The first example does not yield a match, while the second does.


In [None]:
ss = r"^\d{3}"
re.search(ss, "qw129hdf")


Doesn't match but the following will:


In [None]:
ss = r"^\d{3}"
re.search(ss, "129qwhdf")

### 2.3 Indefinite repetition

The regular expression syntax are also _wildcard_-like repetition characters. 

* **`+`** will match with **_one_ or more** characters that match the pattern immediately to its left.
* **`*`** checks for **_zero_ or more** characters that match the pattern immediately to its left.
* **`?`** checks for **exactly _zero_ or _one_** characters that match the pattern immediately to its left.

We have now covered enough material to be able to start building up complex search patterns. What if we wanted to verify that someone has entered a valid format for their National Insurance number. Mine is of the general form 'AA nn nn nn A' (2 capitals, space, 2 numbers, space, 2 numbers, space, 2 numbers, space, capital letter). Let's write a regular expression that matches this template.

Using the syntax we've learned so far, we could write the following expression.
```python
r"^[A-Z]{2} \d\d \d\d \d\d [A-Z]$"
```

With more advanced syntax, we could write a more compact expression, but this does the job!

> Note that we have used `^` and `$` to make sure that there are no characters before or after the pattern we're looking for.


In [None]:
NI = "AB 12 34 56 Z"  # Hope that isn't anyone's number :-)
ss = r"^[A-Z]{2} \d\d \d\d \d\d [A-Z]$"
result = re.search(ss, NI)
print(result)


That works! Note that printing the `re.Match` object informs us that the matching string was `'AB 12 34 56 Z'`.

How could we write the expression in a more compact form? Notice that the  string `" \d\d"` is repeated 3 times. We can avoid this repetition by using the applying the suffix `{3}` to a subsection of the overall pattern. We isolate the required subsection using _parentheses_ i.e. `"( \d\d)"`.

Run the next code cell to see our new expression in action. 

In [None]:
NI = "AB 12 34 56 Z"
ss = r"^[A-Z]{2}( \d\d){3} [A-Z]$"
result = re.search(ss, NI)
print(result)

### EXCERCISE 2

See whether you can do something similar for a regular UK city telephone number which looks something like **`+44 nnn nnn nnnn`**. 

In [None]:
## Write your code here...

In [None]:
stg = "+44 121 123 4567"
ss = r"^\+44( \d{3}){2} \d{4}$"
result = re.search(ss, stg)
print(result)


Now, try to write regular expression that matches a string that contains the following, in order:

1. A single capital letter e.g. `A`.
2. Any number of lower case alpha characters e.g. `aaaaa`.
3. Any number of digits e.g. `12345`.
4. Zero or more of alphanumeric characters e.g. `abZ3s0d`. 

Matching examples might include **`Pdbajdg2316hpuns`** or **`Zabc123`**.


In [None]:
stg = "Aqwerty123z"
ss = r"^[A-Z][a-z]+\d+\w*$"
result = re.search(ss, stg)
print(result)

## 3. Groups

Before the exercise, we wrote regular expressions to verify that input strings were formatted correctly.  Now, we'll show how to extract specific components from the matched part of the string. For example, what if you only needed the numeric parts of the matched string?

By now, it probably won't surprise you to find out that this can also be accomplished using regular expressions. In _Python_ we do this by adding some extra syntax to our regular expression and using the **`group()`** method of the `re.Match` object returned by the **`re.search()`** function.

We _identify_ the part of the search pattern that we want to _isolate_ using parentheses. We actually used this syntax earlier to repeat a sub-pattern of a regular expression. Let's write a similar expression to demonstrate how groups work in regular expressions. Can you unravel what the following pattern matches?

```python
stg = "Aqwerty123z"
ss = r"^[A-Z][a-z]+(\d+)(\w*)$"
result = re.search(ss, stg)
```

> Note we have 2 parenthesised search groupings (`(\d+)` and `(\w*)`) in the search pattern.

Now, if the target string matches the search pattern, we can use **`.group()`** to return the parts of the match that correspond with the parenthesised sub-expressions. 

* **`result.group(0)`** will return the whole string. 
* **`result.groups()`** will return a tuple of the succesful matches, which you can iterate over.

Try running the next code cell to see grouping in action.

In [None]:
stg = "Aqwerty123z"
ss = r"^[A-Z][a-z]+(\d+)(\w*)$"
m = re.search(ss, stg)

print("Iterating through the matches:")
for val in m.groups():
    print(val)
print(m.groups()[0])

print("\nLooking at what group() returns")
print(m.group(0))
print(m.group(1))
print(m.group(2))

### 3.1 An example from astronomy!

The ability to isolate numerical substrings in textual data can really useful in the context of astronomy. Many data files produced by telescopes will have a header section containing embedded information about the observation - and we may need that information. 

For example, the OU radio telescope, ARROW, produces data files with a _header_ that contains the following line: 

    #startTimeUTC=2019-10-24T13:05:00.566Z
    
How can we abstract the date and time automatically using regular expressions?

The code in the next cell demonstrates one approach. Can you improve on it?

In [None]:
stg = "#startTimeUTC=2019-10-24T13:05:00.566Z"
ss = r"^.*UTC=(.*)T(.*)Z$"
m = re.search(ss, stg)

for val in m.groups():
    print(val)