# Regular expressions

Regular Expressions (RegEx, plural RegEx) are powerful means to search text. You are probably used to normal search dialogs that find literal text matches. RegEx start from there, and allow you to search far more flexibly.

In this Notebook, we'll explore the power of regexes.

The first step to using RegEx in Python is to import the "re" library.

## Exercise: `import`ant
execute the snippet below, to make _everything_ below here work

In [None]:
import re

## our first example

Regex allow you to search text and _patterns_ of text.
Lets try an example:

In [None]:
match = re.search("hello", "hello, this text will be searched")

This uses the general `search` method of the RegEx library. `search` looks for matches _anywhere_. (It's brother `match` only looks for matches _directly_ at the start of the text, see the [appendix](#match-vs-search) for more details).

The first argument of these functions is the RegEx, it's searched _for_. This is also sometimes affectionally called the "needle".

The second argument is the text that is searched for the pattern, affectionally also called "the haystack".
This can be any "normal" text, be it literal strings, or variables containing text.

Next, lets see what the search got us:

In [None]:
print(match)

As we see, we got a `Match object`. This signals that our `search` actually found _at least_ one match inside our haystack. (if it hadn't, we would've gotten a `None`).

Lets see what we can do with this `Match object`: what methods does it have?

In [None]:
dir(match)

It is usually safe to ignore all methods whose name look like "`__name__`". These is Python's form of "internal" or "private" methods, and they are only relevant when you are tinkering _deeply_ under the hood of an object.

The first group of methods, `start()`, `end()`, and `span()` tell us _where_ we found the result.
  * `start()` returns the positions of the first character that matched
  * `end()` returns the position _just after_ the last character that matched
  * `span()` is simply the tuple `(start(), end())`

In [None]:
print(match.start())
print(match.end())
print(match.span())

 `string` gives us the entire text that was searched:

In [None]:
print(match.string)

The next useful methods are `group()` and `groups()`, which show us the actual matching part(s):

In [None]:
print(match.group())
print(match.group(0))
print(match.groups())

Currently, only `group()` is useful to us. We will learn the uses of `groups()` and `group(<number>)` later, when we capture multiple different groups within one RegEx search.

More about Match objects in the [Python reference](https://docs.python.org/3/library/re.html#match-objects)

# Literal matches

From now on, we'll usually pre-compile our regexes, using `re.compile`. This allows us to re-use an expression multiple times, without having to repeat the (rather expensive) parsing of the pattern and construction of the searching logic every time we use the same regex. Also, it allows us to give meaningful names to our pattern such as "email", "purine-base" or "CRISPR_pattern".

(For throw-away, single-use regexes, the `re.search` form can still be useful for brevity.)

In [None]:
the_regex = re.compile("our Regex")

Now, lets see if we can find something, by inspecting the resulting `Match object`:

In [None]:
match = the_regex.search("this is a text to be searched")
if (match):
    print("the regex found results the first time!")
    print("the match was:")
    print(match)
else:
    print("the regex didn't find results the first time") 

In [None]:
match2 = the_regex.search("this will be searched by our Regex")
if (match2):
    print("the regex found results the second time!")
    print("the match was:")
    print(match2)
else:
    print("the regex didn't match the second time")

## Exercise: pretty printing

Time to try something yourself!

Using the methods you've learned above, and string-formatting from yesterday, to pretty-print the results contained in match2. Try to get _exactly_ this output

```
Matched "our Regex"
in "this will be searched by our Regex"
starts at: 25, ends at: 34
```

In [None]:
# Now you try!
print("YOUR CODE HERE")

## Exercise: Willkommen

_slightly_ more challenging: find the start and end positions of "_in_" in "_Wilkommen in Heidelberg_":

In [None]:
welcome = "Wilkommen in Heidelberg"

# Your Code Here
# Good Luck!

# Multiple multiple multiple matches

While it is very useful to know if a text contains a match _at all_, sometimes you want to process _every_ match.
For these cases, the methods `findall()` and `finditer()` are used.

* `findall` returns a list of 'just' the matching texts. This can be an empty list (no match), or a list with one or more elements
* `finditer` returns an "iterable" of `Match objects` that can be used as we learned above.

In [None]:
haystack = "Hay is usually made by farmers. Hay is, confusingly, tolerated by many people with \"Hayfever\""
needle = "Hay"

hay_regex = re.compile(needle)
print("findall:", hay_regex.findall(haystack))

In [None]:
print("finditer:", hay_regex.finditer(haystack))

In [None]:
print("finditer, usefully:")
for match in hay_regex.finditer(haystack):
    print(match.group(), match.span())

## Exercise: and a pony!

find all the occurences of "and" in the text below.
There are more than you probably think!
use `span()` to identify where in the text they occur.

In [None]:
clandestine_ands = """andreas and andy applied
random regex until
the dandy trainer and
the android assistant
aptly and accurately vandalised their
abandonded solution in tandem
to land the outstanding participants
in candyland.
"""

# YOUR CODE BELOW

# Alternation

You can express "this literal, or this other literal" using the "pipe", "|". Example:

In [None]:
for m in re.finditer("cats|dogs", "it's raining cats and dogs today"):
    print(m.group(), m.span())

## Exercise: easy as do-re-mi

use alternation to find the three-character "easy" things from a well-known Jackson 5 song:
"ABC" and "123"

In [None]:
jackson5 = "ABC, easy as 123"

needle = "YOUR EASY REGEX HERE"

for m in re.finditer(needle, jackson5):
    print(m.group(), m.span())
else:
    print("no matches (yet?)")

# Character classes

So far, we've only (tamely) searched for literal text matches. However RegEx have far more power than that, you can define "patterns" to search for.

The simplest type of pattern is character classes such as "a digit" or "whitespace".

The most generic character class is `.`, which matches literally _any_ single character, except newlines.

Many useful character classes are represented by single "escape characters":
 * `\d` for any _single_ digit, 0-9
 * `\s` for any single whitespace (space, tab (\t), newline (\n))
 * `\w` for any alphanumeric, or "word"-character (a-z, A-Z, 0-9, and underscore ("_"))

Character classes can be _inverted_ by CAPITALISING them:
 * `\S` anything _except_ whitespace
 * etc.

You can also make custom classes, by defining them in square brackets: `[]`
 * `[aeiou]` matches only the vowels
 * `[DKFZ]` matches ONE character, D or F or K or Z.
 
Inside these brackets, you can use ranges of characters for brevity:
 * `[a-f]` matches one of a,b,c,d,e,f (but NOT A,B,C,D,E,F)
 * `[A-F]` matches one of A,B,C,D,E,F (but not a,b,c,d,e,f)
 * `[a-fA-F]` matches one of a,b,c,d,e,f,A,B,C,D,E,F
 * `[0-6]`: 0,1,2,3,4,5,6 (note that 0 sorts before 1, not after 9! this surprises some people!)
 
If you want to include a literal "-" in your custom class, either escape it as `\-`, or put it directly after the opening bracket: `[-....]` (where it cannot indicate a 'from-to' range).
 
To invert a custom class, use a caret directly after the opening bracket: `[^....]` to create a "negative class"
 * [^a-f] matches _any_-thing, except a,b,c,d,e,f (including "funny" characters like newlines, tabs, exclamation marks, etc!)
 * etc.

This allows you to look for multiple closely related matches with only a single regex:

In [None]:
haystack = "hello!(1) we'll be saying hello2 a LOT in this hello-sentence(2): hello hello, count the hello6 with us! hello7"
needle = "hello\d"

hello_regex = re.compile(needle)
for match in hello_regex.finditer(haystack):
    print(match.group(), match.span())

Note how we _didn't_ match "hello!", nor the "hello-" in "hello-sentence"

## Exercise: hello, hello
adapt `needle` above to match:
 1. all "hello" NOT followed by a number (4 matches)
 1. all "hello" followed by whitespace (1 match)
 1. all "hello" followed by a number _less than 6_ (1 match!)
 1. all "hello" followed by punctuation (3 matches)
 1. all "hello" followed by ANY character (7 matches)

## Exercise: and.. context!

use your newfound ability to match arbitrary characters to make our and-exercise from above easier.
expand the regex to also match context around the "and". Having the characters before and after the "and" should help you to identify their position in the text better.

* Also try to expand the context to _two_ characters surrounding the "and"
Did you lose any matches?

* go back to the single character case. Did you lose any matches there?

In [None]:
print(clandestine_ands)

# TASK: IMPROVE THIS
needle = "and"

regex = re.compile(needle)
for match in regex.finditer(clandestine_ands):
    print(match.group(), match.span())

## Word boundaries, a class of their own

A special character class is `\b`, meaning "word **b**oundary".

This class is special in that it doesn't match characters themselves, but rather a zero-length _boundary_ between two characters, to be precise, any position where we switch from a `\w` character to a non-`\w` (`\W`) character, or from `\W` back to `\w`.

This is very useful if you only want to look for complete words.

In [None]:
print(re.findall(r  "and",   "random and miscceleaneous are used interchangeably"))
print(re.findall(r"\band\b", "random and miscceleaneous are used interchangeably"))

Astute readers will have noticed we used a new syntax for our regex, "raw strings" notation: `r"..."`.
This is to avoid conflicting interpretations between _Python's_ interpretation of `\b` (a non-printable backspace), and the _regex_ interpretation (word boundary).

In general, to avoid hard-to-debug surprises, whenever you include a `\x` escape character, or literal `\\` in your regexes, use the raw string notation.
Alternatively, you can use the normal string-notation, but double-escape the '\\', to tell python to give a '\' and a 'b' to the regex engine. This gets unreadable and confusing quickly, so raw-string notation is usually preferred.

## Exercise: raw strings

Try the `\band\b` regex above with (`r"..."`) and without the raw-string (`"..."`)

# Repeating ourselves

Up until now, we've learned to match single characters, or multiple single characters after one-another.
Often, we need to match _multiple_ characters. Fortunately regex can also express repetition.

## Kleene
The simplest two repetion operators are known as the "Kleene star" and the "Kleene plus", after a mathematician who researched computability, [Stephen Cole Kleene](https://en.wikipedia.org/wiki/Stephen_Cole_Kleene) (1909-1994).

 * the plus, `+` means "repeat the preceding pattern _one or more_ times.
 * the star, `*` means "repeat the preceding pattern _zero or more_ times.
 * the question mark, `?` means "repeat the pattern _zero or exactly one_ times. It's usually known as the "optional"

## Kleene +

The plus `+` works especially well with character classes, as it allows us to match, for example, a _series of digits_, otherwise known as a "number".

In [None]:
re.search("\d+", "it's over 9000!").group()

In [None]:
re.findall("\d+", "be careful to format large numbers, such as 1.000.000, for easy readability")

## Exercise: ONE. MILLION. EUROS!

replace the `\d+` with a regex that captures the entire million as ONE match.
(`['1.000.000']`)

We can also match sequences of characters, also known as "words"

In [None]:
# from "Words", by  William Charles Wentworth (1790 - 1872 / Australia)
#   https://www.poemhunter.com/poem/words-6/
poem = "Words are deeds. The words we hear \ May revolutionize or rear \ A mighty state. ..."
re.findall("\w+", poem)

## Exercise: Capital Idea

Find only the Capitalised words in `poem` above.

In [None]:
re.findall("YOUR REGEX HERE", poem)

Repetition also works perfectly fine for custom character classes:

In [None]:
# GATTACA: http://www.imdb.com/title/tt0119177/
re.findall("[ACGT]+", "I wonder if there's DNA in the movie GATTACA")

## Exercise: boundary cases

Note how we also matched the "A" in "DNA", even though that's only _part_ of a word.
Fix the regex above to NOT match the "A" of DNA.


## Kleene *

Be careful when matching _zero_ or more with the star:
There are a _lot_ of places where you can match exactly zero characters in a text:

In [None]:
re.findall("[ACGT]*", "I wonder if there's DNA in the movie GATTACA")

For this reason, the star is mostly useful if the repeated part is prefixed by a non-optional search part before it

## Optional `?`

The question mark `?` comes in especially handy with plurals:

In [None]:
re.findall("files?", "1 file, many files")

Note how the `?` only applied to the single character "s", not the rest of the string "file".

# {Repeating ourselves}

"Zero or more" repeats are quite flexible, and sometimes too much so. When we have more information on how often we should have a repeat, we can define limits using curly brackets:
 * `{n,m}` to indicate "_between n and m_ times (inclusive)_"
 * `{,m}` to indicate "_at most m_ times"
 * `{n,}` to indicate "_at least n_ times"
 * `{m}` to indicate "_exactly m times_".

In [None]:
jackson5 = "ABC, easy as 123"

print(re.findall("[CBA]{3}", jackson5))
print(re.findall("\d{3}", jackson5))

In [None]:
sentence = "shorter and longer words are variously occuring inside prosaic formulaic text in a pattern I cannot discern at a glance"

mini_words   = r"\b\w{1,2}\b"
medium_words = r"\b\w{3,6}\b"
long_words   = r"\b\w{7,}\b"
print(re.findall(mini_words, sentence))
print(re.findall(medium_words, sentence))
print(re.findall(long_words, sentence))

## Exercise: Cas9

write a regex that matches the [Cas9](https://en.wikipedia.org/wiki/Cas9) genome editing search pattern of:
 * start: ACC 
 * twenty or twentyone random bases
 * end: a G
 
There are (at least) two straigthforward solutions that are both wholly correct.

In [None]:
haystack = "TATAGACTACCTACGATCGATGTCAGTCAGTAGGATTT"

needle = "YOUR CODE HERE"

re.search(needle, haystack)

# greed and repetition

Repetition, by default is "greedy". It matches as many repetitions as it can get away with.
In some, rarer, situations, you want to match "non-greedy", as few repetitions as possible.
The non-greedy operators simply have an extra question mark `?` appended to them:

In [None]:
re.search("\d+?", "it's over 9000!").group()

In [None]:
re.findall("[ACGT]+?", "I wonder if there's DNA in the movie GATTACA")

In [None]:
re.findall("files??", "1 file, many files.")

non-greedy matches are useful if you want to match the shortest possible segment.

An example would be directory names in a path: `/your/very/long/path/with/a/file.txt`
matching `/.+/` using a greedy repetition matches way more than you intend.
Using a non-greedy operator fixes this problem, although it is also often possible to rewrite your character class to exclude the double match

## Exercise: folders

match the individual folder names, but NOT the "file.txt".
The folder names should not include any slashes.

* try it once with non-greedy repetition
* try it once without using any `?`, but with custom character classes instead, `(^_^)` (kawaii-smiley!)

In [None]:
path = '/your/very/long/path/with/a/file.txt'

# YOUR CODE BELOW

# ^Anchors$

Matches so far have occured anywhere in the string, although we could limit it somewhat with our `\b` word boundaries.
A special boundary that is often useful is that at the beginning or end of the text. These can also be referred to:
 * `^` matches the beginning of the haystack, the zero-length boundary just before the first character
 * `$` matches the end of the haystack, the zero-length boundary just after the last character.

## Exercise: Truth

Using anchors, match both "the whole Truth"s below, in separate regexes. (Funny how there's always more than one truth!)
 * the first should match from (0,15)
 * the second from (33, 48)


In [None]:
haystack = "the whole Truth, and nothing BUT the whole Truth"

print(re.search("YOUR CODE HERE", haystack).span()) # first
print(re.search("YOUR CODE HERE", haystack).span()) # last

# Grouping and Capturing

We've gotten quite far with literals, character classes and literals followed or preceded by character classes.
Now, we'll learn about grouping parts of a regex together, for this, we use the round brackets: `()`.

Grouping something allows us, for example, to apply a repeat operator to a block of characters, instead of just a single character.

In [None]:
re.search("ba(na){2}", "banana!").group()

In [None]:
re.search("(na )+", "na na na na na na na na na na na, hey Jude!").group()

A big advantage of brackets is that they create a "capture group". The match stores an extra reference to whatever is matched by the brackets. We can use this to find out what our flexible character classes ended up matching.

Groups are numbered by their opening brackets, and this number can be used with `group(<number>)`, as we learned way back in the beginning. Group number zero is always the entire match.

In [None]:
# https://www.songsforteaching.com/folk/ilovethemountains.php
boom_de_yada = "I love the mountains. I love the rolling hills. I love the flowers. I love the daffodils. I love the fireside. When all the lights are low."

needle = "I love the (\w+)"

for match in re.finditer(needle, boom_de_yada):
    print("entire match: %s" % match.group())
    print("group 0:      %s" % match.group(0))
    print("group 1:      %s" % match.group(1))
    print("---")

## Exercise: rolling hills

Notice that we missed the "hills" part of "rolling hills". Adapt the regex so that the capture contents include the full "rolling hills".

Do not lose the other matches!

## Exercise: monitor resolutions

Using capture groups, extract the monitor resolutions X and Y pixel counts from the advertising snippet below

In [None]:
marketing = "Supported resolutions: 800x600, 1024x768, 1280x1024, 1920x1080"

resolution_regex = r"YOUR CODE HERE"

for m in re.finditer(resolution_regex, marketing):
    print("X:")
    # YOUR CODE HERE
    print("Y:")
    # YOUR CODE HERE
else:
    print("no matches (yet?)")

## Exercise: Cas9 again

In our Cas9 example [from earlier](#Exercise:-Cas9), extract the variable DNA part in a capture group and print it

In [None]:
haystack = "TATAGACTACCTACGATCGATGTCAGTCAGTAGGATTT"

# YOUR CODE HERE

# GRAND FINALE: Code Golf exercise

"Code golf" is the programmer-"sport" of trying to solve a given problem using the least amount of characters.

In this golfing exercise, using everything that you have learned so far, write a SINGLE regex that matches the names of all female Nobel prize winners for Physics, Chemistry and Medicine, but NOT Alfred Nobel himself.

In [None]:
dont_match = "Alfred Bernhard Nobel"
do_match = [
    "Ada Yonath",
    "Barbara McClintock",
    "Carol Widney Greider",
    "Christiane Nüsslein-Volhard",
    "Dorothy Crowfoot Hodgkin",
    "Elizabeth Helen Blackburn",
    "Françoise Barré-Sinoussi",
    "Gertrude Belle Elion",
    "Gerty Theresa Cori",
    "Irène Joliot-Curie",
    "Linda Brown Buck",
    "Maria Goeppert Mayer",
    "Marie Curie",
    "Marie Curie", # Yup, she won TWICE!
    "May-Britt Moser",
    "Rita Levi-Montalcini",
    "Rosalyn Yalow",
    "Youyou Tu"
]

# GOOD LUCK!
needle = r"YOUR AWESOME REGEX HERE"


#################################################
## Validation and printing logic, no need to edit

print("your regex: %s\n\
length: %s" % (needle, len(needle)))

print("\n== shouldn't match ==")

regex = re.compile(needle)

# verify dont_match
if (regex.search(dont_match)):
    print("WRONG: ", dont_match)
else:
    print("RIGHT: ", dont_match)
    
print("\n== should match ==")

for awesome_lady in do_match:
    m = regex.search(awesome_lady)
    if (m):
        print("RIGHT: ", awesome_lady)
    else:
        print("WRONG: ", awesome_lady)
    

# Appendix

## `match` vs `search`

In [None]:
help(the_regex.match)
help(the_regex.search)

## Further Reading

* Python `re` reference page: https://docs.python.org/3/library/re.html
* "Mastering Regular Expressions", by Jeffrey Friedl, http://shop.oreilly.com/product/9781565922570.do