Welcome to lesson 7 of the Noisebridge Python class! ([Noisebridge Wiki](https://www.noisebridge.net/wiki/PyClass) | [Github](https://github.com/audiodude/PythonClass))

In this lesson, we will talk about a few unrelated but useful topics for a programmer: Regular expressions (regex), date and time handling, and recursion.

You will learn:

1. What a regex is
1. Regex operators: ? + |

Let's get started!

## Regular Expressions (regex)

Somewhere on the internet or in academia there is a formal definition of regex. It has something to do with "regular" languages, which are defined by certain "grammars", which is a Computer Science-y way of talking about text generation, text processing and computation.

We'll spare you the boring details.

In practice, regex that are implemented by programming languages like Perl, Python and Javascript -- as well as command line tools like sed and grep -- are not always fully compliant with the academic definition. And they don't have to be, they're perfectly useful anyways.

For our purposes a regular expression is a **pattern that allows for searching or matching a piece of text (a string)**.

Let's look at one of the simplest possible regexes (that's the plural) and see what it matches. First, we will import `re`, the Python [regular expression library](https://docs.python.org/3/library/re.html).

In [None]:
import re

simple_regex = 'a'

assert re.match(simple_regex, 'a')
assert re.match(simple_regex, 'abc')
assert not re.match(simple_regex, 'b')
# In these two, the a doesn't appear at the beginning of the line
assert not re.match(simple_regex, 'cba')
assert not re.match(simple_regex, 'christmas')

In this example, the **pattern** is simply `a`. Any string that starts with that character, and that character only, is considered a **match**.

This is useful, but not very interesting. We could do the same with the built-in string method `startswith()`.

In [None]:
import re

pattern = 'a'

assert 'a'.startswith(pattern)
assert 'abc'.startswith(pattern)
assert not 'b'.startswith(pattern)
assert not 'cba'.startswith(pattern)
assert not 'christmas'.startswith(pattern)

One important property of regular expressions is that if *R* is a regular expression and *S* is a regular expression, then concatenating them also gives a regular expression, *RS*. Remember that concatenation roughly means "combining" for strings and is done in Python with the `+` operator.

In [None]:
happy = 'happy'
birthday = 'birthday'
greeting = happy + ' ' + birthday

print(greeting)

### Operators: ?

Things get much more interesting however when we start using **operators**. A simple operator is `?` (question mark). It allows us to specify that a character (or in reality, an entire regular expression -- we'll get to that) either appears or doesn't appear. It has to occur 0 times or 1 time. This will be easier to demonstrate.

In [None]:
import re

exists = 'ab?'

assert re.match(exists, 'a')
assert re.match(exists, 'abc')
assert re.match(exists, 'acd')
assert re.match(exists, 'abbacus')
assert not re.match(exists, 'b')
assert not re.match(exists, 'cba')
assert not re.match(exists, 'fourth of july')

We can use the question mark as many times as we like, and we can also use parenthesis to **group** characters into what are known as **sub expressions**.

In [None]:
import re

exists_2 = '(black)?cat'

assert re.match(exists_2, 'cat')
assert re.match(exists_2, 'catch')  # Remember, we only have to match the beginning of the line.
assert re.match(exists_2, 'blackcat')
assert not re.match(exists_2, 'orangecat')
assert not re.match(exists_2, 'orangecat and blackcat') # This is kind of annoying

The question mark applies to the entire sub expression (which is, itself, a regular expression, remember) of `black`. The last line gives us pause though. It probably would be useful to be able to find `blackcat` or `cat` anywhere in the string. For this we can use a complementary method called `search()`.

In [None]:
assert re.search(exists_2, 'orangecat and blackcat')
assert re.search(exists_2, 'yellowmonkey and a random cat-like creature')

Before we continue introducing new concepts, let's try something just a little crazy. Can you provide 3 different examples of strings that match each of the following regex?

In [None]:
import re

not_crazy = 'hall?owe?en' # (possible teacher's note: allow for 2nd grade spelling mistakes)

assert re.match(not_crazy, 'your')
assert re.match(not_crazy, 'strings')
assert re.match(not_crazy, 'go here')

# Bonus: are there any other ways to write the 'not_crazy' regex?

crazy = '(x(ab)?)(y(ab?)(z(a?b)))'

assert re.match(crazy, 'your strings')
assert re.match(crazy, 'go')
assert re.match(crazy, 'here')

## Operators: + and *

Two more similar operators are `+` and `*`.

The `+` operator matches 1 or more occurrences (so there has to be at least once, but there can be as many as you like).

In [None]:
import re

plus = 'al+ my love'

assert re.match(plus, 'all my love')
assert re.match(plus, 'allllllll my love')
assert re.match(plus, 'al my love')

# Note that if we introduce a character before the character that
# is operated on by the '+', that character is still required.
# The two `l` characters in this example are not "combined" in any way.

plus_2 = 'all+ my love'

# This no longer matches
assert not re.match(plus_2, 'al my love')

# Here's a more obvious example

plus_3 = 'axe+l my love'

assert re.match(plus_3, 'axel my love')
assert re.match(plus_3, 'axeeeeeel my love')
assert not re.match(plus_3, 'axl my love')

The `*` is kind of any "any amount" operator because it matches 0 or more occurrences (so it can be there any number of times, including none at all!).

In [None]:
import readline

star = 'A*B*C* XXX A*B*C*'

assert re.match(star, 'ABC XXX ABC')
assert re.match(star, 'B XXX BBBBBBBBBC')
assert re.match(star, 'AC XXX AAAABBBBCCCC')
assert re.match(star, ' XXX ')

# Important point: This is any number of A's, followed
# by any number of B's followed by any number of C's.
# It does NOT mean any number of any combination of ABC:

assert not re.match(star, 'ABCABC XXX ABC')
assert not re.match(star, 'ABACBCAB XXX ABC')

Let's try writing some more regex matches. Can you write 5 strings that match this regex? Note that ax axx axxx axxxx axxxxx doesn't really count.

In [None]:
import re

exercise_2 = 'te+n spi(ri)*t'

Now let's try writing our own patterns. Can you write the most succinct regex that matches against all of these strings?

In [None]:
import re

first_regex = 'your regex goes here'

assert re.match(first_regex, 'ac unit')
assert re.match(first_regex, 'ace unlit')
assert re.match(first_regex, 'race unlittttttttt')

# But doesn't match

assert not re.match(first_regex, 'race unllliiilllll')

Oh no!

Unfortunately **this lesson isn't finished**!

Other things that were covered in the live lesson:

* `|` operator
* `.` for matching anything
* Character groupings/ranges/classes: '[]'
    * a-zA-Z0-9 \w \W \s \S \d \D
* `MatchData` objects and capturing sub-expressions
* Substitutions with regular expressions