In [3]:
import re

<_sre.SRE_Match object at 0x0000000003EEAC60>


In [4]:
x = re.search("cat", "A cat and a rat can't be friends.")
print x

<_sre.SRE_Match object at 0x0000000003EEAD30>


In [5]:
x = re.search("cow", "A cat and a rat can't be friends.")
print x

None


In [6]:
re.search("cat", "A cat and a rat can't be friends.")

<_sre.SRE_Match at 0x3eeae68>

# Any Character

In [13]:
pattern = r".at"

The syntax of regular expressions supplies a metacharacter ".", which is used like a placeholder for "any character". The regular expression of our example can be written like this: 
r" .at " 

This RE matches three letter words, isolated by blanks, which end in "at". Now we get words like "rat", "cat", "bat", "eat", "sat" and many others. 

In [29]:
pattern = r".at"
for word in ["rat", "cat", "bat", "eat", "sat"]:
    if re.search(pattern, word):
        print pattern, "matches", word
    else:
        print pattern, "doesn't match", word

.at matches rat
.at matches cat
.at matches bat
.at matches eat
.at matches sat


But what, if the text contains "words" like "@at" or "3at"? These words match as well and this means that we have created over matching again.

In [27]:
for word in ["@at", "3at"]:
    if re.search(pattern, word):
        print pattern, "matches", word
    else:
        print pattern, "doesn't match", word

[-a-z] matches @at
[-a-z] matches 3at


# Character Classes

Square brackets, "[" and "]", are used to include a character class. [xyz] means e.g. either an "x", an "y" or a "z". 
Let's look at a more practical example:

In [17]:
pattern = r"M[ae][iy]er"

This is a regular expression, which matches a surname which is quite common in German. A name with the same pronunciation and four different spellings: Maier, Mayer, Meier, Meyer 

In [30]:
pattern = r"M[ae][iy]er"
for word in ["Maier", "Mayer", "Meier", "Meyer"]:
    if re.search(pattern, word):
        print pattern, "matches", word
    else:
        print pattern, "doesn't match", word

M[ae][iy]er matches Maier
M[ae][iy]er matches Mayer
M[ae][iy]er matches Meier
M[ae][iy]er matches Meyer


A finite state automata to recognize this expression can be build like this: 

![](http://www.python-course.eu/images/finite_state_machine_mayer.png)

The graph of the finite state machine (FSM) is simplified to keep the design easy. There should be an arrow in the start node pointing back on its own, i.e. if a character other than an upper case "M" has been processed, the machine should stay in the start condition. Furthermore, there should be an arrow pointing back from all nodes except the final nodes (the green ones) to the start node, if not the expected letter has been processed. E.g. if the machine is in state Ma, after having processed a "M" and an "a", the machine has to go back to state "Start", if any character except "i" or "y" can be read. Those who have problems with this FSM, shouldn't be bothered, because it is not necessary to understand it for the things to come. 

Instead of a choice between two characters, we often need a choice between larger character classes. We might need e.g. a class of letters between "a" and "e" or between "0" and "5"
To manage such such character classes the syntax of regular expressions supplies a metacharacter "-". [a-e] a simplified writing for [abcde] or [0-5] denotes [012345]. 

The advantage is obvious and even more impressive, if we have to coint expressions like "any uppercase letter" into regular expressions. So instead of [ABCDEFGHIJKLMNOPQRSTUVWXYZ] we can write [A-Z]. If this is not convincing: Write an expression for the character class "any lower case or uppercase letter" [A-Za-z]

There is something more about the dash, we used to mark the begin and the end of a character class. The dash has only a special meaning if it is used within square brackets and in this case only if it isn't positioned directly after an opening or immediately in front of a closing bracket. 
So the expression [-az is only the choice between the three characters "-", "a" and "z", but no other characters. The same is true for [az-. 

**Exercise:**
What character class is described by [-a-z]? 

**Answer**
The character "-" and all the characters "a", "b", "c" all the way up to "z". 

In [24]:
pattern = "[-a-z]"
for word in ["-", "a", "b", "c", "d", "w", "x", "y", "z", "1"]:
    if re.search(pattern, word):
        print pattern, "matches", word
    else:
        print pattern, "doesn't match", word

[-a-z] matches -
[-a-z] matches a
[-a-z] matches b
[-a-z] matches c
[-a-z] matches d
[-a-z] matches w
[-a-z] matches x
[-a-z] matches y
[-a-z] matches z
[-a-z] doesn't match 1


The only other special character inside square brackets (character class choice) is the caret "^". If it is used directly after an opening square bracket, it negates the choice. [^0-9] denotes the choice "any character but a digit". The position of the caret within the square brackets is crucial. If it is not positioned as the first character following the opening square bracket, it has no special meaning. 
[^abc] means anything but an "a", "b" or "c" 
[a^bc] means an "a", "b", "c" or a "^"

In [36]:
# negates the choice
pattern = "[^0-9]"
for word in ["a", "b", "c", "@", "!", "^", "0", "1", "9"]:
    if re.search(pattern, word):
        print pattern, "matches", word
    else:
        print pattern, "doesn't match", word

[^0-9] matches a
[^0-9] matches b
[^0-9] matches c
[^0-9] matches @
[^0-9] matches !
[^0-9] matches ^
[^0-9] doesn't match 0
[^0-9] doesn't match 1
[^0-9] doesn't match 9


In [38]:
# if ^ is used directly after an opening square bracket
pattern = "[^abc]"
for word in ["a", "b", "c", "^"]:
    if re.search(pattern, word):
        print pattern, "matches", word
    else:
        print pattern, "doesn't match", word
# ^ is just a normal character       
pattern = "[a^bc]"
for word in ["a", "b", "c", "^"]:
    if re.search(pattern, word):
        print pattern, "matches", word
    else:
        print pattern, "doesn't match", word

[^abc] doesn't match a
[^abc] doesn't match b
[^abc] doesn't match c
[^abc] matches ^
[a^bc] matches a
[a^bc] matches b
[a^bc] matches c
[a^bc] matches ^
