# ISJ – regular expressions

## Motivation
> Scripting without regular expressions is like sex without partner.  
  You can do it, but you are definitely missing something important.

## Motivation
> Some people, when confronted with a problem, think  
  “I know, I'll use regular expressions.”  
  Now they have two problems.  
  
  Author: Jamie Zawinski (an early Netscape engineer, a high priest of the church of XEmacs) 



## Motivation
>Some people, when confronted with a problem, think  
 “I know, I won't use regular expressions.”  
 Now they have two problems. 


  
  Author: Rex, 21 October 2015


## Motivation
> Obviously, something there is missing…  
What two problems are we talking about?

[Regex Humor] (http://www.rexegg.com/regex-humor.html)

Some attempts to complete the quote follow...   

## Motivation
>...Now you have two problems:  
1. figuring out what to do with the many hours of tedious coding you just saved, and
2. having to deal with the trolls who give you an earache parrotting some lame quote about having two problems. 


  
  Author: Rex, 7 May 2014


# Regex: what is it?

* A **regular expression** (abbreviated <span style="color:red">regex</span>) is a sequence of characters that forms a search pattern,
* mainly for use in pattern matching with strings, 
* i.e. "<span style="color:blue">find and replace</span>"-like operations.

## Widely used, e.g. on the command line
* search and replace: sed
* match and find: grep
* split/tokenize: no specific command


## Main tasks
* check if a string/file matches a **pattern**
* find a/all matches in a string/file of a **pattern**
* search and replace
* split strings on a pattern: _tokenization_

## RegEx syntax
* Except of special characters, every string matches itself

* RegEx are generally case sensitive but it is usually easy to set the matching to be case insensitive

* Old implementations deal with one-byte characters only, new ones cope with UNICODE

## Metacharacters

Char | Meaning 
:---: |---
\  | Quote the next metacharacter
^  | Match the beginning of the line
.  | Match any character (except newline)
$  | Match the end of the string (or before newline at the end of the string)
│  | Alternation
() | Grouping
[] | Bracketed Character class

## Examples
    \.\.\.
    
    [0-9]

    [^aeiou]

    \[[a-z]\]

## Examples
    
    [.^az-]

    [][]

    [[]]

## Quantifiers

Q | Meaning 
:---: |---
        *          | Match 0 or more times
        +          | Match 1 or more times
        ?          | Match 1 or 0 times
        {n}        | Match exactly n times
        {n,}       | Match at least n times
        {n,m}      | Match at least n but not more than m times
        

## Be aware of

What lines match regular expression:  

What part of string:  

corresponds to regular expression:    

<span style="color:red">Watch Out for The Greediness!</span>

## RegEx matching process
1. start at the begining of the given string/text
2. for each subpart of the given RE, try to find the longest sequence of characters in the string
3. continue with other subparts of the RE
4. if it is not possible any time later, try to shorten the current sequence (the potential subpart match) by 1                                                                                                                                                                                                                                                                                                                                                                                     
5. if all parts are shorted to their minima and still the match has not been found, move one character in the string and repeat.

## RegEx matching process

![r.*](brambory.jpg)

## RegEx flavours

* PCRE – Perl Compatible Regular Expressions
* en.wikipedia.org/wiki/Comparison_of_regular_expression_engines


## Tools
`grep`:

A utility for pattern matching. grep is by far the most useful unix utility. grep is typically called like this: 

`grep [options] [pattern] [files]`. 

With no options specified, this simply looks for the specified pattern in the given files, printing to the console only those lines that match the given pattern.

## Grep options

Short | Long | Meaning
--- | --- | ---
  -E | --extended-regexp |    PATTERN is an extended regular expression (ERE)
  -F | --fixed-strings   |    PATTERN is a set of newline-separated strings
  -G | --basic-regexp    |    PATTERN is a basic regular expression (BRE)
  -P | --perl-regexp     |    PATTERN is a Perl regular expression
  -i | --ignore-case      |   ignore case distinctions
  -v | --invert-match    |    select non-matching lines
 

## Tools
`sed`:

* A non-interactive stream editor
* RE used for:
  - find lines for which an command should be applied
  - substitution: sed 's/old/new/g' < inp > out
  
'123456789012345678901234567890z'.match(/^(\d+)*$/)
  - 


## Tools
`python`:

než začnete používat RE, zkuste to bez nich:
https://docs.python.org/3/library/string.html#string-functions

In [None]:
uk = "colour, flavour, behaviour, harbour, honour, humour, labour, neighbour, rumour, splendour"
%timeit -n 100000 uk.replace('our', 'or')
import re
%timeit -n 100000 re.sub('our', 'or', uk)
pat = re.compile('our')
%timeit -n 100000 pat.sub('or', uk)

In [None]:
import re
# import regex # if you like good times
# intended to replace `re`, the regex module has many advanced
# features for regex lovers. http://pypi.python.org/pypi/regex
pattern = r'(\w+):(\w+):(\d+)'
subject = 'apple:green:3 banana:yellow:5'
regex = re.compile(pattern)

In [None]:
######## The six main tasks we're likely to have ########

# Task 1: Is there a match?
print("*** Is there a Match? ***")
if regex.search(subject):
	print ("Yes")
else:
	print ("No")

In [None]:
# Task 2: How many matches are there?
print("\n" + "*** Number of Matches ***")
matches = regex.findall(subject)
print(len(matches))

In [None]:
# Task 3: What is the first match?
print("\n" + "*** First Match ***")
match = regex.search(subject)
if match:
	print("Overall match: ", match.group(0))
	print("Group 1 : ", match.group(1))
	print("Group 2 : ", match.group(2))
	print("Group 3 : ", match.group(3))
	

In [None]:
# Task 4: What are all the matches?
print("\n" + "*** All Matches ***\n")
print("------ Method 1: finditer ------\n")
for match in regex.finditer(subject):
	print ("--- Start of Match ---")
	print("Overall match: ", match.group(0))
	print("Group 1 : ", match.group(1))
	print("Group 2 : ", match.group(2))
	print("Group 3 : ", match.group(3))
	print ("--- End of Match---\n")		

print("\n------ Method 2: findall ------\n")
# if there are capture groups, findall doesn't return the overall match
# therefore, in that case, wrap the pattern in capturing parentheses
# the overall match becomes group 1, so other group numbers are bumped up!
wrappedpattern = "(" + pattern + ")"
wrappedregex = re.compile(wrappedpattern)
matches = wrappedregex.findall(subject)
if len(matches)>0:
	for match in matches:
	    print ("--- Start of Match ---")
	    print ("Overall Match: ",match[0])
	    print ("Group 1: ",match[1])
	    print ("Group 2: ",match[2])
	    print ("Group 3: ",match[3])
	    print ("--- End of Match---\n")		

In [None]:
# Task 5: Replace the matches
# simple replacement: reverse group
print("\n" + "*** Replacements ***")
print("Let's reverse the groups")
def reversegroups(m):
	return m.group(3) + ":" + m.group(2) + ":" + m.group(1)
replaced = regex.sub(reversegroups, subject)
print(replaced)

In [None]:
# Task 6: Split
print("\n" + "*** Splits ***")
# Let's split at colons or spaces
splits = re.split(r":|\s",subject)
for split in splits:
	    print (split)

In [None]:
a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)

In [209]:
%%html
<h2>Characters</h2>
<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Character</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="brown"><td><span class="mono">\d</span></td><td>Most engines: one digit<br />from 0 to 9</td><td>file_\d\d</td><td>file_25</td></tr><tr class="beige"><td><span class="mono">\d</span></td><td>.NET, Python 3: one Unicode digit in any script</td><td>file_\d\d</td><td>file_9੩</td></tr><tr class="brown"><td><span class="mono">\w</span></td><td>Most engines: "word character": ASCII letter, digit or underscore</td><td>\w-\w\w\w</td><td>A-b_1</td></tr><tr class="beige"><td><span class="mono">\w</span></td><td>.Python 3: "word character": Unicode letter, ideogram, digit, or underscore</td><td>\w-\w\w\w</td><td>字-ま_۳</td></tr><tr class="brown"><td><span class="mono">\w</span></td><td>.NET: "word character": Unicode letter, ideogram, digit, or connector</td><td>\w-\w\w\w</td><td>字-ま‿۳</td></tr><tr class="beige"><td><span class="mono">\s</span></td><td>Most engines: "whitespace character": space, tab, newline, carriage return, vertical tab</td><td>a\sb\sc</td><td>a b<br />c</td></tr><tr class="brown"><td><span class="mono">\s</span></td><td>.NET, Python 3, JavaScript: "whitespace character": any Unicode separator</td><td>a\sb\sc</td><td>a b<br />c</td></tr><tr class="beige"><td><span class="mono">\D</span></td><td>One character that is not a <i>digit</i> as defined by your engine's <i>\d</i></td><td>\D\D\D</td><td>ABC</td></tr><tr class="brown"><td><span class="mono">\W</span></td><td>One character that is not a <i>word character</i> as defined by your engine's <i>\w</i></td><td>\W\W\W\W\W</td><td>*-+=)</td></tr><tr class="beige"><td><span class="mono">\S</span></td><td>One character that is not a <i>whitespace character</i> as defined by your engine's <i>\s</i></td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

Character,Legend,Example,Sample Match
\d,Most engines: one digit from 0 to 9,file_\d\d,file_25
\d,".NET, Python 3: one Unicode digit in any script",file_\d\d,file_9੩
\w,"Most engines: ""word character"": ASCII letter, digit or underscore",\w-\w\w\w,A-b_1
\w,".Python 3: ""word character"": Unicode letter, ideogram, digit, or underscore",\w-\w\w\w,字-ま_۳
\w,".NET: ""word character"": Unicode letter, ideogram, digit, or connector",\w-\w\w\w,字-ま‿۳
\s,"Most engines: ""whitespace character"": space, tab, newline, carriage return, vertical tab",a\sb\sc,a b c
\s,".NET, Python 3, JavaScript: ""whitespace character"": any Unicode separator",a\sb\sc,a b c
\D,One character that is not a digit as defined by your engine's \d,\D\D\D,ABC
\W,One character that is not a word character as defined by your engine's \w,\W\W\W\W\W,*-+=)
\S,One character that is not a whitespace character as defined by your engine's \s,\S\S\S\S,Yoyo


In [210]:
%%html
<h2>Quantifiers</h2>


<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Quantifier</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="wasabi"><td><span class="mono">+</span></td><td>One or more</td><td>Version \w-\w+</td><td>Version A-b1_1</td></tr><tr class="greentea"><td><span class="mono">{3}</span></td><td>Exactly three times</td><td>\D{3}</td><td>ABC</td></tr><tr class="wasabi"><td><span class="mono">{2,4}</span></td><td>Two to four times</td><td>\d{2,4}</td><td>156</td></tr><tr class="greentea"><td><span class="mono">{3,}</span></td><td>Three or more times</td><td>\w{3,}</td><td>regex_tutorial</td></tr><tr class="wasabi"><td><span class="mono">*</span></td><td>Zero or more times</td><td>A*B*C*</td><td>AAACC</td></tr><tr class="greentea"><td><span class="mono">?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

Quantifier,Legend,Example,Sample Match
+,One or more,Version \w-\w+,Version A-b1_1
{3},Exactly three times,\D{3},ABC
"{2,4}",Two to four times,"\d{2,4}",156
"{3,}",Three or more times,"\w{3,}",regex_tutorial
*,Zero or more times,A*B*C*,AAACC
?,Once or none,plurals?,plural


In [211]:
%%html
<h2>More Characters</h2>

<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Character</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="brown"><td><span class="mono"><b>.</b></span></td><td>Any character except line break</td><td>a.c</td><td>abc</td></tr><tr class="beige"><td><span class="mono"><b>.</b></span></td><td>Any character except line break</td><td>.*</td><td>whatever, man.</td></tr><tr class="brown"><td><span class="mono">\<b>.</b></span></td><td>A period (special character: needs to be escaped by a \)</td><td>a\.c</td><td>a.c</td></tr><tr class="beige"><td><span class="mono">\</span></td><td>Escapes a special character</td><td>\.\*\+\?&nbsp;&nbsp;&nbsp;&nbsp;\$\^\/\&#92;</td><td>.*+?&nbsp;&nbsp;&nbsp;&nbsp;$^/&#92;</td></tr><tr class="brown"><td><span class="mono">\</span></td><td>Escapes a special character</td><td>\[\{\(\)\}\]</td><td>[{()}]</td></tr></table>

Character,Legend,Example,Sample Match
.,Any character except line break,a.c,abc
.,Any character except line break,.*,"whatever, man."
\.,A period (special character: needs to be escaped by a \),a\.c,a.c
\,Escapes a special character,\.\*\+\? \$\^\/\\,.*+? $^/\
\,Escapes a special character,\[\{\(\)\}\],[{()}]


In [None]:
import re
s = '100 NORTH MAIN ROAD'
re.sub('ROAD$', 'RD.', s) 
# $ means “end of the string.”
# caret ^ means “beginning of the string.”

In [None]:
import re
s = '100 NORTH MAIN ROAD'
re.sub('ROAD$', 'RD.', s) 
# $ means “end of the string.”
# caret ^ means “beginning of the string.”

In [None]:
s = '100 BROAD'
re.sub('ROAD$', 'RD.', s)

In [None]:
re.sub('\\bROAD$', 'RD.', s)
# \b, means “a word boundary must occur right here.”  + \ escape character

In [None]:
re.sub(r'\bROAD$', 'RD.', s) 
# raw string, tells Python that nothing in this string should be escaped

In [None]:
s = '100 BROAD ROAD APT. 3'
re.sub(r'\bROAD$', 'RD.', s)

In [None]:
re.sub(r'\bROAD\b', 'RD.', s) 

Roman Numerals
==============
    I = 1
    V = 5
    X = 10
    L = 50
    C = 100
    D = 500
    M = 1000 

In [None]:
pattern = '^M?M?M?$'
# M? optionally matches a single M character. 
re.search(pattern, 'M') 

In [None]:
re.search(pattern, '')
# all the M characters are optional. 

In [None]:
pattern = '^M?M?M?(CM|CD|D?C?C?C?)$' 
# set of three mutually exclusive patterns, separated by vertical bars
# regex parser checks for each of these patterns in order, takes the first one that matches, and ignores the rest. 
re.search(pattern, 'MMMCCC')     

"""possible patterns:
    CM
    CD
    an optional D, followed by zero to three C characters """

In [None]:
pattern = '^M{0,3}$'
# Match the start of the string, then anywhere from zero to three M characters, then the end of the string.
re.search(pattern, 'MMM') 

In [None]:
pattern = '^M?M?M?(CM|CD|D?C?C?C?)(XC|XL|L?X?X?X?)$'
>>> re.search(pattern, 'MCMXL') 

In [None]:
pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'
re.search(pattern, 'MDLV')      

In [None]:
re.search(pattern, 'MMMDCCCLXXXVIII') 

In [None]:
pattern = '''
    ^                   # beginning of string
    M{0,3}              # thousands - 0 to 3 Ms
    (CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 Cs),
                        #            or 500-800 (D, followed by 0 to 3 Cs)
    (XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 Xs),
                        #        or 50-80 (L, followed by 0 to 3 Xs)
    (IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 Is),
                        #        or 5-8 (V, followed by 0 to 3 Is)
    $                   # end of string
    '''
re.search(pattern, 'M', re.VERBOSE) 


    ^ matches the beginning of a string.
    $ matches the end of a string.
    \b matches a word boundary.
    \d matches any numeric digit.
    \D matches any non-numeric character.
    x? matches an optional x character (in other words, it matches an x zero or one times).
    x* matches x zero or more times.
    x+ matches x one or more times.
    x{n,m} matches an x character at least n times, but not more than m times.
    (a|b|c) matches exactly one of a, b or c.
    (x) in general is a remembered group. You can get the value of what matched by using the groups() method of the object returned by re.search. 

In [212]:
%%html
<h2>Logic</h2>


<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Logic</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="wasabi"><td><span class="mono">|</span></td><td> Alternation / OR operand</td><td>22|33</td><td>33</td></tr><tr class="greentea"><td><span class="mono">( … )</span></td><td>Capturing group</td><td>A(nt|pple)</td><td>Apple (captures "pple")</td></tr><tr class="wasabi"><td><span class="mono">\1</span></td><td>Contents of Group 1</td><td>r(\w)g\1x</td><td>regex</td></tr><tr class="greentea"><td><span class="mono">\2</span></td><td>Contents of Group 2</td><td>(\d\d)\+(\d\d)=\2\+\1</td><td>12+65=65+12</td></tr><tr class="wasabi"><td><span class="mono">(?: … )</span></td><td>Non-capturing group</td><td>A(?:nt|pple)</td><td>Apple</td></tr></table>

Logic,Legend,Example,Sample Match
|,Alternation / OR operand,22|33,33
( … ),Capturing group,A(nt|pple),"Apple (captures ""pple"")"
\1,Contents of Group 1,r(\w)g\1x,regex
\2,Contents of Group 2,(\d\d)\+(\d\d)=\2\+\1,12+65=65+12
(?: … ),Non-capturing group,A(?:nt|pple),Apple


Parsing Phone Numbers
=====================

In [None]:
phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$') 
# \d means “any numeric digit” (0 through 9). The {3} means “match exactly three numeric digits”
# Putting it in parentheses means “match exactly three numeric digits, and then remember them as a group that I can ask for later”. 

In [None]:
phonePattern.search('800-555-1212').groups() 

In [None]:
phonePattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})-(\d+)$')
# \d+ remembered group of one or more digits
phonePattern.search('800-555-1212-1234').groups() 

In [None]:
phonePattern = re.compile(r'^(\d{3})\D+(\d{3})\D+(\d{4})\D+(\d+)$')  
# \D+ matches one or more characters that are not digits
phonePattern.search('800 555 1212 1234').groups()

In [None]:
phonePattern = re.compile(r'^(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
# * means “zero or more”.
phonePattern.search('80055512121234').groups() 

In [None]:
phonePattern = re.compile(r'^\D*(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')  
phonePattern.search('(800)5551212 ext. 1234').groups()

In [None]:
phonePattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$') 
phonePattern.search('work 1-(800) 555.1212 #1234').groups()  

In [None]:
phonePattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '800')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '555')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '1212')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)
phonePattern.search('work 1-(800) 555.1212 #1234').groups()

In [227]:
import re
from nltk.corpus import words
wds = words.words()
len(wds)
235786
cands = [w for w in wds if re.search('^..j..t..$',w)]
cands
palindrom = [w for w in wds if re.search(r'^(.)(.).\2\1$',w)]
palindrom
ma_rymu_umiram_foneticky = "marimuumiram"
re.search(r'^(.)(.)(.)(.)(.)(.)\6\5\4\3\2\1$', ma_rymu_umiram_foneticky)


<_sre.SRE_Match object; span=(0, 12), match='marimuumiram'>

In [213]:
%%html
<h2>More White-Space</h2>


<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Character</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="brown"><td><span class="mono">\t</span></td><td>Tab</td><td>T\t\w{2}</td><td>T&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;ab</td></tr><tr class="beige"><td><span class="mono">\r</span></td><td>Carriage return character</td><td>see below</td><td></td></tr><tr class="brown"><td><span class="mono">\n</span></td><td>Line feed character</td><td>see below</td><td></td></tr><tr class="beige"><td><span class="mono">\r\n</span></td><td>Line separator on Windows</td><td>AB\r\nCD</td><td>AB<br />CD</td></tr><tr class="brown"><td><span class="mono">\N</span></td><td>Perl, PCRE (C, PHP, R…): one character that is not a line break</td><td>\N+</td><td>ABC</td></tr><tr class="beige"><td><span class="mono">\h</span></td><td>Perl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator</td><td></td><td></td></tr><tr class="brown"><td><span class="mono">\H</span></td><td>One character that is not a horizontal whitespace</td><td></td><td></td></tr><tr class="beige"><td><span class="mono">\v</span></td><td>.NET, JavaScript, Python, Ruby: vertical tab</td><td></td><td></td></tr><tr class="brown"><td><span class="mono">\v</span></td><td>Perl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator</td><td></td><td></td></tr><tr class="beige"><td><span class="mono">\V</span></td><td>Perl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace</td><td></td><td></td></tr><tr class="brown"><td><span class="mono">\R</span></td><td>Perl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v)</td><td></td><td></td></tr></table>

Character,Legend,Example,Sample Match
\t,Tab,T\t\w{2},T ab
\r,Carriage return character,see below,
\n,Line feed character,see below,
\r\n,Line separator on Windows,AB\r\nCD,AB CD
\N,"Perl, PCRE (C, PHP, R…): one character that is not a line break",\N+,ABC
\h,"Perl, PCRE (C, PHP, R…), Java: one horizontal whitespace character: tab or Unicode space separator",,
\H,One character that is not a horizontal whitespace,,
\v,".NET, JavaScript, Python, Ruby: vertical tab",,
\v,"Perl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator",,
\V,"Perl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace",,


In [214]:
%%html
<h2>More Quantifiers</h2>


<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Quantifier</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="wasabi"><td><span class="mono">+</span></td><td>The + (one or more) is "greedy"</td><td>\d+</td><td>12345</td></tr><tr class="greentea"><td><span class="mono">?</span></td><td>Makes quantifiers "lazy"</td><td>\d+?</td><td>1 in <b>1</b>2345</td></tr><tr class="wasabi"><td><span class="mono">*</span></td><td>The * (zero or more) is "greedy"</td><td>A*</td><td>AAA</td></tr><tr class="greentea"><td><span class="mono">?</span></td><td>Makes quantifiers "lazy"</td><td>A*?</td><td>empty in AAA</td></tr><tr class="wasabi"><td><span class="mono">{2,4}</span></td><td>Two to four times, "greedy"</td><td>\w{2,4}</td><td>abcd</td></tr><tr class="greentea"><td><span class="mono">?</span></td><td>Makes quantifiers "lazy"</td><td>\w{2,4}?</td><td>ab in <b>ab</b>cd</td></tr></table>

Quantifier,Legend,Example,Sample Match
+,"The + (one or more) is ""greedy""",\d+,12345
?,"Makes quantifiers ""lazy""",\d+?,1 in 12345
*,"The * (zero or more) is ""greedy""",A*,AAA
?,"Makes quantifiers ""lazy""",A*?,empty in AAA
"{2,4}","Two to four times, ""greedy""","\w{2,4}",abcd
?,"Makes quantifiers ""lazy""","\w{2,4}?",ab in abcd


In [None]:
# minitask 1.1
# expected output ['b', '/b', 'i', '/i']
import re
text = '<b>foo</b> and <i>so on</i>'
print(re.findall(r'<.+>',text))


## re.findall(pattern, string, flags=0)

Return all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order found. If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group. Empty matches are included in the result.

In [None]:
# what is the result in the first case: b or ab
# what is the result in the second case?
import re
print(re.search('a??b','aaab').group())
print(re.search('a*?b','aaab').group())
# regex engine re parses the string from left to right and
# returns the first possible match at the leftmost position

In [215]:
%%html
<h2>Character Classes</h2>


<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Character</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="brown"><td><span class="mono">[ … ]</span></td><td>One of the characters in the brackets</td><td>[AEIOU]</td><td>One uppercase vowel</td></tr><tr class="beige"><td><span class="mono">[ … ]</span></td><td>One of the characters in the brackets</td><td>T[ao]p</td><td><i>Tap</i> or <i>Top</i></td></tr><tr class="brown"><td><span class="mono">-</span></td><td>Range indicator</td><td>[a-z]</td><td>One lowercase letter</td></tr><tr class="beige"><td><span class="mono">[x-y]</span></td><td>One of the characters in the range from x to y</td><td>[A-Z]+</td><td>GREAT</td></tr><tr class="brown"><td><span class="mono">[ … ]</span></td><td>One of the characters in the brackets</td><td>[AB1-5w-z]</td><td>One of either: A,B,1,2,3,4,5,w,x,y,z</td></tr><tr class="beige"><td><span class="mono">[x-y]</span></td><td>One of the characters in the range from x to y</td><td>[&ensp;-~]+</td><td>Characters in the printable section of the <a href="http://www.asciitable.com/" target="_blank">ASCII table</a>.</td></tr><tr class="brown"><td><span class="mono">[^x]</span></td><td>One character that is not x</td><td>[^a-z]{3}</td><td>A1!</td></tr><tr class="beige"><td><span class="mono">[^x-y]</span></td><td>One of the characters <b>not</b> in the range from x to y</td><td>[^&ensp;-~]+</td><td>Characters that are <b>not</b> in the printable section of the <a href="http://www.asciitable.com/" target="_blank">ASCII table</a>.</td></tr><tr class="brown"><td><span class="mono">[\d\D]</span></td><td>One character that is a digit or a non-digit</td><td>[\d\D]+</td><td>Any characters, inc-<br />luding new lines, which the regular dot doesn't match</td></tr><tr class="beige"><td><span class="mono">[\x41]</span></td><td>Matches the character at hexadecimal position 41 in the ASCII table, i.e. A</td><td>[\x41-\x45]{3}</td><td>ABE</td></tr></table>

Character,Legend,Example,Sample Match
[ … ],One of the characters in the brackets,[AEIOU],One uppercase vowel
[ … ],One of the characters in the brackets,T[ao]p,Tap or Top
-,Range indicator,[a-z],One lowercase letter
[x-y],One of the characters in the range from x to y,[A-Z]+,GREAT
[ … ],One of the characters in the brackets,[AB1-5w-z],"One of either: A,B,1,2,3,4,5,w,x,y,z"
[x-y],One of the characters in the range from x to y,[ -~]+,Characters in the printable section of the ASCII table.
[^x],One character that is not x,[^a-z]{3},A1!
[^x-y],One of the characters not in the range from x to y,[^ -~]+,Characters that are not in the printable section of the ASCII table.
[\d\D],One character that is a digit or a non-digit,[\d\D]+,"Any characters, inc- luding new lines, which the regular dot doesn't match"
[\x41],"Matches the character at hexadecimal position 41 in the ASCII table, i.e. A",[\x41-\x45]{3},ABE


In [216]:
%%html
<h2><a href="regex-anchors.html">Anchors</a> and <a href="regex-boundaries.html">Boundaries</a></h2>


<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Anchor</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="wasabi"><td><span class="mono">^</span></td><td><a href="regex-anchors.html#caret">Start of string</a> or <a href="regex-anchors.html#carmulti">start of line</a> depending on multiline mode. (But when [^inside brackets], it means "not")</td><td>^abc .*</td><td>abc (line start)</td></tr><tr class="greentea"><td><span class="mono">$</span></td><td><a href="regex-anchors.html#dollar">End of string</a> or <a href="regex-anchors.html#eol">end of line</a> depending on multiline mode. Many engine-dependent subtleties.</td><td>.*? the end$</td><td>this is the end</td></tr><tr class="wasabi"><td><span class="mono">\A</span></td><td><a href="regex-anchors.html#A">Beginning of string</a><br />(all major engines except JS)</td><td>\Aabc[\d\D]*</td><td>abc (string...<br />...start)</td></tr><tr class="greentea"><td><span class="mono">\z</span></td><td><a href="regex-anchors.html#z">Very end of the string</a><br />Not available in Python and JS</td><td>the end\z</td><td>this is...\n...<b>the end</b></td></tr><tr class="wasabi"><td><span class="mono">\Z</span></td><td><a href="regex-anchors.html#Z">End of string</a> or (except Python) before final line break<br />Not available in JS</td><td>the end\Z</td><td>this is...\n...<b>the end</b>\n</td></tr><tr class="greentea"><td><span class="mono">\G</span></td><td><a href="regex-anchors.html#G">Beginning of String or End of Previous Match</a><br />
	  .NET, Java, PCRE (C, PHP, R…), Perl, Ruby</td><td></td><td></td></tr><tr class="wasabi"><td><span class="mono">\b</span></td><td><a href="regex-boundaries.html#wordboundary">Word boundary</a><br /> Most engines: position where one side only is an ASCII letter, digit or underscore</td><td>Bob.*\bcat\b</td><td>Bob ate the cat</td></tr><tr class="greentea"><td><span class="mono">\b</span></td><td><a href="regex-boundaries.html#wordboundary">Word boundary</a><br />.NET, Java, Python 3, Ruby: position where one side only is a Unicode letter, digit or underscore</td><td>Bob.*\b\кошка\b</td><td>Bob ate the кошка</td></tr><tr class="wasabi"><td><span class="mono">\B</span></td><td><a href="regex-boundaries.html#notb">Not a word boundary</a></td><td>c.*\Bcat\B.*</td><td>copycats</td></tr></table>

Anchor,Legend,Example,Sample Match
^,"Start of string or start of line depending on multiline mode. (But when [^inside brackets], it means ""not"")",^abc .*,abc (line start)
$,End of string or end of line depending on multiline mode. Many engine-dependent subtleties.,.*? the end$,this is the end
\A,Beginning of string (all major engines except JS),\Aabc[\d\D]*,abc (string... ...start)
\z,Very end of the string Not available in Python and JS,the end\z,this is...\n...the end
\Z,End of string or (except Python) before final line break Not available in JS,the end\Z,this is...\n...the end\n
\G,"Beginning of String or End of Previous Match  .NET, Java, PCRE (C, PHP, R…), Perl, Ruby",,
\b,"Word boundary  Most engines: position where one side only is an ASCII letter, digit or underscore",Bob.*\bcat\b,Bob ate the cat
\b,"Word boundary .NET, Java, Python 3, Ruby: position where one side only is a Unicode letter, digit or underscore",Bob.*\b\кошка\b,Bob ate the кошка
\B,Not a word boundary,c.*\Bcat\B.*,copycats


In [217]:
%%html
<h2>POSIX Classes</h2>

<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Character</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="brown"><td><span class="mono">[:alpha:]</span></td><td>PCRE (C, PHP, R…): ASCII letters A-Z and a-z</td><td>[8[:alpha:]]+</td><td>WellDone88</td></tr><tr class="beige"><td><span class="mono">[:alpha:]</span></td><td>Ruby 2: Unicode letter or ideogram</td><td>[[:alpha:]\d]+</td><td>кошка99</td></tr><tr class="brown"><td><span class="mono">[:alnum:]</span></td><td>PCRE (C, PHP, R…): ASCII digits and letters A-Z and a-z</td><td>[[:alnum:]]{10}</td><td>ABCDE12345</td></tr><tr class="beige"><td><span class="mono">[:alnum:]</span></td><td>Ruby 2: Unicode digit, letter or ideogram</td><td>[[:alnum:]]{10}</td><td>кошка90210</td></tr><tr class="brown"><td><span class="mono">[:punct:]</span></td><td>PCRE (C, PHP, R…): ASCII punctuation mark</td><td>[[:punct:]]+</td><td>?!.,:;</td></tr><tr class="beige"><td><span class="mono">[:punct:]</span></td><td>Ruby: Unicode punctuation mark</td><td>[[:punct:]]+</td><td>‽,:〽⁆</td></tr></table>

Character,Legend,Example,Sample Match
[:alpha:],"PCRE (C, PHP, R…): ASCII letters A-Z and a-z",[8[:alpha:]]+,WellDone88
[:alpha:],Ruby 2: Unicode letter or ideogram,[[:alpha:]\d]+,кошка99
[:alnum:],"PCRE (C, PHP, R…): ASCII digits and letters A-Z and a-z",[[:alnum:]]{10},ABCDE12345
[:alnum:],"Ruby 2: Unicode digit, letter or ideogram",[[:alnum:]]{10},кошка90210
[:punct:],"PCRE (C, PHP, R…): ASCII punctuation mark",[[:punct:]]+,"?!.,:;"
[:punct:],Ruby: Unicode punctuation mark,[[:punct:]]+,"‽,:〽⁆"


In [218]:
%%html
<h2><a href="regex-modifiers.html">Inline Modifiers</a></h2>


None of these are supported in JavaScript. In Ruby, beware of <span class="socode">(?s)</span> and <span class="socode">(?m)</span>.
<br />
<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Modifier</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="wasabi"><td><span class="mono">(?i)</span></td><td><a href="regex-modifiers.html#i">Case-insensitive mode</a><br />
(except JavaScript)</td><td>(?i)Monday</td><td>monDAY</td></tr><tr class="greentea"><td><span class="mono">(?s)</span></td><td><a href="regex-modifiers.html#dotall">DOTALL mode</a> (except JS and Ruby). The dot (.) matches new line characters (\r\n). Also known as "single-line mode" because the dot treats the entire input as a single line</td><td>(?s)From A.*to Z</td><td>From A<br />to Z</td></tr><tr class="wasabi"><td><span class="mono">(?m)</span></td><td><a href="regex-modifiers.html#multiline">Multiline mode</a><br />
(except Ruby and JS) ^ and $ match at the beginning and end of every line</td><td>(?m)1\r\n^2$\r\n^3$</td><td>1<br />2<br />3</td></tr><tr class="greentea"><td><span class="mono">(?m)</span></td><td><a href="regex-modifiers.html#rubym">In Ruby</a>: the same as (?s) in other engines, i.e. DOTALL mode, i.e. dot matches line breaks</td><td>(?m)From A.*to Z</td><td>From A<br />to Z</td></tr><tr class="wasabi"><td><span class="mono">(?x)</span></td><td><a href="regex-modifiers.html#freespacing">Free-Spacing Mode mode</a><br />
(except JavaScript). Also known as comment mode or whitespace mode</td><td>(?x) # this is a<br /># comment<br />abc # write on multiple<br /># lines<br />[ ]d # spaces must be<br /># in brackets</td><td>abc d</td></tr><tr class="greentea"><td><span class="mono">(?n)</span></td><td><a href="regex-modifiers.html#n">.NET: named capture only</a></td><td>Turns all (parentheses) into non-capture groups. To capture, use <a href="regex-capture.html#namedgroups">named groups</a>.</td><td></td></tr><tr class="wasabi"><td><span class="mono">(?d)</span></td><td><a href="regex-modifiers.html#d">Java: Unix linebreaks only</a></td><td>The dot and the ^ and $ anchors are only affected by \n</td><td></td></tr></table>

Modifier,Legend,Example,Sample Match
(?i),Case-insensitive mode (except JavaScript),(?i)Monday,monDAY
(?s),"DOTALL mode (except JS and Ruby). The dot (.) matches new line characters (\r\n). Also known as ""single-line mode"" because the dot treats the entire input as a single line",(?s)From A.*to Z,From A to Z
(?m),Multiline mode (except Ruby and JS) ^ and $ match at the beginning and end of every line,(?m)1\r\n^2$\r\n^3$,1 2 3
(?m),"In Ruby: the same as (?s) in other engines, i.e. DOTALL mode, i.e. dot matches line breaks",(?m)From A.*to Z,From A to Z
(?x),Free-Spacing Mode mode (except JavaScript). Also known as comment mode or whitespace mode,(?x) # this is a # comment abc # write on multiple # lines [ ]d # spaces must be # in brackets,abc d
(?n),.NET: named capture only,"Turns all (parentheses) into non-capture groups. To capture, use named groups.",
(?d),Java: Unix linebreaks only,The dot and the ^ and $ anchors are only affected by \n,


In [219]:
%%html
<h2><a href="regex-lookarounds.html">Lookarounds Zero-Length Assertions</a></h2>


<table width="600" border="0" style="table-layout:fixed;"><tr><th width="100" scope="col">Lookaround</th><th width="200" scope="col">Legend</th><th width="150" scope="col">Example</th><th width="150" scope="col">Sample Match</th></tr><tr class="brown"><td><span class="mono">(?=…)</span></td><td><a href="regex-disambiguation.html#lookahead">Positive lookahead</a></td><td>(?=\d{10})\d{5}</td><td>01234 in <b>01234</b>56789</td></tr><tr class="beige"><td><span class="mono">(?&lt;=…)</span></td><td><a href="regex-disambiguation.html#lookbehind">Positive lookbehind</a></td><td>(?&lt;=\d)cat</td><td>cat in 1<b>cat</b></td></tr><tr class="brown"><td><span class="mono">(?!…)</span></td><td><a href="regex-disambiguation.html#negative-lookahead">Negative lookahead</a></td><td>(?!theatre)the\w+</td><td>theme</td></tr><tr class="beige"><td><span class="mono">(?&lt;!…)</span></td><td><a href="regex-disambiguation.html#negative-lookbehind">Negative lookbehind</a></td><td>\w{3}(?&lt;!mon)ster</td><td>Munster</td></tr></table>

Lookaround,Legend,Example,Sample Match
(?=…),Positive lookahead,(?=\d{10})\d{5},01234 in 0123456789
(?<=…),Positive lookbehind,(?<=\d)cat,cat in 1cat
(?!…),Negative lookahead,(?!theatre)the\w+,theme
(?<!…),Negative lookbehind,\w{3}(?<!mon)ster,Munster


## Look ahead
* https://docs.python.org/3/howto/regex.html#lookahead-assertions

<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">(?=...)</span></tt></dt>
<dd>Positive lookahead assertion.  This succeeds if the contained regular
expression, represented here by <tt class="docutils literal"><span class="pre">...</span></tt>, successfully matches at the current
location, and fails otherwise. But, once the contained expression has been
tried, the matching engine doesn’t advance at all; the rest of the pattern is
tried right where the assertion started.</dd>

<dt><tt class="docutils literal"><span class="pre">(?!...)</span></tt></dt>
<dd>Negative lookahead assertion.  This is the opposite of the positive assertion;
it succeeds if the contained expression <em>doesn’t</em> match at the current position
in the string.</dd>
</dl>

* `.*[.].*$` Match filenames with an extension like `auto.bat`, `slide.ipynb`
* Suppose you do not want to match `.bat` files
* `.*[.](?!bat$).*$`

## Examples

1. The password must have between six and ten word characters \w
2. It must include at least one lowercase character [a-z]
3. It must include at least three uppercase characters [A-Z]
4. It must include at least one digit \d

In [None]:
import re
passwd = 'apBDpleA5'
regex = re.compile(r'''\A # beginning of the string
                   (?=[^a-z]*[a-z]) # at least one lowercase chars
                   (?=(?:[^A-Z]*[A-Z]){3}) # at least three uppercase chars
                   (?=\D*\d) # at least one digit \d
                   \w{6,10} # match the string between six and ten word chars
                   \b # word boundary''', re.X)
if regex.search(passwd):
	print ("Yes")
else:
	print ("No")


In [228]:
# minitask 1.2
# expected output ['resource load failed', 'flow failed']
import re
text = '''
INFO 2019-02-17 12:13:44 resource load failed
INFO 2019-02-18 22:09:17 authentication failed
INFO 2019-02-18 10:55:48 data received
INFO 2019-02-19 19:53:31 flow failed
'''
pattern = '''
    \w+                 # failure type
    \s                  # a white character
    [\d-]+              # date such as 2019-02-21
    \s                  # a white character
    [\d:]+              # time such as 21:15:06
    \s                  # a white character
    (.*\sfailed)        # failure description
'''
print(re.findall(pattern,text, re.VERBOSE))


['resource load failed', 'authentication failed', 'flow failed']


In [None]:
import re
pattern = re.compile(r'\b(?!958)\d{3}\b')
text = "123\n235\n456\n1000\n957 958 959 960\n  909\n915 916"
print(re.findall(pattern, text))


In [233]:
# minitask 1.3
# change the last du to DU
# expected output ['resource load failed', 'flow failed']import re
pattern = re.compile(r'du')
text = ['du du du', 'du po ledu', 'dopředu du', 'i dozadu du', 'dudu dupl']
for row in text:
    print(re.sub(pattern, 'DU', row))

DU DU DU
DU po leDU
dopřeDU DU
i dozaDU DU
DUDU DUpl


In [None]:
import re
text = 'BM25'
pattern = r'B(?=2)'
if re.search(pattern, text):
    print ("Yes")
else:
    print ("No")

In [None]:
import re
text = 'BM25'
pattern = r'B(?=[^2]*2)'
if re.search(pattern, text):
    print ("Yes")
else:
    print ("No")

In [None]:
import re
text = 'B52'
pattern = r'B(?=5)(?=2)'
if re.search(pattern, text):
    print ("Yes")
else:
    print ("No")

In [None]:
import re
text = 'B52'
pattern = r'B(?=5)(?=[^2]*2)'
if re.search(pattern, text):
    print ("Yes")
else:
    print ("No")

In [235]:
# minitask 1.4
# strings/lines that contain _words_ David and Pavel and 
# do not contain neither _words_ Petr nor Jan
# expected output:
# Iva Pavel David Ada
# Pavel David Jansen
texts = ['David Petr','Iva Pavel David Ada',
         'Davidson Pavelek','Pavel David Jansen']
for text in texts:
    if re.search(r'^.*$',text):
        print(text)

David Petr
Iva Pavel David Ada
Davidson Pavelek
Pavel David Jansen


## Positive lookbehind assertion
<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">(?<=...)</span></tt></dt>
<dd>Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in 'abcdef', since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function</dd>

In [None]:
import re
m = re.search('(?<=abc)def', 'abcdef')
m.group(0)

## Negative lookbehind assertion
<dl class="docutils">
<dt><tt class="docutils literal"><span class="pre">(?<!...)</span></tt></dt>
<dd>Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.</dd>

In [None]:
import re
print(re.sub(r'(?<!\b)(?=(\d{3})+\b)', '.', "it is 12345673456456456 and 1000 and 500"))

## RegEx Golf
[http://regex.alf.nu/]

Prime number of characters?

In [None]:
import re
prime = 'xxxxx'
nonprime = 'xxxxxx'
regex = re.compile(r'^(?!(..+)\1+$)')
factor = 25
regex = r'a?'*factor + 'a'*factor
print('regex =', regex)
many_a = 'a'*factor
print('many_a =', many_a)
regex = re.compile(regex)
regex.search(many_a)
print(regex.search(nonprime))
print(regex.search(prime))

In [None]:
import re
factor = 25
regex = r'a?'*factor + 'a'*factor
print(regex)
many_a = 'a'*factor
print(many_a)
regex = re.compile(regex)
regex.search(many_a)

In [None]:
import re
text = 'item 6 is OK; item 1) not; but what about 3 a (i)'
regex = re.compile(r'\d\s*(?![()])')
for match in regex.finditer(text):
    print(match)

In [None]:
import re
text = '12345'
regex = re.compile(r'1+|[0–9]+')
print(regex.search(text))

## The Best Regex Trick
[http://www.rexegg.com/regex-best-trick.html]

Task: match *Tarzan* except when this exact word is in double-quotes. In other words, we want to exclude *"Tarzan"*. 

(?<!")Tarzan|Tarzan(?!")

"Tarzan"|(Tarzan)

Now Tarzan says to Jane: "Tarzan".

In [None]:
import re
text = '"Tarzan and Jane": Now Tarzan says to Jane: "Tarzan" and "John and Tarzan"'
regex = re.compile(r'(?<!"(?=Tarzan"))Tarzan')
for match in regex.finditer(text):
    print(match)

In [None]:
import re
text = '"Tarzan and Jane": Now Tarzan says to Jane: "Tarzan" and "John an Tarzan"'
regex = re.compile(r'Tarzan(?!"(?<="Tarzan"))')
for match in regex.finditer(text):
    print(match)

In [None]:
    import re
    text = '"Tarzan and Jane": Now Tarzan says to Jane: "Tarzan" and "John an Tarzan"'
    regex = re.compile(r'(?<!")Tarzan|Tarzan(?!")')
    for match in regex.finditer(text):
        print(match)

In [None]:
import re
text = '"Tarzan and Jane": Now Tarzan says to Jane: "Tarzan" and "John an Tarzan"'
regex = re.compile(r'"Tarzan"|(Tarzan)')
for match in regex.finditer(text):
    print(match.group(1))
print('just the "correct" matches:')
for match in regex.finditer(text):
    correct = match.group(1)
    if correct:
        print(correct)    

## Think about backtracking

In [None]:
import re
pattern = r'^(.*?,){11}x'
ok = '1,2,3,4,5,6,7,8,9,10,11,x12'
bad = '1,2,3,4,5,6,7,8,9,10,11,12'
%timeit re.search(pattern, ok)

#Runaway Regular Expressions: Catastrophic Backtracking

In [None]:
import re
print(re.sub('', 'a', '12345'))
inbetween = re.compile(r'''(?<=\d  # after a digit,
    (?!        # but not the position where 
      (?<=3)   # '3' precedes, AND
      (?=4)    # '4' follows
    )
  )''', re.X)
print(inbetween.sub('a', '12345'))

In [None]:
import re
objects = 'Mars Jupiter Uran Neptun Pluto'
tests = ['Mars Jupiter Uran Neptun Pluto', 'Jupiter Uran Neptun', 'Jupiter', 'Mars', 'Neptun', 'Uran']
for test in tests:
    if re.search(r'^.*\b(Jupiter|Mars|Neptun|Uran)\b.*$', objects).group(1) == test:
        print('funguje')
    else:
        print('nefunguje')

In [None]:
import re
tests = ['999==666','987==987','765==765','666==666']
for test in tests:
    if re.search(r'([987]{3}|[654]{3})==\1', test):
        print('funguje')
    else:
        print('nefunguje')

![regex golf](regex_golf.png)
[Peter Norvig's solution] (http://nbviewer.jupyter.org/url/norvig.com/ipython/xkcd1313.ipynb)

## Regex Engine Types

There are two fundamentally different types of regex engines - DFA and NFA

Both engine types have been around for a long time, but like its gasoline counterpart, the NFA type seems to be used more often. Tools that use an NFA engine include the .NET languages, PHP, Ruby, Perl, Python, GNU Emacs, ed, sed, vi, most versions of grep, and even a few versions of egrep and awk. On the other hand, a DFA engine is found in almost all versions of egrep and awk, as well as lex and flex. Some systems have a multi-engine hybrid system, using the most appropriate engine for the job (or even one that swaps between engines for different parts of the same regex, as needed to get the best combination of features and speed).

## Next lecture - Python data structures and basics
- https://www.youtube.com/watch?v=fYlnfvKVDoM
- https://www.youtube.com/watch?v=_AEJHKGk9ns

In [234]:
# minitask 1.4
# strings/lines that contain _words_ David and Pavel and do not contain neither _words_ Petr nor Jan
# expected output:
# Iva Pavel David Ada
# Pavel David Jansen
texts = ['David Petr','Iva Pavel David Ada','Davidson Pavelek','Pavel David Jansen']
for text in texts:
    if re.search(r'^(?=.*?\bDavid\b)(?=.*?\bPavel\b)(?!.*\b(Petr|Jan)\b).*$',text):
        print(text)

Iva Pavel David Ada
Pavel David Jansen
