## Regex 101

Basic examples of regex and how to define a pattern and search for it in a text.

**Outline**

* [Define pattern and search](#search)
    + re.search, re.match, re.findall
* [Repetition in Regex patterns](#rep)
* [Matching digits, words, and set boundaries](#more pat)
* [References](#ref)

In [3]:
import re

## <a id="search">Define pattern and search</a>

* **re.search(pattern, string, flags=0)** method takes a regular expression pattern and a string and searches for that pattern within the string. If the search is successful, search() returns a match object or None otherwise. Option flags are used to control various aspects of matching. 
    + We use group(num) or groups() function of match object to get matched expression.
    + group(num=0) returns entire match

In [4]:
input_string = 'The cookie is very good!!'
match = re.search(r'cookie', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found cookie'
else:
    print ('Did not find')

Found cookie


In [3]:
input_string = 'The cookie is very good!!'
match = re.search(r'(coo)k(ie)', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found cookie'
    print ('Found', match.group(1)) ## 'found coo'
    print ('Found', match.group(2)) ## 'found ie'
else:
    print ('Did not find')

Found cookie
Found coo
Found ie


* **re.match(pattern, string)** checks for a match only at the beginning of the string, while search checks for a match anywhere in the string
    + a|b -- Matches either a or b.

In [9]:
input_string = 'The cookie is very good'
match = re.match(r'cookie', input_string)
# If-statement after match() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'cookie'
else:
    print ('Did not find')

Did not find


In [6]:
input_string = 'cookie is good'
match = re.match(r'cookie', input_string)
# If-statement after match() tests if it succeeded
if match:
    print ('Found', match.group())
else:
    print ('Did not find')

Found cookie


* **re.findall(pattern, string)** finds a list of matches, while search checks for one match

In [6]:
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!!'
matches = re.findall(r'cookie|Cookies', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookies and cookie'
else:
    print ('Did not find')

Found Cookies
Found cookie


## <a id="rep">Repetition in Regex patterns</a>

In [7]:
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Cookie'
matches = re.findall(r'cookie|Cookies?', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookies and cookie and Cookie and Cookie'
else:
    print ('Did not find')

Found Cookies
Found cookie
Found Cookie
Found Cookie


In [8]:
input_string = 'The color is red. The other colour is green.'
matches = re.findall(r'colou?r', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found color and colour
else:
    print ('Did not find')

Found color
Found colour


In [9]:
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Cooookie'
matches = re.findall(r'co+kie|Co+kies?', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookies and cookie and Cookie and Cooookie'
else:
    print ('Did not find')

Found Cookies
Found cookie
Found Cookie
Found Cooookie


Square bracket []
Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'.

In [10]:
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'[cC]o+kies?', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookies and cookie and Cookie and Coookie'
else:
    print ('Did not find')

Found Cookies
Found cookie
Found Cookie
Found Coookie


In [11]:
input_string = 'aaa'
matches = re.findall(r'a*', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found aaa
Found 


In [12]:
input_string = 'aaa'
matches = re.findall(r'a+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found aaa


In [13]:
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'[a-zA-Z]+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found Sam
Found Cookies
Found are
Found here
Found Julie
Found The
Found cookie
Found is
Found very
Found good
Found Cookie
Found Coookie


In [14]:
input_string = '12348976'
matches = re.findall(r'[0-9]', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found 1
Found 2
Found 3
Found 4
Found 8
Found 9
Found 7
Found 6


In [15]:
input_string = 'The dog is cute.'
matches = re.findall(r'[^A-Z]', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found h
Found e
Found  
Found d
Found o
Found g
Found  
Found i
Found s
Found  
Found c
Found u
Found t
Found e
Found .


In [16]:
input_string = 'Sam: 100 cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'[0-9a-zA-Z]+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words and digits'
else:
    print ('Did not find')

Found Sam
Found 100
Found cookies
Found are
Found here
Found Julie
Found The
Found cookie
Found is
Found very
Found good
Found Cookie
Found Coookie


In [17]:
input_string = 'begign'
matches = re.findall(r'beg.n', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Did not find


In [18]:
input_string = 'Sam: 100 cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'.+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found Sam: 100 cookies are here! Julie: The cookie is very good!! Cookie Coookie


Dot (.) inside square bracketsjust means a literal dot.

In [19]:
input_string = 'The dog is cute.'
matches = re.findall(r'[^.]', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else: 
    print ('Did not find')

Found T
Found h
Found e
Found  
Found d
Found o
Found g
Found  
Found i
Found s
Found  
Found c
Found u
Found t
Found e


In [20]:
input_string = 'Sam: 100 cookies are here! Julie: The cookie is very good. Cookie Coookie'
matches = re.findall(r'[0-9a-zA-Z.]+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words and periods'
else:
    print ('Did not find')

Found Sam
Found 100
Found cookies
Found are
Found here
Found Julie
Found The
Found cookie
Found is
Found very
Found good.
Found Cookie
Found Coookie


## <a id="more pat">Matching digits, words, and set boundaries</a>

* \d -- decimal digit [0-9] (some older regex utilities do not support \d)
* \D -- Match a nondigit: [^0-9]
* \w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. 
* \W -- (upper case W) matches any non-word character.

In [21]:
input_string = 'Sam: 100 cookies are here! Julie: The cookie is very good. Cookie Coookie'
matches = re.findall(r'[\d\w]+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words and digits'
else:
    print ('Did not find')

Found Sam
Found 100
Found cookies
Found are
Found here
Found Julie
Found The
Found cookie
Found is
Found very
Found good
Found Cookie
Found Coookie


\t, \n, \r -- tab, newline, return

In [22]:
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'\w+\s\w+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found Cookies are
Found The cookie
Found is very
Found Cookie Coookie


In [23]:
input_string = 'The other dog is happy.'
matches = re.findall(r'[tT]he', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found The
Found the


In [24]:
input_string = 'The other dog is happy.'
matches = re.findall(r'\b[tT]he\b', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found The"'
else:
    print ('Did not find')

Found The


In [25]:
pattern = re.compile(r'\b[tT]he\b')
for m in pattern.finditer(input_string):
    start_index = int(m.start()) ## 'found The"'
    print(input_string[start_index:start_index+8])

The othe


In [26]:
matches = re.findall(r'\B[tT]he\B', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found the in "other"'
else:
    print ('Did not find')

Found the


In [27]:
pattern = re.compile(r'\B[tT]he\B')
for m in pattern.finditer(input_string):
    start_index = int(m.start())
    print(input_string[start_index:start_index+8])

ther dog


The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? {} [ ] \ | ( )

In [28]:
input_string = 'Cookies are here! Cookie is here!'
matches = re.findall(r'^Cookies?', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookies'
else:
    print ('Did not find')

Found Cookies


In [29]:
input_string = 'Cookies are here! Cookie'
matches = re.findall(r'Cookies?$', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookie'
else:
    print ('Did not find')

Found Cookie


In [30]:
input_string = 'The dog.'
matches = re.findall(r'^The dog\.$', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found The dog.


In [31]:
input_string = r'Cookies are here! Cookie \t' # notice r'' is used in here
match = re.search(r'\\t', input_string)
# If-statement after matches() tests if it succeeded
if match:
    print ('Found', match.group())  ## 'found \t'
else:
    print ('Did not find')

Found \t


In [32]:
input_string = 'My email is xiaofengzhu2013@u.northwestern.edu'
match = re.search(r'[\w.-]+@[\w.-]+', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found xiaofengzhu2013@u.northwestern.edu'
else:
    print ('Did not find')

Found xiaofengzhu2013@u.northwestern.edu


In [33]:
input_string = 'My email is xiaofengzhu2013@u.northwestern.edu. His email is xyxyxy@u.northwestern.edu'
matches = re.findall(r'(((xiaofengzhu2013)|(xy){3})@[\w.-]+)', input_string)
# If-statement after search() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match[0]) ## 'found both emails'
else:
    print ('Did not find')

Found xiaofengzhu2013@u.northwestern.edu.
Found xyxyxy@u.northwestern.edu


In [34]:
input_string = 'The 1st lab was scheduled on 10/2/2018.'
match = re.search(r'\d+/\d+/\d+', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found 10/2/2018'
else:
    print ('Did not find')

Found 10/2/2018


In [35]:
input_string = 'The 1st lab was scheduled on 10/2/2018.'
match = re.search(r'\d+/\d+/[1-9]\d{3}', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found 10/2/2018'
else:
    print ('Did not find')

Found 10/2/2018


In [3]:
input_string = 'Cookie COOKIe'
matches = re.findall(r'cookie', input_string, re.I)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookie and COOKIe'
else:
    print ('Did not find')

Found Cookie
Found COOKIe


## <a id="ref">References</a>

* [More python regex](https://www.tutorialspoint.com/python/python_reg_expressions.htm)
* [Test your regex101 pattern](https://regex101.com/)