In [1]:
import re

# Regular Expression Note

* [Summary](#summary)
* [Example](#example)
* [Note](#note)
* [Reference](#refer)

## <a id='summary'>Summary</a>

* **a|b** -- Matches either a or b.

* **+** -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's
* ***** -- 0 or more occurrences of the pattern to its left
* **?** -- match 0 or 1 occurrences of the pattern to its left
* **{n}** -- repeat exact n times
* **{n,}** -- repeat at least n times
* **{n, m}** -- repeat [n-m] times

* **[]**: Square brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'.

* **()** -- Explicit capture the word in the bracket itself. For example, (a-z0-9) would capture the word "a-z0-9" itself.

* a, X, 9, < -- ordinary characters just match themselves exactly.  
* **.** (a period) -- matches any single character except newline '\n'

* **\d** -- decimal digit [0-9] (some older regex utilities do not support \d)
* **\D** -- Match a nondigit: [^0-9]

* **\w** -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. 
* **\W** -- (upper case W) matches any non-word character.

* \t, \n, \r -- tab, newline, return

* The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? {} [ ] \ | ( )

* **^** (Caret) -- match the start the string 
* **$**  -- match the end of the string

* **\s** -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]
* **\S** -- (upper case S) matches any non-whitespace character.

* **\b** -- boundary between word and non-word
* **\B** -- nonword boundary
-> \b should be the same as \s+ except it can also capture the start and end of the input string.
-> if we just use \s+, the word that happens to be in start and end of the string wouldn't be captured.

* **\** -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '\t', you can put a slash in front of it, \\t, to make sure it is treated just as a character.

## <a id='example'>Examples</a>

In [2]:
# example 1
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'[cC]o+kies?', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match)
else:
    print ('Did not find')

Found Cookies
Found cookie
Found Cookie
Found Coookie


In [3]:
# example 2
input_string = 'aaa'
matches = re.findall(r'a*', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found aaa
Found 


In [4]:
# example 3
input_string = 'aaa'
matches = re.findall(r'a+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found aaa


In [5]:
# example 4
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'[a-zA-Z]+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found Sam
Found Cookies
Found are
Found here
Found Julie
Found The
Found cookie
Found is
Found very
Found good
Found Cookie
Found Coookie


In [6]:
# example 5
input_string = '12348976'
matches = re.findall(r'[0-9]', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found 1
Found 2
Found 3
Found 4
Found 8
Found 9
Found 7
Found 6


In [7]:
# example 6
input_string = 'The dog is cute.'
matches = re.findall(r'[^A-Z]', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found h
Found e
Found  
Found d
Found o
Found g
Found  
Found i
Found s
Found  
Found c
Found u
Found t
Found e
Found .


In [8]:
# example 7
input_string = 'Sam: 100 cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'[0-9a-zA-Z]+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words and digits'
else:
    print ('Did not find')

Found Sam
Found 100
Found cookies
Found are
Found here
Found Julie
Found The
Found cookie
Found is
Found very
Found good
Found Cookie
Found Coookie


In [9]:
# example 8
input_string = 'beggin'
matches = re.findall(r'beg.n', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Did not find


In [10]:
# example 9
input_string = 'Sam: 100 cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'.+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found Sam: 100 cookies are here! Julie: The cookie is very good!! Cookie Coookie


In [12]:
# example 10
input_string = 'Sam: 100 cookies are here! Julie: The cookie is very good. Cookie Coookie'
matches = re.findall(r'[\d\w]+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words and digits'
else:
    print ('Did not find')

Found Sam
Found 100
Found cookies
Found are
Found here
Found Julie
Found The
Found cookie
Found is
Found very
Found good
Found Cookie
Found Coookie


In [13]:
# example 11
input_string = 'Sam: Cookies are here! Julie: The cookie is very good!! Cookie Coookie'
matches = re.findall(r'\w+\s\w+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found Cookies are
Found The cookie
Found is very
Found Cookie Coookie


In [14]:
# example 12
input_string = 'The other dog is happy.'
matches = re.findall(r'[tT]he', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found all words'
else:
    print ('Did not find')

Found The
Found the


In [15]:
# example 13
input_string = 'The other dog is the happyest dog. The'
matches = re.findall(r'\s+[tT]he\s+', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found The"'
else:
    print ('Did not find')

Found  the 


In [16]:
# example 14
input_string = 'The other dog is the happyest dog. The'
matches = re.findall(r'\b[tT]he\b', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found The"'
else:
    print ('Did not find')

Found The
Found the
Found The


In [17]:
# example 15
input_string = 'The other dog is happy.'
matches = re.findall(r'\B[tT]he\B', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found the in "other"'
else:
    print ('Did not find')

Found the


In [18]:
# example 16
input_string = 'Cookies are here! Cookie is here!'
matches = re.findall(r'^Cookies?', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookies'
else:
    print ('Did not find')

Found Cookies


In [19]:
# example 17
input_string = 'Cookies are here! Cookie'
matches = re.findall(r'Cookies?$', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookie'
else:
    print ('Did not find')

Found Cookie


In [20]:
# example 18
input_string = r'Cookies are here! Cookie \t' # notice r'' is used in here
match = re.search(r'\\t', input_string)
# If-statement after matches() tests if it succeeded
if match:
    print ('Found', match.group())  ## 'found \t'
else:
    print ('Did not find')

Found \t


In [21]:
# example 19
input_string = 'My email is xiaofengzhu2013@u.northwestern.edu@afaef'
match = re.search(r'[\d\w]+@.+', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found xiaofengzhu2013@u.northwestern.edu'
else:
    print ('Did not find')

Found xiaofengzhu2013@u.northwestern.edu@afaef


In [22]:
# example 20
input_string = 'My email is xiaofengzhu2013@u.northwestern.edu'
match = re.search(r'[\w.-]+@[\w.-]+', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found xiaofengzhu2013@u.northwestern.edu'
else:
    print ('Did not find')

Found xiaofengzhu2013@u.northwestern.edu


In [23]:
# example 21
input_string = 'My email is xiaofengzhu2013@u.northwestern.edu. His email is xyxyxy@u.northwestern.edu.'
matches = re.findall(r'(((xiaofengzhu2013)|(xy){3})@[\w.-]+)', input_string)
# If-statement after search() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match[0]) ## 'found both emails'
else:
    print ('Did not find')

Found xiaofengzhu2013@u.northwestern.edu.
Found xyxyxy@u.northwestern.edu.


In [24]:
# example 22
input_string = 'The 1st lab was scheduled on 10/2/2018.'
match = re.search(r'\d+/\d+/\d+', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found 10/2/2018'
else:
    print ('Did not find')

Found 10/2/2018


In [25]:
# example 23
input_string = 'The 1st lab was scheduled on 10/2/2018.'
match = re.search(r'\d+/\d+/[12]\d{3}', input_string)
# If-statement after search() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'found 10/2/2018'
else:
    print ('Did not find')

Found 10/2/2018


In [26]:
# example 24
input_string = 'Cookie COOKIe'
matches = re.findall(r'cookie', input_string, re.I)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookie and COOKIe'
else:
    print ('Did not find')

Found Cookie
Found COOKIe


## <a id='note'>Notes</a>

> **match** vs **findall**

* **match** checks for a match only at the beginning of the string
* **search** checks for a match anywhere in the string
* **findall** checks for all the occurence
* **finditer**

In [5]:
input_string = 'The cookie is very good'
match = re.match(r'cookie', input_string)
# If-statement after match() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'cookie'
else:
    print ('Did not find')

Did not find


In [16]:
input_string = 'The cookie is very good. cookie'
match = re.search(r'cookie', input_string)
# If-statement after match() tests if it succeeded
if match:
    print ('Found', match.group()) ## 'cookie'
else:
    print ('Did not find')

Found cookie


In [17]:
input_string = 'The cookie is very good. cookie'
match = re.findall(r'(coo)kie', input_string)
# If-statement after match() tests if it succeeded
if match:
    print ('Found', match) ## 'cookie'
else:
    print ('Did not find')

Found ['coo', 'coo']


In [23]:
identified_pattern = []
input_string = 'The cookie is very good. cookie'
pattern_list = [r'cookie']
for p in pattern_list:        
    iter = re.finditer(p, input_string)
    temp_pattern = [(m.start(0), m.start(0)+len(m.group()), m.group()) for m in iter]
    identified_pattern+=temp_pattern
    
identified_pattern    

[(4, 10, 'cookie'), (25, 31, 'cookie')]

> **?:**

A group is, by default, capturing -- meaning you can fetch groups (sub-matches inside parens) with the group(int) method. 

A non-capturing group is just that -- don't capture the submatch. The group(int) method doesn't return submatches from non-capturing groups. Generally, if you don't need the value of a submatch, the group should be non-capturing, as there is no reason for the regex engine to collect the data for you, if you don't intend to use it. 

For example,
* (?:abc){3} matches abcabcabc. No groups.
* (abc){3} matches abcabcabc. First group matches abc.

In short, regular expression will first capture the whole pattern and output it as match. With (abc), it will try to match the input string using the pattern abc again and output it as group 1. That's why we will have 2 outputs. abcabcabc and abc

Reference
* [refer](https://coderanch.com/t/466558/java/regular-expression)
* [notation-in-regular-expression](https://stackoverflow.com/questions/36524507/notation-in-regular-expression)
* [refcapture](https://www.regular-expressions.info/refcapture.html)

In [29]:
str = """COBOL is a compiled English-like computer programming language designed for business use. 122. On 30 OCT 2015 is a big date unlike 1 NOV 2010 """

# all = re.findall(r"[\d]{1,2} [ADFJMNOS]\w* [\d]{4}", str) 
all = re.findall(r"([\d]{1,2}\s(?:JAN|NOV|OCT|DEC)\s[\d]{4})", str)
for s in all:
    print(s)

30 OCT 2015
1 NOV 2010


In [4]:
str = """COBOL is a compiled English-like computer programming language designed for business use. 122. On 30 OCT 2015 is a big date unlike 1 NOV 2010 """

all = re.findall(r"([\d]{1,2}\s(JAN|NOV|OCT|DEC)\s[\d]{4})", str)
for s in all:
    print(s)

('30 OCT 2015', 'OCT')
('1 NOV 2010', 'NOV')


> **?=**

positive lookahead: For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.


In [34]:
input_string = 'queen q qu'
matches = re.findall(r'q(?=u)', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookie and COOKIe'
else:
    print ('Did not find')
    
# the output matches the first & third word

Found q
Found q


* https://www.regular-expressions.info/lookaround.html
* https://stackoverflow.com/questions/1570896/what-does-mean-in-a-regular-expression/1570916#1570916
* [Good explaination](https://www.regular-expressions.info/lookaround.html)

> **?!**

Negative lookahead: q(?!u) matches a q that is not followed by a u, without making the u part of the match.

In [33]:
input_string = 'queen q qu'
matches = re.findall(r'q(?!u)', input_string)
# If-statement after matches() tests if it succeeded
if matches:
    for match in matches:
        print ('Found', match) ## 'found Cookie and COOKIe'
else:
    print ('Did not find')
    
# the output matches the second word

Found q


## <a id='refer'>Reference</a>

* [Tutorialpoint - Python Regular Expressions](https://www.tutorialspoint.com/python/python_reg_expressions.htm)