# Regular expressions

Regexes let us work with patterns.

Is helpful to:

Standardize / clean batches of data (replace)
Extract pieces of information (parsing)

module name: re



In [2]:
import re

in order to test regexp you can also try 

https://regex101.com/

where you can insert the regular expression and a test string.

BASICS

- . any char
- ^ leading
- $ trailing




GROUPS and SPECIAL CHARS
- \b to match a word boundary
- \n or \r\n newline
- \s whitespaces   (vs \S : no spaces)
- \d digits [0-9]  (vs \D : no digits)
- \w matches chars [0-9A-Za-z_]



REPETITIONS:

- () contains agroup of chars to apply following:
- match one or more chars:  +
- match 0 or more chars:    * 
- match 0 or 1 repetitions: ?  (avoids greedy)
- a{4} matches exactly 4 repetitions
- 0{3,7} matches at least 3 but no more than 7 repetitions
- | = or

example (ab|cd)+

SETS

- [abc]: a b or c
- [a-z]: any letter a to z (ascii order) 
- [^abc] : negative set: where not in abc

example: [0-9] = \d


SYMBOLIC ACCESS
- (?P<name>...) the substring matched by the group is accessible via the symbolic group name name





## Find

To Find an expression within a text exits the operator .search(text)

We need before to compile() the pattern, then apply serch to the compiled pattern.


In [9]:
patterns= ['IBM', 'APPLE', 'INTEL']
    
lines = ['GOOGLE CORP', 'AMAZON LTD', 'APPLE CORP']    
    
for line in lines:    
    for pattern in patterns:
        rx = re.compile(pattern)    #  these 3 lines might be  
        match = rx.search(line)     #  compressed in one:
        if match:                   #  if re.compile(pattern).search(line):
            print(line)
    

APPLE CORP


### Regex parsing

Regex is very powerful when used together with dictionaries
(hardcoded or even better loaded from file / external dbs)

Along with symbolic patterns (?P<variable_name>regexp_pattern) allows data parsing.


In [15]:
rx_dict = {'City':     re.compile(r' (?P<City>(\w+)), \w+$'),
           'Country':  re.compile(r', (?P<Country>(\w+))$'),
           'Zip':      re.compile(r'(?P<Zip>(\d{3,5}))') }

line= "00014 UNIV HELSINKI, DEPT FOREST ECOL, HELSINKI, FINLAND"

for key, rx in rx_dict.items():
    match = rx.search(line)
    if match:
        print(key, match.group(key)) 

City HELSINKI
Country FINLAND
Zip 00014



Anyway some patterns are more difficult than other.
can anybody guess what type of address data can spot this patter?

(?:[A-Za-z]\d ?\d[A-Za-z]{2})|(?:[A-Za-z][A-Za-z\d]\d ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d{2} ?\d[A-Za-z]{2})|(?:[A-Za-z]\d[A-Za-z] ?\d[A-Za-z]{2})|(?:[A-Za-z]{2}\d[A-Za-z] ?\d[A-Za-z]{2})


## Replace

re.sub(pattern, replace, string, count=0, flags=0)

Allows multiple find and replace if pattern has multiple values held between parenthesis
and replace has \1 .. N for the arguments + other text


#### Find duplicate lines in a text:

PATTERN : ^(.*)(\r?\n\1)+$ 
REPLACE : \1
    
The caret will match only at the start of a line. 
So the regex engine will only attempt to match the remainder of the regex there. 
The dot and star combination simply matches an entire line, whatever its contents, if any. 
The parentheses store the matched line into the first backreference.

Next we will match the line separator. 
Put the question mark into \r?\n to make this regex work with both Windows (\r\n) and UNIX (\n) text files. 
So up to this point we matched a line and the following line break.

Now we need to check if this combination is followed by a duplicate of that same line. 
We do this simply with \1. This is the first backreference which holds the line we matched. 
The backreference will match that very same text.    
    

In [15]:
# moves from surname, name 2nd name  to name 2nd name surname

name = "Goode, Johnny B."

pattern= re.compile(r'([\w-]+), ([\w-]+) ([A-Z]\.)')
replace = r'\2 \3 \1'
           
re.sub(pattern, replace, name, count=0, flags=0)           



'Johnny B. Goode'

In [3]:
# from MMDDYYYY to DDMMYYYY

name = "12/24/2012, $50.00"

pattern= re.compile(r'(\d+)/(\d+)/(\d+)')
replace = r'\2/\1/\3'
           
re.sub(pattern, replace, name, count=0, flags=0)       




'24/12/2012, $50.00'

### Remove


(?:...) A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match; helpful to save memory / number of groups

Example, remove optional part of zip code in US addresses


In [16]:

name = "New York 10020-9020 United States"

pattern= re.compile(r'( \d{5})(?:-\d{4})')
replace = r',\1,'
           
re.sub(pattern, replace, name, count=0, flags=0)     

'New York, 10020, United States'

### Greediness and laziness

example: remove html tags
(don't do it with regexp: better beautifulsoup module)





In [30]:

name = 'Today, I built a <a href="http://example.com" style="color:green;">website</a> which is now listed <a class="header remoteLink" href="http://www.google.com">on various search engines</a>.'

pattern= re.compile(r'(<.*?>)')
replace = r''
           
re.sub(pattern, replace, name, count=0, flags=0)   


'Today, I built a website which is now listed on various search engines.'