## Regular Expression

* Powerful function used to match pattern especially in text dataset

In [11]:
import re
string = "an example word:cat!!"
match = re.search(r'word:\w\w\w', string)

# If-statement after search() tests if it succeeded
if match:
    print('found', match.group())
else:
    print('did not find')

found word:cat
<_sre.SRE_Match object; span=(11, 19), match='word:cat'>


* match.group() gives the matching text
* 'r' at the beginning designates a python 'raw' string which passes through backslashses without change

### Basic Patterns

* a, X, 9 <- ordinary characters just match themselves
* . :matches any single character except newline \n
* \w :a "word" character, a letter or digit or underbar[a-zA-Z0-9_]
* \W :any non-word character
* \b :boundary between word and non-word
* \s :single whitespace
* \S :non-whitespace character
* \t, \n, \r - tab, newline, return
* \d :decimal digit [0-9] 
* ^:start, $:end
* \:inhibit the specialness. E.g: "\\." to match a period, "\\\" to match a slash

### Basic Example

In [18]:
# Search for pattern 'iii' in string 'piiig'
match = re.search(r'iii', 'piiig')

## . = any char but \n
match = re.search(r'..g', 'piiig')

## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')
match = re.search(r'\w\w\w', '@@abcd!!')

found abc


### Repetition

* +: 1 or more times
* *: 0 or more times
* ?: 0 or 1 times

In [26]:
# i+ = one more i's, as many as possibles
match = re.search(r'pi+', 'piiig')

# Look for 3 digits, possibly seperated by whitespace
match = re.search(r'\d\s*\d\s*\d', "xx1 2    3xx")

# matches the start of string
match = re.search(r'^\w+', 'foobar')

'foobar'

### Email example

In [42]:
string = 'purple alice-b@google.com monkey dishwasher'
exp = r'\w+@\w+'
match = re.search(exp, string)
match.group()

'b@google'

### Square Brackets

In [41]:
exp = r'[\w.-]+@[\w.-]+'
match = re.search(exp, string)
match.group()
#[^ab] anything but "a" and "b"

'alice-b@google.com'

### Group extraction

* The group feature allows to pick parts of matching text. Add parenthesis () around the username and host in the pattern. The parenthesis do not change what the pattern will match, they instead establish logical "groups" inside the match text.

In [40]:
string = 'purple alice-b@google.com monkey dishwasher'
exp = '([\w.-]+)@([\w.-]+)'
match = re.search(exp, string)
if match:
    print(match.group())
    print(match.group(1))
    print(match.group(2))

alice-b@google.com
alice-b
google.com


### Find all

* Re.search(): finds first match
* findall(): returns all matches as list of strings

In [39]:
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwater'
exp = r'[\w\.-]+@[\w\.-]+' # regex to match email address exactly
emails = re.findall(exp, string) 
for email in emails:
    print(email)

alice@google.com
bob@abc.com


### Findall with files

In [43]:
# open file
f = open("test.txt", "r")
#
strings = re.findall(r'some patter', f.read())

FileNotFoundError: [Errno 2] No such file or directory: 'test.txt'

### Findall and groups

In [60]:
strings = 'purple alice@google.com, blah monkey bob@abc.com blah dishwater'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', strings)
print(tuples)
for tuple in tuples:
    print(tuple[0]) # username
    print(tuple[1]) # host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


* Options
    * IGNORECASE - ignore upper/lower case
    * DOTALL - allow . to match newline \n
    * MULTILINE - allow ^ and $ to match the start and end of each line

### Substitution

* re.sub(exp, replacement, str) - returns new string with all replacements
* \1,\2 will refer to the matched text from group 1 and group 2

In [59]:
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', r'\1@yo-yo-dyne.com', strings))

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwater
