In [48]:
%%html
<style>
table { text-align: left; display: block}
</style>

# String Manipulation and Regex Practice 
___

**References**
* https://docs.python.org/3/howto/regex.html
* https://developers.google.com/edu/python/regular-expressions
* https://www.regular-expressions.info/quickstart.html
https://www.w3schools.com/python/python_regex.asp
* https://stackoverflow.com/questions/267399/how-do-you-match-only-valid-roman-numerals-with-a-regular-expression

**Best practice for RegEx patterns should be of form `r"..."`**



```
All regex patterns should start be of the form pattern=r"..."

^ // matches position just before the first character of the string (anchors to beginning of line)
$ // matches position just after the last character of the string (anchors to end of line)
. matches a single character. Does not matter what character it is, except newline
* // matches preceding match zero or more times

.* // zero or more of any charactter

^.*$ // match, from beginning to end, any character that appears zero or more times (This regex pattern is not very useful)

```

**RegEx Functions**

* `findall`	Returns a list containing all matches
* `search`	Returns a Match object if there is a match anywhere in the string
* `split`	Returns a list where the string has been split at each match
* `sub`	Replaces one or many matches with a string

**RegEx Metacharacters**

| <div style="width:50px">Character</div>   | <div style="width:50px">Description</div>   | <div style="width:50px">Example</div>   |
| ------- | ------- | ------- |
|`[]`| A set of characters	| "[a-m]"	|
|`\`	| Signals a special sequence (can also be used to escape special characters)	| "\d"	|
|`.`	| Any character (except newline character)	| "he..o"	|
|`^`	| Starts with	| "^hello"	|
|`$`	| Ends with	"planet$"	
|`*`	| Zero or more occurrences	| "he.*o"	|
|`+`	| One or more occurrences	| "he.+o"	|
|`?`	| Zero or one occurrences	| "he.?o"	|
|`{}`	| Exactly the specified number of occurrences	| "he.{2}o"	|
| `\|`| Either or	| "falls\|stays"	|
|`()`	| Capture and group| |	 


**RegEx Special Sequences**

| <div style="width:50px">Character</div>   | <div style="width:50px">Description</div>   | <div style="width:50px">Example</div>   |
| ------- | ------- | ------- |
|`\A`|	Returns a match if the specified characters are at the beginning of the string	|"\AThe"	|
|`\b`|	Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")	|r"\bain" r"ain\b"	|
|`\B`|	Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string")	|r"\Bain" r"ain\B"	 |
|`\d`|	Returns a match where the string contains digits (numbers from 0-9) |	"\d"|	
|`\D`|	Returns a match where the string DOES NOT contain digits	|"\D"	|
|`\s`|	Returns a match where the string contains a white space character |	"\s"	|
|`\S`|	Returns a match where the string DOES NOT contain a white space character	|"\S"|	
|`\w`|	Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character)	| "\w"	|
|`\W`|	Returns a match where the string DOES NOT contain any word characters	|"\W"	|
|`\Z`|	Returns a match if the specified characters are at the end of the string	|"Spain\Z"|


**RegEx Sets**

| <div style="width:50px">Character</div>   | <div style="width:50px">Description</div>   |
| ------- | ------- | 
|`[arn]`	|Returns a match where one of the specified characters (a, r, or n) are present	|
|`[a-n]`	|Returns a match for any lower case character, alphabetically between a and n	|
|`[^arn]`	|Returns a match for any character EXCEPT a, r, and n	|
|`[0123]`	|Returns a match where any of the specified digits (0, 1, 2, or 3) are present	|
|`[0-9]`	|Returns a match for any digit between 0 and 9	|
|`[0-5][0-9]`	|Returns a match for any two-digit numbers from 00 and 59	|
|`[a-zA-Z]`	|Returns a match for any character alphabetically between a and z, lower case OR upper case	|
|`[+]`	|In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string|



In [2]:
import re

## Extract *only* alphanum, characters, digits, upper/lower characters with RegEx


`re.sub(pattern, repl, string, count=0, flags=0)`
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl.

In [7]:
s = "Qa24q"

s_alphanum_1 = re.sub(r"\W+", "", s)
s_alphanum_2 = re.sub(r"[^0-9a-zA-Z]", "", s) # Equivalent to above

s_num = re.sub("[^0-9]", "", s)
s_alpha = re.sub("[^a-zA-Z]+", "", s)
s_alphalower = re.sub("[^a-z]", "", s)
s_alphaupper = re.sub("[^A-Z]", "", s)

print(s_alphanum_1, s_alphanum_2, s_num, s_alpha, s_alphalower, s_alphaupper)

Qa24q Qa24q 24 Qaq aq Q


## Extract *only* alphanum, characters, digits, upper/lower characters with Python In-Built Methods

Python in-built methods such as `.isalnum`, `.isalpha`, `.isdigit`, `.islower`, `.isupper`, only returns `True` if all of the characters satisfy the condition.

In [9]:
# Example using filter

s = "Qa24q"

s_alphanum = ''.join(filter(str.isalnum, s))

s_num = ''.join(filter(str.isdigit, s))
s_alpha = ''.join(filter(str.isalpha, s))
s_alphalower = ''.join(filter(str.islower, s))
s_alphaupper = ''.join(filter(str.isupper, s))

print(s_alphanum, s_num, s_alpha, s_alphalower, s_alphaupper)

Qa24q 24 Qaq aq Q


In [10]:
# Example using list comprehension

s = "Qa24q"

s_alphanum = ''.join([c for c in s if c.isalnum()])

s_num = ''.join([c for c in s if c.isdigit()])
s_alpha = ''.join([c for c in s if c.isalpha()])
s_alphalower = ''.join([c for c in s if c.isupper()])
s_alphaupper = ''.join([c for c in s if c.islower()])

print(s_alphanum, s_num, s_alpha, s_alphalower, s_alphaupper)

Qa24q 24 Qaq Q aq


## Regex - First match in a string

`re.search(pattern, string, flags=0)` 
Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

In [41]:
matchObj = re.search(r'\d+%[A-Za-z]+',"5%Hello,%^&aworldz")
print(0, matchObj)

# Single character
print(1, re.search(r'z',"Hello, worldz!"))

# Any character
print(2, re.search(r'.',"123@Hello, world!"))
print(3, re.search(r'.',"!!!Hello, world!"))
print(3, re.search(r'.',"Hello, world!"))

# Any word character
print(4, re.search(r'\w',"!$@ Hello, world!"))
print(5, re.search(r'[a-zA-Z0-9]',"!$@ Hello, world!"))

# Any alphanumeric character
print(6, re.search(r'\w+',"!@#$%^&*Hello, world!"))

# Any non-word character
print(7, re.search(r'\W',"Hello@world!"))

# boundary between word and non-word
print(8, re.search(r'\b',"Hello, world!"))

# match any whitespace character
print(9, re.search(r'\s',"Hello, world!"))

# match any non-whitespace character
print(10, re.search(r'\S',"Hello, world!"))

# match decimal digit=
print(11, re.search(r'\d',"Hello, world 777!"))

# Match the start of the string
print(12, re.search(r'$',"Hello, world!"))

# Match the end of the string
print(13, re.search(r'^',"Hello, world!"))

# Use \ to inhibit specialness characters
print(14, re.search(r'\@',"Hello@world!"))

0 <re.Match object; span=(0, 7), match='5%Hello'>
1 <re.Match object; span=(12, 13), match='z'>
2 <re.Match object; span=(0, 1), match='1'>
3 <re.Match object; span=(0, 1), match='!'>
3 <re.Match object; span=(0, 1), match='H'>
4 <re.Match object; span=(4, 5), match='H'>
5 <re.Match object; span=(4, 5), match='H'>
6 <re.Match object; span=(8, 13), match='Hello'>
7 <re.Match object; span=(5, 6), match='@'>
8 <re.Match object; span=(0, 0), match=''>
9 <re.Match object; span=(6, 7), match=' '>
10 <re.Match object; span=(0, 1), match='H'>
11 <re.Match object; span=(13, 14), match='7'>
12 <re.Match object; span=(13, 13), match=''>
13 <re.Match object; span=(0, 0), match=''>
14 <re.Match object; span=(5, 6), match='@'>


## RegEx - Find all matches in a string
for match in re.compile("l").finditer("Hello world!"):
    print(match)

In [103]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.
match = re.search(r'iii', 'piiig') # found, match.group() == "iii"
print(match)
match = re.search(r'igs', 'piiig') # not found, match == None
print(match)

## . = any char but \n
match = re.search(r'..g', 'piiig') # found, match.group() == "iig"
print(match)

## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g') # found, match.group() == "123"
print(match)
match = re.search(r'\w\w\w', '@@abcd!!') # found, match.group() == "abc"
print(match)

<re.Match object; span=(1, 4), match='iii'>
None
<re.Match object; span=(2, 5), match='iig'>
<re.Match object; span=(1, 4), match='123'>
<re.Match object; span=(2, 5), match='abc'>


In [124]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig') # found, match.group() == "piii"
print(match)

## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii') # found, match.group() == "ii"
print(match)

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx') # found, match.group() == "1 2   3"
print(match)
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx') # found, match.group() == "12  3"
print(match)
match = re.search(r'\d\s*\d\s*\d', 'xx123xx') # found, match.group() == "123"
print(match)

## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar') # not found, match == None
print(match)
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar') # found, match.group() == "bar"
print(match)
## $ = matches the end of string bar:
match = re.search(r'bar$', 'barfoobar') # not found, match == None
print(match)

<re.Match object; span=(0, 4), match='piii'>
<re.Match object; span=(1, 3), match='ii'>
<re.Match object; span=(2, 9), match='1 2   3'>
<re.Match object; span=(2, 7), match='12  3'>
<re.Match object; span=(2, 5), match='123'>
None
<re.Match object; span=(3, 6), match='bar'>
<re.Match object; span=(6, 9), match='bar'>


In [127]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
    print(match.group())  ## 'b@google'

b@google


In [138]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\s\w+\s', str)
if match:
    print(match.group())  ## 'b@google'

 monkey 


In [151]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\s[a-z]+[-]*[a-z]+\@[a-z]+.[a-z]+\s', str)
if match:
    print(match.group())  ## 'b@google'

 alice-b@google.com 


In [157]:
# Cleaner, any one or more characters '.', '-' or alphanumeric characters with @ in between
str = 'purple alice.cheng-b@google.com monkey dishwasher'
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group())  ## 'alice-b@google.com'

alice.cheng-b@google.com


In [161]:
# Add parenthesis for logical groups, e.g. if desire split username and domain host seperately
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


In [163]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
    print(email)

alice@google.com
bob@abc.com


In [164]:
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'([\w\.-]+)@([\w\.-]+)', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
    print(email)

('alice', 'google.com')
('bob', 'abc.com')


In [165]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
    print(tuple[0])  ## username
    print(tuple[1])  ## host

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


In [171]:
## Suppose we have a text with many email addresses
str = 'purple ALICE@google.com, blah monkey BOBBY@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[a-z\.-]+@[a-z\.-]+', str, re.IGNORECASE) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
# do something with each found email string
    print(email)

ALICE@google.com
BOBBY@abc.com


In [186]:
str = '<b>foo</b> and <i>so on</i>'
match = re.search(r'<.*>', str)
if match:
    print(match.group())  

<b>foo</b> and <i>so on</i>


In [187]:
str = '<b>foo</b> and <i>so on</i>'
match = re.search(r'<\w>', str)
if match:
    print(match.group())  

<b>


In [190]:
str = '<b>foo</b> and <i>so on</i>'
match = re.findall(r'</*\w>', str)
if match:
    print(match)

['<b>', '</b>', '<i>', '</i>']


In [177]:
# 0 or more of any characters
str = './*@#@!()'
match = re.search(r'.*', str)
if match:
    print(match.group())  

./*@#@!()


In [208]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
replacement = r'\1@yo-yo-dyne.com' # use \1 to refer to group(1)

## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', replacement, str))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher


In [209]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
replacement = r'name@\2' # use \2 to refer to group(2)

## re.sub(pat, replacement, str) -- returns new string with all replacements,
## \1 is group(1), \2 group(2) in the replacement
print(re.sub(r'([\w\.-]+)@([\w\.-]+)', replacement, str))
## purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher

purple name@google.com, blah monkey name@abc.com blah dishwasher


In [215]:
p = re.compile('[a-z]+')
print(p.match(""))
print(p.match('tempo'))

m = p.match('tempo')
print(m.group())
print(m.start())
print(m.end())

None
<re.Match object; span=(0, 5), match='tempo'>
tempo
0
5


In [218]:
# Two pattern methods for returning all matches
p = re.compile(r'\d+')
print(p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping'))


iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
iterator  

for match in iterator:
    print(match.span())

['12', '11', '10']
(0, 2)
(22, 24)
(29, 31)


In [223]:
# Match vs search 
#  match() function only checks if the RE matches at the beginning of the string while search() will scan forward through the string for a match
print(re.match('super', 'superstition').span())

print(re.match('super', 'insuperable'))



#  search() will scan forward through the string, reporting the first match it finds.
print(re.search('super', 'superstition').span())

print(re.search('super', 'insuperable').span())

(0, 5)
None
(0, 5)
(2, 7)
