<a href="https://colab.research.google.com/github/paiml/python_for_datascience/blob/master/Lesson12_Python_For_Data_Science_Pattern_Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 12: Pattern Matching 


## 12.1 Perform simple pattern matching                                                                                                   


In [0]:
"sol" in "absolute"

True

In [0]:
'absolute'.startswith('ab')

True

In [0]:
'absolute'.endswith('lute')

True

In [0]:
'absolute'.find('sol')

2

## 12.2 Use regular expressions 


In [0]:
# TDL SDL LN
text = '''
Ahab: ahab@pequod.com
Peleg: peleg@pequod.com
Ishmael: ishmael@pequod.com
Herman: herman@acushnet.io
Pollard: pollard@essex.me
'''

### Simple matching

In [3]:
import re


re.match("Ahab:", text )


<_sre.SRE_Match object; span=(0, 5), match='Ahab:'>

In [0]:
if re.match("Ahab:", text ):
  print("We found Ahab")

We found Ahab


In [0]:
if re.match("Peleg", text):
  print("We found Peleg")
else:
  print("No Peleg found!")

No Peleg found!


### Search

In [0]:
if re.search("Peleg", text):
  print("We found Peleg")
else:
  print("No Peleg found!")

We found Peleg


### Character sets

In [0]:
re.search("[A-Z][a-z]", text)

<_sre.SRE_Match object; span=(0, 2), match='Ah'>

In [0]:
re.search("[A-Za-z]+", text)

<_sre.SRE_Match object; span=(0, 4), match='Ahab'>

In [0]:
re.search("[A-Za-z]{7}", text)

<_sre.SRE_Match object; span=(66, 73), match='Ishmael'>

In [0]:
re.search("[a-z]+\@[a-z]+\.[a-z]+", text)

<_sre.SRE_Match object; span=(6, 21), match='ahab@pequod.com'>

### Character classes

In [0]:
re.search("\w\d\w", "His panic over Y2K was overwhelming.")

<_sre.SRE_Match object; span=(15, 18), match='Y2K'>

In [0]:
re.search("\w+\@\w+\.\w+", text)

<_sre.SRE_Match object; span=(6, 21), match='ahab@pequod.com'>

### Groups

In [13]:
m = re.search("(\w+)\@(\w+)\.(\w+)", text)
print(f'''
Group 0 is {m.group(0)}
Group 1 is {m.group(1)}
Group 2 is {m.group(2)}
Group 3 is {m.group(3)}
''')


Group 0 is ahab@pequod.com
Group 1 is ahab
Group 2 is pequod
Group 3 is com



### Named groups

In [17]:
m = re.search("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", text)

print(f'''
Email address: {m.group()}
Name:  {m.group("name")}
Secondary level domain: {m.group("SLD")}
Top level Domain: {m.group("TLD")}
''')


Email address: ahab@pequod.com
Name:  ahab
Secondary level domain: pequod
Top level Domain: com



### Find all

In [19]:
m = re.findall("\w+\@\w+\.\w+", text)
m

['ahab@pequod.com',
 'peleg@pequod.com',
 'ishmael@pequod.com',
 'herman@Acushnet.io',
 'pollard@essex.me']

In [23]:
re.findall("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", text)


[('ahab', 'pequod', 'com'),
 ('peleg', 'pequod', 'com'),
 ('ishmael', 'pequod', 'com'),
 ('herman', 'Acushnet', 'io'),
 ('pollard', 'essex', 'me')]

### Find iterator

In [30]:
iterator = re.finditer("\w+\@\w+\.\w+", text)

print(f"An {type(iterator)} object is returned by finditer" )

An <class 'callable_iterator'> object is returned by finditer


In [32]:
m = next(iterator)
f"The first match, {m.group()} is processed without processing the rest of the text"

"The first match, <_sre.SRE_Match object; span=(39, 55), match='peleg@pequod.com'> is processed without processing the rest of the text"

### Iterators with named groups

In [33]:
iterator = re.finditer("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", text)
for m in iterator:
  print(m.groupdict())

{'name': 'ahab', 'SLD': 'pequod', 'TLD': 'com'}
{'name': 'peleg', 'SLD': 'pequod', 'TLD': 'com'}
{'name': 'ishmael', 'SLD': 'pequod', 'TLD': 'com'}
{'name': 'herman', 'SLD': 'Acushnet', 'TLD': 'io'}
{'name': 'pollard', 'SLD': 'essex', 'TLD': 'me'}


In [37]:
iterator = re.finditer("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", text)
for m in iterator:
  data = m.groupdict()
  print(f"{data['name'].title()} sailed on the {data['SLD'].title()}")

Ahab sailed on the Pequod
Peleg sailed on the Pequod
Ishmael sailed on the Pequod
Herman sailed on the Acushnet
Pollard sailed on the Essex


### Substitution

In [46]:
re.sub("\d", "#", "Your secrect pin is 12345")

'Your secrect pin is #####'

### Substitution using named groups

In [45]:
new_text = re.sub("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)", "\g<TLD>.\g<SLD>.\g<name>", text)

print(new_text)


Ahab: com.pequod.ahab
Peleg: com.pequod.peleg
Ishmael: com.pequod.ishmael
Herman: io.acushnet.herman
Pollard: me.essex.pollard



### Compiling regexes

In [49]:
regex = re.compile("(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)")
regex

re.compile(r'(?P<name>\w+)\@(?P<SLD>\w+)\.(?P<TLD>\w+)', re.UNICODE)

In [50]:
regex.search(text)

<_sre.SRE_Match object; span=(7, 22), match='ahab@pequod.com'>

## 12.3 Learn text processing techniques: Beautiful Soup 


# Notes

[python regular expresssions](https://docs.python.org/3/library/re.html#re-syntax)

[python regex howto](https://docs.python.org/3/howto/regex.html)

[regular_expressions](https://en.wikipedia.org/wiki/Regular_expression)