# Regular Expressions

## Data Science: Machine Learning Techniques

We start by importing the regular expression package, 're'.

In [1]:
import re

In [15]:
string_ex_list = ['i like beer', "i don't like bears", 'double match beerbeer', 'not a match', 'be']
# Note the second phrase usage of double and single parentheses!
pattern = r'be'

In [13]:
match_result = [re.match(pattern, x) for x in string_ex_list]

print(match_result)

print([x.group(0) for x in match_result if x])

[None, None, None, None, <_sre.SRE_Match object; span=(0, 2), match='be'>]
['be']


Notice how we didn't match 'bear' or 'beer' at all.  This is because the re.match() function only matches at the start of the string!

To search all strings, use re.search():

In [19]:
all_result = [re.search(pattern, x) for x in string_ex_list]

print(all_result)

print([x.group() for x in all_result if x])

[<_sre.SRE_Match object; span=(7, 9), match='be'>, <_sre.SRE_Match object; span=(13, 15), match='be'>, <_sre.SRE_Match object; span=(13, 15), match='be'>, None, <_sre.SRE_Match object; span=(0, 2), match='be'>]
['be', 'be', 'be', 'be']


But this only is matching the first result!! (See third result)

To search all strings for all matches, use re.findall():

In [21]:
all_result2 = [re.findall(pattern, x) for x in string_ex_list]

# This will return all the strings!
print(all_result2)

# This will fail.  Already have the strings.
print([x.group() for x in all_result2 if x])

[['be'], ['be'], ['be', 'be'], [], ['be']]


AttributeError: 'list' object has no attribute 'group'

What if we want ALL the indices?  use re.finditer():

In [43]:
all_result3 = [re.finditer(pattern, x) for x in string_ex_list]

# Iterable Match objects:
print('Iterable Match Objects:')
print(all_result3)

# Iterate through match objects:
print('\nAll iterated match objects:')
print([[x for x in match_iterable] for match_iterable in all_result3])

# This will return all the strings!
print('\nHere are all the strings:')
print([[x.group() for x in string_matches if x] for string_matches in all_result3])

# Return indices
print('\nHere are the start and stop indices:')
print([[x.span(0) for x in string_matches] for string_matches in all_result3])

# WHAT WENT WRONG???

Iterable Match Objects:
[<callable_iterator object at 0x7fe1b49115f8>, <callable_iterator object at 0x7fe1b49115c0>, <callable_iterator object at 0x7fe1b4911a58>, <callable_iterator object at 0x7fe1b4911e80>, <callable_iterator object at 0x7fe1b49117f0>]

All iterated match objects:
[[<_sre.SRE_Match object; span=(7, 9), match='be'>], [<_sre.SRE_Match object; span=(13, 15), match='be'>], [<_sre.SRE_Match object; span=(13, 15), match='be'>, <_sre.SRE_Match object; span=(17, 19), match='be'>], [], [<_sre.SRE_Match object; span=(0, 2), match='be'>]]

Here are all the strings:
[[], [], [], [], []]

Here are the start and stop indices:
[[], [], [], [], []]


The last two are all empty lists!!!  WHY???

It is because of how python handles 'iterators'!!!  Once an iterator is done 'iterating', then there is nothing left to show.

So we should save it!

In [45]:
all_result3 = [re.finditer(pattern, x) for x in string_ex_list]

# Iterable Match objects:
print('Iterable Match Objects:')
print(all_result3)

# Iterate through match objects:
print('\nAll iterated match objects:')
match_object_list = [[x for x in match_iterable] for match_iterable in all_result3]
print([[x for x in match_object] for match_object in match_object_list])

# This will return all the strings!
print('\nHere are all the strings:')
print([[x.group() for x in match_object if x] for match_object in match_object_list])

# Return indices
print('\nHere are the start and stop indices:')
print([[x.span(0) for x in match_object] for match_object in match_object_list])


Iterable Match Objects:
[<callable_iterator object at 0x7fe1b4911358>, <callable_iterator object at 0x7fe1b49115f8>, <callable_iterator object at 0x7fe1b4911278>, <callable_iterator object at 0x7fe1b4911438>, <callable_iterator object at 0x7fe1b4911a58>]

All iterated match objects:
[[<_sre.SRE_Match object; span=(7, 9), match='be'>], [<_sre.SRE_Match object; span=(13, 15), match='be'>], [<_sre.SRE_Match object; span=(13, 15), match='be'>, <_sre.SRE_Match object; span=(17, 19), match='be'>], [], [<_sre.SRE_Match object; span=(0, 2), match='be'>]]

Here are all the strings:
[['be'], ['be'], ['be', 'be'], [], ['be']]

Here are the start and stop indices:
[[(7, 9)], [(13, 15)], [(13, 15), (17, 19)], [], [(0, 2)]]
