## Regular expressions ## 


### 1.1 Finding one match in string ###
1. Import 're' module.
2. Create Regex Object by passing pattern to compile() function.
3. Call search() function on regex object and pass it a string you want to look for a match.
4. Store Matching Object (if it exists) in variable 'mo'.
5. Call group() function on object mo to display a whole match.

In [1]:
import re 

In [58]:
# Write function to search a text for phone number (e.x. 675 678 900, 675-678-900, 675678900)
def find_number(text):
    pattern = re.compile(r'\D\d{3}\s\d{3}\s\d{3}\D|\D\d{3}-\d{3}-\d{3}\D|\D\d{9}\D')
    mo = pattern.search(text)
    if mo != None:
        return mo.group()[1:-1]
    else:
        return 'No match'

In [63]:
text = 'Dzis jest 24.10. Szukamy chetnych do pracy za 12 zł/h. Platacja ma wysokosc 34567 '\
'metrow. Prosze  kontakt na nr 123456789 lub do biura na numer 3456 23 23. 3345-890-798 ' \
'Kontakt z obsluga spa pod nr :345-890-798. lub 234 567 890.'

find_number(text)

'123456789'

<p> Notice that search() func returns only one, first match. </p>

### 1.2 Finding all matches in string ###
 
* Call findall() function on regex object and pass a string to it. (instead of using search() function)

In [49]:
# Write function to search a text for all phone numbers (e.x. 675 678 900, 675-678-900, 675678900)
def find_numbers(text):
    pattern = re.compile(r'\D\d{3}\s\d{3}\s\d{3}\D|\D\d{3}-\d{3}-\d{3}\D|\D\d{9}\D')
    mo = pattern.findall(text)
    if mo != []:
        mo = list(map(lambda x:x[1:-1], mo))
        return mo
    else:
        return 'No match'

In [50]:
text = 'Dzis jest 24.10. Szukamy chetnych do pracy za 12 zł/h. Platacja ma wysokosc 34567 '\
'metrow. Prosze  kontakt na nr 123456789 lub do biura na numer 456 23 23.345-890-798 ' \
'Kontakt z obsluga spa pod nr 1345-890-798. lub 234567 890.'

find_numbers(text)

['123456789', '345-890-798']

### Notes ###
* findall() function returns a list, search() function returns a Match Object. 
* Remember to crate a regex object with raw string! 

### 1.3 Grouping with parentheses ###
* One set of parentheses in pattern is one group, first set - first group, second - second group etc.
* To grab first group with Match Object use group(1) on MO.
* To grab the whole match, use group() without an argument.

In [82]:
def find_number(text, g):
    pattern = re.compile(r'\D(\d{3})\s(\d{3}\s\d{3})\D')
    mo = pattern.search(text)
    if mo != None:
        return mo.group(g)
    else:
        return 'No match'

In [86]:
text = "123456789, it's not my number! My number is 415 678 903."

# In this case, the first group will be '415' and the second one is '678 903'. Let's check it!

find_number(text, 2), find_number(text, 1)

('678 903', '415')

### 1.4 Matching multiple groups with pipe | ###

* alternative of patterns, the first one in text which is pass to search() func is taken as Match Object.
* option with using parentheses works only when used with search(), not findall().

In [4]:
ro = re.compile(r'Bat(man|mobile)')
mo = ro.search('I like Batman very much and Batmobile as well.')
print(mo.group(), mo.group(1))

Batman man


In [5]:
ro = re.compile(r'Bat(man|mobile)')
mo = ro.findall('I like Batman very much and Batmobile as well.')
print(mo)

['man', 'mobile']


### 1.5 Optional matching with question mark ? ###
* The part of a string in parentheses, before the '?' is optional
* works only with search() method

In [11]:
ro = re.compile(r'Bat(wo)?man')
mo = ro.search('I like Batman very much and Batwoman as well.')
print(mo.group())

Batman


In [9]:
ro = re.compile(r'Bat(wo)?man')
mo = ro.findall('I like Batman very much and Batwoman as well.')
print(mo)

['', 'wo']


### 1.6 Matching zero or more with the star * ###

In [15]:
ro = re.compile(r'bat(wo)*man')
mo = ro.search('me and batman')
mo1 = ro.search('she is a batwowoman')
mo2 = ro.search('batwowowowowoman')
print(mo.group(), mo1.group(), mo2.group())

batman batwowoman batwowowowowoman


### 1.7 Matching one or more with the plus + ###
* similar to matching with a star, but in this case group has to appear at least 1 time

In [18]:
ro = re.compile(r'bat(wo)+man')
mo = ro.search('me and batman')
if mo != None:
    print(mo.group())
else:
    print('No match')

No match


In [27]:
ro = re.compile(r'bat(wo)+man')
mo = ro.search('I am a batwowoman')
mo.group()

'batwowoman'

### 1.8 Matching specific repetitions with curly brackets {} ###

* Use when you want to it repeat a group a specific number of times
* ('String'){3} - 'String' group appears 3 times -> 'StringStringString'
* ('string'){2,5} - appears 2 min. and 5 max. times
* ('string'){2,} - appears 2 times min.
* ('string'){,5} - appears 5 times max.

In [26]:
haRegex = re.compile(r'(Ha){3}')
mo1 = haRegex.search('HaHaHa')
mo1.group()

'HaHaHa'

In [28]:
reg = re.compile(r'Ha{3}')
mo2 = reg.search('Ha')
mo2 == None

True

### 1.9 Greedy and nongreedy matching ###
* greedy matching by default - finds the longest matching string {3,5}
* nongreedy matching - finds the shortest matching string {3,5}?

In [30]:
# Greedy matching
reg = re.compile(r'(ha){3,5}')
mo =reg.search('hahahahaha')
mo.group()

'hahahahaha'

In [31]:
# Nongreedy matching
reg = re.compile(r'(ha){3,5}?')
mo =reg.search('hahahahaha')
mo.group()

'hahaha'

### 1.10 Character classes and making character classes###
* character class is written like this: [abc], for class with 'a', 'b', 'c' characters, it's a shortcut for regex ex - (r'a|b|c')
* we can use a hypen to precisize how broad is a class e.x. [0-5] stand for regex expression- (r'0|1|2|3|4|5')
* in square brackets there is no need to escape special characters with backslash e.x. [1-5.] 
* '^' at the bieginning of character class, creates a negative class, including all the character that are NOT in character class

### Making character class ###

In [61]:
# Creating character class with vowels 'a','e','i','y','o','u' (capital as well).
vowel = re.compile(r'[aoueiyAOUEIY]')
mo = vowel.findall('I was there the day before yesterday.')
mo

['I', 'a', 'e', 'e', 'e', 'a', 'y', 'e', 'o', 'e', 'y', 'e', 'e', 'a', 'y']

### Making negative character class with ^ ###

In [66]:
not_vowel = re.compile(r'[^aoueiyAOUEIY]')
mo = not_vowel.findall('I was there the day before yesterday.')
print(mo)

[' ', 'w', 's', ' ', 't', 'h', 'r', ' ', 't', 'h', ' ', 'd', ' ', 'b', 'f', 'r', ' ', 's', 't', 'r', 'd', '.']


In [60]:
# Creating character class with all lowercase and uppercase letters and numerical digits

allwordsigns = re.compile(r'[a-zA-z0-9]')
mo = allwordsigns.findall('Rebecca know hot to sing since she was 18.')
print(mo)

['R', 'e', 'b', 'e', 'c', 'c', 'a', 'k', 'n', 'o', 'w', 'h', 'o', 't', 't', 'o', 's', 'i', 'n', 'g', 's', 'i', 'n', 'c', 'e', 's', 'h', 'e', 'w', 'a', 's', '1', '8']


### 1.11 The dollar and caret characters ###
* Caret sign ^ at the beginning of regex indicates, that match must occure at the beggining of searched text.
* Dollar sign at the end of regex indicates, that match must occure at the end of searched text.

In [69]:
# Caret character ^
beginswith = re.compile(r'^Hello')
mo1 = beginswith.search('Hello world')
mo2 = beginswith.search('Oh, Hello my dear!')
mo1, mo2

(<re.Match object; span=(0, 5), match='Hello'>, None)

In [70]:
# Dollar character $
endswith = re.compile(r'wow$')
mo3 = endswith.search('wow it looks nice')
mo4 = endswith.search('it looks nice, wow')
mo3, mo4

(None, <re.Match object; span=(15, 18), match='wow'>)

In [71]:
# Caret and dollar together in one regex
startsends = re.compile('^house$')
mo5 = startsends.search('house')
mo6 = startsends.search('house is beautiful')
mo5, mo6

(<re.Match object; span=(0, 5), match='house'>, None)

### 1.12 The wildcard character  '.' ###
* . (dot) match any character except for a newline 

In [74]:
reg = re.compile(r'.\d{3}$')
mo = reg.search("It must be a missunderstanding, please call a p012")
print(mo.group())

p012


In [76]:
reg = re.compile(r'.at')
mo = reg.findall('cat in the hat sat at the couch.')
print(mo)

['cat', 'hat', 'sat', ' at']


### 1.13 Matching everything with dot-star .* ###

In [82]:
namereg = re.compile(r'First name:(.*) Last name:(.*)')
mo = namereg.search('First name: Edward Last name: Tompson')
mo.group(1), mo.group(2)

(' Edward', ' Tompson')

In [83]:
# The same example but in non-greedy manner
namereg = re.compile(r'First name:(.*?) Last name:(.*?)')
mo = namereg.search('First name: Edward Last name: Tompson')
mo.group(1), mo.group(2)

(' Edward', '')

In [85]:
# Greedy again
greedy = re.compile(r'<.*>')
mo1 = greedy.search('<i talk to you all> and you as well>')
mo1.group()

'<i talk to you all> and you as well>'

In [86]:
# Non-greedy
nongreedy = re.compile(r'<.*?>')
mo2 = nongreedy.search('<i talk to you all> and you as well>')
mo2.group()

'<i talk to you all>'

### 1.13 Matching newlines with dot character ###
* dot character stand for every character except a newline character (\n)
* if we want to include \n as well, we can send re.DOTALL as a second argument to re.compile()

In [100]:
# Without re.DOTALL
reg = re.compile(r'.*')
reg.search('I was very curiose about it.\n But she was not.').group()

'I was very curiose about it.'

In [101]:
# With re.DOTALL
reg = re.compile(r'.*', re.DOTALL)
reg.search('I was very curiose about it.\nBut she was not.').group()

'I was very curiose about it.\nBut she was not.'

### 1.14 Case-insensitive matching ###
* pass re.IGNORECASE or re.I to re.compile()

In [103]:
reg = re.compile(r'roBocOP', re.I)
reg.search('robOcop is somthing').group()

'robOcop'

### 1.15 Substituting strings with a sub() method ###

In [117]:
reg = re.compile(r'A\w{3}.*?\s\w{4}')
reg.sub('CENSORED', 'I heard that Alicja Keys has confidential informations.')

'I heard that CENSORED has confidential informations.'

In [120]:
reg = re.compile(r'Agent (\w)\w*')
reg.sub(r'\1****', 'Agent Thomas told Agent Agatha that she must meet Agent Joanne in Belgium')

'T**** told A**** that she must meet J**** in Belgium'

### 1.16 Using re.VERBOSE to comment regex ###

In [122]:
phoneRegex = re.compile(r'''(
 (\d{3}|\(\d{3}\))?           # area code
 (\s|-|\.)?                   # separator
 \d{3}                        # first 3 digits
 (\s|-|\.)                    # separator
 \d{4}                        # last 4 digits
 (\s*(ext|x|ext.)\s*\d{2,5})? # extension
 )''', re.VERBOSE)

In [125]:
phoneRegex.search('(415) 675 4563  x  23456. I was sure this is 3456767222222').group()

'(415) 675 4563  x  23456'