# Objective: <br>
<ol><li>Introduction to Regular Expression.</li>
    <li>Examples and use cases.</li><ol>

In [1]:
import re

**Regular expressions are very useful, especially in text pre-processing.**<br>
**Before using them we need to know the basic steps:**<br>
<ol>
    <li> Come up with the regular expression. </li>
    <li> Compile the regular expression. </li>
    <li> Use the complied RE object to match, search, replace etc.</li>
    
</ol>

In [11]:
# compiled the regular expression
p = re.compile('xyz*',re.IGNORECASE)
m=p.match("abcdefgh")
print(m) # NO match


None


In [12]:

m=p.match("abcdxyzxy")
print(m) # No match again contains xyz but does not start with xyz
m=p.match("XYZXYZABC")
print(m)
m=p.match("XYZXYZABC".lower())
print(m)


None
<_sre.SRE_Match object; span=(0, 3), match='XYZ'>
<_sre.SRE_Match object; span=(0, 3), match='xyz'>


In [22]:
# lets try using search
m=p.search("abcdxyzxy")
print(m)

<_sre.SRE_Match object; span=(4, 7), match='xyz'>


In [25]:
#findall
m=p.findall("abcdxyzxy123456789xyz")
print(m)

['xyz', 'xyz']


In [38]:
# Extracting all numbers from a string
string = "Bob was planning to buy a model-y for 1200 USD but he was having seconds thoughts, it was manufactured in 2014. On the other hand it had 5 years extended warranty plan active that runs through 2019."
# when to use findall - if you just want all numbers in a list
find_digits  = re.compile('\d').findall(string) # \d matches all the digits
print(find_digits) # extracts each digit individually

find_digits  = re.compile('\d+').findall(string) # \d matches all the digits + matches at least once
print([int(x) for x in find_digits])

# If you need spans of the numbers -
iterator = re.compile('\d+').finditer(string)
for match in iterator:
    print(match.group(),match.span())

['1', '2', '0', '0', '2', '0', '1', '4', '5', '2', '0', '1', '9']
[1200, 2014, 5, 2019]
1200 (38, 42)
2014 (106, 110)
5 (137, 138)
2019 (193, 197)


**Replace function. ** <br>
** sub function first finds the substring that matches with the RE then re places it with the string provided and returns the resulting string. If the pattern is not found it returns the original string unchanged.**

In [42]:
# 1. Delete numbers from the string above
num_removed = re.sub('\d+','',string) # input takes RE, string that is to be replaced, input string
print(num_removed)

Bob was planning to buy a model-y for  USD but he was having seconds thoughts, it was manufactured in . On the other hand it had  years extended warranty plan active that runs through .


In [40]:
# 2. replace all numbers with string
num_replaced = re.sub('\d+','NUM',string) # input takes RE, string that is to be replaced, input string
print(num_replaced)

Bob was planning to buy a model-y for NUM USD but he was having seconds thoughts, it was manufactured in NUM. On the other hand it had NUM years extended warranty plan active that runs through NUM.


In [44]:
# 3. Eliminate all duplicate white spaces
string_with_duplicate_whites = "aaa   bbbbb cc ccc ddgtc   gh"
print(string_with_duplicate_whites)
whitespaces = re.sub(r'\s+', ' ',   string_with_duplicate_whites)
print(whitespaces)


aaa   bbbbb cc ccc ddgtc   gh
aaa bbbbb cc ccc ddgtc gh


In [45]:
#4. To return the number of replacements use subn
num_removed = re.subn('\d+','',string) # input takes RE, string that is to be replaced, input string
print(num_removed[0],num_removed[1])

Bob was planning to buy a model-y for  USD but he was having seconds thoughts, it was manufactured in . On the other hand it had  years extended warranty plan active that runs through . 4


# Using RE

**1.Finding all adverbs. Assuming they end with -ly**

In [46]:
adverb = "The module was designed very carefully and deployed very efficiently, blehly."
iter_adverb = re.finditer('\w+ly',adverb) # \w matches any alphanumeric character + ensures ly is present atleast once.
#                                            Note how the assumptions we make are very important. Once we get all the words
#                                            We could use wordnet to check for spellings as shown in previous notebook
for adv in iter_adverb:
    print(adv.group(),adv.span())

carefully (29, 38)
efficiently (57, 68)
blehly (70, 76)


**2.Finding all nouns. Assuming they start with capital letter.**

In [54]:
nouns  = "asdhagsd Ajkjahjkah Bkjakajksd sefafa1233 Thskdfh231"
iter_noun = re.finditer('[A-Z]{1}\w*',nouns) # The set of capital letters [A-Z] {1} repeats once followed by w 

for noun in iter_noun:
    print(noun.group(),noun.span())

Ajkjahjkah (9, 19)
Bkjakajksd (20, 30)
Thskdfh231 (42, 52)


**3.Pluralize nouns with regular expressions - we did this with text blob in the previous notebook.**<br>
**Before we start compiling regular expressions for the task we need to have some rules that define the suffixes.**<br>
<ol>
    <li>If a word ends in s, x, or z, add es</li>
    <li>If a word ends in a noisy h, add es; if it ends in a silent h, just add s.What's a noisy h? One that gets combined with other letters to make a sound that we can hear. So coach becomes coaches and rash becomes rashes, because we can hear the ch and sh sounds when we say them. But cheetah becomes cheetahs, because the h is silent.</li>
    <li>If a word ends in y that sounds like i, change the y to ies; if the y is combined with a vowel to sound like something else, just add s.</li>
    <li>If all else fails add s.</li>
    </ol>

In [55]:
def pluralize(noun):          
    if re.search('[sxz]$', noun): # $ matches the ending    
        return re.sub('$', 'es', noun)
    elif re.search('[^aeioudgkprt]h$', noun): # ^matches the beginning similar to {1}
        return re.sub('$', 'es', noun)       
    elif re.search('[^aeiou]y$', noun):      
        return re.sub('y$', 'ies', noun)     
    else:
        return noun + 's'

In [59]:
N = "knife" # change the nouns here
print(pluralize(N))

knifes


**Excercise - Download a list of nouns and their plurals off the internet, using this function evaluate the answers. If you encounter a wrong answer using the function. Come up with a new rule and include it in the above function.** 

**4.Parse telephone numbers.**
There is no one correct way

In [62]:
# This way of writing a regular expression is easier to understand and modify. We use re.versbose argument when RE is defined this way.
pattern = re.compile(r'''
                # don't match beginning of string, number can start anywhere
    (\d{3})     # area code is 3 digits (e.g. '415')
    \D*         # optional separator is any number of non-digits
    (\d{3})     # trunk is 3 digits (e.g. '867')
    \D*         # optional separator
    (\d{4})     # rest of number is 4 digits (e.g. '5309')
    \D*         # optional separator
    (\d*)       # extension is optional and can be any number of digits
    $           # end of string
    ''', re.VERBOSE)

print(pattern.search('emergency 1-(415) 867.5309 #9999').groups())
print(pattern.search('415-867-5309').groups())
print(pattern.search('(415)86753099999').groups())

# Since we are not matching anything at beginning as in teh previous example RE engine figures out the start when it 
# encounters 3 digits followed.
# Note that \D matches any non digit. and \d matches any digit.


('415', '867', '5309', '9999')
('415', '867', '5309', '')
('415', '867', '5309', '9999')


**5.Gather hashtags from tweets.**

In [63]:
# This is an excercise. Explain the RE defined below. Try it with a few tweets. If it fails come up with a better RE.
hashtag_re = re.compile("(^|[^A-Za-z0-9/_])[##]{1}([A-Za-z0-9_]+)",re.UNICODE)
hashtags = hashtag_re.findall("$your #Tweet23@ Her#ee")
print(hashtags)

[(' ', 'Tweet23')]
