# Regular expressions -re
The ‘re’ package provides multiple methods to perform queries on an input string. Here are the most commonly used methods:
re.match()
re.search()
re.findall()
re.split()
re.sub()
re.compile()

<b>re.match(pattern, string): </b>
This method finds match if it occurs at start of the string. For example, calling match() on the string ‘SM Something inside SM’ and looking for a pattern ‘SM’ will match. However, if we look for only Somethibg, the pattern will not match. Let’s perform it in python now.


In [2]:
import re
result = re.match(r'SM', 'SM Something inside SM   SM')
print (result)

<_sre.SRE_Match object; span=(0, 2), match='SM'>


In [3]:
print(result.group(0))

SM


In [4]:
result = re.match(r'Something', 'SM Something inside SM   SM')
print (result)

None


<b>re.search(pattern,string)</b>  - returns first occurance

In [8]:
result = re.search(r'inside', 'SM Something inside SM   SM')
print(result.group(0))

inside


<b>re.findall(pattern,string)</b>  - returns all

In [9]:
result = re.findall(r'SM', 'SM Something inside SM   SM')
print(result)

['SM', 'SM', 'SM']


<b>re.split(pattern,string, [maxsplit=0]):</b>  - split string by the occurences  of a given pattern

In [10]:
result = re.split(r't', 'Something')
result

['Some', 'hing']

In [11]:
result = re.split(r'i', 'Something inside')
result

['Someth', 'ng ', 'ns', 'de']

In [12]:
result = re.split(r'i', 'Something inside', maxsplit=1)
result

['Someth', 'ng inside']

<b>re.sub(pattern, repl, string):</b>  It helps to search a pattern and replace with a new sub string. If the pattern is not found, string is returned unchanged.

In [13]:
result=re.sub(r'Australia','the World','AU is largest Analytics community in Australia')
result

'AU is largest Analytics community in the World'

<b>re.compile(pattern, repl, string):</b>

In [14]:
import re
pattern=re.compile('AU')
result=pattern.findall('AU Analytics Something AU')
print(result)
result2=pattern.findall('AU is largest analytics community in Australia')
print(result2)

['AU', 'AU']
['AU']


## Problem 1: Return the first word of a given string
Solution-1  Extract each character (using “\w“)

In [15]:
import re
result=re.findall(r'.','AU is largest analytics community in Australia')
print(result)

['A', 'U', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'a', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'i', 'n', ' ', 'A', 'u', 's', 't', 'r', 'a', 'l', 'i', 'a']


In [17]:
result=re.findall(result=re.findall(r'\w','AU is largest analytics community in Australia')
print(result)

['A', 'U', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'a', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'i', 'n', 'A', 'u', 's', 't', 'r', 'a', 'l', 'i', 'a']


Solution-2  Extract each word (using “*” or “+“)

In [19]:
result=re.findall(r'\w*','AU is largest analytics community in Australia')
print(result)

['AU', '', 'is', '', 'largest', '', 'analytics', '', 'community', '', 'in', '', 'Australia', '']


Again, it is returning space as a word because “*” returns zero or more matches of pattern to its left. Now to remove spaces we will go with “+“.

In [20]:
result=re.findall(r'\w+','AU is largest analytics community in Australia')
print(result)

['AU', 'is', 'largest', 'analytics', 'community', 'in', 'Australia']


Solution-3 Extract each word (using “^“)

In [21]:
result=re.findall(r'^\w+','AU is largest analytics community in Australia')
print(result)

['AU']


In [23]:
result=re.findall(r'\w+$','AU is largest analytics community in Australia')
print(result)

['Australia']


## Problem 2: Return the first two character of each word
Solution-1  Extract consecutive two characters of each word, excluding spaces (using “\w“)

In [24]:
result=re.findall(r'\w\w','AU is largest analytics community in Australia')
print(result)

['AU', 'is', 'la', 'rg', 'es', 'an', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'in', 'Au', 'st', 'ra', 'li']


Solution-2  Extract consecutive two characters those available at start of word boundary (using “\b“)

In [25]:
result=re.findall(r'\b\w','AU is largest analytics community in Australia')
print(result)

['A', 'i', 'l', 'a', 'c', 'i', 'A']


## Problem 3: Return the domain type of given email-ids
To explain it in simple manner, I will again go with a stepwise approach:

Solution-1  Extract all characters after “@”

In [27]:
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, myemait@email.com, first.test@rest.biz') 
print (result) 

['@gmail', '@test', '@email', '@rest']


Above, you can see that “.com”, “.in” part is not extracted. To add it, we will go with below code.

In [28]:
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, myemait@email.com, first.test@rest.biz') 
print (result) 

['@gmail.com', '@test.in', '@email.com', '@rest.biz']


Solution – 2 Extract only domain name using “( )”

In [29]:
result=re.findall(r'@\w+.(\w+)','abc.test@gmail.com, xyz@test.in, myemait@email.com, first.test@rest.biz') 
print (result) 

['com', 'in', 'com', 'biz']


## Problem 4: Return date from given string
Here we will use “\d” to extract digit.

In [30]:
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)

['12-05-2007', '11-11-2011', '12-01-2009']


extract only year add ()

In [31]:
result=re.findall(r'\d{2}-\d{2}-(\d{4})','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)

['2007', '2011', '2009']


## Problem 5: Return all words of a string those starts with vowel
Solution-1  Return each words

In [35]:
result=re.findall(r'\w+','AU is largest analytics community in Australia')
print(result)

['AU', 'is', 'largest', 'analytics', 'community', 'in', 'Australia']


Solution-2  Return words starts with alphabets (using [])

In [36]:
result=re.findall(r'[aeiouAEIOU]\w+','AU is largest analytics community in Australia')
print(result)

['AU', 'is', 'argest', 'analytics', 'ommunity', 'in', 'Australia']


Above you can see that it has returned “argest” and “ommunity” from the mid of words. To drop these two, we need to use “\b” for word boundary.
Solution- 3

In [37]:
result=re.findall(r'\b[aeiouAEIOU]\w+','AU is largest analytics community in Australia')
print(result)

['AU', 'is', 'analytics', 'in', 'Australia']


In similar ways, we can extract words those starts with constant using “^” within square bracket.

In [38]:
result=re.findall(r'\b[^aeiouAEIOU]\w+','AU is largest analytics community in Australia')
print(result)

[' is', ' largest', ' analytics', ' community', ' in', ' Australia']


## Problem 6: Validate a phone number (phone number must be of 10 digits and starts with 8 or 9) 
We have a list phone numbers in list “li” and here we will validate phone numbers using regular

In [39]:
import re
li=['9999999999','999999-999','99999x9999']
for val in li:
    if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
        print ('yes')
    else:
        print ('no')

yes
no
no


## Problem 7: Split a string with multiple delimiters

In [40]:
import re
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result= re.split(r'[;,\s]', line)
print(result)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


or use re.sub instead


In [41]:
import re
line = 'asdf fjdk;afed,fjek,asdf,foo'
result= re.sub(r'[;,\s]',' ', line)
print(result)

asdf fjdk afed fjek asdf foo


## Problem 8: Retrieve Information from HTML file
I want to extract information from a HTML file (see below sample data). Here we need to extract information available between <td> and </td> except the first numerical index. I have assumed here that below html code is stored in a string str.

Sample HTML file (str)

<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
<tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>
<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>

In [65]:
str = """<tr align="center"><td>1</td> <td>Noah</td> <td>Emma</td></tr>
<tr align="center"><td>2</td> <td>Liam</td> <td>Olivia</td></tr>
<tr align="center"><td>3</td> <td>Mason</td> <td>Sophia</td></tr>
<tr align="center"><td>4</td> <td>Jacob</td> <td>Isabella</td></tr>
<tr align="center"><td>5</td> <td>William</td> <td>Ava</td></tr>
<tr align="center"><td>6</td> <td>Ethan</td> <td>Mia</td></tr>
<tr align="center"><td>7</td> <td HTML>Michael</td> <td>Emily</td></tr>
"""
#import urllib.request
#response = urllib.request.urlopen('https://trainings.analyticsvidhya.com/courses/course-v1:AnalyticsVidhya+BPDS001+2018_T2/courseware/4e12b8833d204ad5a10e7a39d2171d54/06a2d731c0ba40798687d6e234e5fbfc/?child=first')
#str=response.read().decode('utf-8')
#print(str)
result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
print(result)

[('Noah', 'Emma'), ('Liam', 'Olivia'), ('Mason', 'Sophia'), ('Jacob', 'Isabella'), ('William', 'Ava'), ('Ethan', 'Mia')]
