# Regular Expressions

Regular expressions are normally the default way of data cleaning and wrangling in most of these tools. Be it extraction of specific parts of text from web pages, making sense of twitter data or preparing your data for text mining

Simply put, regular expression is a sequence of character(s) mainly used to find and replace patterns in a string or file.

Regular expressions use two types of characters:

a) Meta characters: As the name suggests, these characters have a special meaning, similar to * in wild card.

b) Literals (like a,b,1,2…)

In Python, we have module “re” that helps with regular expressions. So you need to import library re before you can use regular expressions in Python.

The most common uses of regular expressions are:

-Search a string (search and match)
-Finding a string (findall)
-Break string into a sub strings (split)
-Replace part of a string (sub)

In [1]:
import re

In [3]:
#re.match(pattern, string):
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result)

<re.Match object; span=(0, 2), match='AV'>


In [5]:
result = re.match(r'AV', 'AV Analytics Vidhya AV')
print(result.group(0))

AV


In [8]:
#will not find string as the string is not starting with analytics
#will only look for pattern if it occurs at the startof the string
result = re.match(r'Analytics', 'AV Analytics Vidhya AV')
print(result) 

None


In [14]:
#re.search(pattern, string):
#It is similar to match() but it doesn’t restrict us to find matches at the beginning of the string only. 
#Unlike previous method, here searching for pattern ‘Analytics’ will return a match
result=re.search(r"Analytics","AV Analytics AV")
print(result.group(0))
##But it only returns the first occurrence of the search pattern

Analytics


In [18]:
#re.findall (pattern, string):
#It helps to get a list of all matching patterns. It has no constraints 
result=re.findall(r"AV","AV Analytics AV")
print(result)

['AV', 'AV']


In [19]:
#re.split(pattern, string, [maxsplit=0]):
#it splits a string on occurance of given pattern
result=re.split(r"y","AV Analytics AV")
print(result)

['AV Anal', 'tics AV']


In [21]:
result=re.split(r'i','Analytics Vidhya')
print(result)

['Analyt', 'cs V', 'dhya']


In [25]:
result=re.split(r'i','Analytics Vidhya',maxsplit=1)
print(result)

['Analyt', 'cs Vidhya']


In [32]:
#re.sub(pattern, repl, string):
#It helps to search a pattern and replace with a new sub string.
#If the pattern is not found, string is returned unchanged
result=re.sub(r"india","the world","AV is the largest analytics hub in india")
print(result)

AV is the largest analytics hub in the world


In [29]:
#re.compile(pattern, repl, string):
#We can combine a regular expression pattern into pattern objects, which can be used for pattern matching. 
#It also helps to search a pattern again without rewriting it.
pattern=re.compile('AV')
result=pattern.findall('AV Analytics Vidhya AV')
print(result)
result2=pattern.findall('AV is largest analytics community of India')
print(result2)

['AV', 'AV']
['AV']


## Some Examples of Regular Expressions

In [33]:
# Return the first word of a goven string
result=re.findall(r'.','AV is largest Analytics community of India') # "." matches anything except /n
print(result)

['A', 'V', ' ', 'i', 's', ' ', 'l', 'a', 'r', 'g', 'e', 's', 't', ' ', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', ' ', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', ' ', 'o', 'f', ' ', 'I', 'n', 'd', 'i', 'a']


In [40]:
result=re.findall(r'\w','AV is largest Analytics community of India')# W for non alplanumeric and w for alphanumeric
print(result)  

['A', 'V', 'i', 's', 'l', 'a', 'r', 'g', 'e', 's', 't', 'A', 'n', 'a', 'l', 'y', 't', 'i', 'c', 's', 'c', 'o', 'm', 'm', 'u', 'n', 'i', 't', 'y', 'o', 'f', 'I', 'n', 'd', 'i', 'a']


In [42]:
result=re.findall(r'\w*','AV is largest Analytics community of India') #0 or more occurance of pattern to left
print (result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']


In [44]:
result=re.findall(r'\w+','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [46]:
result=re.findall(r'^\w+','AV is largest Analytics community of India')
print (result)

['AV']


In [47]:
#Return the first two character of each word
result=re.findall(r'\w\w','AV is largest Analytics community of India')
print(result)

['AV', 'is', 'la', 'rg', 'es', 'An', 'al', 'yt', 'ic', 'co', 'mm', 'un', 'it', 'of', 'In', 'di']


In [49]:
result=re.findall(r'\b\w.','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'la', 'An', 'co', 'of', 'In']


In [51]:
#Return the domain type of given email-ids
result=re.findall(r'@\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)

['@gmail', '@test', '@analyticsvidhya', '@rest']


In [52]:
result=re.findall(r'@\w+.\w+','abc.test@gmail.com, xyz@test.in, test.first@analyticsvidhya.com, first.test@rest.biz')
print(result)

['@gmail.com', '@test.in', '@analyticsvidhya.com', '@rest.biz']


In [55]:
 #Return date from given string
result=re.findall(r'\d{2}-\d{2}-\d{4}','Amit 34-3456 12-05-2007, XYZ 56-4532 11-11-2011, ABC 67-8945 12-01-2009')
print(result)    

['12-05-2007', '11-11-2011', '12-01-2009']


In [57]:
#Return all words of a string those starts with vowel
result=re.findall(r'\w+','AV is largest Analytics community of India')
print (result)


['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [61]:
# Validate a phone number (phone number must be of 10 digits and starts with 8 or 9)
li=['9999999999','999999-999','99999x9999']
for val in li:
 if re.match(r'[8-9]{1}[0-9]{9}',val) and len(val) == 10:
     print ('yes')
 else:
     print ("no")

yes
no
no


In [64]:
#Split a string with multiple delimiters
line = 'asdf fjdk;afed,fjek,asdf,foo' # String has multiple delimiters (";",","," ").
result= re.split(r'[;,\s]', line)
print (result)

['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']


In [66]:
#Retrieve Information from HTML file
result=re.findall(r'<td>\w+</td>\s<td>(\w+)</td>\s<td>(\w+)</td>',str)
print (result)

TypeError: expected string or bytes-like object