# Regular Expressions
If we want to represent a group of Strings according to a particular format/pattern then we should go for Regular Expressions.<br>
i.e Regualr Expressions is a declarative mechanism to represent a group of Strings accroding to particular format/pattern.<br>
Eg 1: We can write a regular expression to represent all mobile numbers.<br>
Eg 2: We can write a regular expression to represent all mail ids.<br>
The main important application areas of Regular Expressions are
 1. To develop validation frameworks/validation logic
 2. To develop Pattern matching applications (ctrl-f in windows, grep in UNIX etc)
 3. To develop Translators like compilers, interpreters etc
 4. To develop digital circuits
 5. To develop communication protocols like TCP/IP, UDP etc.

We can develop Regular Expression Based applications by using python module: `re`<br>
This module contains several inbuilt functions to use Regular Expressions very easily in our applications.

sites for learning regex effectively

*   [regex101.com](https://regex101.com)
*   [regexr.com](https://regexr.com)
  
for practicing regex questions
* [regexOne.com](https://regexOne.com)


In [1]:
print('\t')

	


In [2]:
print(r'\t') # r represents raw string that will not execute escape sequence characters

\t


In [3]:
print([A-Z]{2}\d{1,5})

SyntaxError: unexpected character after line continuation character (3042984604.py, line 1)

In [None]:
dir(re)  

['A',
 'ASCII',
 'DEBUG',
 'DOTALL',
 'I',
 'IGNORECASE',
 'L',
 'LOCALE',
 'M',
 'MULTILINE',
 'Match',
 'Pattern',
 'RegexFlag',
 'S',
 'Scanner',
 'T',
 'TEMPLATE',
 'U',
 'UNICODE',
 'VERBOSE',
 'X',
 '_MAXCACHE',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '__version__',
 '_cache',
 '_compile',
 '_compile_repl',
 '_expand',
 '_locale',
 '_pickle',
 '_special_chars_map',
 '_subx',
 'compile',
 'copyreg',
 'enum',
 'error',
 'escape',
 'findall',
 'finditer',
 'fullmatch',
 'functools',
 'match',
 'purge',
 'search',
 'split',
 'sre_compile',
 'sre_parse',
 'sub',
 'subn',
 'template']

In [5]:
import re

## RegEx Metacharacters

In [9]:
# a basic regex code to find emails from text

email = re.compile('[\w\.]+@[\w]+\.[a-z]{3}')

text = "You can mail us on hrutk.nawade@gmail.com"
email.findall(text)

['hrutk.nawade@gmail.com']

In [10]:
regex = re.compile(r'\$\d+')
regex.findall("the cost is $20")

['$20']

In [11]:
regex = re.compile('at')
regex.findall('Great Expectations')

['at', 'at']

In [15]:
# comparison raw string and simple string

print(r'a\nb\nc')
print('a\nb\nc')

a\nb\nc
a
b
c


In [None]:
regex = re.compile(r'\w{2}\s\w')
regex.findall('the fox is n9 years old')

['he f', 'ox i', 'n9 y', 'rs o']

In [16]:
regex = re.compile('[aeiou]+')
regex.split('consequential')

['c', 'ns', 'q', 'nt', 'l']

In [17]:
regex.findall('consequential')

['o', 'e', 'ue', 'ia']

## Matching characters

#### \A

In [20]:
s = 'the quick brown fox jumps he over the lazy dog'
pattern = r'\Ahe'

print(re.findall(pattern, s))

[]


#### \b
`\b` represents a word boundary. It matches the position where a word starts or ends.

In [30]:
# this finds the match for the word dog exactly
print(re.findall(r'\bdog\b','the quick brown fox jumps over the lazy dog.'))

# as dog is written as doggy here it won't find any match.
print(re.findall(r'\bdog\b','the quick brown fox jumps over the lazy doggy.'))

['dog']
[]


#### \B

In [36]:
print(re.match(r'\Bdog\B','the quick brown fox jumps over the lazy dog.'))

None


#### \d
Matches any digit, equivalent to `[0-9]`.

In [37]:
# finds all digits in the string
print(re.findall(r'\d', 'This 245 is @345 regex-text'))

['2', '4', '5', '3', '4', '5']


In [40]:
# substitutes all digits with #

regex_code = re.compile('\d')
print(regex_code.sub('#' , 'This 245 is @345 regex-text'))

This ### is @### regex-text


#### /D
Matches any non-digit.<br>
finds everything except digits.

In [42]:
print(re.findall(r'\D', 'This 245 is @345 regex-text'))

['T', 'h', 'i', 's', ' ', ' ', 'i', 's', ' ', '@', ' ', 'r', 'e', 'g', 'e', 'x', '-', 't', 'e', 'x', 't']


In [45]:
regex_code = re.compile('\D')
text = 'This 245 is @345 regex-text'
print(regex_code.sub('#' , text))

#####245#####345###########


#### \s
Matches any whitespace character (spaces, tabs, newlines).

In [46]:
# in the below string it finds its match for the spaces " " 
print(re.findall(r'\s', 'This is @ regex-text'))

[' ', ' ', ' ']


In [48]:
# replaces spaces with #

regex_code = re.compile('\s')
print(regex_code.sub('#' , 'this is regex text'))

this#is#regex#text


#### \S
 Matches any non-whitespace character.

In [49]:
# finds all characters except spaces 
print(re.findall(r'\S', 'This is @ regex-text'))

['T', 'h', 'i', 's', 'i', 's', '@', 'r', 'e', 'g', 'e', 'x', '-', 't', 'e', 'x', 't']


In [50]:
regex_code = re.compile('\S')
print(regex_code.sub('#' , 'this is regex text'))

#### ## ##### ####


#### \w
Matches any alphanumeric character (letters, digits, and underscore), equivalent to `[a-zA-Z0-9_]`.

In [51]:
# matches alphanumeric characters and underscore
print(re.findall('\w', 'O.T.P = 48N@one_7'))

['O', 'T', 'P', '4', '8', 'N', 'o', 'n', 'e', '_', '7']


In [58]:
regex_code = re.compile(r'\w')
print(regex_code.findall("O.T.P = 43n@one_3"))

['O', 'T', 'P', '4', '3', 'n', 'o', 'n', 'e', '_', '3']


In [53]:
print(re.sub(r'\w', ' ', 'O.T.P = nin@e_7'))

 . .  =    @   


In [57]:
# substitutes all words with * stars

regex_code = re.compile(r'\w')
print(regex_code.sub('*', "O.T.P = 43n@one_3"))

*.*.* = ***@*****


#### \W
Matches any non-word character.

In [63]:
# finds everything except words
print(re.findall(r'\W',  "O.T.P = 43n@one_3"))

['.', '.', ' ', '=', ' ', '@']


In [61]:
regex_code = re.compile(r'\W')
print(regex_code.findall("O.T.P = 43n@one_3"))

['.', '.', ' ', '=', ' ', '@']


In [64]:
# subsitutes everything except word characters
print(re.sub(r'\W',' ',   "O.T.P = 43n@one_3"))

O T P   43n one_3


#### \Z
In regex, `\Z` is an anchor that matches the absolute end of a string. It ensures that the match occurs only at the very end of the string, without any exceptions.

In [72]:
# search for if the string ends with a specific character or not.
print(re.search('done\Z', 'Job is done'))

<re.Match object; span=(7, 11), match='done'>


In [74]:
# here it ends with exclamation mark so it won't consider a perfect match
print(re.search('done\Z', 'Job is done!'))

None


## matching character groups

In [76]:
regex = re.compile('[aeiou]')
regex.split('consequential')

['c', 'ns', 'q', '', 'nt', '', 'l']

In [77]:
regex = re.compile('[A-Z][0-9]')
regex.findall('1043879, G2, H6, jajK4')

['G2', 'H6', 'K4']

### matching repeated characters

In [78]:
regex = re.compile(r'\w{3}')
regex.findall('i love data science')

['lov', 'dat', 'sci', 'enc']

In [None]:
regex = re.compile(r'\w{1,5}')
regex.findall('i love data science')

['i', 'love', 'data', 'scien', 'ce']

In [None]:
regex = re.compile(r'\w+')
regex.findall('i love data science')

['i', 'love', 'data', 'science']

#### `\` backslash
 Escapes special characters or signals special sequences.

In [82]:
str = 'this are regex metacharacters.'

# here . is treated as special character as it is a part of metacharacters 
s = re.search('.', str) 
print(s)

# by using \ it will lose its speciality and can be treated as . only
s = re.search('\.', str) 
print(s)

<re.Match object; span=(0, 1), match='t'>
<re.Match object; span=(29, 30), match='.'>


#### `[]`  square brackets
Character classes allow you to specify sets of characters to match. These are enclosed in square brackets `[ ]`.

- `[abc]`: Matches any one of a, b, or c.
- `[a-z]`: Matches any lowercase letter from a to z.
- `[A-Z]`: Matches any uppercase letter from A to Z.
- `[0-9]`: Matches any digit from 0 to 9.
- `[^abc]`: Matches any character except a, b, or c.

In [93]:
# matches for any of the character present inside the string.
# matches [aberin] these characters in the string

pattern = "[aberin]" 
print(re.findall(pattern, "The quick brown fox jumps over the lazy dog"))

['e', 'i', 'b', 'r', 'n', 'e', 'r', 'e', 'a']


In [90]:
# we can specify the range of characters by using - inside square brackets

string = "The quick brown fox jumps over the lazy dog11234567890"
pattern = "[0-9]" 
result = re.findall(pattern, string)
 
print(result)

['1', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0']


In [96]:
#we can specify multiple ranges

string = "The quick brown fox jumps over the lazy dog11234567890"
pattern = "[A-Z0-9]" 
result = re.findall(pattern, string)
 
print(result)

['T', '1', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0']


#### `^`  caret
Matches the start of a string.

In [None]:
# Match strings starting with "The"
regex = r'^The'
strings = ['The quick brown fox', 'The lazy dog', 'A quick brown fox']
for string in strings:
    if re.match(regex, string):
        print(f'Matched: {string}')
    else:
        print(f'Not matched: {string}')

In [97]:
# using caret symbol inside square brackets we can invert the character class
# means any character except a, b, e, r, i, n 

string = "The quick brown fox jumps over the lazy dog"
pattern = "[^aberin]" 
result = re.findall(pattern, string)
 
print(result)

['T', 'h', ' ', 'q', 'u', 'c', 'k', ' ', 'o', 'w', ' ', 'f', 'o', 'x', ' ', 'j', 'u', 'm', 'p', 's', ' ', 'o', 'v', ' ', 't', 'h', ' ', 'l', 'z', 'y', ' ', 'd', 'o', 'g']


#### `$` dollar
 Matches the end of a string.

In [100]:
import re
 
string = "Hello World!"
pattern = r"World!$"  # checks whether string endswith the given character or not
  
match = re.search(pattern, string)
if match:
    print("Match found!")
else:
    print("Match not found.")

Match found!


In [101]:
regex = re.compile(r'\$')
regex.findall("the cost is $20")

['$']

#### `.`  dot
 Matches any character except a newline.

In [None]:
import re
 
string = "The quick brown fox jumps over the lazy dog."
pattern = r"brown.fox"    # matches only a single character except newline character.
 
mat = re.search(pattern, string)
if mat:
    print("Match found!")

else:
    print("Match not found.")

Match found!


#### | - or

In [None]:
import re

string = 'the quick brown fox jumps over the lazy dog'
pattern = 'a|b'
res = re.search(pattern, string)
print(res)

<re.Match object; span=(10, 11), match='b'>


#### ? - question mark

In [None]:
regex = re.compile(r'o?')   #  checks if the string before the question mark in the regex occurs at least once or not at all
regex.search('ove data science')

<re.Match object; span=(0, 1), match='o'>

#### * - star/asterisk

In [None]:
regex = re.compile(r'\w*')  # symbol matches zero or more occurrences of the regex preceding the * symbol
regex.findall('i love data science')

['i', '', 'love', '', 'data', '', 'science', '']

#### + - plus

In [None]:
regex = re.compile(r'\w+') #  symbol matches one or more occurrences of the regex preceding the + symbol.
regex.findall('i love data science')

['i', 'love', 'data', 'science']

#### {m, n} braces

In [None]:
regex = re.compile(r'\w{2}')  # braces match any repetitions preceding regex from m to n
regex.findall('i love data science')

['lo', 've', 'da', 'ta', 'sc', 'ie', 'nc']

In [None]:
regex = re.compile(r'\w{2,4}')
regex.findall('i love data science')

['love', 'data', 'scie', 'nce']

In [None]:
email = re.compile(r'\w+\.\w+@\w+\.[a-z]{3}')
email.findall('aiadventures.pune@gmail.com')

['aiadventures.pune@gmail.com']

### extracting groups

#### ( ) - groups

In [None]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')

In [None]:
text = "To contact us, you can mail us on aiadventures.pune@gmail.com or contact hrutik.nawade@gmail.com Ankur on ankurs.aiadventures@gmail.com"
email3.findall(text)

[('aiadventures.pune', 'gmail', 'com'),
 ('hrutik.nawade', 'gmail', 'com'),
 ('ankurs.aiadventures', 'gmail', 'com')]

## regex methods

#### compile method

In [None]:
text = '''Many people still conflate Google with the internet.
They don't know that Google is actually a search engine like Bing, Baidu, Yahoo. However aforementioned fact surely reflects the prevalent use of Google. '''
import re
regex_object = re.compile(r'Google')
print(regex_object.findall(text))


['Google', 'Google', 'Google']


In [None]:
text = " Is nature a creation of God or God itself ? "
import re
regex_code = re.compile(r'god')
print(regex_code.findall(text))

[]


In [None]:
text = " P2P stands for peer-to-peer network. In a peer-to-peer network, peers are computer systems connected to each other via internet connection. "
import re
regex_code = re.compile(r'P2P')
print(regex_code.findall(text))

['P2P']


#### findall method



In [None]:
import re
text = 'Employees with ids AC23455 and HB4596857 are to be promoted. AH23'
print(re.findall(r'[A-Z]{2}\d{1,5}',text))


['AC23455', 'HB45968', 'AH23']


In [None]:
import re
 
string = """Hello my Number is 123456789 and
            my friend's number is 987654321"""
 
regex = '\d+'
 
match = re.findall(regex, string)
print(match)
 

['123456789', '987654321']


#### sub method

In [None]:
line = "this is aiadventures"
line.replace( 'aiadventures', 'data science')
line

'this is aiadventures'

In [None]:
re.sub('is', '##',line)

'th## ## aiadventures'

In [None]:
import re

regex_code = re.compile(r'\W')
text = "O.T.P = 43n@one_3"
print(regex_code.sub(' ',text))
print(regex_code)

O T P   43n one_3
re.compile('\\W')


In [None]:
line = 'i  love data science'

In [None]:
import re
 
# Regular Expression pattern 'ub' matches the
# string at "Subject" and "Uber". As the CASE
# has been ignored, using Flag, 'ub' should
# match twice with the string Upon matching,
# 'ub' is replaced by '~*' in "Subject", and
# in "Uber", 'Ub' is replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already',
             flags=re.IGNORECASE))
 
# Consider the Case Sensitivity, 'Ub' in
# "Uber", will not be replaced.
print(re.sub('ub', '~*', 'Subject has Uber booked already'))
 
# As count has been given value 1, the maximum
# times replacement occurs is 1
print(re.sub('ub', '~*', 'Subject has Uber booked already',
             count=1, flags=re.IGNORECASE))
 
# 'r' before the pattern denotes RE, \s is for
# start and end of a String.
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam',
             flags=re.IGNORECASE))

S~*ject has ~*er booked already
S~*ject has Uber booked already
S~*ject has Uber booked already
Baked Beans & Spam


#### subn method

In [None]:
import re

print(re.subn('ub', '@@', 'this ub is replacable ub'))


('this @@ is replacable @@', 2)


In [None]:
import re

print(re.subn('ub', '@@', 'this ub is ub replacable ub',1)) # this will substitute only 1 occurence

('this @@ is ub replacable ub', 1)


#### split method

In [None]:
import re
regex = re.compile('\s+')
r = regex.split(line)

In [None]:
regex.findall(line)

[' ', ' ']

In [None]:
print(re.split('\d+', 'this is my number 7457297 238759 3257205 87345 but this will split for only two numbers', maxsplit=2))

['this is my number ', ' ', ' 3257205 87345 but this will split for only two numbers']


#### match method

In [None]:
for s in ["     ", "abc  ", "  abc"]:
    if regex.match(s):
        print(repr(s), "matches")
    else:
        print(repr(s), "does not match")

'     ' matches
'abc  ' does not match
'  abc' matches


In [None]:
line.index('data')

8

#### search method


In [2]:
line = 'i  love data 4398 science & data analytics'

In [4]:
import re

In [5]:
regex = re.compile(r'\d+')
mat = regex.search(line)
print(mat.end())
print(mat)

17
<re.Match object; span=(13, 17), match='4398'>


In [None]:
import re

regex = r'([a-zA-Z]+) (\d+)'

mat = re.search(regex, 'this is regex function learned on April 5')

if mat != None:

  print('Match at index {} {}'.format(mat.start(), mat.end()))
  print("Full match: {}".format (mat.group(0))) 
  print ("Month: {}" .format (mat.group(1)))
  print ("Day: {}".format (mat.group(2)))

else:
  print('None of the object is match.')


Match at index 34 41
Full match: April 5
Month: April
Day: 5


####escape method

In [None]:
import re
 
# escape() returns a string with BackSlash '\',
# before every Non-Alphanumeric Character
# In 1st case only ' ', is not alphanumeric
# In 2nd case, ' ', caret '^', '-', '[]', '\'
# are not alphanumeric
print(re.escape("This is Awesome even 1 AM"))
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

This\ is\ Awesome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


#### group method

In [None]:
import re
 
s = "Welcome to GeeksForGeeks this sentence is grouped there"
 
# here x is the match object
res = re.search(r"\D{2} t", s)
 
print(res.group())

me t


### extracting groups

In [None]:
email3 = re.compile(r'([\w.]+)@(\w+)\.([a-z]{3})')

In [None]:
text = "To contact us, you can mail us on aiadventures.pune@gmail.com or contact Ankur on ankurs.aiadventures@gmail.com"
email3.findall(text)

[('aiadventures.pune', 'gmail', 'com'),
 ('ankurs.aiadventures', 'gmail', 'com')]

In [None]:
email = re.compile(r'(?P<name>[\w.]+)@(?P<domain>\w+)\.(?P<suffix>[a-z]{3})')
email.match(text)

# Exercise

## Theory question

In [None]:
# 2

import re

s = 'this string used for replacing question using regex'
re.subn('ing', '***', s)



('this str*** used for replac*** question us*** regex', 3)

In [None]:
# 7

import re

regex = re.compile(r'[\d/-]{3,}')
print(regex.findall('If Ram was born on 26/12/1995 then he will be 25 years old, on 28-10-2021.'))

['26/12/1995', '28-10-2021']


## coding questions

In [None]:
# 2

def len_words(string):
    ### your code here
    return {i:len(i) for i in re.split(' ', string)}

string1 = "Practice Problems to Drill List Comprehension in Your Head."
len_words(string1)

{'Practice': 8,
 'Problems': 8,
 'to': 2,
 'Drill': 5,
 'List': 4,
 'Comprehension': 13,
 'in': 2,
 'Your': 4,
 'Head.': 5}

In [None]:
# 4

import re

def create_abbreviations(fullname):
    ### your code 
    regex = re.compile(r'\s')
    temp = regex.split(fullname)
    new_s = ''
    for i in range(len(temp)):
      if i == len(temp)-1:
        new_s += temp[i].capitalize()
      else:
        new_s += temp[i][0].upper()+'. '
    return new_s

name = input('Enter your full name : ')
create_abbreviations(name)


Enter your full name : hrutik djfb sdf


'H. D. sdf'

In [None]:
from re import I
from re import I
# 5 

def pattern_finding(string):
    ### your code here
    regex = re.compile(r'[AEIOU][^AEIOU]+')
    res = regex.findall(string)
    d = {}
    # print(res)
    for i in res:
      add = len(i)
      if add<=3:
        new_s =  i + str(add)*add
      else:
        new_s = str(add)*add + i
      d[new_s] = add
    return d



string1 = 'ABCDEFGHAWQETEAINDVOPLZABMNPUI'
pattern_finding(string1)

{'4444ABCD': 4,
 '4444EFGH': 4,
 'AWQ333': 3,
 'ET22': 2,
 '4444INDV': 4,
 '4444OPLZ': 4,
 '55555ABMNP': 5}

In [20]:
# 6

import re 

def string_modification(string):

  r1 = re.compile(r'\s+')
  split_string = r1.split(string)
  if split_string[0][0].islower() or string.endswith('!'):
    for word in split_string:
      if word.islower():
        r2 = re.compile(word)
        string = r2.sub(word.upper(), string)
      if word.isupper():
        r2 = re.compile(word)
        string = r2.sub(word.lower(), string)
  print(string.center(100))

string1 = "learning data science is too much fun! Always in a process of upgradation."
string_modification(string1)



re.compile('\\s+')
             LEARNING DATA SCIENCE IS TOO MUCH FUN! AlwAys IN A PROCESS OF upgrAdAtion.             


In [21]:
string2 = "Artificial Intelligence, BlockChain, Cybersecurity and Networking are going to mould the future!"
string_modification(string2)

re.compile('\\s+')
  Artificial Intelligence, BlockChain, Cybersecurity AND Networking ARE GOING TO MOULD THE FUTURE!  


In [14]:
# 8
import re
def extract_info(links):

  date_l = []
  news_article_l = []
  ext_l = []
  category_l = []
  headline_l = []


  for i in links:
    date = re.findall(r'\d{4}/\d{2}/\d{2}', i)[0]
    date_l.append(date)
    news_article= re.findall(r'www.(\w+).(\w+)', i)[0][0]
    news_article_l.append(news_article)
    ext = re.findall(r'www.(\w+).(\w+)', i)[0][1]
    ext_l.append(ext)
    category = re.findall(r'[A-Z]+', i)[0]
    category_l.append(category)
    headline = re.findall(r'(\w+(-\w+)+)+', i)[0][0]
    headline_l.append(headline)

  max_date = max(len(i) for i in date_l)
  max_news_article = max(len(i) for i in news_article_l)
  max_ext = max(len(i) for i in ext_l)
  max_category = max(len(i) for i in category_l)
  max_headline = max(len(i) for i in headline_l)

  print('Date'.center(max_date), 'News Article'.center(max_news_article), 'Ext'.center(max_ext), 'Category'.center(max_category), 'Headline'.center(max_headline))
  for i in range(len(links)):
    print(date_l[i].center(max_date), news_article_l[i].center(max_news_article), ext_l[i].center(max_ext), category_l[i].center(max_category), headline_l[i].center(max_headline))


links = ['https://www.washingtonpost.com/TECHNOLOGY/2021/08/31/tips-phone-disasters/',
'https://www.nytimes.com/2019/12/30/TELEVISION/indian-tv-amazon-netflix/',
'https://www.thestar.net/TERRORISM/2019/06/15/maoist-rebels-kill-5-policemen-in-eastern-india/',
'https://www.weforum.org/HEALTH/2022/01/10/covid19-top-stories-omicron-coronavirus/',
'https://www.livemint.in/2022/04/22/SPORTS/neeraj-chopra-wins-gold-in-javelin-throw-at-tokyo-olympics/']

extract_info(links)

   Date     News Article  Ext  Category                           Headline                         
2021/08/31 washingtonpost com TECHNOLOGY                    tips-phone-disasters                   
2019/12/30    nytimes     com TELEVISION                  indian-tv-amazon-netflix                 
2019/06/15    thestar     net TERRORISM       maoist-rebels-kill-5-policemen-in-eastern-india      
2022/01/10    weforum     org   HEALTH            covid19-top-stories-omicron-coronavirus          
2022/04/22    livemint     in   SPORTS   neeraj-chopra-wins-gold-in-javelin-throw-at-tokyo-olympics
