<a href="https://colab.research.google.com/github/op-mahato/NLP/blob/main/regular_expression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Regular Expressions**


---
Regular expressions, often abbreviated as regex or regexp, are powerful tools for pattern matching and text manipulation. They are a sequence of characters that define a search pattern. Regular expressions are widely used in various programming languages, text editors, and command-line utilities to perform tasks such as searching, matching, and replacing strings based on specific patterns.


**The re Module**

After importing the module we can use it to detect or find patterns.

In [None]:
import re

**Character:** All characters, except those having special meaning in regex, matches themselves. E.g., the regex x matches substring "x"; regex 9 matches "9"; regex = matches "="; and regex @ matches "@".

**Special Regex Characters (Metacharacters):** These characters have special meaning in regex (to be discussed below): ., +, *, ?, ^, $, (, ), [, ], {, }, |, \.

**Metacharacters**

Metacharacters are the characters with special meaning.

To understand the RE analogy, Metacharacters are useful and important. They will be used in functions of module re. Below is the list of metacharacters.



*   \    : Used to drop the special meaning of character following it
*  []    : Represent a character class
*   ^    : Matches the beginning
*   $    : Matches the end
*   .    : Matches any character except newline
*   |    : Means OR (Matches with any of the characters separated by it.
*   ?    : Matches zero or one occurrence
*   '*'  : Any number of occurrences (including 0 occurrences)
*   '+'  : One or more occurrences
*   {}   : Indicate the number of occurrences of a preceding regex to match.
*   ()   : Enclose a group of Regex



**Metacharacters & Description in Detail**

\d

Matches any decimal digit; this is equivalent to the class [0-9].

\D

Matches any non-digit character; this is equivalent to the class [^0-9].

\s

Matches any whitespace character; this is equivalent to the class [\t\n\r\f\v].

\S

Matches any non-whitespace character; this is equivalent to the class [^\t\n\r\f\v].

\w

Matches any alphanumeric character; this is equivalent to the class [a-zAZ0-9_].

\W

Matches any non-alphanumeric character. equivalent to the class [^a-zAZ0-9_].

**.**

Matches with any single character except newline '\n'.

?

match 0 or 1 occurrence of the pattern to its left

+

1 or more occurrences of the pattern to its left

*

0 or more occurrences of the pattern to its left

\b

boundary between word and non-word and \B is opposite of \b

'[..]'

Matches any single character in a square bracket and [^..] matches any single character not in square bracket.


{n,m}

Matches at least n and at most m occurrences of preceding

a| b

Matches either a or b

**Occurrence Indicators (or Repetition Operators or Quantifiers):**


*   +: one or more (1+), e.g., [0-9]+ matches one or more digits such as '123', '000'.
*   \*: zero or more (0+), e.g., [0-9]* matches zero or more digits. It accepts all those in [0-9]+ plus the empty string.

*  ?: zero or one (optional), e.g., [+-]? matches an optional "+", "-", or an empty string.
*   {m,n}: m to n (both inclusive)

*   {m}: exactly m times
*   {m,}: m or more (m+)


**Character class (or Bracket List):**

[...]: Accept ANY ONE of the character within the square bracket, e.g., [aeiou] matches "a", "e", "i", "o" or "u".

[.-.] (Range Expression): Accept ANY ONE of the character in the range, e.g., [0-9] matches any digit; [A-Za-z] matches any uppercase or lowercase letters.

[^...]: NOT ONE of the character, e.g., [^0-9] matches any non-digit.

Only these four characters require escape sequence inside the bracket list: ^, -, ], \.


**Escape Sequences**

(\char):
To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash (\\). E.g., \\. matches "."; regex \\+ matches "+"; and regex \\( matches "(".
You also need to use regex \\\ to match "\\" (back-slash).
Regex recognizes common escape sequences such as \n for newline, \t for tab, \r for carriage-return, \nnn for a up to 3-digit octal number, \xhh for a two-digit hex code, \uhhhh for a 4-digit Unicode, \uhhhhhhhh for a 8-digit Unicode.

**Position Anchors:** does not match character, but position such as start-of-line, end-of-line, start-of-word and end-of-word.


*   ^,\$: start-of-line and end-of-line respectively. E.g., ^[0-9]$ matches a numeric string.
*   \b: boundary of word, i.e., start-of-word or end-of-word. E.g., \bcat\b matches the word "cat" in the input string.

*  \B: Inverse of \b, i.e., non-start-of-word or non-end-of-word.
*  \<, \>: start-of-word and end-of-word respectively, similar to \b. E.g., \<cat\> matches the word "cat" in the input string.
*  \A, \Z: start-of-input and end-of-input respectively.


**Methods in re Module**

To find a pattern we use different set of re character sets that allows to search for a match in a string.


*   re.match(): searches only in the beginning of the first line of the string and returns matched objects if found, else returns None.
*   re.fullmatch() : It is used to match the whole string with a regex pattern.
*   re.search(): Returns a match object if there is one anywhere in the string, including multiline strings.
*   re.findall(): Returns a list containing all matches
*   re.finditer(): It returns an iterator that yields match objects.
*   re.split(): Takes a string, splits it at the match points, returns a list
*   re.sub(): Replaces one or many matches within a string

*   re.subn(): It works the same as 'sub'. It returns a tuple (new_string, num_of_substitution).
*   re.compile() : It is used to turn a regular pattern into an object of a regular expression that may be used in a number of ways for matching patterns in a string.

*   re.escape() :It is used to escape special characters in a pattern.
*  re.purge() : It is used to clear the regex expression cache.





**Match**



In [None]:
# syntac
re.match(substring, string, re.I)
# substring is a string or a pattern, string is the text we look for a pattern , re.I is case ignore

In [None]:
import re

txt = 'I love to teach Natural Language Processing and Machine Translation'
# It returns an object with span, and match
match = re.match('I love to teach', txt, re.I)
print(match)  # <re.Match object; span=(0, 15), match='I love to teach'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (0, 15)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 0, 15
substring = txt[start:end]
print(substring)       # I love to teach

As you can see from the example above, the pattern we are looking for (or the substring we are looking for) is I love to teach. The match function returns an object only if the text starts with the pattern.

In [None]:
import re

txt = 'I love to teach Natural Language Processing and Machine Translation'
match = re.match('I like to teach', txt, re.I)
print(match)  # None

None


The string does not string with I like to teach, therefore there was no match and the match method returned None.

**Search**

In [None]:
# syntax
re.match(substring, string, re.I)
# substring is a pattern, string is the text we look for a pattern , re.I is case ignore flag

The **Match** object has properties and methods used to retrieve information about the search, and the result:

*   .span() returns a tuple containing the start-, and end positionsList item of the match.
*   .string returns the string passed into the function
*   .group() returns the part of the string where there was a match


**Example**
Print the position (start- and end-position) of the first match occurrence.

The regular expression looks for any words that starts with an upper case "S":

In [None]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.span())

(12, 17)


**Example**

Print the string passed into the function:

In [None]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.string)


The rain in Spain


**Example**
Print the part of the string where there was a match.

The regular expression looks for any words that starts with an upper case "S":

In [None]:
import re

txt = "The rain in Spain"
x = re.search(r"\bS\w+", txt)
print(x.group())

Spain


In [None]:
import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

# It returns an object with span and match
match = re.search('first', txt, re.I)
print(match)  # <re.Match object; span=(100, 105), match='first'>
# We can get the starting and ending position of the match as tuple using span
span = match.span()
print(span)     # (100, 105)
# Lets find the start and stop position from the span
start, end = span
print(start, end)  # 100 105
substring = txt[start:end]
print(substring)       # first

**Group Extraction**


In [None]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search('([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2))  ## 'google.com' (the host, group 2)

alice-b@google.com
alice-b
google.com


**The search() Function**

The search() function searches the string for a match, and returns a Match object if there is a match.

If there is more than one match, only the first occurrence of the match will be returned:

**Example**
Search for the first white-space character in the string:

In [None]:
import re

txt = "The rain in Spain"
x = re.search("\s", txt)
print(x)
print("The first white-space character is located in position:", x.start())

<re.Match object; span=(3, 4), match=' '>
The first white-space character is located in position: 3


**The findall() Function**

The findall() function returns a list containing all matches.

**Example**
Print a list of all matches:

In [None]:
import re

txt = "The rain in Spain"
x = re.findall("ai", txt)
print(x)

['ai', 'ai']


The list contains the matches in the order they are found.

If no matches are found, an empty list is returned:

**Example**
Return an empty list if no match was found:

In [None]:
import re

txt = "The rain in Spain"
x = re.findall("Portugal", txt)
print(x)

[]


**Escape character(\) in RegEx**

In [None]:
regex_pattern = r'\d'  # d is a special character which means digits
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'], this is not what we want

['6', '2', '0', '1', '9', '8', '2', '0', '2', '1']


**One or more times(+)**

In [None]:
regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021'] - now, this is better!

['6', '2019', '8', '2021']


**Period(.) Example**


In [None]:
regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

regex_pattern = r'[a].+'  # . any character, + any character one or more times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['an', 'an', 'an', 'a ', 'ar']
['and banana are fruits']


**Quantifier in RegEx**



In [None]:
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['2019', '2021']

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{1, 4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021']

['2019', '2021']
[]


 **Zero or more times(\*)**

Zero or many times. The pattern could may not occur or it can occur many times.

In [None]:
regex_pattern = r'[a].*'  # . any character, * any character zero or more times
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['and banana are fruits']


**Cart ^** **Example**

*   Starts with


In [None]:
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'^This'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['This']

['This']


*   Negation

In [None]:
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6,', '2019', '8', '2021']

['6,', '2019', '8,', '2021']


**\b	Example** which returns a match where the specified characters are at the beginning or at the end of a word
(the "r" in the beginning is making sure that the string is being treated as a "raw string")




*   Check if "ain" is present at the beginning of a WORD


In [None]:
import re
txt = "The rain in Spain"
x = re.findall(r"\bain", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

[]
No match


*   Check if "ain" is present at the beginning of a WORD:


In [None]:
import re
txt = "The rain in Spain"
x = re.findall(r"ain\b", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['ain', 'ain']
Yes, there is at least one match!


**Example**
Check if "ain" is present, but NOT at the beginning of a word:

In [None]:
import re
txt = "The rain in Spain"
x = re.findall(r"\Bain", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['ain', 'ain']
Yes, there is at least one match!


**Exercise**
Check if "ain" is present, but NOT at the beginning of a word

**\A Example**

Check if the string starts with "The"


In [None]:
import re
txt = "The rain in Spain"
x = re.findall("\AThe", txt)
print(x)
if x:
  print("Yes, there is a match!")
else:
  print("No match")

['The']
Yes, there is a match!


**\D example**

Return a match at every no-digit character

In [None]:
import re
txt = "The rain in Spain"
x = re.findall("\D", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")

['T', 'h', 'e', ' ', 'r', 'a', 'i', 'n', ' ', 'i', 'n', ' ', 'S', 'p', 'a', 'i', 'n']
Yes, there is at least one match!


**Exercise**

Check if the string contains any digits (numbers from 0-9)



**Example**
Check if the string has any two-digit numbers, from 00 to 59

In [None]:
import re
txt = "8 times before 11:45 AM"
x = re.findall("[0-5][0-9]", txt)
print(x)
if x:
  print("Yes, there is at least one match!")
else:
  print("No match")


['11', '45']
Yes, there is at least one match!


**Example:**

Suppose we have a text with many email addresses. List out all the emails found in strings

In [None]:
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print(email)

alice@google.com
bob@abc.com


**findall() With Files**


In [None]:
f = open("demofile.txt", "a")
f.write("Now the file has more content!. The content can written here..")
f.close()

# Open file
with open('demofile.txt', 'r') as f1:
  text = f1.read()

strings = re.findall(r'the', text)
print(strings)

['the', 'the']


**findall() and Groups**

The parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of *tuples*. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data

In [None]:
import re
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print(tuples)

for tuple in tuples:
    print(tuple[0])
    print(tuple[1])

[('alice', 'google.com'), ('bob', 'abc.com')]
alice
google.com
bob
abc.com


**Writing RegEx Patterns**

To declare a string variable we use a single or double quote. To declare RegEx variable r''. The following pattern only identifies apple with lowercase, to make it case insensitive either we should rewrite our pattern or we should add a flag.

In [None]:
import re

regex_pattern = r'apple'
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away. '
matches = re.findall(regex_pattern, txt)
print(matches)  # ['apple']

# To make case insensitive adding flag '
matches = re.findall(regex_pattern, txt, re.I)
print(matches)  # ['Apple', 'apple']
# or we can use a set of characters method
regex_pattern = r'[Aa]pple'  # this mean the first letter could be Apple or apple
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']


['apple']
['Apple', 'apple']
['Apple', 'apple']


**The split() Function**

The split() function returns a list where the string has been split at each match:

**Example**
Split at each white-space character:

In [None]:
import re

txt = "The rain in Spain"
x = re.split("\s", txt)
print(x)

['The', 'rain', 'in', 'Spain']


You can control the number of occurrences by specifying the maxsplit parameter:

**Example**
Split the string only at the first occurrence:

In [None]:
import re

txt = "The rain in Spain"
x = re.split("\s", txt, 2)
print(x)

['The', 'rain', 'in Spain']


**The sub() Function**
The sub() function replaces the matches with the text of your choice:
**Example**
Replace every white-space character with the number 9:

In [None]:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt)
print(x)

The9rain9in9Spain


You can control the number of replacements by specifying the count parameter:

**Example**
Replace the first 2 occurrences:


In [None]:
import re

txt = "The rain in Spain"
x = re.sub("\s", "9", txt, 2)
print(x)

The9rain9in Spain


**Exercise:**

Let's suppose that you are creating software for customer support and you are going to find the order from a text. Write code using regular expersion

In [None]:
import re

chat1='codebasics: Hello, I am having an issue with my order # 412889912'
chat2='codebasics: I have a problem with my order number 412889912'
chat3='codebasics: My order 412889912 is having an issue, I was charged 300$ when online it says 280$'
pattern = 'order[^\d]*(\d*)'
matches1 = re.findall(pattern, chat1)
matches2 = re.findall(pattern, chat2)
matches3 = re.findall(pattern, chat3)
print(matches1, matches2, matches3)

['412889912'] ['412889912'] ['412889912']


Write Regex to find Retrieve email id and phone

In [None]:
chat1 = 'codebasics: you ask lot of questions 😠  1235678912, abc@xyz.com'
chat2 = 'codebasics: here it is: (123)-567-8912, abc@xyz.com'
chat3 = 'codebasics: yes, phone: 1235678912 email: abc@xyz.com'
pattern1= '[a-zA-Z0-9_]*@[a-z]*\.[a-zA-Z0-9]*'
matches1 = re.findall(pattern1, chat1)
matches2 = re.findall(pattern1, chat2)
matches3 = re.findall(pattern1, chat3)
print(matches1, matches2, matches3)

pattern2='(\d{10})|(\(\d{3}\)-\d{3}-\d{4})'

matches1 = re.findall(pattern2, chat1)
matches2 = re.findall(pattern2, chat2)
matches3 = re.findall(pattern2, chat3)
print(matches1, matches2, matches3)

['abc@xyz.com'] ['abc@xyz.com'] ['abc@xyz.com']
[('1235678912', '')] [('', '(123)-567-8912')] [('1235678912', '')]


**Regex for Information Extraction**  **Exmaple**

In [None]:
text='''
Born	Elon Reeve Musk
June 28, 1971 (age 50)
Pretoria, Transvaal, South Africa
Citizenship
South Africa (1971–present)
Canada (1971–present)
United States (2002–present)
Education	University of Pennsylvania (BS, BA)
Title
Founder, CEO and Chief Engineer of SpaceX
CEO and product architect of Tesla, Inc.
Founder of The Boring Company and X.com (now part of PayPal)
Co-founder of Neuralink, OpenAI, and Zip2
Spouse(s)
Justine Wilson
​
​(m. 2000; div. 2008)​
Talulah Riley
​
​(m. 2010; div. 2012)​
​
​(m. 2013; div. 2016)
'''
print (re.findall(r'age (\d+)', text))
print (re.findall(r'Born(.*)\n', text))
print (re.findall(r'Born.*\n(.*)\(age', text))
print (re.findall(r'\(age.*\n(.*)', text))

['50']
['\tElon Reeve Musk']
['June 28, 1971 ']
['Pretoria, Transvaal, South Africa']
