## A regular expression or RegEx is a special text string that helps to find patterns in data. 
## A RegEx can be used to check if some pattern exists in a different data type. 
## To use RegEx in python first we should import the RegEx module which is called re.
### re.I is a ignore case --> (it ignores case of the sentence and aprint everything )

In [1]:
# Searching for All Matches Using match


import re

In [2]:
string = "I love my Mother and Father Very Very Much"

In [8]:
match = re.match("I love my",string,re.I)         # It returns an object with span, and match
print(match)                                      # if substring does not match to correct string then it will print None
span = match.span()
print(span)
start,end = span
print(start,end)
print(string[start:end])

<re.Match object; span=(0, 9), match='I love my'>
(0, 9)
0 9
I love my


In [9]:
# Searching for All Matches Using search

import re
                                                                                        
txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

In [10]:
search = re.search("first",txt,re.I)
print(search)

<re.Match object; span=(100, 105), match='first'>


In [11]:
span = search.span()
print(span)

(100, 105)


In [12]:
start,end = span
print(start,end)

100 105


In [14]:
print(txt[start:end])

first


### As you can see, search is much better than match because it can look for the pattern throughout the text.
### Search returns a match object with a first match that was found, otherwise it returns None. 
### A much better re function is findall. 
### This function checks for the pattern through the whole string and returns all the matches as a list.

In [15]:
# Searching for All Matches Using findall
# findall() returns all the matches as a list

import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match = re.findall("language",txt,re.I)
print(match)

['language', 'language']


In [18]:
match2 = re.findall("python",txt,re.I)
print(match2)

['Python', 'python']


### Since we are using re.I both lowercase and uppercase letters are included. If we do not have the re.I flag, 
### then we will have to write our pattern differently. Let us check it out:

In [19]:
import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match = re.findall("Python|python",txt)
print(match)

['Python', 'python']


In [21]:
match = re.findall("[Pp]ython",txt)
print(match)

['Python', 'python']


In [23]:
# Replacing a Substring

import re

txt = '''Python is the most beautiful language that a human being has ever created.
I recommend python for a first programming language'''

match_replace = re.sub("Python|python","SQL",txt,re.I)         # replacing python with SQL 
print(match_replace)

SQL is the most beautiful language that a human being has ever created.
I recommend SQL for a first programming language


In [26]:
match_repl = re.sub("[Pp]ython","Javascript",txt,re.I)         # replacing python with Javascript
print(match_repl)

Javascript is the most beautiful language that a human being has ever created.
I recommend Javascript for a first programming language


In [27]:
# Let us add one more example. The following string is really hard to read unless we remove the % symbol. 
# Replacing the % with an empty string will clean the text.

import re

txt = '''%I a%m te%%a%%che%r% a%n%d %% I l%o%ve te%ach%ing. 
T%he%re i%s n%o%th%ing as r%ewarding a%s e%duc%at%i%ng a%n%d e%m%p%ow%er%ing p%e%o%ple.
I fo%und te%a%ching m%ore i%n%t%er%%es%ting t%h%an any other %jobs. 
D%o%es thi%s m%ot%iv%a%te %y%o%u to b%e a t%e%a%cher?'''

In [33]:
match_replace = re.sub("%","",txt)               # it is replacing % symbol with space character and printing clean text
print(match_replace)

I am teacher and  I love teaching. 
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs. 
Does this motivate you to be a teacher?


In [36]:
# Splitting Text Using RegEx Split

import re                                                       #  splitting using \n - end of line symbol
                                                                #  it automaically splits string were it was creating new line in the string 
txt = '''I am teacher and  I love teaching.
There is nothing as rewarding as educating and empowering people.
I found teaching more interesting than any other jobs.
Does this motivate you to be a teacher?'''

spil = re.split("\n",txt)
print(spil)

['I am teacher and  I love teaching.', 'There is nothing as rewarding as educating and empowering people.', 'I found teaching more interesting than any other jobs.', 'Does this motivate you to be a teacher?']


## Writing RegEx Patterns
## To declare a string variable we use a single or double quote. To declare RegEx variable r''. 
## The following pattern only identifies apple with lowercase, to make it case insensitive either we should rewrite our pattern or we should add a flag.

In [41]:
import re 

pattern = r'apple'        # searching and printing with the help of pattern and with the help of flag
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(pattern,txt)
print(matches)

# To make case insensitive adding flag '

match = re.findall(pattern,txt,re.I)
print(match)


# or we can use a set of characters method

match = re.findall('[Aa]pple',txt)
print(match)

['apple']
['Apple', 'apple']
['Apple', 'apple']


In [50]:
"""
le', 'apple']
[]: A set of characters
[a-c] means, a or b or c
[a-z] means, any letter from a to z
[A-Z] means, any character from A to Z
[0-3] means, 0 or 1 or 2 or 3
[0-9] means any number from 0 to 9
[A-Za-z0-9] any single character, that is a to z, A to Z or 0 to 9

\: uses to escape special characters
\d means: match where the string contains digits (numbers from 0-9)
\D means: match where the string does not contain digits

. : any character except new line character(\n)
^: starts with
r'^substring' eg r'^love', a sentence that starts with a word love
r'[^abc] means not a, not b, not c.
$: ends with
r'substring$' eg r'love$', sentence that ends with a word love
*: zero or more times
r'[a]*' means a optional or it can occur many times.
+: one or more times
r'[a]+' means at least once (or more)
?: zero or one time
r'[a]?' means zero times or once
{3}: Exactly 3 characters
{3,}: At least 3 characters
{3,8}: 3 to 8 characters
|: Either or
r'apple|banana' means either apple or a banana
(): Capture and group

"""

"\nle', 'apple']\n[]: A set of characters\n[a-c] means, a or b or c\n[a-z] means, any letter from a to z\n[A-Z] means, any character from A to Z\n[0-3] means, 0 or 1 or 2 or 3\n[0-9] means any number from 0 to 9\n[A-Za-z0-9] any single character, that is a to z, A to Z or 0 to 9\n\n\\: uses to escape special characters\n\\d means: match where the string contains digits (numbers from 0-9)\n\\D means: match where the string does not contain digits\n\n. : any character except new line character(\n)\n^: starts with\nr'^substring' eg r'^love', a sentence that starts with a word love\nr'[^abc] means not a, not b, not c.\n$: ends with\nr'substring$' eg r'love$', sentence that ends with a word love\n*: zero or more times\nr'[a]*' means a optional or it can occur many times.\n+: one or more times\nr'[a]+' means at least once (or more)\n?: zero or one time\nr'[a]?' means zero times or once\n{3}: Exactly 3 characters\n{3,}: At least 3 characters\n{3,8}: 3 to 8 characters\n|: Either or\nr'apple|ba

In [47]:
# Square Bracket

regex_pattern = r'[Aa]pple' # this square bracket mean either A or a
txt = 'Apple and banana are fruits. An old cliche says an apple a day a doctor way has been replaced by a banana a day keeps the doctor far far away.'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['Apple', 'apple']


['Apple', 'apple']


In [48]:
pattern = r'[Aa]pple | [Bb]anana' # this square bracket means either A or a and B or b
match = re.findall(pattern,txt)
print(match)

# Using the square bracket and or operator , we manage to extract Apple, apple, Banana and banana.

['Apple ', ' banana', 'apple ', ' banana']


In [53]:
# Escape character(\) in RegEx

regex_pattern = r'\d'  # d is a special character which means digits
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2', '0', '1', '9', '8', '2', '0', '2', '1'], this is not what we want

['6', '2', '0', '1', '9', '8', '2', '0', '2', '1']


In [52]:
# One or more times(+)

regex_pattern = r'\d+'  # d is a special character which means digits, + mean one or more times
txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021'] - now, this is better!

['6', '2019', '8', '2021']


In [55]:
# Period(.)

regex_pattern = r'[a].'  # this square bracket means a and . means any character except new line
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['an', 'an', 'an', 'a ', 'ar']

regex_pattern = r'[a].+'  # . any character, + any character one or more times 
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['an', 'an', 'an', 'a ', 'ar']
['and banana are fruits']


In [56]:
# Zero or more times(*)
# Zero or many times. The pattern could may not occur or it can occur many times.

regex_pattern = r'[a].*'  # . any character, * any character zero or more times 
txt = '''Apple and banana are fruits'''
matches = re.findall(regex_pattern, txt)
print(matches)  # ['and banana are fruits']

['and banana are fruits']


In [57]:
# Zero or one time(?)
# Zero or one time. The pattern may not occur or it may occur once.

txt = '''I am not sure if there is a convention how to write the word e-mail.
Some people write it as email others may write it as Email or E-mail.'''
regex_pattern = r'[Ee]-?mail'  # ? means here that '-' is optional
matches = re.findall(regex_pattern, txt)
print(matches)  # ['e-mail', 'email', 'Email', 'E-mail']

['e-mail', 'email', 'Email', 'E-mail']


In [61]:
# Quantifier in RegEx
# We can specify the length of the substring we are looking for in a text, using a curly bracket. 
# Let us imagine, we are interested in a substring with a length of 4 characters:

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{4}'  # exactly four times
matches = re.findall(regex_pattern, txt)
print(matches)  # ['2019', '2021']

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'\d{1,4}'   # 1 to 4
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6', '2019', '8', '2021']

['2019', '2021']
['6', '2019', '8', '2021']


In [62]:
# Cart ^ Starts with

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'^This'  # ^ means starts with
matches = re.findall(regex_pattern, txt)
print(matches)  # ['This']

['This']


In [63]:
# Negation

txt = 'This regular expression example was made on December 6,  2019 and revised on July 8, 2021'
regex_pattern = r'[^A-Za-z ]+'  # ^ in set character means negation, not A to Z, not a to z, no space
matches = re.findall(regex_pattern, txt)
print(matches)  # ['6,', '2019', '8', '2021']

['6,', '2019', '8,', '2021']


## Exercise

In [73]:
# What is the most frequent word in the following paragraph?

import re
paragraph = '''I love teaching. If you do not love teaching what else can you love. 
    I love Python if you do not love something which can give you all the capabilities to develop an application what else can you love.'''

match = re.findall("love",paragraph,re.I)
print(match)
print(" total frequant items in this string :-",set(match),len(match))

['love', 'love', 'love', 'love', 'love', 'love']
 total frequant items in this string :- {'love'} 6
