# Regular Expressions

In [None]:
import re

A regular expressions is a sequence of characters that define a search pattern.

Where they are used?
Regular Expressions have wide use cases:
* Text Preprocessing more specifically string preprocessing.
* Extracting text on basis of some pattern, finding and replacing some words that fullfill a pattern match for example locate all words that start with 'a' and replace them with word cat.
* Password Pattern Matching
* Data Validation

These are some I have applied myself and you can as well.

This might become boring very quick so we will keep on doing examples to keep on learning about regular expressions.

Regular Expression Pattern matching has two parts to it:
1. Right Pattern : It's about figuring out what we really want to match.
2. Right ```re``` function : It's related to position. Whether we want to match pattern at beginning or anywhere. In case we want to split and substitute we have to change our re function as well.

**Problem : You are given two texts and you have to find whether they start with word corona or not.**

In [2]:
text1 = "corona epidemic has taken world by storm."
text2 = "offices shut down due to covid-19."

So now we have to figure out two things what pattern to use and what function to use.
1. We have to match word corona only so corona is our pattern string.
2. For Function we will select ```re.match``` because of the reason we need to find if string **starts** with pattern corona or not.
3. ```re.match``` takes 3 inputs -> pattern to match, text which needs to be searched, flags. We will talk about flags later.

In [3]:
pattern = r'corona'
result_text1 = re.match(pattern,text1)
print(result_text1)

<re.Match object; span=(0, 6), match='corona'>


In [4]:
result_text2 = re.match(pattern,text2)
print(result_text2)

None


* If pattern matches to the string a match object is returned otherwise None object is returned.
* In case of a match as in ```result_text1``` match object is returned which has three properties:
    * ```.span()```: returns a tuple contaning start and end positions of the match.
    * ```.string``` : returns the string passed into the function
    * ```.group()``` : returns the part of string where was a match

In [5]:
print("Span of result_text1: ", result_text1.span())
print("String passed to result_text1: ", result_text1.string)
print("Groups in result_text1: ", result_text1.group())

Span of result_text1:  (0, 6)
String passed to result_text1:  corona epidemic has taken world by storm.
Groups in result_text1:  corona


## Issue with ```re.match```:
* It's an excellent function for matching strings only in the beginning.

In case we want a match anywhere in the string we use ```re.search```. It works same as ```re.match``` but allows us to match anywhere in the string. Returns a **Match object** and if no match is found returns **None**.

In [6]:
text1 = "corona epidemic has taken world by storm."
text2 = "offices shut down due to covid-19."
text3 = "covid-19 is another name for coronavirus"
pattern = r'corona'

In [7]:
result_text1 = re.search(pattern,text1)
print(result_text1)

<re.Match object; span=(0, 6), match='corona'>


In [8]:
result_text2 = re.search(pattern,text2)
print(result_text2)

None


In [9]:
result_text3 = re.search(pattern,text3)
print(result_text3)

<re.Match object; span=(29, 35), match='corona'>


## Limitations with ```re.match``` and ```re.search```
```re.match``` can match only in the beginning and ```re.search``` can match anywhere but only first match is returned. The reason behind is that ```re.search``` was designed to find whether a pattern exists in the string or not. In case we want all of the matches to be returned we use ```re.findall```. In fact ```re.findall``` is by default for pattern matching since by various identifiers and conditioning it can behave as ```re.match``` & ```re.search```.

In [10]:
text4 = "coronavirus patients can spread coronavirus to others as well."
result_text4 = re.search(pattern,text4)
print(result_text4)
#Only first instance is matched

<re.Match object; span=(0, 6), match='corona'>


```re.findall``` matches all occurences.

In [11]:
result_text4_findall = re.findall(pattern,text4)
print(result_text4_findall)
#Finds all occurences

['corona', 'corona']


## What if we want to replace places where pattern matches with a string of our choice?

Let's say we want to replace all occurences of **coronavirus** with COVID-19. For this purpose we use ```re.sub```
```re.sub(pattern,replacement,text,count)``` : pattern is what we are trying to match, replacement is what to replace pattern with and text is what we are searching pattern in. Set **Count**  in case of substituting limited number of occurences. By default it substitutes all occurences.

In [12]:
text = "coronavirus is causing international shutdowns. Neil Ferguson's report stated that coronavirus matches SARS."
pattern = r'coronavirus'
new_text = re.sub(pattern,'COVID-19',text)
print(text)
print(new_text)

coronavirus is causing international shutdowns. Neil Ferguson's report stated that coronavirus matches SARS.
COVID-19 is causing international shutdowns. Neil Ferguson's report stated that COVID-19 matches SARS.


In [13]:
text = "coronavirus is causing international shutdowns. Neil Ferguson's report stated that coronavirus matches SARS."
pattern = r'coronavirus'
new_text = re.sub(pattern,'COVID-19',text,count=1)
print(text)
print(new_text)
"""Replaces only 1 match"""

coronavirus is causing international shutdowns. Neil Ferguson's report stated that coronavirus matches SARS.
COVID-19 is causing international shutdowns. Neil Ferguson's report stated that coronavirus matches SARS.


'Replaces only 1 match'

## Splitting Text where pattern matches

```re.split(pattern,string,maxsplit,flags)```

pattern and string are usual parameters. maxsplit means how many split at max we want to do, by default it means all. flags will be discussed quiet shortly.

In [14]:
text = 'COVID-19 is coronavirus, as of 24 March coronavirus has causes more than 400,000 cases.'
pattern = r'corona'
new_strings = re.split(pattern,text)
print(new_strings)

['COVID-19 is ', 'virus, as of 24 March ', 'virus has causes more than 400,000 cases.']


**string is split everywhere the pattern is matched. Since the above string has two corona occurences it just split it at two places and hence resulting in three parts. A list of splits is returned.**

## Level - II 

### Ordinary & Meta Characters in Regular Expression

* Ordinary Characters : a,b,corona,123 these are ordinary characters with no special meaning
* Special Characters or Meta Characters : ```. ^ $ * + ? | { } \ [ ] ( ) ``` : They imply special meanings.

### ```.``` : Match Any Character Except New Line. 

In [15]:
#1.1
text = '2020 is in lockdown.'
pattern = r'.'
matches = re.findall(pattern,text)
print(matches)

['2', '0', '2', '0', ' ', 'i', 's', ' ', 'i', 'n', ' ', 'l', 'o', 'c', 'k', 'd', 'o', 'w', 'n', '.']


You can see that ```.``` doesn't matches only fullstop in the end of text because it's a special characters which implies anything except newline character

In [16]:
#1.2
text = "2020 is in lockdown\nDon't hoard"
pattrn = r'.'
matches = re.findall(pattern,text)
print(matches) #See it doesn't match new line character

['2', '0', '2', '0', ' ', 'i', 's', ' ', 'i', 'n', ' ', 'l', 'o', 'c', 'k', 'd', 'o', 'w', 'n', 'D', 'o', 'n', "'", 't', ' ', 'h', 'o', 'a', 'r', 'd']


### ```*``` : Causes RE to match 0 or more repititons of preceding RE.
### ```+``` : Causes RE to match 1 or more repititions of preceding RE.

**Problem : Let's say we published a report and due to some intern's mistakes(Let's blame it on intern for now) spelling for corona was published incorrectly at multiple places.***

We have to find them for it now. His incorrect mistakes were very specific. He wrote wrong spellings but maintained same characters and their relative positions. 

Meaning ```Coronna,Cooroonna,Corrona``` are the mistakes he made since these mistakes and ```Corona``` has ```c,o,r,o,n,a``` characters in the same relative order (doesn't matter how many occurences they have). 

So, ```Coryna,Carano,Karano,Coroana``` cannot be his mistakes since they don't have same characters as ```Corona``` and they don't have same relative positons (Coroana has a before n).

Find out how many mistakes he made


In [17]:
text = 'Corona cases grew rapidly in Wuhan. Coronna spread to other countries as well. Cooroonna has wrecked havoc in Italy. \
Corrona made India lockdown their whole country for 21 days. Coryna is spreading in USA as well. Carano is also in France now \
Korona is spreading in Korean and Japan as well, Coroana has no vaccine right now.'

In [18]:
#Let's form the pattern we are looking for.
pattern = r'Corona' #is what we are looking for in the end as per the relative positions.
#but we are allowing multiple occurences of characters while in their relative position so we need to tweak the pattern.

pattern = r'C+o+r+o+n+a+' #This mean multiple occurences of C o r o n a alphabets are allowed but not if they destroy
#relative positons. This translates into atleast 1 occurence of C  + atleast 1 occurence of o ......same upto a
matches = re.findall(pattern,text)
print(matches)

['Corona', 'Coronna', 'Cooroonna', 'Corrona']


In [19]:
### Find the whole paragraph from the word hard.

In [20]:
text = "Work hard play hard. This is a phrase that's used a lot by people who love instagram"
pattern = r'hard.*' #hard word needs to be found so we add it to pattern and .* means everything after that(except newline character)
matches = re.findall(pattern,text)
print(matches)

["hard play hard. This is a phrase that's used a lot by people who love instagram"]


### ```^``` : Matches the start of string
### ```$``` : Matches the end of the string

Check whether the text down below starts with the word ```Even```.

In [21]:
text = "Tom Hanks got corona. Times are rough"
pattern = r'^Even'
matches = re.findall(pattern,text)
print(matches) #No match empty list

[]


Check whether the text given ends with rough or not

In [22]:
text = "Even Tom Hanks got corona. Times are rough"
pattern = r'rough$'
matches = re.findall(pattern,text)
print(matches) #returns rough hence it ends with rough

['rough']


We want to return the last sentence in case text ends with word rough. So for a sentence we will pick out everything after the last full stop before word ```rough```.

In [23]:
text = "Even Tom Hanks got corona. Times are rough"
pattern = r'.*rough$'
matches = re.findall(pattern,text)
print(matches)

['Even Tom Hanks got corona. Times are rough']


This has a issue it returns everything before rough because . is a special character and it's not going to behave as a fullstop. In case we want it to behave like fullstop we have to backslash it. Backslash nullfies the behaviour of special characters

In [24]:
text = "Even Tom Hanks got corona. Times are rough"
pattern = r'\..*rough$' #\. means fullstop(.) without special behaviour and after that .* means all characters
#rough$ means the pattern must end with rough.
#Translation : Start with . and continue all characters until you reach end of the string but you must have rough in the end.
matches = re.findall(pattern,text)
print(matches)

['. Times are rough']


Following string doesn't match with the pattern because it doesn't end with ```rough``` it ends with ```rougher```

In [25]:
text = "Even Tom Hanks got corona. Times are rougher"
pattern = r'\..*rough$'
matches = re.findall(pattern,text)
print(matches)

[]


Let's say we work for a company and we have to check passwords whether they meet the criteria or not?

Criteria:
* Password must start with "comp".
* Password must have #$ anywhere in the string but after comp and also the password mustnot end with it as well.
* Passowrd must end with atleast 1 occurence of b

In [26]:
password1 = 'comp#$bbb'
password2 = 'comp#$aaabb'
password3 = 'compaaa#$aaabbb'

In [27]:
pattern = r'^comp' #begins with comp
pattern = r'^comp#\$' #has #$ in it \ since $ is a special character
pattern = r'^comp#\$b+' #has atleast one b that's why we are using +. If zero number of b was acceptable we would have used *

In [28]:
matches_password1 = re.match(pattern,password1)
print(matches_password1) #This one matches

<re.Match object; span=(0, 9), match='comp#$bbb'>


In [29]:
matches_password2 = re.match(pattern,password2)
print(matches_password2)

None


In [30]:
matches_password3 = re.match(pattern,password3)
print(matches_password3)

None


```password2```  and ```password3``` doesn't match the pattern? because we didn't accout for the fact anything can come in the string before and after #$.

In [31]:
pattern = r'^comp#\$.*b+'
matches_password1 = re.match(pattern,password1)
print(matches_password1)

<re.Match object; span=(0, 9), match='comp#$bbb'>


In [32]:
pattern = r'^comp#\$.*b+'
matches_password2 = re.match(pattern,password2)
print(matches_password2)

<re.Match object; span=(0, 11), match='comp#$aaabb'>


In [33]:
pattern = r'^comp.*#\$.*b+'
matches_password3 = re.match(pattern,password3)
print(matches_password3)

<re.Match object; span=(0, 15), match='compaaa#$aaabbb'>


### ```{ }``` for exact the specified number of occurences

For the above problem let's say we have new criteria:
* Password must start with "comp".
* Password must have #$ anywhere in the string but after comp and also the password mustnot end with it as well.
* Passowrd must end with 3 to 5 occurences of b, both inclusive

In [34]:
password1 = 'comp#$bbb'
password2 = 'comp#$aaabb'
password3 = 'compaaa#$aaabbb'
password4 = 'comp12#$aaabbbbb'

In [35]:
pattern = r'^comp.*#\$.*b{3,5}' #pattern = r'^comp.*#\$.*b+' removed + and installed { } in it

In [36]:
matches_password1 = re.match(pattern,password1)
print(matches_password1)

<re.Match object; span=(0, 9), match='comp#$bbb'>


In [37]:
matches_password2 = re.match(pattern,password2)
print(matches_password2) # doesn't match because it has only 2 b in the end

None


In [38]:
matches_password3 = re.match(pattern,password3)
print(matches_password3) 

<re.Match object; span=(0, 15), match='compaaa#$aaabbb'>


In [39]:
matches_password4 = re.match(pattern,password4)
print(matches_password4) 

<re.Match object; span=(0, 16), match='comp12#$aaabbbbb'>


In case we wanted only 5 repeated occurence of b ```pattern = r'^comp.*#\$.*b{5}'```

### ```|``` : A|B either expression A or expression B
Let' say we have to select the whole text if it starts with ```corona``` or ```covid-19``` otherwise don't select it.

In [40]:
text1 = 'coronavirus came from the bats.'
text2 = 'covid-19 and sars both came from bats.'
text3 = 'ebola also came from bats.'

In [41]:
pattern = r'corona.*|covid-19.*'

In [42]:
match_text1 = re.match(pattern,text1)
print(match_text1)

<re.Match object; span=(0, 31), match='coronavirus came from the bats.'>


In [43]:
match_text2 = re.match(pattern,text2)
print(match_text2)

<re.Match object; span=(0, 38), match='covid-19 and sars both came from bats.'>


In [44]:
match_text3 = re.match(pattern,text3)
print(match_text3)

None
