<center><h1 style=font-size:40px>Regular Expressions in Python &#128013;</h1></center>

<center><img src="images/regex.jpg" style="width:1000px;height:200px;"/>

<center style="color:#99BB00;font-size:30px;">Give a thought !!!</center>

<p style="color:#55AA56;font-size:15px;">Can you check & extract, if the words ```cat, fat, eat, oat, rat, bat``` exist in the string ```A fat cat doesn't eat oat but a rat eats bats```</p></br>
    

In [1]:
sentence = "A fat cat doesn't eat oat but a rat eats bats" 
list_of_matched_words = []

if "cat" in sentence:
    list_of_matched_words.append("cat")
if "cat" in sentence:
    list_of_matched_words.append("fat")
if "cat" in sentence:
    list_of_matched_words.append("eat")
if "cat" in sentence:
    list_of_matched_words.append("oat")
if "cat" in sentence:
    list_of_matched_words.append("bat")
if "cat" in sentence:
    list_of_matched_words.append("rat")

list_of_matched_words  

['cat', 'fat', 'eat', 'oat', 'bat', 'rat']

<br>
<p style="color:#AA0000;font-size:25px;">Domain Specific Language (DSL) </p>

* A domain-specific language is created specifically to solve problems in a particular domain and is not intended to be able to solve problems outside it (although that may be technically possible)

* These languages have there own declared syntaxes and grammars, with very specific goals in design and implementation

* It has specialized features for a particular domain but be applicable more broadly, or conversely may in principle be capable of broad application but in practice used primarily for a specific domain
 
Examples:

* ```HTML``` for web pages
* ```SQL``` for Querying databases
* ```Regular Expressions``` for matching text patterns
    

<p>Regular expressions is a powerful language for matching text patterns.
It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents. </p>

<br>Python library or module that supports ```regex``` is ```re```


## Raw Strings in Python

The python parser interprets ```‘\’``` (backslashes) as escape characters in string literals;
like ```\n``` becomes a newline character.

<p style="color:#AA0000;font-size:25px;">Activity: Can we suppress the behavior of ```\``` in a string<br><br></p>

### Way 1: Use another backslash

In [2]:
s_escape = 'I work \\n with \\t Gemini \\t Solutins'
print(s_escape)

I work \n with \t Gemini \t Solutins


### Way 2: Just put an ```r``` before a string and make it a ```raw string```
<p style="color:#11111;font-size:15px;">When a string is raw, the python parser will not even attempt to make any substitutions within it. Essentially, we are telling the parser to completely leave our string alone.</p>

In [3]:
s_raw = r'I am an \n active learner of \n python'
print(s_raw)

I am an \n active learner of \n python


## Performing Queries with Regex in Python

<p style="color:#111111;font-size:15px;">
1. re.match() <br> 
2. re.search() <br>
3. re.findall()  <br>

<br>
**N.B.:** Each of the methods accepts a regular expression, and a string to scan for matches.</p>

## match(pattern, str)


<p> Returns a corresponding match object, if zero or more characters at the beginning of the string match the pattern <br> else ```None``` if the string does not match the given pattern</p>
 
```re.match(pattern, string[, flags])```


In [4]:
import re
pattern ="C"
sequence = "IceCream"
match = re.match(pattern, sequence)
if match:
    print('found', match.group())
else:
    print("No Match", match)

No Match None


In [5]:
sequence = "Cake"
match = re.match(pattern, sequence)
if match:
    print('found', match.group(0))

found C


## search(pattern, str)

<p>Scan through the string/sequence looking for the first location where the regular expressions produces a match.</p>

```search(pattern, str)```

In [6]:
match = re.search(r'dog', 'dog cat dog')
match.group()

'dog'

<br>

<p style="color:#AA0000;font-size:23px;">Activity: Check if the word ```Time``` or ```time``` exist in below paragraph</p>
<br>

In [7]:
quote = """Time means a lot to me because, you see,
I, too, am also a learner and am often lost in the
joy of forever developing and simplifying. 
If you love life, don’t waste time,
for time is what life is made up of.
"""

for line in quote.split('\n'):
    match = re.search(r'[Tt]ime', line)
    if match:
      print(match.group())
    else:
      print(line)

Time
I, too, am also a learner and am often lost in the
joy of forever developing and simplifying. 
time
time



<center style="color:#AA0000;font-size:25px;">match() v/s search() </center>


Python offers two different primitive operations based on regular expressions: ```match``` checks for a match only at the **beginning of the string**, while ```search``` checks for a match **anywhere in the string** (this is what Perl does by default).

**N.B.:** Note that ```match``` may differ from ```search``` even when using a regular expression beginning with ```^```.

```^``` matches only at the start of the string, or in  ```MULTILINE``` mode also immediately following a newline. The ```match``` operation succeeds only if the ```pattern matches``` at the ```start of the string``` regardless of mode, or at the starting position given by the optional pos argument regardless of whether a newline precedes it.

In [8]:
# example code:
string_with_newlines = """something
someotherthing"""

import re

print("Line 1:",re.match('some', string_with_newlines))# matches

print("Line 2:",re.match('someother', string_with_newlines)) # won't match
      
print("Line 3:", re.match('^someother', string_with_newlines, re.MULTILINE)) # also won't match
      
print("Line 4:", re.search('someother', string_with_newlines)) # finds something

print("Line 5:",  re.search('^someother', string_with_newlines, re.MULTILINE)) # also finds something
      
m = re.compile('thing$', re.MULTILINE)

print(m.match(string_with_newlines)) # no match

print(m.match(string_with_newlines, pos=4)) # matches

print(m.search(string_with_newlines,re.MULTILINE)) # also matches

Line 1: <_sre.SRE_Match object; span=(0, 4), match='some'>
Line 2: None
Line 3: None
Line 4: <_sre.SRE_Match object; span=(10, 19), match='someother'>
Line 5: <_sre.SRE_Match object; span=(10, 19), match='someother'>
None
<_sre.SRE_Match object; span=(4, 9), match='thing'>
<_sre.SRE_Match object; span=(19, 24), match='thing'>


## findall(pattern, str)

<p>```findall``` returns all non-overlapping matches of pattern in string, as a list of strings. The string is scanned left-to-right, and matches are returned in the order in which they are found. </p>

```re.findall(pattern, string[, flags])```


In [9]:
import re
str="A fat cat doesn't eat oat but a rat eats bats."

match = re.findall(r'[force]at', str)
match

['fat', 'cat', 'eat', 'oat', 'rat', 'eat']

<br>
<br>
<center style="color:#BB0000;font-size:30px;">Mind Boggling Question
<br>
<br>
<center style="color:#6666BB;font-size:20px;">Activity: Can we extract the ```email IDs``` from below string.</p>

In [10]:
email_address = "Please support us via PayPal at: letspython3.x@gmail.com"

addresses = re.findall(r'[a-zA-Z0-9_.]+@[a-zA-Z0-9_.]+', email_address)
for address in addresses:
    print(address)

letspython3.x@gmail.com


## compile(pattern, [FLAGS])


<p>if you want to use the same regexp more than once in a script, it might be a good idea to use a regular expression object, i.e. the ```compiled regex```.</p>

```search(pattern, str)```

**N.B.:** It's certainly possible to store strings and pass them to ```re.match```; however, that's less readable:


In [17]:
pat = "..."

mystr = input("Enter something: ")
m = re.match(pat, mystr)
m.group()

Enter something: gifyifyi\


'gif'

### Versus compiling

In [15]:
pat = re.compile("...")

str_1=input("Enter something: ")

m = pat.match(str_1)
m.group()


Enter something: hty


'hty'

**N.B:** They are fairly close but the second feels more natural and simpler when used repeatedly.

### Basic Patterns

<p>The power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basic patterns which match single chars:

<table>
<tr>
    <th>Pattern</th>
    <th>Description</th>
</tr>
<tr>
    <td>a, X, 9 </td>
    <td>Ordinary characters just match themselves exactly</td>
</tr>
<tr>
    <td>```.``` (a period)</td>
    <td> matches any single character except newline ```\n```</td>
</tr>
    
<tr>
    <td>\w (lowercase w)  </td>
    <td> matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]
        <br>It only matches a single word char, not a whole word.</td> 
</tr>
<tr>
    <td>\W (upper case W) </td>
    <td> matches any non-word character.</td>
</tr>
<tr>
    <td>\b</td>
    <td> boundary between word and non-word</td>
</tr>
<tr>
    <td>\s (lowercase s)</td>
    <td> matches a single whitespace character <br>space, newline, return, tab, form [ \n\r\t\f]. </td>
</tr>
<tr>
<td>\S (upper case S) </td><td>matches any non-whitespace character.</td>
</tr>
<tr>
    <td>\t, \n, \r </td>
    <td> tab, newline, return</td>
</tr>
<tr>
    <td>\d </td>
    <td>decimal digit [0-9] </td>
</tr>
<tr>
    <td>^ = start, $ = end </td>
    <td> match the start or end of the string</td>
</tr>
<tr>
    <td>```\``` </td>
    <td> inhibit the "specialness" of a character</td>
</tr>
</table>
             

## Search and Replace with ```sub```

<p>Every match of the regular expression regex in the string subject 
will be replaced by the string replacement. </p>

<br>
<p style="color:#AA0000;font-size:25px;">Activity: replace ```yes``` or ```Yes``` with ```no```</p>

```re.sub(regex, replacement, subject)```

In [None]:
str = "yes I said yes I will Yes."
res = re.sub("[yY]es","no", str)
res

## Repetitions

<center style="color:#AA0000;font-size:20px;">find long patterns in a sequence.

|Character |Description|
|----------|:----------|
| ```+``` | Checks for one or more characters to its left|
| ```*``` | Checks for zero or more characters to its left.|
| ```?``` | Checks for exactly zero or one character to its left.|


In [None]:
re.search(r'Co+kie', 'Cooookie').group()

In [None]:
re.search(r'Ca*o*kie', 'Caaaokie').group()

In [None]:
re.search(r'Colou?r', 'Color').group()

<center style="color:#AA0000;font-size:20px;"> But what if you want to check for exact number of sequence repetition?


<br>
<br>
<center style="color:#BB0000;font-size:30px;">Mind Boggling Question
<br>
 <br>
<center style="color:#6666BB;font-size:20px;">Can we write a regex pattern to extract ```<h1>``` from  ```<h1>TITLE</h1>``` ?

In [None]:
import re
heading  = r'<h1>TITLE</h1>regex'
re.match(r'<.*>', heading).group()

### Description

The RE matches the ```'<'``` in ```'<h1>'```, and the ```.*``` consumes the rest of the string. There’s still more left in the RE, though, and the ```>``` can’t match at the end of the string,
so the regular expression engine has to ```backtrack character by character``` until it finds a match for the ```>```. 
The final match extends from the ```'<'``` in ```'<h1>'``` to the ```>``` in ```'</h1>'```, ofcourse which isn’t expected.

This behavior of ```*```,  ```+``` etc is known as ```Greedy behavior```.

In [None]:
re.match(r'<.*?>', heading).group()

### Description
Adding ```?``` after the qualifier makes it perform the match in a non-greedy or minimal fashion; That is, as few characters as possible will be matched. 
Now, ```>``` is tried immediately after the first ```<``` matches, and ```when it fails```, the ```engine advances a character at a time```, retrying the ```>``` at every step.

Non-greedy qualifiers ```*?```, ```+?```, ```??```, or ```{m,n}?```, which ```match as little text as possible```. 

This behavior of ```*?```, ```+?```, ```??```, or ```{m,n}?```is known as ```Non-Greedy behavior```.

### Groups and Grouping using Regular Expressions
Grouping feature allows to pick up parts of the matching text.

Parts of a regular expression pattern bounded by parenthesis ```()``` are called ```groups```. The parenthesis does not change what the expression matches, but rather forms groups within the matched sequence. The plain ```match.group()``` without any argument is still the whole matched text as usual.

In [None]:
my_str = "Customer number: 232454, Date: February 12, 2011"
mo = re.search("[0-9]+", my_str)
print(mo.group())
print(mo.span())
print(mo.start())
print(mo.end())

In [None]:
my_str = "Customer number: 232454, Date: February 12, 2011"
mo = re.search("([0-9]+).*: (.*)", my_str)
print(mo.group())
print(mo.group(1))
print(mo.group(2))
print(mo.group(1,2))

<p><br><br></p>
<center style="color:#3B3B30;font-size:35px;">Thank you for being an active listener.</center>
<p><br></p>

### Please show us your support by ```like```, ```share``` and ```subscribe``` to our channel : <a href="https://www.youtube.com/channel/UCWHI-Ebf7_2ogLINkBIqaYA">```letspython3.x```</a>