## Regular Expressions

Regular expressions are used for matching text patterns for searching, replacing and parsing text 
with complex patterns of characters.

Regexes are used for four main purposes - 
- To validate if a text meets some criteria; Ex. a zip code with 6 numeric digits 
- Search substrings. Ex. finding texts that ends with abc and does not contain any digits 
- Search & replace everywhere the match is found within a string; Ex. search "fixed deposit" and replace with "term deposit" 
- Split a string at each place the regex matches; Ex. split everywhere a @ is encountered

#### Raw python string

It is recommended that you use raw strings instead of regular Python strings. Raw strings begin with a prefix, r, placed before the quotes

In [1]:
print("ABC \n PQR")

ABC 
 PQR


In [2]:
print(r"ABC \n PQR")

ABC \n PQR


In [None]:
open(r"C:\users\newfolder\file.txt")

#### Importing re module

In [3]:
import re

#### Functions in re Module
The "re" module offers functionalities that allow us to match/search/replace a string 

- `re.match()` - The match only if it occurs at the beginning of the string 
- `re.search()` - First occurrence of the match if there is a match anywhere in the string  
- `re.findall()` - Returns a list containing all matches in the string 
- `re.split()` - Returns a list where the string has been split at each match 
- `re.sub()` - Replaces one or many matches with a string 
- `re.finditer()` - Returns a collectable iterator yielding all non-overlapping matches 

In [23]:
text = "India celebrates its independence day on 15th August"
re.match(r"[A-z]+\b", text)  # returns a match object

<re.Match object; span=(0, 5), match='India'>

In [18]:
text = "India celebrates its independence day on 15th August"
re.search(r"[0-9]+", text)  # returns a match object

<re.Match object; span=(41, 43), match='15'>

The **Match object** in Python's `re` module provides several useful methods to extract information about the match. It contains details about the match, such as the matched string, its position, and captured groups.

<table style="width: 60%; border-collapse: collapse; border: 1px solid #ccc; text-align: left; margin-left: 0;">
  <thead>
    <tr style="background-color: #050A30; color: white;">
    <th>Method</th>
    <th>Description</th>
  </tr>
  </thead>
  <tr>
    <td><b>.group([group])</b></td>
    <td>Returns the matched string or a specific group if specified.</td>
  </tr>
  <tr>
    <td><b>.groups()</b></td>
    <td>Returns a tuple of all captured groups.</td>
  </tr>
  <tr>
    <td><b>.start([group])</b></td>
    <td>Returns the start index of the match or a group.</td>
  </tr>
  <tr>
    <td><b>.end([group])</b></td>
    <td>Returns the end index of the match or a group.</td>
  </tr>
  <tr>
    <td><b>.span([group])</b></td>
    <td>Returns a tuple `(start, end)` of the match or a group.</td>
  </tr>
  <tr>
    <td><b>.re</b></td>
    <td>Returns the regular expression pattern object.</td>
  </tr>
  <tr>
    <td><b>.string</b></td>
    <td>Returns the original string searched.</td>
  </tr>
  <tr>
    <td><b>.lastindex</b></td>
    <td>Returns the last captured group’s index.</td>
  </tr>
  <tr>
    <td><b>.lastgroup</b></td>
    <td>Returns the last captured group’s name if named groups are used.</td>
  </tr>
</table>

In [20]:
# Example - 

import re

text = "India celebrates its independence day on 15th August and Republic day on 26th January"
match = re.search(r"[0-9]+", text)  # returns a match object

if match:
    print("Matched Text:", match.group())        
    print("Start Position:", match.start())     
    print("End Position:", match.end())         
    print("Span:", match.span())                
else:
    print("No match found.")

Matched Text: 15
Start Position: 41
End Position: 43
Span: (41, 43)


**Basic Characters**
- `^` - Matches the expression to its right at the start of a string. It matches every such 
instance before each line break in the string 
- `$` - Matches the expression to its left at the end of a string. It matches every such 
instance before each line break in the string 
- `p|q` - Matches expression p or q 

**Character Classes**

- `\w` - Matches alphanumeric characters: a-z, A-Z, 0-9 and _
- `\W` - Matches non-alphanumeric characters. Ignores a-z, A-Z, 0-9 and _
- `\d` - Matches digits: 0-9
- `\D` - Matches any non-digits 
- `\s` - Matches whitespace characters, which include the \t, \n, \r, and space characters 
- `\S` - Matches non-whitespace characters 
- `\A` - Matches the expression to its right at the absolute start of a string (in single or multi-line mode) 
- `\t` - Matches tab character
- `\Z` - Matches the expression to its left at the absolute end of a string (in single or multi-line mode) 
- `\n` - Matches a newline character 
- `\b` - Matches the word boundary at the start and end of a word 
- `\B` - Matches where \b does not, that is, non-word boundary

**Groups and Sets**

- `[abc]` - Matches either a, b, or c. It does not match abc
- `[a\-z]` - Matches a, -, or z. It matches - because \ escapes it 
- `[^abc]` - Adding ^ excludes any character in the set. Here, it matches characters that are  NOT a, b or c 
- `()` Matches the expression inside the parentheses and groups it
- `[a-zl` - Matches any alphabet from a to z 
- `[a-z0-9]` - Matches characters from a to z and O to 9 
- `[(+*)]` - Special characters become literal inside a set, so this matches ( + * and ) 
- `(?P=name)` - Matches the expression matched by an earlier group named "name"

**Quantifiers**

- `.` - Matches any character except newline 
- `?` - Matches the expression to its left O or 1 times 
- `{n}` - Matches the expression to its left n times 
- `(,m)` - Matches the expression to its left up to m times
- `*` - Matches the expression to its left O or more times 
- `+` - Matches the expression to its left 1 or more times 
- `{n,m}` - Matches the expression to its left n to m times 
- `{n, }` - Matches the expression to its left n or more times 

#### Examples - 

###### Ex. Extract all digits from the text

In [24]:
text = "The stock price was 456 yesterday. Today, it rose to 564"
re.findall(r"\d", text)

['4', '5', '6', '5', '6', '4']

###### Ex. Extract all numbers from the text

In [25]:
text = "The stock price was 456 yesterday. Today, it rose to 564"
re.findall(r"\d+", text)

['456', '564']

###### Ex. Retrive the dividend from the text

In [27]:
text = "On 25th March, the company declared 17% dividend."
re.findall(r"\d+%", text)

['17%']

###### Ex. Retrieve all uppercase characters

In [28]:
text = "Stocks like AAPL GOOGL BMW are the preferred ones"
re.findall(r"[A-Z]", text)

['S', 'A', 'A', 'P', 'L', 'G', 'O', 'O', 'G', 'L', 'B', 'M', 'W']

###### Ex. Retrive all stock names

In [33]:
text = "Stocks like AAPL GOOGL BMW X are the preferred ones"
re.findall(r"[A-Z]+\b", text)

['AAPL', 'GOOGL', 'BMW', 'X']

In [34]:
text = "Stocks like AAPL GOOGL BMW X are the preferred ones"
re.findall(r"[A-Z]{2,}", text)

['AAPL', 'GOOGL', 'BMW']

###### Ex. Retrieve the phone numbers with country code only 

In [35]:
text = "My number is 65-11223344 and 65-91919191. My other number is 44332211"
re.findall(r"\d+-\d+", text)

['65-11223344', '65-91919191']

###### Ex. Retrieve the phone numbers with or without country code

In [38]:
text = "My number is 65-11223344 and 65-91919191. My other number is 44332211"
re.findall(r"\d+-\d+|\d+", text)

['65-11223344', '65-91919191', '44332211']

###### Ex. Retrieve the phone numbers without country code

In [52]:
text = "My number is 65-11223344 and 65-91919191. My other number is 44332211"
re.findall(r"[^-\d+]\d{3,}", text)

[' 44332211']

###### Ex. Retrieve the zip codes with 2 alphabets in the beginning 

In [54]:
text = "The zipcodes are AB4567, TX23A3, 310120, NY1210, 734001 "
re.findall(r"[A-Z]{2}\w+", text)

['AB4567', 'TX23A3', 'NY1210']

###### Ex. Replace values as given in the dict

In [55]:
text = "Stocks like AAPL GOOGL BMW are the preferred ones"
repl_dict = {"AAPL": "APPLE", "GOOGL": "GOOGLE"}
re.sub(r"[A-Z]+\b", "*", text)

'Stocks like * * * are the preferred ones'

In [63]:
fn = lambda m_obj : repl_dict.get(m_obj.group(), m_obj.group())
re.sub(r"[A-Z]+\b", fn, text)

'Stocks like APPLE GOOGLE BMW are the preferred ones'

In [59]:
m_obj = re.search(r"[A-Z]+\b", text)
repl_dict[m_obj.group()]

'APPLE'

In [56]:
help(re.sub)

Help on function sub in module re:

sub(pattern, repl, string, count=0, flags=0)
    Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used.



<hr><hr>