### Regular Expression in Python

Regular expressions are handled using Python's built-in **re** library. See [the docs](https://docs.python.org/3/library/re.html) for more information.

In [17]:
import re

In [44]:
text = "Hello this my phone number of my phone"
pattern = "phone"
invalid_pattern = "text"

## Matching pattern

- `search(pattern, text)` - gives single re object result if exist
- `findAll(patterns, text)` - gives list of all matching text
- `finditer(pattern, text)` - gives list of all matching re object

In [19]:
match_obj = re.search(pattern, text)
print(match_obj)

<re.Match object; span=(14, 19), match='phone'>


In [20]:
print(match_obj.span()) # gives start and end as tuple
print(match_obj.start()) # gives start of matching pattern
print(match_obj.end()) # gives start of matching pattern

(14, 19)
14
19


In [21]:
invalid_match = re.search(invalid_pattern, text)
print(invalid_match) # if invalid it returns None

None


In [22]:
all_match_text = re.findall(pattern, text) # it gives list of matching strings
print(all_match_text)

['phone', 'phone']


In [26]:
all_match_obj = re.finditer(pattern, text) # it gives match_obj iterator for all matching patterns
print(all_match_obj)

for match_obj in all_match_obj:
    match_span = match_obj.span()

    # match_start =  match_obj.start()
    # match_end =  match_obj.end()
    # match_text = text[match_start:match_end]
    
    match_text = match_obj.group() 
    
    print("The match text is:{}, span is :{}".format(match_text, match_span))

<callable_iterator object at 0x108598df0>
The match text is:phone, span is :(14, 19)
The match text is:phone, span is :(33, 38)


In [47]:
def printAllMatchingPattern(pattern, text):
    all_match_obj = list(re.finditer(pattern, text)) # finditer gives match_obj iterator for all matching patterns

    if not all_match_obj: # iterators always not null, so we need to use this method
        print("No matching patterns")

    for match_obj in all_match_obj: # iterate on all match object
        match_span = match_obj.span()

        match_text = match_obj.group() # extracting match text
        
        print("The match text is:{}, span is :{}".format(match_text, match_span))

printAllMatchingPattern(pattern, text)

The match text is:phone, span is :(14, 19)
The match text is:phone, span is :(33, 38)


In [48]:
printAllMatchingPattern(invalid_pattern, text)

No matching patterns


## Characters and Quantifiers

### Identifiers for Characters in Patterns

We can use this characters  in regular expression and generalize the pattern instead of matching exact text. 

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Example Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

Note:
`\w` - aplha numeric - characters[a-zA-Z], numbers[0-9], underscore(_)

#### Quantifiers

Quantifiers helps to specify how may number of occurrence of characters need. Without quantifiers it characters are considered as single occurrence.

<table >
<tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Example Match</th></tr>
<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>


</table>

In [54]:
text = "My phone number is 000-233-4565 which is similar to old phone number 000-233-4565"
pattern = r'\d{3}-\d{3}-\d{4}'

printAllMatchingPattern(pattern, text)

The match text is:000-233-4565, span is :(19, 31)
The match text is:000-233-4565, span is :(69, 81)


| Feature       | Symbol      | Description                                      | Example            | Matches         | Does Not Match  |
|--------------|------------|--------------------------------------------------|--------------------|----------------|----------------|
| **Grouping** | `( ... )`  | Groups part of the regex for capturing or logic | `(ab)+`           | "abab", "ab"   | "a", "b"       |
| **Non-Capturing Group** | `(?: ... )`  | Groups without capturing for efficiency | `(?:ab)+`         | "abab", "ab"   | "a", "b"       |
| **Start of Range** | `^`        | Matches the **start** of a string            | `^Hello`          | "Hello world"  | "world Hello"  |
| **End of Range**   | `$`        | Matches the **end** of a string              | `world$`          | "Hello world"  | "world Hello"  |
| **Character Range** | `[a-z]`    | Matches **any** character from `a` to `z`   | `[a-c]`           | "a", "b", "c"  | "d", "x"       |
| **Digit Range** | `[0-9]`    | Matches **any** digit from `0` to `9`         | `[3-6]`           | "3", "4", "5"  | "2", "7"       |
| **Exclusion** | `[^ ... ]` | Negates a set, matching anything **except** the characters inside | `[^aeiou]` | "b", "c", "x"  | "a", "e", "i"  |


In [None]:
text = "Hello my name is 12dsgsd"
pattern = r"[^\d]+" # excluding chars in match

printAllMatchingPattern(pattern, text)

The match text is:Hello my name is , span is :(0, 17)
The match text is:dsgsd, span is :(19, 24)


In [57]:
text = "Hello my employee id is emp0032"
pattern = r"emp\d+" # excluding chars in match

printAllMatchingPattern(pattern, text)

The match text is:emp0032, span is :(24, 31)


In [None]:
text = "number is +1 234-345-3433"
pattern = r"(\+\d{1,3}) (\d{3}-\d{3}-\d{4})" # splitting the match into multiple groups using ()

printAllMatchingPattern(pattern, text)

match_obj = re.search(pattern, text)

extension = match_obj.group(1)
phone_number = match_obj.group(2)
combined_result = match_obj.group(0) # this gives the entire match

print("Phone extension is {}, and the number is {}, So the entire match is {}".format(extension, phone_number, combined_result))

The match text is:+1 234-345-3433, span is :(10, 25)
Phone extension is +1, and the number is 234-345-3433, So the entire match is +1 234-345-3433


### Regex Anchors Table

| Anchor  | Description                                                                 | Example                     | Matches                     | Does Not Match              |
|---------|-----------------------------------------------------------------------------|-----------------------------|-----------------------------|-----------------------------|
| `\b`    | Word boundary: Matches the position between a word character and a non-word character. | `\bcat\b`                   | "cat" in "The cat"           | "cat" in "category"         |
| `\B`    | Non-word boundary: Matches a position that is **not** a word boundary.      | `\Bcat\B`                   | "cat" in "category"          | "cat" in "The cat"          |
| `^`     | Start of a string (or line, in multiline mode).                            | `^The`                      | "The" in "The cat"          | "The" in "This is The cat"  |
| `$`     | End of a string (or line, in multiline mode).                              | `cat$`                      | "cat" in "This is a cat"    | "cat" in "catapult"         |
| `\A`    | Start of a string (ignores multiline mode).                                | `\AThe`                     | "The" in "The cat"          | "The" in "This is The cat"  |
| `\Z`    | End of a string (ignores multiline mode).                                  | `cat\Z`                     | "cat" in "This is a cat"    | "cat" in "catapult"         |
| `\G`    | Start of the match (used for contiguous matches).                          | `\G\d`                      | "1", "2", "3" in "123"      | "2" in "1 2 3"              |

---

In [66]:
text = "Employee id is emp0032 and serial number is 23n232k12 but payroll number is 43234234324"

pattern=r"\d+" # this gives all the text with partial and fully match
printAllMatchingPattern(pattern, text)

print("-"*40)

pattern= r'\b\d+\b' #numbers that starts and end 
printAllMatchingPattern(pattern, text)

The match text is:0032, span is :(18, 22)
The match text is:23, span is :(44, 46)
The match text is:232, span is :(47, 50)
The match text is:12, span is :(51, 53)
The match text is:43234234324, span is :(76, 87)
----------------------------------------
The match text is:43234234324, span is :(76, 87)


In [76]:
text = "cat is not caterpillar, on this"

print("-"*35+"  cat "+"-"*10)
pattern=r"^cat" # it match cat not caterpillar's cat
printAllMatchingPattern(pattern, text)
print("-"*10)
pattern=r"cat" # it matches all cat
printAllMatchingPattern(pattern, text)


print("-"*35+"  is "+"-"*10)
pattern=r"is$" # it matches this's is not is
printAllMatchingPattern(pattern, text)

print("-"*10)
pattern=r"is" # it matches all is
printAllMatchingPattern(pattern, text)




-----------------------------------  cat ----------
The match text is:cat, span is :(0, 3)
----------
The match text is:cat, span is :(0, 3)
The match text is:cat, span is :(11, 14)
-----------------------------------  is ----------
The match text is:is, span is :(29, 31)
----------
The match text is:is, span is :(4, 6)
The match text is:is, span is :(29, 31)
