# Regex

## Identifiers / Special Sequences



<p style="border-style:solid;border-spacing=length;border-width:1px;padding:9px">
    <b>\b</b> Matches Empty String(Spaces) around the words<br>
    <b>\B</b> Matches Empty String(spaces) but not around word boundary<br>
    <b>\d</b> Matches Any Digit <br>
    <b>\D</b> Matches Anything But not a  Digit <br>
    <b>\s</b> Matches a Whitespace [ \t\n\r\f\v] <br>
    <b>\S</b> Matches Anything But not a space <br>
    <b>\w</b> Matches Alphanumeric[a-zA-Z0-9]
</p>

<p style="border-style:solid;border-spacing=length;border-width:1px;padding:9px">
    <b>^</b> Matches start of the string<br>
    <b>$</b> Matches end of the string<br>
    <b>*</b> 0 or more occurences<br>
    <b>+</b> 1 or more occurences <br>
    <b></b>
    <b>?</b> 0 or more occurences <br>
    <b>|</b> or <br>
    <b>()</b> Capture Group <br>
    <b>{n}</b> Exactly n number of occurences <br>
    <b>{m,n}</b> Exactly m-n number of occurences <br>
</p>

<p style="border-style:solid;border-spacing=length;border-width:1px;padding:9px">
    <b>(?=..)</b> Positive lookahead <br>
    <b>(?<=..)</b> Positive lookbehind <br>
    <b>(?!..)</b> Negative lookahead <br>
    <b>(?<!..)</b> Negative lookbehind <br>
</p>
<p style="border-style:solid;border-spacing=length;border-width:1px;padding:9px"> * + ? | [] {} \ . ^ $ </p>

In [151]:
import re

## re.search
__re.search__ : It searches the whole string and return the first matched part of the string(if any) else None.<br>

In [152]:
print(re.search.__doc__)

Scan through string looking for a match to the pattern, returning
    a Match object, or None if no match was found.


In [153]:
rs_obj = re.search("W", "HELLOWORLSW")
print(rs_obj) #Returns the matched object (part of the string that matched)

<re.Match object; span=(5, 6), match='W'>


## re.match
__re.match__   : It searches the whole string from the start else same as re.search.

In [154]:
print(re.match.__doc__)

Try to apply the pattern at the start of the string, returning
    a Match object, or None if no match was found.


In [155]:
rs_obj = re.match("W", "HELLOWORLSW") 
print(rs_obj) #Returns none as, it matches from the start of the string

None


In [156]:
re.search("\d* years", "himanshu is 23 years old")

<re.Match object; span=(12, 20), match='23 years'>

## re.findall

In [157]:
print(re.findall.__doc__)

Return a list of all non-overlapping matches in the string.

    If one or more capturing groups are present in the pattern, return
    a list of groups; this will be a list of tuples if the pattern
    has more than one group.

    Empty matches are included in the result.


In [158]:
# Suppose we got a string(mentioned below) and we want to fetch marks, sujects and student names from the string
s = "Jassie got 24 marks in English, John scored 24 marks in Maths and Simran got 54 marks in Psycology." 

student_marks = re.findall("\d{1,2}", s) #Matches a digit of length 1 or 2
student_subjects = re.findall("in ([A-Z][a-z]*)", s) #Matches the words after "in " and must start with a capital character
student_names = re.findall("[A-Z][a-z]*", s) #Matches all the words that start with a capital character
print("student_marks : " + str(student_marks))
print("student_subjects : ", student_subjects) #Each subject returns in phrase 
student_names = set(student_names).difference(student_subjects) #Remving subject names from students names
print("student_names : ", student_names)

student_marks : ['24', '24', '54']
student_subjects :  ['English', 'Maths', 'Psycology']
student_names :  {'Jassie', 'John', 'Simran'}


## re.split

In [159]:
re.split?

## re.compile

In [160]:
print(re.compile.__doc__)

Compile a regular expression pattern, returning a Pattern object.


In [161]:
pattern = re.compile(r'\d')
print(list(pattern.finditer("12345"))) #It returns iterator over the matched object.
print(re.findall(pattern, "12345")) #It returns the list of all the matched strings.

[<re.Match object; span=(0, 1), match='1'>, <re.Match object; span=(1, 2), match='2'>, <re.Match object; span=(2, 3), match='3'>, <re.Match object; span=(3, 4), match='4'>, <re.Match object; span=(4, 5), match='5'>]
['1', '2', '3', '4', '5']


## re.group
__Returns one or more subgroups of the match.__

In [172]:
print(re.match("(\w+)\t(\w+)", "HELLO\tWORLSWH").group.__doc__)

group([group1, ...]) -> str or tuple.
    Return subgroup(s) of the match by indices or names.
    For 0 returns the entire match.


In [173]:
print(re.match("(\w+)\t(\w+)", "HELLO\tWORLSWH").group()) #Returns the entire match
print(re.match("(\w+)\t(\w+)", "HELLO\tWORLSWH").group(0)) #Returns the entire match
print(re.match("(\w+)\t(\w+)", "HELLO\tWORLSWH").group(1)) #Returns the first matched object
print(re.match("(\w+)\t(\w+)", "HELLO\tWORLSWH").group(2)) #Returns the second matched object

HELLO	WORLSWH
HELLO	WORLSWH
HELLO
WORLSWH


## re.groups

In [174]:
print(re.match("(\w+)\t(\w+)", "HELLO\tWORLSWH").groups.__doc__)

Return a tuple containing all the subgroups of the match, from 1.

  default
    Is used for groups that did not participate in the match.


In [177]:
print(re.match("(\w+)\t(\w+)", "HELLO\tWORLSWH").groups())

('HELLO', 'WORLSWH')


## re.start, re.end & re.span
__Return the start and end index of the matched part from the string__

In [189]:
print(re.search("\w+", "....ABCD").start()) # Returns the start of the index of the first alphanumeric character series
print(re.search("\w+", "....ABCD").end()) # Returns the end of the index of the first alphanumeric character series
print(re.search("\w+", "....ABCD"))
print(re.search("\w+", "....ABCD").span()) # Both start and end as a tuple.

4
8
<re.Match object; span=(4, 8), match='ABCD'>
(4, 8)


## re.sub

In [190]:
print(re.sub.__doc__)

Return the string obtained by replacing the leftmost
    non-overlapping occurrences of the pattern in string by the
    replacement repl.  repl can be either a string or a callable;
    if a string, backslash escapes in it are processed.  If it is
    a callable, it's passed the Match object and must return
    a replacement string to be used.


In [205]:
make_it_twice = lambda x : str(int(x.group(0))*2 if int(x.group(0)) < 50 else int(x.group(0))) 

In [206]:
re.sub(r"\t", "-", "AA\tAA") #Will replace tab by -
re.sub(r"\d+", make_it_twice, "John is 22 years old and Senorita is 20 years old whereas, Evan is the oldest with the age of 56 years")

'John is 44 years old and Senorita is 40 years old whereas, Evan is the oldest with the age of 56 years'

## Reference 
<a href="http://www.pyregex.com/">http://www.pyregex.com/</a><br>
<a href="https://www.rexegg.com/regex-quickstart.html">https://www.rexegg.com/regex-quickstart.html</a><br>
<a href="https://docs.python.org/3/library/re.html">https://docs.python.org/3/library/re.html</a>