# Python Regular Expressions

The regular expressions can be defined as the sequence of characters which are used to search for a pattern in a string. 
The module re provides the support to use regex in the python program. The re module throws an exception if there is some error 
while using the regular expression.

The re module must be imported to use the regex functionalities in python.

## Regex Functions
The ‘re’ package provides multiple methods to perform queries on an input string. Here are the most commonly used methods,
    re.match()
    re.search()
    re.findall()
    re.split()
    re.sub()
    re.compile()



### Matching Characters / Metacharacters

The complete list of metacharaceters are ... 

. ^ $ * + ? { } [ ] \ | ( )

There are a total of 14 metacharacters :

### []  
Represent a character class, which is a set of characters that you wish to match. Characters can be listed individually, or a range of characters can be indicated by giving two characters and separating them by a '-'.

Ex. [abc] will match any of the characters a, b or c and is also same as [a-c]. Note that this will match only lower case characters. 

### ^
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

You can match the characters not listed within the class by "complementing the set". This is indicated by including a '^' as the first character of the class. 

For example, [^5] will match any character except '5'. If the caret appears elsewhere in a character class, it does not have special meaning. For example: [5^] will match either a '5' or a '^'

### \   Used to drop the special meaning of character following it (discussed below)
the backslash can be followed by various characters to signal various special sequences. It’s also used to escape all the metacharacters so you can still match them in patterns; for example, if you need to match a [ or \, you can precede them with a backslash to remove their special meaning: \[ or \\.

    \d
    Matches any decimal digit; this is equivalent to the class [0-9].

    \D
    Matches any non-digit character; this is equivalent to the class [^0-9].

    \s
    Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

    \S
    Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

    \w
    Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

    \W
    Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_]

### $   Matches the end.


### .
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

### ?   Matches zero or one occurrence.
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

#### *?, +?, ??
The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<a> b <c>', it will match the entire string, and not just '<a>'. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only '<a>'.
    

### |   Means OR (Matches with any of the characters separated by it.)

### *   Any number of occurrences (including 0 occurrences)
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

+   One or more occurrences
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

### {}  Indicate number of occurrences of a preceding RE to match.
{m}
Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.

{m,n}
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match 'aaaab' or a thousand 'a' characters followed by a 'b', but not 'aaab'. The comma may not be omitted or the modifier would be confused with the previously described form.

{m,n}?
Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.


()  Enclose a group of REs
    
    
References:
    1. https://www.geeksforgeeks.org/regular-expression-python-examples-set-1/
    2. https://docs.python.org/3/howto/regex.html#modifying-strings
    3. https://docs.python.org/3/library/re.html#re-syntax


In [1]:
# Compiling Regular Expressions 

import re
p = re.compile('ab*')
p


re.compile(r'ab*', re.UNICODE)

In [2]:
p = re.compile('ab*', re.IGNORECASE)
p

re.compile(r'ab*', re.IGNORECASE|re.UNICODE)

In [None]:
### Note: 



In [8]:
import re

p = re.compile('[a-z]+')
a = p.match("KashyAp")
print (a)


None
<re.Match object; span=(0, 5), match='tempo'>


In [10]:
m = p.match('tempo')
print (m)
# Return the string matched by the RE
print(m.group()) 
# Return the starting position of the match
print (m.start())
# Return the ending position of the match
print (m.end())
# Return a tuple containing the (start, end) positions of the match
print (m.span())

<re.Match object; span=(0, 5), match='tempo'>
tempo
0
5
(0, 5)


In [11]:
# Module Regular Expression is imported using __import__(). 
import re 

# compile() creates regular expression character class [a-e], 
# which is equivalent to [abcde]. 
# class [abcde] will match with string with 'a', 'b', 'c', 'd', 'e'. 
p = re.compile('[a-e]') 

# findall() searches for the Regular Expression and return a list upon finding 
print(p.findall("Aye, said Mr. Gibenson Stark")) 

['e', 'a', 'd', 'b', 'e', 'a']


In [14]:
'''
The findall() function
This method returns a list containing a list of all matches of a pattern within the string.

It returns the patterns in the order they are found. If there are no matches, then an empty list is returned.
'''

import re  
  
str = "How are you. How is everything"    
matches = re.findall("How", str)  
print(matches) 

['How', 'How']


In [37]:
import re 
a = re.compile('[a-k]')
print (a.findall("Kkashyap"))

# \d is equivalent to [0-9]. 
p = re.compile('\d') 
print(p.findall("I went to him at 11 A.M. on 4th July 1886")) 
  
# \d+ will match a group on [0-9], group of one or greater size 
p = re.compile('\d+') 
print(p.findall("I went to him at 11 A.M. on 4th July 1886 123456")) 


['k', 'a', 'h', 'a']
['1', '1', '4', '1', '8', '8', '6']
['11', '4', '1886', '123456']


In [36]:
# \w is equivalent to [a-zA-Z0-9_]. 
p = re.compile('\w') 
print(p.findall("He said * in some_lang.")) 
  
# \w+ matches to group of alphanumeric character. 
p = re.compile('\w+') 
print(p.findall("I went to him at 11 A.M., he said *** in some_language.")) 
  
# \W matches to non alphanumeric characters. 
p = re.compile('\W') 
print(p.findall("he said *** in some_language.")) 

p = re.compile('\W+') 
print(p.findall("he said *** in some_language.")) 

# '*' replaces the no. of occurrence of a character. 
p = re.compile('ab*') 
print ()
print(p.findall("ababbaabbb")) 

['H', 'e', 's', 'a', 'i', 'd', 'i', 'n', 's', 'o', 'm', 'e', '_', 'l', 'a', 'n', 'g']
['I', 'went', 'to', 'him', 'at', '11', 'A', 'M', 'he', 'said', 'in', 'some_language']
[' ', ' ', '*', '*', '*', ' ', ' ', '.']
[' ', ' *** ', ' ', '.']

['ab', 'abb', 'a', 'abbb']


In [29]:
# Function split()
'''
Split string by the occurrences of a character or a pattern, upon finding that pattern, the remaining characters from the 
string are returned as part of the resulting list.

re.split(pattern, string, maxsplit=0, flags=0)
'''

from re import split 
  
# '\W+' denotes Non-Alphanumeric Characters or group of characters 
# Upon finding ',' or whitespace ' ', the split(), splits the string from that point 
print(split('\W+', 'Words, words , Words Word')) 
print(split('\W+', "Word's words Words")) 
  
# Here ':', ' ' ,',' are not AlphaNumeric thus, the point where splitting occurs 
print(split('\W+', 'On 12th Jan 2016, at 11:02 AM')) 
  
# '\d+' denotes Numeric Characters or group of characters 
# Splitting occurs at '12', '2016', '11', '02' only 
print(split('\d+', 'On 12th Jan 2016, at 11:02 AM')) 
  
# Splitting will occurs only once, at '12', returned list will have length 2 
print(re.split('\d+', 'On 12th Jan 2016, at 11:02 AM', 1)) 
  
# 'Boy' and 'boy' will be treated same when flags = re.IGNORECASE 
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here', flags = re.IGNORECASE)) 
print(re.split('[a-f]+', 'Aey, Boy oh boy, come here'))


['Words', 'words', 'Words', 'Word']
['Word', 's', 'words', 'Words']
['On', '12th', 'Jan', '2016', 'at', '11', '02', 'AM']
['On ', 'th Jan ', ', at ', ':', ' AM']
['On ', 'th Jan 2016, at 11:02 AM']
['', 'y, ', 'oy oh ', 'oy, ', 'om', ' h', 'r', '']
['A', 'y, Boy oh ', 'oy, ', 'om', ' h', 'r', '']


In [30]:
'''
Function sub() 
Syntax:
 re.sub(pattern, repl, string, count=0, flags=0)


The ‘sub’ in the function stands for SubString, a certain regular expression pattern is searched in the given string(3rd parameter), and upon finding the substring pattern is replaced by repl(2nd parameter), count checks and maintains the number of times this occurs. 
'''

import re 
  
# Regular Expression pattern 'ub' matches the string at "Subject" and "Uber". 
# As the CASE has been ignored, using Flag, 'ub' should match twice with the string 
# Upon matching, 'ub' is replaced by '~*' in "Subject", and in "Uber", 'Ub' is replaced. 
print(re.sub('ub', '~*' , 'Subject has Uber booked already', flags = re.IGNORECASE)) 
  
# Consider the Case Sensitivity, 'Ub' in "Uber", will not be reaplced. 
print(re.sub('ub', '~*' , 'Subject has Uber booked already')) 
  
# As count has been given value 1, the maximum times replacement occurs is 1 
print(re.sub('ub', '~*' , 'Subject has Uber booked already', count=1, flags = re.IGNORECASE)) 
  
# 'r' before the patter denotes RE, \s is for start and end of a String. 
print(re.sub(r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE)) 


S~*ject has ~*er booked already
S~*ject has Uber booked already
S~*ject has Uber booked already
Baked Beans & Spam


In [32]:
'''
Function subn() 
Syntax:
 re.subn(pattern, repl, string, count=0, flags=0)


subn() is similar to sub() in all ways, except in its way to providing output. 
It returns a tuple with count of total of replacement and the new string rather than just the string. 
'''

import re 
print(re.subn('ub', '~*' , 'Subject has Uber booked already')) 
t = re.subn('ub', '~*' , 'Subject has Uber booked already', flags = re.IGNORECASE) 
print(t) 
print(len(t)) 
  
# This will give same output as sub() would have  
print(t[0]) 


('S~*ject has Uber booked already', 1)
('S~*ject has ~*er booked already', 2)
2
S~*ject has ~*er booked already


In [31]:
'''
Function escape() 
Syntax:
re.escape(string)


Return string with all non-alphanumerics backslashed, this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it.
'''

import re 
  
# escape() returns a string with BackSlash '\', before every Non-Alphanumeric Character 
# In 1st case only ' ', is not alphanumeric 
# In 2nd case, ' ', caret '^', '-', '[]', '\' are not alphanumeric 
print(re.escape("This is Awseome even 1 AM")) 
print(re.escape("I Asked what is this [a-9], he said \t ^WoW"))

This\ is\ Awseome\ even\ 1\ AM
I\ Asked\ what\ is\ this\ \[a\-9\],\ he\ said\ \	\ \^WoW


In [None]:
'''
The match object

The match object contains the information about the search and the output. If there is no match found, 
the None object is returned.
'''

import re  
  
str = "How are you. How is everything"  
matches = re.search("How", str)  
  
print(type(matches))  
print(matches) #matches is the search object  
'''
The Match object methods
There are the following methods associated with the Match object.

span(): It returns the tuple containing the starting and end position of the match.
string(): It returns a string passed into the function.
group(): The part of the string is returned where the match is found.
'''

print(matches.span())  
print(matches.group())  
print(matches.string)  

In [None]:
## Python Regular Expressions

The regular expressions can be defined as the sequence of characters which are used to search for a pattern in a string. 
The module re provides the support to use regex in the python program. The re module throws an exception if there is some error while using the regular expression.

The re module must be imported to use the regex functionalities in python.

## Regex Functions


'''
The findall() function
This method returns a list containing a list of all matches of a pattern within the string.

It returns the patterns in the order they are found. If there are no matches, then an empty list is returned.
'''

import re  
  
str = "How are you. How is everything"    
matches = re.findall("How", str)  
  
print(matches) 


'''
The match object

The match object contains the information about the search and the output. If there is no match found, 
the None object is returned.
'''

import re  
  
str = "How are you. How is everything"  
matches = re.search("How", str)  
  
print(type(matches))  
print(matches) #matches is the search object  
'''
The Match object methods
There are the following methods associated with the Match object.

span(): It returns the tuple containing the starting and end position of the match.
string(): It returns a string passed into the function.
group(): The part of the string is returned where the match is found.
'''

print(matches.span())  
print(matches.group())  
print(matches.string)  

In [None]:
import re
pattern = re.compile("<(\d{4,5})>")

for i, line in enumerate(open('test.txt')):
    for match in re.finditer(pattern, line):
        print 'Found on line %s: %s' % (i+1, match.group())

In [None]:
import re

textfile = open(filename, 'r')
filetext = textfile.read()
textfile.close()
matches = re.findall("(<(\d{4,5})>)?", filetext)

In [None]:
import re

textfile = open(filename, 'r')
matches = []
reg = re.compile("(<(\d{4,5})>)?")
for line in textfile:
    matches += reg.findall(line)
textfile.close()