**Regular expressions**

A Regular Expressions (RegEx) is a special sequence of characters that uses a search pattern to find a string or set of strings.

Regular expressions can contain both special and ordinary characters.
Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions;
Some characters, like '|' or '(', are special. 

**The special characters are:**

* **.**
(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.

* **^** 
(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.

* **$**
Matches the end of the string or just before the newline at the end of the string.

* **(star(*))**  
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.

* **+**  
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.

* **?**  
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.

* **{m}**     
Specifies that exactly m copies of the previous RE should be matched;For example, a{6} will match exactly six 'a' characters, but not five.

* **[]**     
Used to indicate a set of characters.
e.g. [amk] will match 'a', 'm', or 'k'.
Ranges of characters [0-5][0-9]

* **/**     
Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence;

* **|**  
A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B



In [None]:
import re

In [None]:
pattern='[abc]def'
text='abcdef'
m=re.search(pattern,text)


In [None]:
string='hello nitesh here'
print(re.search('\n',string))

None


In [None]:
re.A

re.ASCII

In [None]:
#Perform case-insensitive matching; expressions like [A-Z] will also match lowercase letters.
re.IGNORECASE

re.IGNORECASE

#### **Function**
**re.compile(pattern, flags=0)**  

* Compile a regular expression pattern into a regular expression object, which can be used for matching using its match(), search() and other methods.

* using re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.



In [None]:
prog=re.compile('[abc]+')
result=prog.match('abc nitesh pandey')
print(result)

<re.Match object; span=(0, 3), match='abc'>


In [None]:
#Equivalent to above code
#result = re.match(pattern, string)
result = re.match('[abc]+', 'abc nitesh pandey')
print(result)


<re.Match object; span=(0, 3), match='abc'>


**re.search(pattern, string, flags=0)**

* Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding match object. 

* Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

**re.match(pattern, string, flags=0)**
* If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern

**Python offers different primitive operations based on regular expressions:**

* **re.match()** checks for a match only at the beginning of the string.Return None if no position in the string matches the pattern
* **re.search()** checks for a match anywhere in the string (this is what Perl does by default).
* **re.fullmatch()** checks for entire string to be a match.

In [25]:
pattern=re.compile('d')
print(pattern.search('delete dog donkey')) # Match at index 0
print(pattern.search('Man gd donkey')) # Match at index 5

<re.Match object; span=(0, 1), match='d'>
<re.Match object; span=(5, 6), match='d'>


In [27]:
pattern=re.compile('d')
print(pattern.match('And Was Or do')) # No match as "d" is not at the start of given string hence return None
print(pattern.match('do abcd'))       # Match at index 0

None
<re.Match object; span=(0, 1), match='d'>


The Match object has properties and methods used to retrieve information about the search, and the result:

* **span()** returns a tuple containing the start-, and end positions of the match.
* **string** returns the string passed into the function
* **group()** returns the part of the string where there was a match


In [37]:
pattern=re.compile('d')
res=pattern.match('do abcd')
print(res.span())

pattern=re.compile('d')
res=pattern.match('do abcd')
print(res.string)

pattern=re.compile('d')
res=pattern.match('do abcd')
print(res.group())

(0, 1)
do abcd
d


In [40]:
m = re.match(r"(?P<first_name>\w+) (?P<last_name>\w+)", "Nitesh Pandey")
print(m.group('first_name'))
print(m.group('last_name'))

Nitesh
Pandey


In [43]:
#Named groups can also be referred to by their index:
m.group(2)

'Pandey'

In [56]:
#Return a tuple containing all the subgroups of the match, from 1 up to however many groups are in the pattern.
#Match.groups(default=None)
res=re.match(r'(\d+)\.(\d+)',"241.34")
res.groups()


('241', '34')

In [33]:
pattern=re.compile('d[on]')
print(pattern.fullmatch('don'))              # No match as not the full string matches.
print(pattern.fullmatch('do nothing',0,2))   # Matches within given limits.

None
<re.Match object; span=(0, 2), match='do'>


**re.split(pattern, string, maxsplit=0, flags=0)**

* Split string by the occurrences of pattern.
* The remainder of the string is returned as the final element of the list.

In [None]:
print(re.split(r'\W+', 'Nit Sam Tom'))

['Nit', 'Sam', 'Tom']


In [61]:
text='''Good Morning 
have a nice day
have a great future ahead '''

we convert the string into a list with each nonempty line having its own entry using **re.split(pattern,string)** function

In [62]:
re.split('\n+',text)

['Good Morning ', 'have a nice day', 'have a great future ahead ']

* **\A**  
Returns a match if the specified characters are at the beginning of the string.	

* **\b**	
Returns a match where the specified characters are at the beginning or at the end of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string").

* **\B**	
Returns a match where the specified characters are present, but NOT at the beginning (or at the end) of a word (the "r" in the beginning is making sure that the string is being treated as a "raw string").

* **\d**	
Returns a match where the string contains digits (numbers from 0-9).	

* **\D**	
Returns a match where the string DOES NOT contain digits.

* **\s**	
Returns a match where the string contains a white space character.	

* **\S**
Returns a match where the string DOES NOT contain a white space character.	

* **\w**	
Returns a match where the string contains any word characters (characters from a to Z, digits from 0-9, and the underscore _ character).		

* **\W**	
Returns a match where the string DOES NOT contain any word characters.

* **\Z**  
Returns a match if the specified characters are at the end of the string.

#### **Set**
A **set** is a set of characters inside a pair of square brackets [] with a special meaning:

* **[arn]**   
Returns a match where one of the specified characters (a, r, or n) is present	

* **[a-n]**	
Returns a match for any lower case character, alphabetically between a and n

* **[^arn]**	
Returns a match for any character EXCEPT a, r, and n

* **[0123]**	
Returns a match where any of the specified digits (0, 1, 2, or 3) are present	

* **[0-9]**	
Returns a match for any digit between 0 and 9	

* **[0-5][0-9]**	Returns a match for any two-digit numbers from 00 and 59	

* **[a-zA-Z]**	Returns a match for any character alphabetically between a and z, lower case OR upper case.

* **[+]**  
In sets, +, *, ., |, (), $,{} has no special meaning, so [+] means: return a match for any + character in the string

In [5]:
import re
print(re.findall(r'\bf[a-z]+','You are from which city??'))
print(re.findall(r'\bf[a-z]+','Which are your favourite Film',re.IGNORECASE))

['from']
['favourite', 'Film']


In [19]:
# replace the value by given pattern in string
# re.sub(pattern,replace_word,string)
re.sub('we','I','we am happy')


'I am happy'

In [18]:
# replace the value by given pattern in string with number of substitute made
# re.sub(pattern,replace_word,string)
re.subn('we','I','we am happy')

('I am happy', 1)

In [17]:
#Escape special characters in pattern.
re.escape('www.cracklogic.com\about')

'www\\.cracklogic\\.com\x07bout'

In [20]:
#Clear the regular expression cache.
re.purge()


In [63]:
# \d is equivalent to [0-9].
p = re.compile('\d')
print(p.findall("26th January 1950"))
 
# \d+ will match a group on [0-9], group
# of one or greater size
p = re.compile('\d+')
print(p.findall("26th January 1950"))

['2', '6', '1', '9', '5', '0']
['26', '1950']
