# Regular Expression Tutorial

### What is Regular Expression ?
   *  A regular expression is a **special sequence of characters** that helps you match or find **other strings or sets of strings**.
   * The concept arose in the 1950s when the American mathematician Stephen Cole Kleene formalized the description of a regular language. 
   * The concept came into common use with Unix text-processing utilities. 

### Why Regular Expression ?
   * Find 
   * Find and replace (subsitution).

### Where it is used ?
	* Text processing tasks (string processing).
	* Data scrapping (web scrapping).
	* Parsing.
	* Syntax highlighting.
	* Internet search engine.

## Let's dive into Reg-ex !

#### How to import Regex in python

In [17]:
import re 

text="reg-ex is interesting reg2reg-6 topic to study!!!"

print(re.findall("reg",text))

['reg', 'reg', 'reg']


# Regex special charecters

2.1. Pattern:
-----------------
	 .    ->  matches any single charecter.

In [30]:
import re 

text="reg-ex is interesting topic to study!."

print(re.findall(".",text))

['r', 'e', 'g', '-', 'e', 'x', ' ', 'i', 's', ' ', 'i', 'n', 't', 'e', 'r', 'e', 's', 't', 'i', 'n', 'g', ' ', 't', 'o', 'p', 'i', 'c', ' ', 't', 'o', ' ', 's', 't', 'u', 'd', 'y', '!', '.']


In [20]:
print(re.findall("....",text))

['reg-', 'ex i', 's in', 'tere', 'stin', 'g re', 'g2re', 'g-6 ', 'topi', 'c to', ' stu', 'dy!!']


In [31]:
print(re.findall(".{4}",text))


['reg-', 'ex i', 's in', 'tere', 'stin', 'g to', 'pic ', 'to s', 'tudy']


2.2. Metacharecter:
----------------------------

	^			-> starts with (each line that is after each '\n')
	$	     -> ends wit(each line that is before each '\n')
    ^....$		-> starts and ends with (works till end of the line charecter) (inbetween size must be known)
	[]			-> A set of charecters (optional match of each charecter)
	{}			-> Exactly specific number of occurance  
	|			-> Either or
	()			-> Capture and Group (each charecter is matched).

In [2]:
import re 
print(re.findall("^reg.{10}",text))

[]


In [27]:
print(re.findall("$.{10}",text))

[]


In [35]:
print(re.findall(".{10}.$",text))

[' to study!.']


In [5]:
text ="""SAP=101501

SAP=101502

SAP=101508

SAP=101509
"""
print(re.findall("(S...{3,})",text))

['SAP=101501', 'SAP=101502', 'SAP=101508', 'SAP=101509']


In [153]:
text="bat ate cat  but cat won't eat bat at any cost!!!"
print(re.findall(".at",text))
print(re.findall("[a-z]+at[a-z]?",text))

['bat', ' at', 'cat', 'cat', 'eat', 'bat', ' at']
['bat', 'cat', 'cat', 'eat', 'bat']


In [149]:

print(re.findall("(.?at.?)",text))
print(re.findall("([a-z]+at[a-z]?)",text))

['bat ', 'ate', 'cat ', 'cat ', 'eat ', 'bat ', 'at ']
['bat', 'cat', 'cat', 'eat', 'bat']


In [24]:
text=" somebody@hcl.com, someone@wipro.com,somename@cts.com, name10@hcl.com"
print(re.findall("([a-z0-9]{2,}@hcl.com)",text))

['somebody@hcl.com', 'name10@hcl.com']


In [26]:
text=" somebody@hcl.com, someone@wipro.com,somename@cts.com"
print(re.findall("(@[a-z0-9]{3,}\.com)",text))

['@hcl.com', '@wipro.com', '@cts.com']


In [29]:
text =" cat mat bat sat vat yat wat uat rat pat hat aat dat eat "
print(re.findall("[a,b,c,d,e]at",text))
print(re.findall("[a|b|c|d|e]at",text))

['cat', 'bat', 'aat', 'dat', 'eat']
['cat', 'bat', 'aat', 'dat', 'eat']



2.3. Quantifiers:
-----------------

	abc*		-> match string 'ab' followed by zero or more 'c'
	abc+		-> match string 'ab' followed by one or more 'c'
	abc?		-> match string 'ab' followed by zero or one 'c'
	abc{2}		-> match string 'ab' followed by two 'c'
	abc{2,}		-> match string 'ab' followed by atleast two or more 'c'
	abc{2,5}	-> match string 'ab' followed by two to five 'c'
	a(bc)*		-> match string 'a' followed by zero or more copies of sequence 'bc'


In [66]:
text="abcabcabcccccabccabcabcabcbcbcabcabcabcabcababcbcacb"

In [37]:
print(re.findall("abc*",text))

['abc', 'abc', 'abccccc', 'abcc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'ab', 'ab']


In [38]:
print(re.findall("abc+",text))

['abc', 'abc', 'abccccc', 'abcc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc']


In [39]:
print(re.findall("abc?",text))

['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'ab', 'ab']


In [40]:
print(re.findall("abc{2}",text))

['abcc', 'abcc']


In [41]:
print(re.findall("abc{2,}",text))

['abccccc', 'abcc']


In [42]:
print(re.findall("abc{2,5}",text))

['abccccc', 'abcc']


In [43]:
print(re.findall("a(bc)*",text))

['bc', 'bc', 'bc', 'bc', 'bc', 'bc', 'bc', 'bc', 'bc', 'bc', 'bc', '', '']


2.4. Operator:
--------------

	a(b|c)      -> 'a' followed by 'b' or 'c'
	a[b|c]		-> 'a' followed by 'b' or 'c'

In [68]:
print(re.findall("a(b|c)",text))

['b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'b', 'c']


In [69]:
print(re.findall("(a(b|c))",text))

[('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ab', 'b'), ('ac', 'c')]


In [71]:
print(re.findall("a[b|c]",text))

['ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ab', 'ac']


In [74]:
print(re.findall("a[bc]",text))

['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'acb']


In [77]:
print(re.findall("a[bc][bc]",text))

['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'acb']


2.5. special charecters:
---------------------------------
	In general. if you use backslash '\' before any charecter, then this will remove the property of special charecter.
	\A          -> matches at the start of string alone. 
	\n          -> looks for a newline   
	\t          -> looks for a tab
	\b          -> matches word boundary (skip whitespaces)
	\B          -> matches no word boundary 
	\w          -> matches any alphaneumeric/word charecter (i.e. [a-zA-Z0-9])
	\W          -> matches any non-word charecter (i.e. [^a-zA-Z0-9])
	\d          -> matches any digit  (i.e. [0-9])
	\D          -> matches any non-digit  (i.e. [^0-9])
	\s          -> matches whitespace charecters
	\S          -> matches non-whitespace charecters
	\.          -> looks for '.'
	\Z          -> matches at the end of string alone.

2.6. Lookarounds:
--------------------------
   * Lookarounds are zero width assertions.
   * They check for a regex (towards right or left of the current position - based on ahead or behind).
   * They don't consume any character - the matching for regex following them (if any), will start at the same cursor position.
   * Four types
       -   Positive Look-ahead.
       -  Negative Look-ahead.
       - Positive Look-behind.
       -  Negative Look-behind.

checking it