## Regular Expressions in Python

Regular Expressions are a sequence of characters used to find and replace patterns in a string In simple terms it is a tool for matching patterns in text.

In [1]:
import re

__Main uses of regular expression__

* Finding a string
* Replace part of a string
* Search a string
* Break our string into a sub strings


__Methods of Regular Expressions__
```
\w --> Matches with alphanumeric characters [a-z,A-Z,0-9]
\W --> Matches non-alphanumeric characters
\d --> Matches with digits [0-9]
\D --> Matches with all non-digits
\s --> Matches with a single space character
\S ---> Matches except space reaming all
\t --> Matches Tab
\n --> Matches newline
\r --> Matches return
. --> Matches any charcter except \n
() --> groups regular expressions and returns matched text
a|b --> Matches either a or b
^ --> starting Position
$ --> ending Position
{m} --> Matches must and should m
{m,} --> Matches more than m
{m,n} --> Matches a digit between m and n in length
? --> Matches one or zero occurance of the pattern
plus(+) --> Matches one or more occurance of the pattern
```

```
sub ----> Find all substrings where re matches, & replace them with a different string
subn -----> same as sub(), but returns the new string & the number of replacements
start ----> This will give starting position
end ----> This will give ending position
span ----> This will give starting and ending positions of a sub string
search ----> Entire string searching
match ----> First word searching
findall ----> Mutiple times searching in string
compile -----> We can compile pattern into pattern object
```

### 1. re.search()
* This search method will search in the entire string and gives the result. 
* If more than one match it rerurns the first occurance of the search pattern

In [2]:
s=("Hi i am python and my no is 7867465789")
z=re.search('\d{10}',s)
print(z)
print(z.group(0))
print(z.start())
print(z.end())

<re.Match object; span=(28, 38), match='7867465789'>
7867465789
28
38


In [3]:
re.search(r'\d{13}', '9876543210999999999999')

<re.Match object; span=(0, 13), match='9876543210999'>

In [4]:
s = ("hi Welcome to python course")
g = re.search('welcome',s,re.I|re.M)
print(g)

<re.Match object; span=(3, 10), match='Welcome'>


### 2. re.match()
* It will search first word in the given string. If the first word will match it will give the required output, if the first word doesn't match it will give None as output

In [5]:
s = ("hi.hello Welcome, my name is python")
y = re.match('hi',s)
print(y)
print(y.group(0))

<re.Match object; span=(0, 2), match='hi'>
hi


In [6]:
d = re.match('hello',s)
print(d)

None


### 3. re.findall()
* Findall returns all the non-overlapping matches of patterns in a string.



In [7]:
s=("hey hi hello how are you?")
i=re.findall('h',s)
print(i)

['h', 'h', 'h', 'h']


In [8]:
s = ("hi i am robot and my email id is robot-zs9112@gmail.com, my another email id is cleaning.robot@gmail.com")
# here re.findall() returns a list of all the found email strings
z = re.findall(r'[\w\.-]+@[\w\.-]+',s)
print(z)

['robot-zs9112@gmail.com', 'cleaning.robot@gmail.com']


In [9]:
print(re.findall(r'\w','i love python'))


['i', 'l', 'o', 'v', 'e', 'p', 'y', 't', 'h', 'o', 'n']


In [10]:
result=re.sub(r'New York','the World','Shake Shack is a great burgher restaurant in New York!')
print(result)


Shake Shack is a great burgher restaurant in the World!


### 4. re.compile()
* We can combine a regular expression pattern into pattern objects,which can be used for pattern matching.It also helps to search a pattern without rewriting it

In [11]:
pattern=re.compile('good')
a=pattern.findall('When you think positive good things happen')
print (a)
b=pattern.findall('Life is all about having a good time')
print (b)

['good']
['good']


###  5. More Examples

* __a. Extract all characters from a given string__

In [12]:
a=re.findall(r'.','Kate loves Python')
print (a)


['K', 'a', 't', 'e', ' ', 'l', 'o', 'v', 'e', 's', ' ', 'P', 'y', 't', 'h', 'o', 'n']


* __b. Extract each word from a given string__



In [13]:
b=re.findall(r'\w*','Hello, world!')
print (b)

['Hello', '', '', 'world', '', '']


* __c. extract numbers from a given string__



In [14]:
c=re.findall(r'\d+',"Hi i am python and my no is 7867465789")

print(c)

['7867465789']


### 6.  Splitting strings on any of mulitple delimiters
* __re.split()__ can specify multiple patterns for the separator.

In [15]:
line = 'asdf fjkl; adf,efsf,     foo'
re.split(r'[;,\s]\s*', line)

['asdf', 'fjkl', 'adf', 'efsf', 'foo']

In [16]:
fields = re.split(r'(;|,|\s)\s*', line)
fields

['asdf', ' ', 'fjkl', ';', 'adf', ',', 'efsf', ',', 'foo']

In [17]:
values = fields[::2]
delimiters = fields[1::2]+['']
print(values)
print(delimiters)

['asdf', 'fjkl', 'adf', 'efsf', 'foo']
[' ', ';', ',', ',', '']


In [18]:
#reform the line using the same delimiters
''.join(v+d for v,d in zip(values, delimiters))

'asdf fjkl;adf,efsf,foo'

### 7. Matching and searching for text patterns


In [19]:
text1='11/27/2012'
text2='Nov 27, 2012'

In [20]:
# simple matching : \d+ means match one or more digits
if re.match(r'\d+/\d+/\d+', text1):
    print('yes')
else:
    print('no')

yes


In [21]:
if re.match(r'\d+/\d+/\d+', text2):
    print('yes')
else:
    print('no')

no


In [22]:
# to perform a lot of matches using the same pattern, 
# usually precompile the regular expression pattern into a pattern object first
datepat = re.compile(r'\d+/\d+/\d+')

In [23]:
if datepat.match( text1):
    print('yes')
else:
    print('no')

yes


In [24]:
if datepat.match(text2):
    print('yes')
else:
    print('no')

no


In [25]:
# to search text for all occurrences of a pattern, us the __findall()__ method
text = 'Today is 11/27/2012. PyCon stars 3/13/2013'
datepat.findall(text)

['11/27/2012', '3/13/2013']

In [26]:
# capture groups by enclosing parts of the patterns in parentheses
datepat = re.compile(r'(\d+)/(\d+)/(\d+)')

In [27]:
m = datepat.match('11/27/2012')
m

<re.Match object; span=(0, 10), match='11/27/2012'>

In [28]:
m.groups()

('11', '27', '2012')

In [29]:
m.group(0)

'11/27/2012'

In [30]:
print(m.group(1))
print(m.group(2))
print(m.group(3))

11
27
2012


In [31]:
month, day, year = m.groups()

In [32]:
# find all mataches (splitting into tuples)
# findall method searches the text and finds al matches, returning them as a list
datepat.findall(text)

[('11', '27', '2012'), ('3', '13', '2013')]

In [33]:
for month, day, year in datepat.findall(text):
    print('{}-{}-{}'.format(year, month, day))

2012-11-27
2013-3-13


In [34]:
# to find matches iteratively, use the finditer() method
for m in datepat.finditer(text):
    print(m.groups())

('11', '27', '2012')
('3', '13', '2013')


### 8. Searching and replacing text
* For simple literal patterns, use the __str.replace()__ method
* For more complicated patterns, use __re.sub()__ function

In [35]:
text

'Today is 11/27/2012. PyCon stars 3/13/2013'

In [36]:
# backslashed digits such as \3 refer to capture group numbers in the pattern
re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)

'Today is 2012-11-27. PyCon stars 2013-3-13'

In [37]:
# to perform repeated substitutions of the same pattern, 
# consider compiling it first for better performance

datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
datepat.sub( r'\3-\1-\2', text)

'Today is 2012-11-27. PyCon stars 2013-3-13'

In [38]:
# for more complicated substitutions, it's possible to specify a substitution callback function
from calendar import month_abbr
def change_date(m):
    month_name = month_abbr[int(m.group(1))]
    return '{} {} {}'.format(m.group(2), month_name, m.group(3))


In [39]:
datepat.sub(change_date, text)

'Today is 27 Nov 2012. PyCon stars 13 Mar 2013'

In [40]:
# to know how many substitutions were made in addition to getting the replacement text
# use re.subn() instead

newtext, n = datepat.subn( r'\3-\1-\2', text)
n

2

### 9. Searching and replacing case-insensitive text
* __re.IGNORECASE__ flag

In [41]:
text = 'UPPER PYTHON, lower python, Mixed Python'
re.findall('python', text, flags=re.IGNORECASE)

['PYTHON', 'python', 'Python']

In [42]:
re.sub('python', 'snake', text, flags=re.IGNORECASE)

'UPPER snake, lower snake, Mixed snake'

In [43]:
# replace text that match the case of the matached text
def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
        
    return replace

In [44]:
re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)

'UPPER SNAKE, lower snake, Mixed Snake'

### 10. Specifying a regular expression for the shortest match

In [45]:
str_pat = re.compile(r'\"(.*)\"')
text1 = 'computer says "no."'
str_pat.findall(text1)

['no.']

In [46]:
# * operator in re is greedy, matching is based on finding the longest possible match
text2= 'computer says "no." phone says "yes."'
str_pat.findall(text2)

['no." phone says "yes.']

In [47]:
# to fix this, add the ? modifier after the * operator in the pattern
str_pat = re.compile(r'\"(.*?)\"')

str_pat.findall(text2)

['no.', 'yes.']

### 11. Regular expression for multiline patterns

In [54]:
comment = re.compile(r'/\*(.*?)\*/')

text1 =  '/* this is a comment */'
text2 = '''/* this is a
    multiline comment */
'''

In [50]:
comment.findall(text1)

[' this is a comment ']

In [51]:
comment.findall(text2)

[]

In [55]:
#to fix the problem, add support for newlines
comment = re.compile(r'/\*((?:.|\n)*?)\*/')

In [56]:
comment.findall(text2)

[' this is a\n    multiline comment ']