# Regular Expression

## What is Regular Expression?

> ### A regular expression in a programming language is a special text string used for describing a search pattern. It is extremely useful for extracting information from text such as code, files, log, spreadsheets or even documents.

### To use regexp, we need to import it.
### >> import re
* #### "re" module included with python primarily used for string searching and manupulation.
* #### Also frequently for web page "Scraping" (extract large amount of data from websites).

### Main important methods of regexp?
* #### compile()
* #### match()
* #### search()
* #### finditer()
* #### findall()
* #### sub()
* #### split()


### Syntax:
### >> import re
### >> re.match(pattern, string, flags=) - flag is optional

## re functions

* ## compile(r"raw_string_pattern", flags=0)
> ### Compile a regular expression pattern, returning a pattern object.
* ## finditer(r"raw_string_pattern", find_str, flags=0)
> ### Return an iterator yielding MatchObject instances over all non-overlapping matchaes for the re patternin the string.
* ## findall(r"raw_string_pattern", find_str, flags=0)
> ### The expression re.findall() returns all the non-overlapping matches of patterns in a string as a list of strings.
* ## search(r"raw_string_pattern", find_str, flags=0)
> ### Searches the string for a match, and returns a MatchObject if there is a match.
* ## split(r"raw_string_pattern", find_str, flags=0)
> ### Returns a list where the string has been split at each match. (Split at each white-space character)
* ## sub(r"raw_string_pattern", replace_with, find_str, flags=0)
> ### Replaces the matches with the text of your choice.
> ### sub(find_expression, replace_with, find_str, occurance_number_or_count, flags=0)
> * ### We can control the number of replacements by specifying the count parameter, i.e.: if we want to replace the first 2 occurrances, see example below:
> > re.sub('\s', "9", txt, 2)

## re Flags

* ## re.IGNORECASE
> ### It makes the pattern case insensitive so that it matches string of different capitalizations.
* ## re.MULTILINE
> ### It is necessary if your input string has newline characters (\n), this flag allows the start and end meta-character (^ and $ respectively) to match at the beginning and end of each line instead of at the beginning and end of the whole input string.
* ## re.DOTALL
> ### It allows the dot (.) meta-character match all characters, including the newline characters (\n).

## The MatchObject has properties and methods used to retrieve information about the search, and the result.
* ### .span()
> #### Returns a tuple containing the start and end positions of the match.
* ### .string
> #### Returns the string passed into the function.
* ### .group()
> #### Returns the part of the string where there was a match.

## Quantifiers
* ### <pre> [] - A set of characters - "[a-z]"</pre>
* ### <pre> \ - Signals a special sequence - "\d"</pre>
* ### <pre> . - Any character - "ab..t"</pre>
* ### <pre> ^ - Starts with - "^hello"</pre>
* ### <pre> $ - Ends with - "world$"</pre>
* ### <pre> * - 0 or more occurrences - "abo*" </pre>
* ### <pre> + - 1 or more occurrences - "abo+"</pre>
* ### <pre> ? - 0 or one occurrences - "abo?"</pre>
* ### <pre> {1} - exactly the specified number of occurrences  "ab{2}"</pre>
* ### <pre> {1, 3} - range of numbers (min, max)</pre>
* ### <pre> | Either or - "falls|statys"</pre>
* ### <pre> () capture and group</pre>

## Special Sequences
* ### \A - Returns a match if the specified characters are at the beginning of the string - "\AThe"
* ### \b - Returns a match where the specified character are at the beginning or at the end of a word - r"\bain", r"ain\b"
* ### \B - Returns a match where the specified character are present, but Not at the beginning - "r\Bain", r"ain\B"
* ### \d - Returns a match where the string contains digits (0-9) - "\d"
* ### \D - Returns a match where the string Does Not contain digits - "\D"
* ### \s - Returns a match where the string contains a white space character - "\s"
* ### \S - Returns a match where the string does not contain a white space character - "\S"
* ### - \w - Returns a match where the string contains any word characters (a-zA-Z0-9_) - "\w"
* ### \W - Returns a match where the string does not contain any word character - "\W"
* ### \Z - Returns a match the specified characters are at the end of the string - "Spain\Z"

## Sets
* ### [arn] - Returns a match where one of the specified character (a, r, n) are present.
* ### [a-n] - Returns a match for any lower case character, alphabetically between 'a' an 'n'
* ### [^arn] - Returns a match for any character Except a, r and n
* ### [78945] - Returns a match where any of the specified digits (7,8,9,4,5) are present.
* ### [0-9] - Returns a match for any digit between 0-9
* ### [0-4][0-9] - Returns a match for any two-digit numbers from 00 - 49
* ### [a-zA-Z] - Returns a match for any character alphabetically between a and z, lower case or upper case.
* ### [+] - In sets, +, *, ., |, (), $, {} has no special meaning, so [+] means: return a match for any + character in the string.

## Sample regex pattern for an email
> ### emailPattern = [a-zA-Z0-9_.+]+@[a-zA-Z0-9-]+.[a-zA-Z0-9.]+

## Sample regex pattern for an url
> ### urlPattern = 'https?'://(www.)?(\w+)(.\w+)'
> ### Here is some different groups
> * #### .group(0) - returns entire domain
> * #### .group(1) - returns www from (www\.)
> * #### .group(2) - returns domain name from (\w+)
> * #### .group(3) - returns domain extension (.com, .net, .org) from (\.\w+)


In [1]:
# sample string

sample_string = """
Hello! Welcome to regex python.
6546a4dsfa654fd6a46df7fdert654

numbers
1234-987-61564-01
123.456.8788888.61654

emails
hello@mail.com
hello-hi@mail.com
hello.hi@mail.com
hello123hi@mail.com
hello_hi12@mail.com
hello@mail.co.in

links
http://google.com
https://www.google.com/
https://google.co.in
https://codweb.in/
https://typhit.codweb.in/site

special characters
+,*,/,\,&,(,),{,},[,],<,>,?,~`@#$%^&*
"""

In [2]:
import re

In [10]:
c = re.compile(r'Hello')

In [11]:
r = re.match(c, sample_string)

In [12]:
r

In [13]:
type(r)

NoneType

In [14]:
re.search(c, sample_string)

<re.Match object; span=(1, 6), match='Hello'>

In [15]:
r = re.search(c, sample_string)

In [16]:
r.span()

(1, 6)

In [17]:
sample_string[1:6]

'Hello'

In [18]:
re.search('\d', sample_string)

<re.Match object; span=(33, 34), match='6'>

In [19]:
sample_string[33:34]

'6'

In [20]:
re.findall('\d', sample_string)

['6',
 '5',
 '4',
 '6',
 '4',
 '6',
 '5',
 '4',
 '6',
 '4',
 '6',
 '7',
 '6',
 '5',
 '4',
 '1',
 '2',
 '3',
 '4',
 '9',
 '8',
 '7',
 '6',
 '1',
 '5',
 '6',
 '4',
 '0',
 '1',
 '1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '8',
 '7',
 '8',
 '8',
 '8',
 '8',
 '8',
 '6',
 '1',
 '6',
 '5',
 '4',
 '1',
 '2',
 '3',
 '1',
 '2']

In [22]:
iterator = re.finditer('\d', sample_string)

In [23]:
for i in iterator:
  print(i, end=" ")

<re.Match object; span=(33, 34), match='6'> <re.Match object; span=(34, 35), match='5'> <re.Match object; span=(35, 36), match='4'> <re.Match object; span=(36, 37), match='6'> <re.Match object; span=(38, 39), match='4'> <re.Match object; span=(43, 44), match='6'> <re.Match object; span=(44, 45), match='5'> <re.Match object; span=(45, 46), match='4'> <re.Match object; span=(48, 49), match='6'> <re.Match object; span=(50, 51), match='4'> <re.Match object; span=(51, 52), match='6'> <re.Match object; span=(54, 55), match='7'> <re.Match object; span=(60, 61), match='6'> <re.Match object; span=(61, 62), match='5'> <re.Match object; span=(62, 63), match='4'> <re.Match object; span=(73, 74), match='1'> <re.Match object; span=(74, 75), match='2'> <re.Match object; span=(75, 76), match='3'> <re.Match object; span=(76, 77), match='4'> <re.Match object; span=(78, 79), match='9'> <re.Match object; span=(79, 80), match='8'> <re.Match object; span=(80, 81), match='7'> <re.Match object; span=(82, 83),

In [24]:
for i in iterator:
  print(i, end=" ")

In [26]:
iterator = re.finditer('\d', sample_string)

In [28]:
# methors of iterator

next(iterator)

<re.Match object; span=(33, 34), match='6'>

In [29]:
next(iterator)

<re.Match object; span=(34, 35), match='5'>

In [30]:
next(iterator)

<re.Match object; span=(35, 36), match='4'>

In [31]:
next(iterator)

<re.Match object; span=(36, 37), match='6'>

In [32]:
r = re.findall('\D', sample_string)

In [33]:
print(*r)


 H e l l o !   W e l c o m e   t o   r e g e x   p y t h o n . 
 a d s f a f d a d f f d e r t 
 
 n u m b e r s 
 - - - 
 . . . 
 
 e m a i l s 
 h e l l o @ m a i l . c o m 
 h e l l o - h i @ m a i l . c o m 
 h e l l o . h i @ m a i l . c o m 
 h e l l o h i @ m a i l . c o m 
 h e l l o _ h i @ m a i l . c o m 
 h e l l o @ m a i l . c o . i n 
 
 l i n k s 
 h t t p : / / g o o g l e . c o m 
 h t t p s : / / w w w . g o o g l e . c o m / 
 h t t p s : / / g o o g l e . c o . i n 
 h t t p s : / / c o d w e b . i n / 
 h t t p s : / / t y p h i t . c o d w e b . i n / s i t e 
 
 s p e c i a l   c h a r a c t e r s 
 + , * , / , \ , & , ( , ) , { , } , [ , ] , < , > , ? , ~ ` @ # $ % ^ & * 



In [34]:
r = re.findall('\W', sample_string)

In [35]:
print(*r)


 !         . 
 
 
 
 - - - 
 . . . 
 
 
 @ . 
 - @ . 
 . @ . 
 @ . 
 @ . 
 @ . . 
 
 
 : / / . 
 : / / . . / 
 : / / . . 
 : / / . / 
 : / / . . / 
 
   
 + , * , / , \ , & , ( , ) , { , } , [ , ] , < , > , ? , ~ ` @ # $ % ^ & * 



In [36]:
re.findall('[a-z]', sample_string)

['e',
 'l',
 'l',
 'o',
 'e',
 'l',
 'c',
 'o',
 'm',
 'e',
 't',
 'o',
 'r',
 'e',
 'g',
 'e',
 'x',
 'p',
 'y',
 't',
 'h',
 'o',
 'n',
 'a',
 'd',
 's',
 'f',
 'a',
 'f',
 'd',
 'a',
 'd',
 'f',
 'f',
 'd',
 'e',
 'r',
 't',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 'e',
 'm',
 'a',
 'i',
 'l',
 's',
 'h',
 'e',
 'l',
 'l',
 'o',
 'm',
 'a',
 'i',
 'l',
 'c',
 'o',
 'm',
 'h',
 'e',
 'l',
 'l',
 'o',
 'h',
 'i',
 'm',
 'a',
 'i',
 'l',
 'c',
 'o',
 'm',
 'h',
 'e',
 'l',
 'l',
 'o',
 'h',
 'i',
 'm',
 'a',
 'i',
 'l',
 'c',
 'o',
 'm',
 'h',
 'e',
 'l',
 'l',
 'o',
 'h',
 'i',
 'm',
 'a',
 'i',
 'l',
 'c',
 'o',
 'm',
 'h',
 'e',
 'l',
 'l',
 'o',
 'h',
 'i',
 'm',
 'a',
 'i',
 'l',
 'c',
 'o',
 'm',
 'h',
 'e',
 'l',
 'l',
 'o',
 'm',
 'a',
 'i',
 'l',
 'c',
 'o',
 'i',
 'n',
 'l',
 'i',
 'n',
 'k',
 's',
 'h',
 't',
 't',
 'p',
 'g',
 'o',
 'o',
 'g',
 'l',
 'e',
 'c',
 'o',
 'm',
 'h',
 't',
 't',
 'p',
 's',
 'w',
 'w',
 'w',
 'g',
 'o',
 'o',
 'g',
 'l',
 'e',
 'c',
 'o',
 'm'

In [37]:
re.findall('[0-4]', sample_string)

['4',
 '4',
 '4',
 '4',
 '4',
 '1',
 '2',
 '3',
 '4',
 '1',
 '4',
 '0',
 '1',
 '1',
 '2',
 '3',
 '4',
 '1',
 '4',
 '1',
 '2',
 '3',
 '1',
 '2']

In [38]:
re.findall('[a-d0-4]', sample_string)

['c',
 '4',
 'a',
 '4',
 'd',
 'a',
 '4',
 'd',
 'a',
 '4',
 'd',
 'd',
 '4',
 'b',
 '1',
 '2',
 '3',
 '4',
 '1',
 '4',
 '0',
 '1',
 '1',
 '2',
 '3',
 '4',
 '1',
 '4',
 'a',
 'a',
 'c',
 'a',
 'c',
 'a',
 'c',
 '1',
 '2',
 '3',
 'a',
 'c',
 '1',
 '2',
 'a',
 'c',
 'a',
 'c',
 'c',
 'c',
 'c',
 'c',
 'd',
 'b',
 'c',
 'd',
 'b',
 'c',
 'a',
 'c',
 'a',
 'a',
 'c']

In [39]:
re.findall('\w[a-z0-9]', sample_string)

['He',
 'll',
 'We',
 'lc',
 'om',
 'to',
 're',
 'ge',
 'py',
 'th',
 'on',
 '65',
 '46',
 'a4',
 'ds',
 'fa',
 '65',
 '4f',
 'd6',
 'a4',
 '6d',
 'f7',
 'fd',
 'er',
 't6',
 '54',
 'nu',
 'mb',
 'er',
 '12',
 '34',
 '98',
 '61',
 '56',
 '01',
 '12',
 '45',
 '87',
 '88',
 '88',
 '61',
 '65',
 'em',
 'ai',
 'ls',
 'he',
 'll',
 'ma',
 'il',
 'co',
 'he',
 'll',
 'hi',
 'ma',
 'il',
 'co',
 'he',
 'll',
 'hi',
 'ma',
 'il',
 'co',
 'he',
 'll',
 'o1',
 '23',
 'hi',
 'ma',
 'il',
 'co',
 'he',
 'll',
 '_h',
 'i1',
 'ma',
 'il',
 'co',
 'he',
 'll',
 'ma',
 'il',
 'co',
 'in',
 'li',
 'nk',
 'ht',
 'tp',
 'go',
 'og',
 'le',
 'co',
 'ht',
 'tp',
 'ww',
 'go',
 'og',
 'le',
 'co',
 'ht',
 'tp',
 'go',
 'og',
 'le',
 'co',
 'in',
 'ht',
 'tp',
 'co',
 'dw',
 'eb',
 'in',
 'ht',
 'tp',
 'ty',
 'ph',
 'it',
 'co',
 'dw',
 'eb',
 'in',
 'si',
 'te',
 'sp',
 'ec',
 'ia',
 'ch',
 'ar',
 'ac',
 'te',
 'rs']

In [40]:
re.findall('[a-z][A-Z]', sample_string)

[]

In [41]:
re.findall('ab..t', sample_string)

[]

In [42]:
re.findall('http', sample_string)

['http', 'http', 'http', 'http', 'http']

In [43]:
re.findall('https?', sample_string)

['http', 'https', 'https', 'https', 'https']

In [44]:
emailPattern = "[a-zA-Z0-9_.+]+@[a-zA-Z0-9-]+.[a-zA-Z0-9.]+"

In [45]:
re.search('https?', sample_string)

<re.Match object; span=(236, 240), match='http'>

In [47]:
re.findall('[a-zA-Z0-9_.+]+@[a-zA-Z0-9-]+.[a-zA-Z0-9.]+', sample_string)

['hello@mail.com',
 'hi@mail.com',
 'hello.hi@mail.com',
 'hello123hi@mail.com',
 'hello_hi12@mail.com',
 'hello@mail.co.in']