# Regular Expression

* Regex is sequence of characters that describe search pattern.
* That are text matching patterns. 
* Provides flexible way to search or match string patterns in text.

In [1]:
import re

* `.` indicate that any character can be put in its place.
    - "a.c" can be "aac", "abc", "acc", "ascsssss"
    - "..t" can be "bat", "habit", "oat"
* `\.` matches `.` character in string. `\` is escape sequence.
* `^` matches beginning of string
    - "^b.tter" matches with "better", "butter", "batter" 
* `$` matches end of string
* `[bcr]at` will match 'bat', 'cat', 'rat'
* "\[Serious]" look for entire '[Serious]' in string.
* "[\[\(][Ss]erious][\]\)]" match for '(Serious)', '(serious)', '[Serious]', '[serious)'
* To combine the regular expression we can use `|`. Ex. "cat|dog" will match with 'catfish' and 'hotdog'
* `'[0-9]'` matches the characters falls between 0 and 9
* `'[a-z]'` matches the characters falls between a and z
* `{}` is used to indicate pattern will repeat
    - `'[0-9]{4}'` match 0 to 9 four times.

### `re.search(regex, string)`
* Whether string is a match for regex. If yes expression will return match object otherwise None.

In [2]:
text = "This is a string with text term1, but not the other term"

* re.search() returns match object which also contains info about the start and the end of the match

In [3]:
match = re.search("term1", text)

In [4]:
type(match)

re.Match

In [5]:
match.start()

27

In [6]:
match.end()

32

In [7]:
match = re.search("term1", "This is string with TERM1",re.IGNORECASE)
match.start()

20

### `groups()`
* We use group(num) or groups() function of match object to get matched expression

In [8]:
line = "Cats are smarter than dogs"

In [9]:
matchObj = re.match(r'(.*) are (.*?) .*', line, re.M|re.I) # ? means not be greedy. re.I ignore case

In [10]:
matchObj.group()

'Cats are smarter than dogs'

In [11]:
matchObj.group(1)

'Cats'

In [12]:
matchObj.group(2)

'smarter'

* `re.DEBUG` : Display debug information about compiled expression
* `re.I` : re.IGNORECASE, perfom case-insensitive matching
* `re.L` : re.LOCALE make \w, \W, \b, \B, \s and \S dependent on the current locale
* `re.M` : re.MULTILINE When specified the pattern character `^` matches at the beginning of the string and at the beginning of each line and `$` matches at the end of the string and at the end of each line.
* `re.S` : re.DOTALL Make the `.` special character match any character at all, including a newline, without this flag `.` will match anything except new line
* `re.U` : re.UNICODE Make the  \w, \W, \b, \B, \s and \S sequence dependent on the Unicode character property database.
* `re.X` : re.VERBOSE Permits "cuter" regular expression syntax. It ignores whitespace(Except inside [] or escaped by \ ) and treats unescaped # as comment marker.

In [13]:
text = "foo     bar\
    baz\
        qux"

In [14]:
text

'foo     bar    baz        qux'

In [15]:
re.split('\s+', text)

['foo', 'bar', 'baz', 'qux']

* Calling `re.split('\s+', text)`, the RE will first compiled and then split method is called on the passed text. If we plan to ust same logic multiple time compile regex manually to save CPU time.

In [16]:
regex = re.compile('\s+')

In [17]:
regex.split(text)

['foo', 'bar', 'baz', 'qux']

### `split()`
* Returns a list with the term to split on removed and the terms in the list are a split up version of string

In [18]:
split_term = '@'

In [19]:
phrase = "What is the domain name of someone with the email hello@gmail.com"

In [20]:
re.split(split_term, phrase)

['What is the domain name of someone with the email hello', 'gmail.com']

### `re.findall('regex', 'string')`
* Returns list of substring matches the regex.

In [21]:
re.findall('match', 'test phrase match is matching middle')

['match', 'match']

In [22]:
def multi_re_find(patterns, phrase):
    for pattern in patterns:
        print('Searching for pattern {0} from phrase {1}'.format(pattern, phrase))
        print(re.findall(pattern, phrase))
        print('\n')

* `*` matches 0 or more characters
* `+` matches at least one or more character
* `?` means the 0 or 1 character match
* For specific number of occurrences use {m} after the pattern, where m is replaced with the number of times the pattern should repeat.
* Use {m,n} m is min and n is max number of repetitions. {m,} at least m times no max. 

In [23]:
test_patterns = ['sd*', # 0 or more d 
                 'sd+', # 1 or more d
                 'sd?', # 0 or 1 d 
                 'sd{3}',  # 3 d
                 'sd{2,3}'] # 2 to 3 d

In [24]:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

In [25]:
multi_re_find(test_patterns, test_phrase)

Searching for pattern sd* from phrase sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd
['sd', 'sd', 's', 's', 'sddd', 'sddd', 'sddd', 'sd', 's', 's', 's', 's', 's', 's', 'sdddd']


Searching for pattern sd+ from phrase sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd
['sd', 'sd', 'sddd', 'sddd', 'sddd', 'sd', 'sdddd']


Searching for pattern sd? from phrase sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd
['sd', 'sd', 's', 's', 'sd', 'sd', 'sd', 'sd', 's', 's', 's', 's', 's', 's', 'sd']


Searching for pattern sd{3} from phrase sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd
['sddd', 'sddd', 'sddd', 'sddd']


Searching for pattern sd{2,3} from phrase sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd
['sddd', 'sddd', 'sddd', 'sddd']




### Chracter set matching

* Character set are used when we wish to match any one of a group of characters at a point in a input. Brackets are used to create character set inputs. `[ab]` searches for occurrence of a or b.

In [26]:
test_patterns = ['[sd]', # s or d
                's[sd]+'] #s followed by 1 or more s or d

In [27]:
multi_re_find(test_patterns, test_phrase)

Searching for pattern [sd] from phrase sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd
['s', 'd', 's', 'd', 's', 's', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 's', 's', 's', 's', 's', 's', 'd', 'd', 'd', 'd']


Searching for pattern s[sd]+ from phrase sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd
['sdsd', 'sssddd', 'sdddsddd', 'sds', 'sssss', 'sdddd']




* We can use `^` to exclude terms by incorporating it into the bracket syntax notation.
* Use `[^!.? ]` to check for matches that are not a !,. or ? or space. Add + to check that the match appears at least once.

In [28]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [29]:
print(re.findall('[^!.? ]+',test_phrase), end=" ")

['This', 'is', 'a', 'string', 'But', 'it', 'has', 'punctuation', 'How', 'can', 'we', 'remove', 'it'] 

### Character range matching
* [start-end]
* `[a-f]` return matches with any instance of letters between a and f.

In [30]:
test_phrase = "This is an example sentence. Lets see if we find some letters."

In [31]:
test_patterns = [ '[a-z]+', # sequence of lower case letters
                '[A-Z]+', # sequence of upper case letters
                '[a-zA-Z]+', # sequence of lower or upper case letters
                '[A-Z][a-z]+'] # uppercase ;etter followed by lowercase letters.

In [32]:
multi_re_find(test_patterns, test_phrase)

Searching for pattern [a-z]+ from phrase This is an example sentence. Lets see if we find some letters.
['his', 'is', 'an', 'example', 'sentence', 'ets', 'see', 'if', 'we', 'find', 'some', 'letters']


Searching for pattern [A-Z]+ from phrase This is an example sentence. Lets see if we find some letters.
['T', 'L']


Searching for pattern [a-zA-Z]+ from phrase This is an example sentence. Lets see if we find some letters.
['This', 'is', 'an', 'example', 'sentence', 'Lets', 'see', 'if', 'we', 'find', 'some', 'letters']


Searching for pattern [A-Z][a-z]+ from phrase This is an example sentence. Lets see if we find some letters.
['This', 'Lets']




### Escape codes
* Indicated by prefixing the character with \ . 
* `\d` : a digit
* `\D` :  a non digit
* `\s` : whitespace (tab, space, newline)
* `\S` : non whitespace
* `\w` : alphanumeric
* `\W` : non alphanumeric

In [33]:
test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'
test_patterns = [r'\d+',
                r'\D+',
                r'\s+',
                r'\S+',
                r'\w+',
                r'\W+',]

In [34]:
multi_re_find(test_patterns, test_phrase)

Searching for pattern \d+ from phrase This is a string with some numbers 1233 and a symbol #hashtag
['1233']


Searching for pattern \D+ from phrase This is a string with some numbers 1233 and a symbol #hashtag
['This is a string with some numbers ', ' and a symbol #hashtag']


Searching for pattern \s+ from phrase This is a string with some numbers 1233 and a symbol #hashtag
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ']


Searching for pattern \S+ from phrase This is a string with some numbers 1233 and a symbol #hashtag
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', '#hashtag']


Searching for pattern \w+ from phrase This is a string with some numbers 1233 and a symbol #hashtag
['This', 'is', 'a', 'string', 'with', 'some', 'numbers', '1233', 'and', 'a', 'symbol', 'hashtag']


Searching for pattern \W+ from phrase This is a string with some numbers 1233 and a symbol #hashtag
[' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' ', ' #']




### `sub` Substitution
* Replaces all occurrence of the RE pattern in string with repl. substitutes all occurrence unless max provided. This method returns modified string.

In [35]:
phone = "2004-959-559 # this is phone number"
num = re.sub(r"#.*$", "", phone) # delete python style comment
num

'2004-959-559 '

In [36]:
re.sub(r'\D', "", phone) # remove anything other than digits

'2004959559'

### `finditer`
* Returns an iterator over all matches of the pattern in the string.

In [37]:
re.finditer('match', 'test phrase match is matching middle')

<callable_iterator at 0x157c0ca0080>

### Greedy vs non-greedy matching
* `<.*>` greedy repetition. ex it matches `<python>perl>`
* `<.*?>` non greedy matches only `<python>` from `<python>perl>`

### Backreferences
* Matches a previously matched group again.
* `([Pp]ython&\1ails)` match with `python&pails` or `Python&Pails`
    - `\1` meaning match same as previously matched group,
* `(["'])([a-z])[^\1]*\2` 
    - `\1` match whatever first group match
    - `\2` match whatever second group match

* `R(?i)uby` Case-insensitive while matching uby.
* `R(?i:uby)` Same as above

### Anchors
* `Python(?=!)` match "Python" if followed by `!`
* `Python(?!!)` match "Python" if not followed by `!`

![regex](images/regex.jpg)
![regex](images/regex1.jpg)