# Regular Expression

Source : https://learnbyexample.github.io/py_regular_expressions/re-introduction.html

## re.search

Normally you'd use the `in` operator to test whether a string is part of another string or not. For regular expressions, use the `re.search` function whose argument list is shown below.

```python
re.search(pattern, string, flags=0)
```

In [90]:
sentence = 'This is a sample string'

# check if 'sentence' contains the given search string
print('is' in sentence, 'xyz' in sentence)
# need to load the re module before use
import re
# check if 'sentence' contains the pattern described by RE argument
bool(re.search(r'is', sentence)), bool(re.search(r'xyz', sentence))

True False


(True, False)

In [92]:
sentence = 'This is a sample string'
if re.search(r'ring', sentence):
    print('mission success')

if not re.search(r'xyz', sentence):
    print('mission failed')

# some generator example
words = ['cat', 'attempt', 'tattle']

print([w for w in words if re.search(r'tt', w)])
print(all(re.search(r'at', w) for w in words))
print(any(re.search(r'stat', w) for w in words))

mission success
mission failed
['attempt', 'tattle']
True
False


## re.sub

For normal search and replace, you'd use the `str.replace` method. For regular expressions, use the `re.sub` function, whose argument list is shown below.

```python
re.sub(pattern, repl, string, count=0, flags=0)
```

In [97]:
greeting = 'Have a nice weekend'

# replace all occurrences of 'e' with 'E'
# same as: greeting.replace('e', 'E')
print(re.sub(r'e', 'E', greeting))

# replace first two occurrences of 'e' with 'E'
# same as: greeting.replace('e', 'E', 2)
print(re.sub(r'e', 'E', greeting, count=2))
print("string are immutable ",greeting)

HavE a nicE wEEkEnd
HavE a nicE weekend
string are immutable  Have a nice weekend


## Compiling regular expressions

Regular expressions can be compiled using `re.compile` function, which gives back a `re.Pattern` object.

```python
re.compile(pattern, flags=0)
```

In [99]:
pet = re.compile(r'dog')
print(type(pet))

# note that 'search' is called upon 'pet' which is a 're.Pattern' object
# since 'pet' has the RE information, you only need to pass input string
print(bool(pet.search('They bought a dog')))
print(bool(pet.search('A cat crossed their path')))

# replace all occurrences of 'dog' with 'cat'
print(pet.sub('cat', 'They bought a dog'))

<class 're.Pattern'>
True
False
They bought a cat


The search method on a compiled pattern has two optional arguments to specify `start` and `end` index positions.

In [100]:
sentence = 'This is a sample string'
word = re.compile(r'is')

# search for 'is' starting from 5th character of 'sentence' variable
print(bool(word.search(sentence, 4)))

# search for 'is' starting from 7th character of 'sentence' variable
print(bool(word.search(sentence, 6)))

# search for 'is' between 3rd and 4th characters
print(bool(word.search(sentence, 2, 4)))

True
False
True


## Anchors

Instead of matching anywhere in the given input string, restrictions can be specified.These restrictions are made possible by assigning special meaning to certain characters and escape sequences. The characters with special meaning are known as **metacharacters** in regular expressions parlance. In case you need to match those characters literally, you need to escape them with a `\` character.


| Note	| Description |
| ---: | :--- |
| \A |	restricts the match to the start of string |
| \Z |	restricts the match to the end of string |
| ^	| restricts the match to the start of line|
| $	| restricts the match to the end of line|
| \b |	restricts the match to the start/end of words|
| \B |	matches wherever \b doesn't match|
| re.fullmatch |	ensures pattern matches the entire input string |
| |re.fullmatch(pattern, string, flags=0) |
| \n |	line separator, dos-style files need special attention
| metacharacter	| characters with special meaning in RE|
| re.MULTILINE or re.M |	flag to treat input as multiline string
| word | characters: alphabets, digits, underscore|

## Alternation and Grouping

Similar to logical OR, alternation in regular expressions allows you to combine multiple patterns.


In [103]:

# match either 'cat' or 'dog'
print(bool(re.search(r'cat|dog', 'I like cats')))
print(bool(re.search(r'cat|dog', 'I like dogs')))
print(bool(re.search(r'cat|dog', 'I like parrots')))

# replace either 'cat' at start of string or 'cat' at end of word
print(re.sub(r'\Acat|cat\b', 'X', 'catapults concatenate cat scat'))
# replace either 'cat' or 'dog' or 'fox' with 'mammal'
print(re.sub(r'cat|dog|fox', 'mammal', 'cat dog bee parrot fox'))

# a helpful trick where many alternations are required.
words = ['cat', 'dog', 'fox']
"|".join(words)


True
True
False
Xapults concatenate X sX
mammal mammal bee parrot mammal


'cat|dog|fox'

## Grouping

Often, there are some common things among the alternatives. It could be common characters or qualifiers like the anchors. In such cases, you can group them using a pair of parentheses metacharacters.  
Similar to $$a(b+c)d = abd+acd$$ in maths, you get $$a(b|c)d = abd|acd$$ in regular expressions.

In [104]:
# without grouping
print(re.sub(r'reform|rest', 'X', 'red reform read arrest'))
# with grouping
print(re.sub(r're(form|st)', 'X', 'red reform read arrest'))

# without grouping
print(re.sub(r'\bpar\b|\bpart\b', 'X', 'par spare part party'))

# taking out common anchors
print(re.sub(r'\b(par|part)\b', 'X', 'par spare part party'))
# taking out common characters as well
# you'll later learn a better technique instead of using empty alternate
print(re.sub(r'\bpar(|t)\b', 'X', 'par spare part party'))


red X read arX
red X read arX
X spare X party
X spare X party
X spare X party


## Precedence rules

Say, you want to replace either `are` or `spared` — which one should get precedence? The bigger word `spared` or the substring `are` inside it or based on something else?

*In python which matches earliest in the input string gets precedence*  
*If  match on same index? The precedence is then left to right in the order of declaration.*

In [105]:
words = 'lion elephant are rope not'

# span shows the start and end+1 index of matched portion
# match shows the text that satisfied the search criteria

print(re.search(r'on', words))
print(re.search(r'ant', words))

# starting index of 'on' < index of 'ant' for given string input
# so 'on' will be replaced irrespective of order
# count optional argument here restricts no. of replacements to 1
print(re.sub(r'on|ant', 'X', words, count=1))
print(re.sub(r'ant|on', 'X', words, count=1))


<re.Match object; span=(2, 4), match='on'>
<re.Match object; span=(10, 13), match='ant'>
liX elephant are rope not
liX elephant are rope not


In [107]:
# Robust Workaround to sort based on length
words = ['hand', 'handy', 'handful']

alt = re.compile('|'.join(sorted(words, key=len, reverse=True)))
alt.pattern

'handful|handy|hand'