# Regular Expression Exercises

* Debugger: When debugging regular expressions, the best tool is [Regex101](https://regex101.com/). This is an interactive tool that let's you visualize a regular expression in action.
* Tutorial: I tend to like RealPython's tutorials, here is their's on [Regular Expressions](https://realpython.com/regex-python/).
* Tutorial: The [Official Python tutorial on Regular Expressions](https://docs.python.org/3/howto/regex.html) is not a bad introduction.
* Cheat Sheet: People often make use of [Cheat Sheets](https://www.debuggex.com/cheatsheet/regex/python) when they have to do a lot of Regular Expressions.
* Documentation: If you need it, the official [Python documentation on the `re` module](https://docs.python.org/3/library/re.html) can also be a resource.

PLEASE FILL IN THE FOLLOWING:
    
* Your name: Matthew Ueckermann and Amani Arman
* Link to the Github repo with this file: https://github.com/matthewueckermann/regex

In [1]:
import re

## 1. Introduction

**a)** Check whether the given strings contain `0xB0`. Display a boolean result as shown below.

In [2]:
line1 = 'start address: 0xA0, func1 address: 0xC0'
line2 = 'end address: 0xFF, func2 address: 0xB0'

assert bool(re.search(r'0xB0', line1)) == False
assert bool(re.search(r'0xB0', line2)) == True

**b)** Replace all occurrences of `5` with `five` for the given string.

In [3]:
ip = 'They ate 5 apples and 5 oranges'

assert re.sub("5", "five", ip) == 'They ate five apples and five oranges'

**c)** Replace first occurrence of `5` with `five` for the given string.

In [4]:
ip = 'They ate 5 apples and 5 oranges'

assert re.sub("5","five",ip,1) == 'They ate five apples and 5 oranges'

**d)** For the given list, filter all elements that do *not* contain `e`.

In [5]:
items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

assert [w for w in items if not re.search('e', w)] == ['goal', 'sit']

**e)** Replace all occurrences of `note` irrespective of case with `X`.

In [6]:
ip = 'This note should not be NoTeD'

assert re.sub(r'(?i)note',"X",ip) == 'This X should not be XD'

**f)** Check if `at` is present in the given byte input data.

In [7]:
ip = 'tiger imp goat'

assert bool(re.search('at', ip)) == True

**g)** For the given input string, display all lines not containing `start` irrespective of case.

In [8]:
para = '''good start
Start working on that
project you always wanted
stars are shining brightly
hi there
start and try to
finish the book
bye'''

pat = re.compile("start")      ##### add your solution here
for line in para.split('\n'):
    if not pat.search(line):
        print(line)

"""project you always wanted
stars are shining brightly
hi there
finish the book
bye"""

Start working on that
project you always wanted
stars are shining brightly
hi there
finish the book
bye


'project you always wanted\nstars are shining brightly\nhi there\nfinish the book\nbye'

**h)** For the given list, filter all elements that contains either `a` or `w`.

In [9]:
items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

##### add your solution here
assert [w for w in items if re.search('a', w) or re.search('w', w)] == ['goal', 'new', 'eat']

**i)** For the given list, filter all elements that contains both `e` and `n`.

In [10]:
items = ['goal', 'new', 'user', 'sit', 'eat', 'dinner']

##### add your solution here
assert [w for w in items if re.search() and re.search()] == ['new', 'dinner']

TypeError: search() missing 2 required positional arguments: 'pattern' and 'string'

**j)** For the given string, replace `0xA0` with `0x7F` and `0xC0` with `0x1F`.

In [None]:
ip = 'start address: 0xA0, func1 address: 0xC0'

##### add your solution here
assert ___ == 'start address: 0x7F, func1 address: 0x1F'

<br>

# 2. Anchors

**a)** Check if the given strings start with `be`.

In [None]:
line1 = 'be nice'
line2 = '"best!"'
line3 = 'better?'
line4 = 'oh no\nbear spotted'

pat = re.compile(r'^be')       ##### add your solution here
assert bool(pat.search(line1)) == True
assert bool(pat.search(line2)) == False
assert bool(pat.search(line3)) == True
assert bool(pat.search(line4)) == False

**b)** For the given input string, change only whole word `red` to `brown`

In [None]:
words = 'bred red spread credible'

assert re.sub() == 'bred brown spread credible'

**c)** For the given input list, filter all elements that contains `42` surrounded by word characters.

In [24]:
words = ['hi42bye', 'nice1423', 'bad42', 'cool_42a', 'fake4b']

assert [w for w in words if re.search(r'\w+42\w+',w)] == ['hi42bye', 'nice1423', 'cool_42a']

**d)** For the given input list, filter all elements that start with `den` or end with `ly`.

In [None]:
items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']

assert [e for e in items if ] == ['lovely', '2 lonely', 'dent']

**e)** For the given input string, change whole word `mall` to `1234` only if it is at the start of a line.

In [39]:
para = '''
ball fall wall tall
mall call ball pall
wall mall ball fall
mallet wallet malls'''

assert re.sub(r'\nmall\s',"\n1234 ",para) == """
ball fall wall tall
1234 call ball pall
wall mall ball fall
mallet wallet malls"""


ball fall wall tall
1234call ball pall
wall mall ball fall
mallet wallet malls


**f)** For the given list, filter all elements having a line starting with `den` or ending with `ly`.

In [None]:
items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\nfar', 'dent']

##### add your solution here
assert ___ == ['lovely', '1\ndentist', '2 lonely', 'fly\nfar', 'dent']

**g)** For the given input list, filter all whole elements `12\nthree` irrespective of case.

In [None]:
items = ['12\nthree\n', '12\nThree', '12\nthree\n4', '12\nthree']
##### add your solution here
assert ___ == ['12\nThree', '12\nthree']

**h)** For the given input list, replace `hand` with `X` for all elements that start with `hand` followed by at least one word character.

In [None]:
items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']

##### add your solution here
assert ___ == ['Xed', 'hand', 'Xy', 'unhanded', 'Xle', 'hand-2']

**i)** For the given input list, filter all elements starting with `h`. Additionally, replace `e` with `X` for these filtered elements.

In [None]:
items = ['handed', 'hand', 'handy', 'unhanded', 'handle', 'hand-2']

##### add your solution here
assert ___ == ['handXd', 'hand', 'handy', 'handlX', 'hand-2']

<br>

# 3. Alternation and Grouping

**a)** For the given input list, filter all elements that start with `den` or end with `ly`

In [None]:
items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\n', 'dent']

##### add your solution here
assert ___ == ['lovely', '2 lonely', 'dent']

**b)** For the given list, filter all elements having a line starting with `den` or ending with `ly`.

In [None]:
items = ['lovely', '1\ndentist', '2 lonely', 'eden', 'fly\nfar', 'dent']

##### add your solution here
assert ___ == ['lovely', '1\ndentist', '2 lonely', 'fly\nfar', 'dent']

**c)** For the given input strings, replace all occurrences of `removed` or `reed` or `received` or `refused` with `X`.

In [None]:
s1 = 'creed refuse removed read'
s2 = 'refused reed redo received'

pat = re.compile()        ##### add your solution here
assert pat.sub('X', s1) == 'cX refuse X read'
assert pat.sub('X', s2) == 'X X redo X'

**d)** For the given input strings, replace all matches from the list `words` with `A`.

In [None]:
s1 = 'plate full of slate'
s2 = "slated for later, don't be late"
words = ['late', 'later', 'slated']

pat = re.compile()        ##### add your solution here
assert pat.sub('A', s1) == 'pA full of sA'
assert pat.sub('A', s2) == "A for A, don't be A"

**e)** Filter all whole elements from the input list `items` based on elements listed in `words`.

In [None]:
items = ['slate', 'later', 'plate', 'late', 'slates', 'slated ']
words = ['late', 'later', 'slated']

pat = re.compile()       ##### add your solution here

##### add your solution here
assert ___ == ['later', 'late']

<br>

# 4. Escaping metacharacters

**a)** Transform the given input strings to the expected output using same logic on both strings.

In [None]:
str1 = '(9-2)*5+qty/3'
str2 = '(qty+4)/2-(9-2)*5+pq/4'

##### add your solution here for str1
assert ___ == '35+qty/3'
##### add your solution here for str2
assert ___ == '(qty+4)/2-35+pq/4'

**b)** Replace `(4)\|` with `2` only at the start or end of given input strings.

In [None]:
s1 = r'2.3/(4)\|6 foo 5.3-(4)\|'
s2 = r'(4)\|42 - (4)\|3'
s3 = 'two - (4)\\|\n'

pat = re.compile()        ##### add your solution here
assert pat.sub('2', s1) == '2.3/(4)\\|6 foo 5.3-2'
assert pat.sub('2', s2) == '242 - (4)\\|3'
assert pat.sub('2', s3) == 'two - (4)\\|\n'

**c)** Replace any matching element from the list `items` with `X` for given the input strings. Match the elements from `items` literally. Assume no two elements of `items` will result in any matching conflict.

In [None]:
items = ['a.b', '3+n', r'x\y\z', 'qty||price', '{n}']
pat = re.compile()      ##### add your solution here
assert pat.sub('X', '0a.bcd') == '0Xcd'
assert pat.sub('X', 'E{n}AMPLE') == 'EXAMPLE'
assert pat.sub('X', r'43+n2 ax\y\ze') == '4X2 aXe'

**d)** Replace backspace character `\b` with a single space character for the given input string.

In [None]:
ip = '123\b456'
ip

In [None]:
assert re.sub() == '123 456'

**e)** Replace all occurrences of `\e` with `e`.

In [None]:
ip = r'th\er\e ar\e common asp\ects among th\e alt\ernations'

assert re.sub() == 'there are common aspects among the alternations'

**f)** Replace any matching item from the list `eqns` with `X` for given the string `ip`. Match the items from `eqns` literally.

In [None]:
ip = '3-(a^b)+2*(a^b)-(a/b)+3'
eqns = ['(a^b)', '(a/b)', '(a^b)+2']

##### add your solution here

assert pat.sub('X', ip) == '3-X*X-X+3'

<br>

# 5. Dot metacharacter and Quantifiers

Since `.` metacharacter doesn't match newline character by default, assume that the input strings in the following exercises will not contain newline characters.

**a)** Replace `42//5` or `42/5` with `8` for the given input.

In [None]:
ip = 'a+42//5-c pressure*3+42/5-14256'

assert re.sub() == 'a+8-c pressure*3+8-14256'

**b)** For the list `items`, filter all elements starting with `hand` and ending with at most one more character or `le`.

In [None]:
items = ['handed', 'hand', 'handled', 'handy', 'unhand', 'hands', 'handle']

##### add your solution here
assert ___ == ['hand', 'handy', 'hands', 'handle']

**c)** Use `re.split` to get the output as shown for the given input strings.

In [None]:
eqn1 = 'a+42//5-c'
eqn2 = 'pressure*3+42/5-14256'
eqn3 = 'r*42-5/3+42///5-42/53+a'
##### add your solution here for eqn1
assert ___ == ['a+', '-c']
##### add your solution here for eqn2
assert ___ == ['pressure*3+', '-14256']
##### add your solution here for eqn3
assert ___ == ['r*42-5/3+42///5-', '3+a']

**d)** For the given input strings, remove everything from the first occurrence of `i` till end of the string.

In [None]:
s1 = 'remove the special meaning of such constructs'
s2 = 'characters while constructing'

pat = re.compile()        ##### add your solution here

assert pat.sub('', s1) == 'remove the spec'

assert pat.sub('', s2) == 'characters wh'

**e)** For the given strings, construct a RE to get output as shown.

In [None]:
str1 = 'a+b(addition)'
str2 = 'a/b(division) + c%d(#modulo)'
str3 = 'Hi there(greeting). Nice day(a(b)'

remove_parentheses = re.compile()     ##### add your solution here

assert remove_parentheses.sub('', str1) == 'a+b'
assert remove_parentheses.sub('', str2) == 'a/b + c%d'
assert remove_parentheses.sub('', str3) == 'Hi there. Nice day'

**f)** Correct the given RE to get the expected output.

In [None]:
words = 'plink incoming tint winter in caution sentient'
change = re.compile(r'int|in|ion|ing|inco|inter|ink')

# wrong output
assert change.sub('X', words) == 'plXk XcomXg tX wXer X cautX sentient'

# expected output
change = re.compile()       ##### add your solution here
assert change.sub('X', words) == 'plX XmX tX wX X cautX sentient'

**g)** For the given greedy quantifiers, what would be the equivalent form using `{m,n}` representation?

* `?` is same as
* `*` is same as
* `+` is same as

**h)** `(a*|b*)` is same as `(a|b)*` — True or False?


**i)** For the given input strings, remove everything from the first occurrence of `test` (irrespective of case) till end of the string, provided `test` isn't at the end of the string.

In [None]:
s1 = 'this is a Test'
s2 = 'always test your RE for corner cases'
s3 = 'a TEST of skill tests?'

pat = re.compile()      ##### add your solution here

assert pat.sub('', s1) == 'this is a Test'
assert pat.sub('', s2) == 'always '
assert pat.sub('', s3) == 'a '

**j)** For the input list `words`, filter all elements starting with `s` and containing `e` and `t` in any order.

In [None]:
words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 'tests', 'site']

##### add your solution here
assert ___ == ['subtle', 'sets', 'site']

**k)** For the input list `words`, remove all elements having less than `6` characters.

In [None]:
words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 'tests', 'site']

##### add your solution here
assert ___ == ['sequoia', 'subtle', 'exhibit']

**l)** For the input list `words`, filter all elements starting with `s` or `t` and having a maximum of `6` characters.

In [None]:
words = ['sequoia', 'subtle', 'exhibit', 'asset', 'sets', 'tests', 'site']

##### add your solution here
assert ___ == ['subtle', 'sets', 'tests', 'site']

**m)** Can you reason out why this code results in the output shown? The aim was to remove all `<characters>` patterns but not the `<>` ones. The expected result was `'a 1<> b 2<> c'`.

In [None]:
ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

assert re.sub(r'<.+?>', '', ip) == 'a 1 2'

**n)** Use `re.split` to get the output as shown below for given input strings.

In [None]:
s1 = 'go there  //   "this // that"'
s2 = 'a//b // c//d e//f // 4//5'
s3 = '42// hi//bye//see // carefully'

pat = re.compile()     ##### add your solution here

assert pat.split() == ['go there', '"this // that"']
assert pat.split() == ['a//b', 'c//d e//f // 4//5']
assert pat.split() == ['42// hi//bye//see', 'carefully']

<br>

# 6. Working with matched portions

**a)** For the given strings, extract the matching portion from first `is` to last `t`.

In [None]:
str1 = 'This the biggest fruit you have seen?'
str2 = 'Your mission is to read and practice consistently'

pat = re.compile()     ##### add your solution here

##### add your solution here for str1
assert ___ == 'is the biggest fruit'
##### add your solution here for str2
assert ___ == 'ission is to read and practice consistent'

**b)** Find the starting index of first occurrence of `is` or `the` or `was` or `to` for the given input strings.

In [None]:
s1 = 'match after the last newline character'
s2 = 'and then you want to test'
s3 = 'this is good bye then'
s4 = 'who was there to see?'

pat = re.compile()      ##### add your solution here

##### add your solution here for s1
assert ___ == 12
##### add your solution here for s2
assert ___ == 4
##### add your solution here for s3
assert ___ == 2
##### add your solution here for s4
assert ___ == 4

**c)** Find the starting index of last occurrence of `is` or `the` or `was` or `to` for the given input strings.

In [None]:
s1 = 'match after the last newline character'
s2 = 'and then you want to test'
s3 = 'this is good bye then'
s4 = 'who was there to see?'

pat = re.compile()      ##### add your solution here

##### add your solution here for s1
assert ___ == 12
##### add your solution here for s2
assert ___ == 18
##### add your solution here for s3
assert ___ == 17
##### add your solution here for s4
assert ___ == 14

**d)** The given input string contains `:` exactly once. Extract all characters after the `:` as output.

In [None]:
ip = 'fruits:apple, mango, guava, blueberry'

##### add your solution here
assert ___ == 'apple, mango, guava, blueberry'

**e)** The given input strings contains some text followed by `-` followed by a number. Replace that number with its `log` value using `math.log()`.

In [None]:
s1 = 'first-3.14'
s2 = 'next-123'

pat = re.compile()      ##### add your solution here

import math
assert pat.sub() == 'first-1.144222799920162'
assert pat.sub() == 'next-4.812184355372417'

**f)** Replace all occurrences of `par` with `spar`, `spare` with `extra` and `park` with `garden` for the given input strings.

In [None]:
str1 = 'apartment has a park'
str2 = 'do you have a spare cable'
str3 = 'write a parser'

##### add your solution here

assert pat.sub() == 'aspartment has a garden'
assert pat.sub() == 'do you have a extra cable'
assert pat.sub() == 'write a sparser'

**g)** Extract all words between `(` and `)` from the given input string as a list. Assume that the input will not contain any broken parentheses.

In [None]:
ip = 'another (way) to reuse (portion) matched (by) capture groups'

assert re.findall() == ['way', 'portion', 'by']

**h)** Extract all occurrences of `<` up to next occurrence of `>`, provided there is at least one character in between `<` and `>`.

In [None]:
ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

assert re.findall() == ['<apple>', '<> b<bye>', '<> c<cat>']

**i)** Use `re.findall` to get the output as shown below for the given input strings. Note the characters used in the input strings carefully.

In [None]:
row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '

pat = re.compile()       ##### add your solution here

assert pat.findall(row1) == [('-2', '5'), ('4', '+3'), ('+42', '-53'), ('4356246', '-357532354')]
pat.findall(row2) == [('1.32', '-3.14'), ('634', '5.63'), ('63.3e3', '9907809345343.235')]

**j)** This is an extension to previous question.

* For `row1`, find the sum of integers of each tuple element. For example, sum of `-2` and `5` is `3`.
* For `row2`, find the sum of floating-point numbers of each tuple element. For example, sum of `1.32` and `-3.14` is `-1.82`.

In [None]:
row1 = '-2,5 4,+3 +42,-53 4356246,-357532354 '
row2 = '1.32,-3.14 634,5.63 63.3e3,9907809345343.235 '

# should be same as previous question
pat = re.compile()       ##### add your solution here

##### add your solution here for row1
assert ___ == [3, 7, -11, -353176108]

##### add your solution here for row2
assert ___ == [-1.82, 639.63, 9907809408643.234]

**k)** Use `re.split` to get the output as shown below.

In [None]:
ip = '42:no-output;1000:car-truck;SQEX49801'

assert re.split() == ['42', 'output', '1000', 'truck', 'SQEX49801']

**l)** For the given list of strings, change the elements into a tuple of original element and number of times `t` occurs in that element.

In [None]:
words = ['sequoia', 'attest', 'tattletale', 'asset']

##### add your solution here
assert ___ == [('sequoia', 0), ('attest', 3), ('tattletale', 4), ('asset', 1)]

**m)** The given input string has fields separated by `:`. Each field contains four uppercase alphabets followed optionally by two digits. Ignore the last field, which is empty. See [docs.python: Match.groups](https://docs.python.org/3/library/re.html#re.Match.groups) and use `re.finditer` to get the output as shown below. If the optional digits aren't present, show `'NA'` instead of `None`.

In [None]:
ip = 'TWXA42:JWPA:NTED01:'

##### add your solution here
assert ___ == [('TWXA', '42'), ('JWPA', 'NA'), ('NTED', '01')]

>![info](../images/info.svg) Note that this is different from `re.findall` which will just give empty string instead of `None` when a capture group doesn't participate.

**n)** Convert the comma separated strings to corresponding `dict` objects as shown below.

In [None]:
row1 = 'name:rohan,maths:75,phy:89,'
row2 = 'name:rose,maths:88,phy:92,'

pat = re.compile()      ##### add your solution here

##### add your solution here for row1
assert ___ == {'name': 'rohan', 'maths': '75', 'phy': '89'}
##### add your solution here for row2
assert ___ == {'name': 'rose', 'maths': '88', 'phy': '92'}

<br>

# 7. Character class

**a)** For the list `items`, filter all elements starting with `hand` and ending with `s` or `y` or `le`.

In [None]:
items = ['-handy', 'hand', 'handy', 'unhand', 'hands', 'handle']

##### add your solution here
assert ___ == ['handy', 'hands', 'handle']

**b)** Replace all whole words `reed` or `read` or `red` with `X`.

In [None]:
ip = 'redo red credible :read: rod reed'

##### add your solution here
assert ___ == 'redo X credible :X: rod X'

**c)** For the list `words`, filter all elements containing `e` or `i` followed by `l` or `n`. Note that the order mentioned should be followed.

In [None]:
words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']

##### add your solution here
assert ___ == ['surrender', 'unicorn', 'eel']

**d)** For the list `words`, filter all elements containing `e` or `i` and `l` or `n` in any order.

In [None]:
words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']

##### add your solution here
assert ___ == ['surrender', 'unicorn', 'newer', 'eel']

**e)** Extract all hex character sequences, with `0x` optional prefix. Match the characters case insensitively, and the sequences shouldn't be surrounded by other word characters.

In [None]:
str1 = '128A foo 0xfe32 34 0xbar'
str2 = '0XDEADBEEF place 0x0ff1ce bad'

hex_seq = re.compile()        ##### add your solution here

##### add your solution here for str1
assert ___ == ['128A', '0xfe32', '34']

##### add your solution here for str2
assert ___ == ['0XDEADBEEF', '0x0ff1ce', 'bad']

**f)** Delete from `(` to the next occurrence of `)` unless they contain parentheses characters in between.

In [None]:
str1 = 'def factorial()'
str2 = 'a/b(division) + c%d(#modulo) - (e+(j/k-3)*4)'
str3 = 'Hi there(greeting). Nice day(a(b)'

remove_parentheses = re.compile()      ##### add your solution here

assert ___ == remove_parentheses.sub('', str1)
'def factorial'
assert ___ == remove_parentheses.sub('', str2)
'a/b + c%d - (e+*4)'
assert ___ == remove_parentheses.sub('', str3)
'Hi there. Nice day(a'

**g)** For the list `words`, filter all elements not starting with `e` or `p` or `u`.

In [None]:
words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']

##### add your solution here
assert ___ == ['surrender', 'newer', 'door']

**h)** For the list `words`, filter all elements not containing `u` or `w` or `ee` or `-`.

In [None]:
words = ['p-t', 'you', 'tea', 'heel', 'owe', 'new', 'reed', 'ear']

##### add your solution here
assert ___ == ['tea', 'ear']

**i)** The given input strings contain fields separated by `,` and fields can be empty too. Replace last three fields with `WHTSZ323`.

In [None]:
row1 = '(2),kite,12,,D,C,,'
row2 = 'hi,bye,sun,moon'

pat = re.compile()      ##### add your solution here

assert pat.sub() == '(2),kite,12,,D,WHTSZ323'
assert pat.sub() == 'hi,WHTSZ323'

**j)** Split the given strings based on consecutive sequence of digit or whitespace characters.

In [None]:
str1 = 'lion \t Ink32onion Nice'
str2 = '**1\f2\n3star\t7 77\r**'

pat = re.compile()       ##### add your solution here

assert pat.split(str1) == ['lion', 'Ink', 'onion', 'Nice']
assert pat.split(str2) == ['**', 'star', '**']

**k)** Delete all occurrences of the sequence `<characters>` where `characters` is one or more non `>` characters and cannot be empty.

In [None]:
ip = 'a<apple> 1<> b<bye> 2<> c<cat>'

##### add your solution here
assert ___ == 'a 1<> b 2<> c'

**l)** `\b[a-z](on|no)[a-z]\b` is same as `\b[a-z][on]{2}[a-z]\b`. True or False? Sample input lines shown below might help to understand the differences, if any.

In [None]:
print('known\nmood\nknow\npony\ninns')
known
mood
know
pony
inns

**m)** For the given list, filter all elements containing any number sequence greater than `624`.

In [None]:
items = ['hi0000432abcd', 'car00625', '42_624 0512', '3.14 96 2 foo1234baz']

##### add your solution here
assert ___ == ['car00625', '3.14 96 2 foo1234baz']

**n)** Count the maximum depth of nested braces for the given strings. Unbalanced or wrongly ordered braces should return `-1`. Note that this will require a mix of regular expressions and Python code.

In [None]:
def max_nested_braces(ip):
##### add your solution here

assert max_nested_braces('a*b') == 0
assert max_nested_braces('}a+b{') == -1
assert max_nested_braces('a*b+{}') == 1
assert max_nested_braces('{{a+2}*{b+c}+e}') == 2
assert max_nested_braces('{{a+2}*{b+{c*d}}+e}') == 3
assert max_nested_braces('{{a+2}*{\n{b+{c*d}}+e*d}}') == 4
assert max_nested_braces('a*{b+c*{e*3.14}}}') == -1

**o)** By default, `str.split` method will split on whitespace and remove empty strings from the result. Which `re` module function would you use to replicate this functionality?

In [None]:
ip = ' \t\r  so  pole\t\t\t\n\nlit in to \r\n\v\f  '

assert ip.split() == ['so', 'pole', 'lit', 'in', 'to']
##### add your solution here
assert ___ == ['so', 'pole', 'lit', 'in', 'to']

**p)** Convert the given input string to two different lists as shown below.

In [None]:
ip = 'price_42 roast^\t\n^-ice==cat\neast'

##### add your solution here
assert ___ == ['price_42', 'roast', 'ice', 'cat', 'east']

##### add your solution here
assert ___ == ['price_42', ' ', 'roast', '^\t\n^-', 'ice', '==', 'cat', '\n', 'east']

**q)** Filter all elements whose first non-whitespace character is not a `#` character. Any element made up of only whitespace characters should be ignored as well.

In [None]:
items = ['    #comment', '\t\napple #42', '#oops', 'sure', 'no#1', '\t\r\f']

##### add your solution here
assert ___ == ['\t\napple #42', 'sure', 'no#1']

<br>

# 8. Groupings and backreferences

**a)** Replace the space character that occurs after a word ending with `a` or `r` with a newline character.

In [None]:
ip = 'area not a _a2_ roar took 22'

assert re.sub() == """area
not a
_a2_ roar
took 22"""

**b)** Add `[]` around words starting with `s` and containing `e` and `t` in any order.

In [None]:
ip = 'sequoia subtle exhibit asset sets tests site'

##### add your solution here
assert ___ == 'sequoia [subtle] exhibit asset [sets] tests [site]'

**c)** Replace all whole words with `X` that start and end with the same word character. Single character word should get replaced with `X` too, as it satisfies the stated condition.

In [None]:
ip = 'oreo not a _a2_ roar took 22'

##### add your solution here
assert ___ == 'X not X X X took X'

**d)** Convert the given **markdown** headers to corresponding **anchor** tag. Consider the input to start with one or more `#` characters followed by space and word characters. The `name` attribute is constructed by converting the header to lowercase and replacing spaces with hyphens. Can you do it without using a capture group?

In [None]:
header1 = '# Regular Expressions'
header2 = '## Compiling regular expressions'

##### add your solution here for header1
assert ___ == '# <a name="regular-expressions"></a>Regular Expressions'
##### add your solution here for header2
assert ___ == '## <a name="compiling-regular-expressions"></a>Compiling regular expressions'

**e)** Convert the given **markdown** anchors to corresponding **hyperlinks**.

In [None]:
anchor1 = '# <a name="regular-expressions"></a>Regular Expressions'
anchor2 = '## <a name="subexpression-calls"></a>Subexpression calls'

##### add your solution here for anchor1
assert ___ == '[Regular Expressions](#regular-expressions)'
##### add your solution here for anchor2
assert ___ == '[Subexpression calls](#subexpression-calls)'

**f)** Count the number of whole words that have at least two occurrences of consecutive repeated alphabets. For example, words like `stillness` and `Committee` should be counted but not words like `root` or `readable` or `rotational`.

In [None]:
ip = '''oppressed abandon accommodation bloodless
carelessness committed apparition innkeeper
occasionally afforded embarrassment foolishness
depended successfully succeeded
possession cleanliness suppress'''

##### add your solution here
assert ___ == 13

**g)** For the given input string, replace all occurrences of digit sequences with only the unique non-repeating sequence. For example, `232323` should be changed to `23` and `897897` should be changed to `897`. If there no repeats (for example `1234`) or if the repeats end prematurely (for example `12121`), it should not be changed.

In [None]:
ip = '1234 2323 453545354535 9339 11 60260260'

##### add your solution here
assert ___ == '1234 23 4535 9339 1 60260260'

**h)** Replace sequences made up of words separated by `:` or `.` by the first word of the sequence. Such sequences will end when `:` or `.` is not followed by a word character.

In [None]:
ip = 'wow:Good:2_two:five: hi-2 bye kite.777.water.'

##### add your solution here
assert ___ == 'wow hi-2 bye kite'

**i)** Replace sequences made up of words separated by `:` or `.` by the last word of the sequence. Such sequences will end when `:` or `.` is not followed by a word character.

In [None]:
ip = 'wow:Good:2_two:five: hi-2 bye kite.777.water.'

##### add your solution here
assert ___ == 'five hi-2 bye water'

**j)** Split the given input string on one or more repeated sequence of `cat`.

In [None]:
ip = 'firecatlioncatcatcatbearcatcatparrot'

##### add your solution here
assert ___ == ['fire', 'lion', 'bear', 'parrot']

**k)** For the given input string, find all occurrences of digit sequences with at least one repeating sequence. For example, `232323` and `897897`. If the repeats end prematurely, for example `12121`, it should not be matched.

In [None]:
ip = '1234 2323 453545354535 9339 11 60260260'

pat = re.compile()      ##### add your solution here

# entire sequences in the output
##### add your solution here
assert ___ == ['2323', '453545354535', '11']

# only the unique sequence in the output
##### add your solution here
assert ___ == ['23', '4535', '1']

**l)** Convert the comma separated strings to corresponding `dict` objects as shown below. The keys are `name`, `maths` and `phy` for the three fields in the input strings.

In [None]:
row1 = 'rohan,75,89'
row2 = 'rose,88,92'

pat = re.compile()      ##### add your solution here

##### add your solution here for row1
assert ___ == {'name': 'rohan', 'maths': '75', 'phy': '89'}
##### add your solution here for row2
assert ___ == {'name': 'rose', 'maths': '88', 'phy': '92'}

**m)** Surround all whole words with `()`. Additionally, if the whole word is `imp` or `ant`, delete them. Can you do it with single substitution?

In [None]:
ip = 'tiger imp goat eagle ant important'

##### add your solution here
assert ___ == '(tiger) () (goat) (eagle) () (important)'

**n)** Filter all elements that contains a sequence of lowercase alphabets followed by `-` followed by digits. They can be optionally surrounded by `{{` and `}}`. Any partial match shouldn't be part of the output.

In [None]:
ip = ['{{apple-150}}', '{{mango2-100}}', '{{cherry-200', 'grape-87']

##### add your solution here
assert ___ == ['{{apple-150}}', 'grape-87']

**o)** The given input string has sequences made up of words separated by `:` or `.` and such sequences will end when `:` or `.` is not followed by a word character. For all such sequences, display only the last word followed by `-` followed by first word.

In [None]:
ip = 'wow:Good:2_two:five: hi-2 bye kite.777.water.'

##### add your solution here
assert ___ == ['five-wow', 'water-kite']

<br>

# 9. Lookarounds

Starting from here, all following problems are optional!

Please use lookarounds for solving the following exercises even if you can do it without lookarounds. Unless you cannot use lookarounds for cases like variable length lookbehinds.

**a)** Replace all whole words with `X` unless it is preceded by `(` character.

In [None]:
ip = '(apple) guava berry) apple (mango) (grape'

##### add your solution here
assert ___ == '(apple) X X) X (mango) (grape'

**b)** Replace all whole words with `X` unless it is followed by `)` character.

In [None]:
ip = '(apple) guava berry) apple (mango) (grape'

##### add your solution here
assert ___ == '(apple) X berry) X (mango) (X'

**c)** Replace all whole words with `X` unless it is preceded by `(` or followed by `)` characters.

In [None]:
ip = '(apple) guava berry) apple (mango) (grape'

##### add your solution here
assert ___ == '(apple) X berry) X (mango) (grape'

**d)** Extract all whole words that do not end with `e` or `n`.

In [None]:
ip = 'at row on urn e note dust n'

##### add your solution here
assert ___ == ['at', 'row', 'dust']

**e)** Extract all whole words that do not start with `a` or `d` or `n`.

In [None]:
ip = 'at row on urn e note dust n'

##### add your solution here
assert ___ == ['row', 'on', 'urn', 'e']

**f)** Extract all whole words only if they are followed by `:` or `,` or `-`.

In [None]:
ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
assert ___ == ['poke', 'so', 'ever']

**g)** Extract all whole words only if they are preceded by `=` or `/` or `-`.

In [None]:
ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
assert ___ == ['so', 'is', 'sit']

**h)** Extract all whole words only if they are preceded by `=` or `:` and followed by `:` or `.`.

In [None]:
ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
assert ___ == ['so', 'ink']

**i)** Extract all whole words only if they are preceded by `=` or `:` or `.` or `(` or `-` and not followed by `.` or `/`.

In [None]:
ip = 'poke,on=-=so:ink.to/is(vast)ever-sit'

##### add your solution here
assert ___ == ['so', 'vast', 'sit']

**j)** Remove leading and trailing whitespaces from all the individual fields where `,` is the field separator.

In [None]:
csv1 = ' comma  ,separated ,values \t\r '
csv2 = 'good bad,nice  ice  , 42 , ,   stall   small'

remove_whitespace = re.compile()     ##### add your solution here

assert remove_whitespace.sub('', csv1) == 'comma,separated,values'
assert remove_whitespace.sub('', csv2) == 'good bad,nice  ice,42,,stall   small'

**k)** Filter all elements that satisfy all of these rules:

* should have at least two alphabets
* should have at least 3 digits
* should have at least one special character among `%` or `*` or `#` or `$`
* should not end with a whitespace character

In [None]:
pwds = ['hunter2', 'F2H3u%9', '*X3Yz3.14\t', 'r2_d2_42', 'A $B C1234']

##### add your solution here
assert ___ == ['F2H3u%9', 'A $B C1234']

**l)** For the given string, surround all whole words with `{}` except for whole words `par` and `cat` and `apple`.

In [None]:
ip = 'part; cat {super} rest_42 par scatter apple spar'

##### add your solution here
assert ___ == '{part}; cat {{super}} {rest_42} par {scatter} apple {spar}'

**m)** Extract integer portion of floating-point numbers for the given string. A number ending with `.` and no further digits should not be considered.

In [None]:
ip = '12 ab32.4 go 5 2. 46.42 5'

##### add your solution here
assert ___ == ['32', '46']

**n)** For the given input strings, extract all overlapping two character sequences.

In [None]:
s1 = 'apple'
s2 = '1.2-3:4'

pat = re.compile()       ##### add your solution here

##### add your solution here for s1
assert ___ == ['ap', 'pp', 'pl', 'le']
##### add your solution here for s2
assert ___ == ['1.', '.2', '2-', '-3', '3:', ':4']

**o)** The given input strings contain fields separated by `:` character. Delete `:` and the last field if there is a digit character anywhere before the last field.

In [None]:
s1 = '42:cat'
s2 = 'twelve:a2b'
s3 = 'we:be:he:0:a:b:bother'

pat = re.compile()      ##### add your solution here

assert pat.sub() == '42'
assert pat.sub() == 'twelve:a2b'
assert pat.sub() == 'we:be:he:0:a:b'

**p)** Extract all whole words unless they are preceded by `:` or `<=>` or `----` or `#`.

In [None]:
ip = '::very--at<=>row|in.a_b#b2c=>lion----east'

##### add your solution here
assert ___ == ['at', 'in', 'a_b', 'lion']

**q)** Match strings if it contains `qty` followed by `price` but not if there is **whitespace** or the string `error` between them.

In [None]:
str1 = '23,qty,price,42'
str2 = 'qty price,oh'
str3 = '3.14,qty,6,errors,9,price,3'
str4 = '42\nqty-6,apple-56,price-234,error'
str5 = '4,price,3.14,qty,4'

neg = re.compile()       ##### add your solution here

assert bool(neg.search(str1)) == True
assert bool(neg.search(str2)) == False
assert bool(neg.search(str3)) == False
assert bool(neg.search(str4)) == True
assert bool(neg.search(str5)) == False

**r)** Can you reason out why the output shown is different for these two regular expressions?

In [None]:
ip = 'I have 12, he has 2!'

assert re.sub(r'\b..\b', '{\g<0>}', ip) == '{I }have {12}{, }{he} has{ 2}!'

assert re.sub(r'(?<!\w)..(?!\w)', '{\g<0>}', ip) == 'I have {12}, {he} has {2!}'

<br>

# 10. Flags

**a)** Remove from first occurrence of `hat` to last occurrence of `it` for the given input strings. Match these markers case insensitively.

In [None]:
s1 = 'But Cool THAT\nsee What okay\nwow quite'
s2 = 'it this hat is sliced HIT.'

pat = re.compile()       ##### add your solution here

assert pat.sub('', s1) == 'But Cool Te'
assert pat.sub('', s2) == 'it this .'

**b)** Delete from `start` if it is at the beginning of a line up to the next occurrence of the `end` at the end of a line. Match these markers case insensitively.

In [None]:
para = '''
good start
start working on that
project you always wanted
to, do not let it end
hi there
start and end the end
42
Start and try to
finish the End
bye'''

pat = re.compile()        ##### add your solution here

assert pat.sub('', para)) == """
good start

hi there

42

bye"""

**c)** For the given input strings, match all of these three patterns:

* `This` case sensitively
* `nice` and `cool` case insensitively

In [None]:
s1 = 'This is nice and Cool'
s2 = 'Nice and cool this is'
s3 = 'What is so nice and cool about This?'

pat = re.compile()       ##### add your solution here

assert bool(pat.search(s1)) == True
assert bool(pat.search(s2)) == False
assert bool(pat.search(s3)) == True

**d)** For the given input strings, match if the string begins with `Th` and also contains a line that starts with `There`.

In [None]:
s1 = 'There there\nHave a cookie'
s2 = 'This is a mess\nYeah?\nThereeeee'
s3 = 'Oh\nThere goes the fun'

pat = re.compile()     ##### add your solution here

assert bool(pat.search(s1)) == True
assert bool(pat.search(s2)) == True
assert bool(pat.search(s3)) == False

**e)** Explore what the `re.DEBUG` flag does. Here's some example patterns to check out.

* `re.compile(r'\Aden|ly\Z', flags=re.DEBUG)`
* `re.compile(r'\b(0x)?[\da-f]+\b', flags=re.DEBUG)`
* `re.compile(r'\b(?:0x)?[\da-f]+\b', flags=re.I|re.DEBUG)`

<br>

# 11. Unicode

**a)** Output `True` or `False` depending on input string made up of ASCII characters or not. Consider the input to be non-empty strings and any character that isn't part of 7-bit ASCII set should give `False`. Do you need regular expressions for this?

In [None]:
str1 = '123—456'
str2 = 'good fοοd'
str3 = 'happy learning!'
str4 = 'İıſK'

##### add your solution here for str1
assert ___ == False
##### add your solution here for str2
assert ___ == False
##### add your solution here for str3
assert ___ == True
##### add your solution here for str4
assert ___ == False

**b)** Does `.` quantifier with `re.ASCII` flag enabled match non-ASCII characters?

**c)** Explore the following Q&A threads.

* [stackoverflow: remove powered number from string](https://stackoverflow.com/questions/57553721/remove-powered-number-from-string-in-python)
* [stackoverflow: regular expression for French characters](https://stackoverflow.com/questions/1922097/regular-expression-for-french-characters)

<br>

# 12. regex module

This part is super optional, it has you using the non-builtin `regex` module (https://pypi.org/project/regex/). I've never actually tried it. I skimmed through its features, and it doesn't strike me as adding *that* much more functionality.

**a)** Filter all elements whose first non-whitespace character is not a `#` character. Any element made up of only whitespace characters should be ignored as well.

In [None]:
items = ['    #comment', '\t\napple #42', '#oops', 'sure', 'no#1', '\t\r\f']

##### add your solution here
assert ___ == ['\t\napple #42', 'sure', 'no#1']

**b)** Replace sequences made up of words separated by `:` or `.` by the first word of the sequence and the separator. Such sequences will end when `:` or `.` is not followed by a word character.

In [None]:
ip = 'wow:Good:2_two:five: hi bye kite.777.water.'

##### add your solution here
assert ___ == 'wow: hi bye kite.'

**c)** The given list of strings has fields separated by `:` character. Delete `:` and the last field if there is a digit character anywhere before the last field.

In [None]:
items = ['42:cat', 'twelve:a2b', 'we:be:he:0:a:b:bother']

##### add your solution here
assert ___ == ['42', 'twelve:a2b', 'we:be:he:0:a:b']

**d)** Extract all whole words unless they are preceded by `:` or `<=>` or `----` or `#`.

In [None]:
ip = '::very--at<=>row|in.a_b#b2c=>lion----east'

##### add your solution here
assert ___ == ['at', 'in', 'a_b', 'lion']

**e)** The given input string has fields separated by `:` character. Extract all fields if the previous field contains a digit character.

In [None]:
ip = 'vast:a2b2:ride:in:awe:b2b:3list:end'

##### add your solution here
assert ___ == ['ride', '3list', 'end']

**f)** The given input string has fields separated by `:` character. Delete all fields, including the separator, unless the field contains a digit character. Stop deleting once a field with digit character is found.

In [None]:
row1 = 'vast:a2b2:ride:in:awe:b2b:3list:end'
row2 = 'um:no:low:3e:s4w:seer'

pat = regex.compile()      ##### add your solution here

assert pat.sub('', row1) == 'a2b2:ride:in:awe:b2b:3list:end'
assert pat.sub('', row2) == '3e:s4w:seer'

**g)** For the given input strings, extract `if` followed by any number of nested parentheses. Assume that there will be only one such pattern per input string.

In [None]:
ip1 = 'for (((i*3)+2)/6) if(3-(k*3+4)/12-(r+2/3)) while()'
ip2 = 'if+while if(a(b)c(d(e(f)1)2)3) for(i=1)'

pat = regex.compile()       ##### add your solution here

assert pat.search(ip1)[0] == 'if(3-(k*3+4)/12-(r+2/3))'
assert pat.search(ip2)[0] == 'if(a(b)c(d(e(f)1)2)3)'

**h)** Read about `POSIX` flag from https://pypi.org/project/regex/. Is the following code snippet showing the correct output?

In [None]:
words = 'plink incoming tint winter in caution sentient'

change = regex.compile(r'int|in|ion|ing|inco|inter|ink', flags=regex.POSIX)

assert change.sub('X', words) == 'plX XmX tX wX X cautX sentient'

**i)** Extract all whole words for the given input strings. However, based on user input `ignore`, do not match words if they contain any character present in the `ignore` variable.

In [None]:
s1 = 'match after the last newline character'
s2 = 'and then you want to test'

ignore = 'aty'
assert regex.findall() == ['newline']
assert regex.findall() == []

ignore = 'esw'
assert regex.findall() == ['match']
assert regex.findall() == ['and', 'you', 'to']

**j)** Retain only punctuation characters for the given strings (generated from codepoints). Use Unicode character set definition for punctuation for solving this exercise.

In [None]:
s1 = ''.join(chr(c) for c in range(0, 0x80))
s2 = ''.join(chr(c) for c in range(0x80, 0x100))
s3 = ''.join(chr(c) for c in range(0x2600, 0x27ec))

pat = regex.compile()       ##### add your solution here

assert pat.sub('', s1) == '!"#%&\'()*,-./:;?@[\\]_{}'
assert pat.sub('', s2) == '¡§«¶·»¿'
assert pat.sub('', s3) == '❨❩❪❫❬❭❮❯❰❱❲❳❴❵⟅⟆⟦⟧⟨⟩⟪⟫'

**k)** For the given **markdown** file, replace all occurrences of the string `python` (irrespective of case) with the string `Python`. However, any match within code blocks that start with whole line ` ```python ` and end with whole line ` ``` ` shouldn't be replaced. Consider the input file to be small enough to fit memory requirements.

Refer to [github: exercises folder](https://github.com/learnbyexample/py_regular_expressions/tree/master/exercises) for files `sample.md` and `expected.md` required to solve this exercise.

In [None]:
ip_str = open('sample.md', 'r').read()
pat = regex.compile()      ##### add your solution here
with open('sample_mod.md', 'w') as op_file:
    ##### add your solution here

305
assert open('sample_mod.md').read() == open('expected.md').read()

**l)** For the given input strings, construct a word that is made up of last characters of all the words in the input. Use last character of last word as first character, last character of last but one word as second character and so on.

In [None]:
s1 = 'knack tic pi roar what'
s2 = '42;rod;t2t2;car'

pat = regex.compile()       ##### add your solution here

##### add your solution here for s1
assert ___ == 'trick'
##### add your solution here for s2
assert ___ == 'r2d2'

**m)** Replicate `str.rpartition` functionality with regular expressions. Split into three parts based on last match of sequences of digits, which is `777` and `12` for the given input strings.

In [None]:
s1 = 'Sample123string42with777numbers'
s2 = '12apples'

##### add your solution here for s1
assert ___ == ['Sample123string42with', '777', 'numbers']
##### add your solution here for s2
assert ___ == ['', '12', 'apples']

**n)** Read about fuzzy matching on https://pypi.org/project/regex/. For the given input strings, return `True` if they are exactly same as `cat` or there is exactly one character difference. Ignore case when comparing differences. For example, `Ca2` should give `True`. `act` will be `False` even though the characters are same because position should be maintained.

In [None]:
pat = regex.compile()       ##### add your solution here

assert bool(pat.fullmatch('CaT')) == True
assert bool(pat.fullmatch('scat')) == False
assert bool(pat.fullmatch('ca.')) == True
assert bool(pat.fullmatch('ca#')) == True
assert bool(pat.fullmatch('c#t')) == True
assert bool(pat.fullmatch('at')) == False
assert bool(pat.fullmatch('act')) == False
assert bool(pat.fullmatch('2a1')) == False

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=1be04a83-f115-461b-beac-c091c2970fb9' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>