## Regular expressions (RegEx)

- More robust way of getting what we want from the string
- https://docs.python.org/3/library/re.html
- ^ - starting the line (beginning of the line)
- . is replacing any character
- $ replaces the same ending for the line
- Always use rawstrings re.search(r’fall’, ‘onfall’)
- re.search(r’[Pp]ython’, ‘Python’)
- re.search(r’[a-z]ython’, ‘Python’) (from a to z)
- re.search(r’cat|dog’, ‘dog’) - returns either cat or dogs span (but we only get the first one)
- re.findall(r’cat|dog’, ‘cats and dogs’)) - return all of the matches: ['cat', 'dog']


In [1]:
import re
def check_punctuation (text):
  result = re.search(r"[.?!]$", text)
  return result != None

print(check_punctuation("This is a sentence that ends with a period.")) # True
print(check_punctuation("This is a sentence fragment without a period")) # False
print(check_punctuation("Aren't regular expressions awesome?")) # True
print(check_punctuation("Wow! We're really picking up some steam now!")) # True
print(check_punctuation("End of the line")) # False

True
False
True
True
False


### Repeated matches

find Py followed by any number of other characters followed by n (any number of repetitions of thos characters, including 0)

In [3]:
print(re.search(r'Py.*n', 'Pygmalion'))

<re.Match object; span=(0, 9), match='Pygmalion'>


so it expands to all the way to the string

In [6]:
print(re.search(r'Py.*n', 'Python Programming'))

<re.Match object; span=(0, 17), match='Python Programmin'>


- we can choose only letters to match

In [8]:
print(re.search(r'Py[a-z]*n', 'Python Programming'))

<re.Match object; span=(0, 6), match='Python'>


- zero repetitions is also a possibility

In [9]:
print(re.search(r'Py[a-z]*n', 'Pyn'))

<re.Match object; span=(0, 3), match='Pyn'>


- character **+** - we can match few strings together

In [10]:
print(re.search(r'o+l', 'oil'))

None


In [11]:
print(re.search(r'o+l', 'olllllivie'))

<re.Match object; span=(0, 2), match='ol'>


In [12]:
print(re.search(r'o+l', 'oooollloo'))

<re.Match object; span=(0, 5), match='ooool'>


In [13]:
print(re.search(r'o+l+', 'oooollloo'))

<re.Match object; span=(0, 7), match='oooolll'>


The repeating_letter_a function checks if the text passed includes the letter "a" (lowercase or uppercase) at least twice. For example, repeating_letter_a("banana") is True, while repeating_letter_a("pineapple") is False. Fill in the code to make this work.

In [22]:
import re
def repeating_letter_a(text):
    result = re.search(r"[a,A].*[a,A]", text)
    print(result)
    return result != None

print(repeating_letter_a("banana")) # True
print(repeating_letter_a("pineapple")) # False
print(repeating_letter_a("Animal Kingdom")) # True
print(repeating_letter_a("A is for apple")) # True

<re.Match object; span=(1, 6), match='anana'>
True
None
False
<re.Match object; span=(0, 5), match='Anima'>
True
<re.Match object; span=(0, 10), match='A is for a'>
True


#### ? is another multiplier
basically it makes one previous character optional

In [24]:
print(re.search(r'ep?each', 'each'))

None


In [25]:
print(re.search(r'p?each', 'each'))

<re.Match object; span=(0, 4), match='each'>


What if what we need to find matches some of special characters? We can use escape character (backslash \)

In [26]:
print(re.search(r'.com', 'welcome'))

<re.Match object; span=(2, 6), match='lcom'>


In [28]:
print(re.search(r'\.com', 'welcome to www.python.com'))

<re.Match object; span=(21, 25), match='.com'>


- using raw strings helps to avoid confusions with \t and \n as for tab and new line character; they will only be interpreted when parcing the regural expression
- \w (or some other letter) can be used for pre-defined sets of symbols 
- \w any alphanumeric character including letters, numbers and underscores

In [45]:
print(re.findall(r'\w', ' welcome 1 3 4 . /'))

['w', 'e', 'l', 'c', 'o', 'm', 'e', '1', '3', '4']


In [35]:
print(re.findall(r'.', ' welcome 1 3 4'))

[' ', 'w', 'e', 'l', 'c', 'o', 'm', 'e', ' ', '1', ' ', '3', ' ', '4']


Fill in the code to check if the text passed has at least 2 groups of alphanumeric characters (including letters, numbers, and underscores) separated by one or more whitespace characters.

In [38]:
import re
def check_character_groups(text):
  # here we use two one spaces - one is mandatory, other is optional up to 0 repetitions
  result = re.search(r"\w  *\w", text)
  return result != None

print(check_character_groups("One")) # False
print(check_character_groups("123  Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False

False
True
True
False


Cheatsheet:
- https://regex101.com/
- https://docs.python.org/3/howto/regex.html
- https://docs.python.org/3/library/re.html
- https://docs.python.org/3/howto/regex.html#greedy-versus-non-greedy
    

In [42]:
#begins with A and ends with letter a
print(re.search(r'^A.*a$', 'Argentina'))
print(re.search(r'^A.*a$', 'Argentinian'))

<re.Match object; span=(0, 9), match='Argentina'>
None


the characters of upper or underscore and numbers: [A-Za-z0-9]

Fill in the code to check if the text passed looks like a standard sentence, meaning that it starts with an uppercase letter, followed by at least some lowercase letters or a space, and ends with a period, question mark, or exclamation point.

In [44]:
import re
def check_sentence(text):
  result = re.search(r"^[A-Z].*[.?!]$", text)
  return result != None

print(check_sentence("Is this is a sentence?")) # True
print(check_sentence("is this is a sentence?")) # False
print(check_sentence("Hello")) # False
print(check_sentence("1-2-3-GO!")) # False
print(check_sentence("A star is born.")) # True

True
False
False
False
True


Awesome! You're becoming a regular "regular expression"
writer!