# Simple Matching in Python
* The r at the beginning of the pattern indicates that this is a rawstring.
* The match attribute always has a value of the actual sub string that match the search pattern
* The span attribute indicates the range where the sub string can be found in the string

In [1]:
import re

result = re.search(r"aza", "plaza")  # Span is the indexing number
print(result)

result = re.search(r"aza", "bazaar")
print(result)

result = re.search(r"aza", "maze")   # Not found will return None
print(result)

<re.Match object; span=(2, 5), match='aza'>
<re.Match object; span=(1, 4), match='aza'>
None


**Circumflex (^) means match the string that starts with the string mentioned after \^**

In [None]:
print(re.search(r"^x", "xerox"))

**A dot (.) in regex is a metacharacter, it is used to match any character. To match a literal dot in a raw Python string (r"" or r''), we need to escape it, so r"\\."**

In [14]:
print(re.search(r"p.ng", "penguin"))
print(re.search(r"p.ng","clapping"))
print(re.search(r"p.ng", "sponge"))

<re.Match object; span=(0, 4), match='peng'>
<re.Match object; span=(4, 8), match='ping'>
<re.Match object; span=(1, 5), match='pong'>


**Check if the text passed contains the vowels a, e and i, with exactly one occurrence of any other character in between.**

In [15]:
def check_aei(text):
    result = re.search(r"a.e.i", text)
    return result != None

print(check_aei("academia"))
print(check_aei("aerial"))
print(check_aei("paramedic"))

True
False
True


**Additional options to the search function can be added as a 3rd parameter. <br>
The re.IGNORECASE option returns a match that is case insensitive**

In [16]:
re.search(r"p.ng", "Pangaea", re.IGNORECASE)

<re.Match object; span=(0, 4), match='Pang'>

# Wildcards and Character Classes

**Character classes are written inside square brackets and let us list the characters we want to match inside of those brackets**

In [28]:
print(re.search(r"[Pp]ython", "python"))
print(re.search(r"[Pp]ython", "Python"))

<re.Match object; span=(0, 6), match='python'>
<re.Match object; span=(0, 6), match='Python'>


**A range of characters classes can be defined using a dash, for examples: a-z, A-Z, 0-9**

In [27]:
print(re.search(r"[a-z]way", "The end of the highway"))
print(re.search(r"[a-z]way", "What a way to go"))  # Return None because the string way is preceded by a space.

<re.Match object; span=(18, 22), match='hway'>
None


In [24]:
print(re.search(r"cloud[a-zA-Z0-9]", "cloudy"))
print(re.search(r"cloud[a-zA-Z0-9]", "cloud9"))

<re.Match object; span=(0, 6), match='cloudy'>
<re.Match object; span=(0, 6), match='cloud9'>


**Check if the text passed contains punctuation symbols: commas, periods, colons, semicolons, question marks, and exclamation points.**

In [26]:
def check_punctuation (text):
    result = re.search(r"[,.:;?!]", text)
    return result != None

print(check_punctuation("This is a sentence that ends with a period."))
print(check_punctuation("This is a sentence fragment without a period"))
print(check_punctuation("Aren't regular expressions awesome?"))
print(check_punctuation("Wow! We're really picking up some steam now!"))
print(check_punctuation("End of the line"))

True
False
True
True
False


**Sometimes we may want to match any characters that aren't in a group. <br> To do that, we use a circumflex (^) inside the square brackets. For example, let's create a search pattern that looks for any characters that's not a letter.**

In [36]:
print(re.search(r"[a-zA-Z]", "This is a sentence.")) # This will find the space in btw this and is which is index no.4

<re.Match object; span=(0, 1), match='T'>


In [37]:
print(re.search(r"[^a-zA-Z ]", "This is a sentence.")) # This gives us the period becasue we added a space inside the /
                                                       # character class

<re.Match object; span=(18, 19), match='.'>


**If we want to match either one expression or another, we can use the pipe symbol (| )to do that. <br>
This lets us list alternative options that can get matched. <br>
For example, we could have an expression that matches either the word cat or the word dog**

In [40]:
print(re.search(r"cat|dog", "He likes dogs."))
print(re.search(r"cat|dog", "She likes cats."))

<re.Match object; span=(9, 12), match='dog'>
<re.Match object; span=(10, 13), match='cat'>


In [41]:
print(re.search(r"cat|dog", "I like both dogs and cats."))  # We actually have 2 possible matches, but this will only /
                                                            # return the first one.

<re.Match object; span=(12, 15), match='dog'>


**To get all possible matches, use the findall() Function** 

In [42]:
print(re.findall(r"cat|dog", "I like both dogs and cats."))

['dog', 'cat']


# Repetition Qualifiers

**So we wanted to find the longest word in the string, or we wanted to find the host names in a log file by checking for a bunch of alphanumeric characters between brackets. We can do this using another interesting RegEx concept - repeated matches.**<br>
**It's quite common to see expressions that include a dot followed by a star (.*). This means that it matches any character repeated as many times as possible including zero.**

In [43]:
print(re.search(r"Py.*n", "Pygmalion"))

<re.Match object; span=(0, 9), match='Pygmalion'>


**Repeated matches is a common expressions that include a . followed by a * It matches any character repeated as many times as possible including zero. The Star takes as many characters as possible. In programming terms, we say that this behavior is greedy. <br>
It's possible to modify the repetition qualifiers to make them less greedy.**

In [44]:
print(re.search(r"Py.*n", "Python Programming"))  # This is greedy behavior, while our pattern could have matched the /
                                                  # word Python, it expanded all the way until the last n in the string. 

<re.Match object; span=(0, 17), match='Python Programmin'>


In [45]:
print(re.search(r"Py[a-z]*n", "Python Programming"))  # If we only wanted our patterns match letters, we should have /
                                                      # used the character class instead like this.

<re.Match object; span=(0, 6), match='Python'>


In [None]:
print(re.search(r"Py.*n", "Pyn"))  # Zero times is also a possibility, that will let the string pyn also match our pattern

**Use a plus (+) to match one or more occurrences of the character that comes before it**

In [46]:
print(re.search(r"o+l+", "goldfish"))
print(re.search(r"o+l+", "woolly"))
print(re.search(r"o+l+", "boil"))  # There's an in the btwn o and l, so it doesn't match

<re.Match object; span=(1, 3), match='ol'>
<re.Match object; span=(1, 5), match='ooll'>
None


**# Use question mark (?) for either zero or one occurrence of the character before it. It is used to specified optional characters**

In [48]:
print(re.search(r"p?each", "To each their own"))  # The P wasn't present but with the question mark we marked it as /
                                                  # optional. So we still got a match.

<re.Match object; span=(3, 7), match='each'>
<re.Match object; span=(7, 12), match='peach'>


In [50]:
print(re.search(r"p?each", "I like peaches"))  # If the p is presents and so match included it.

<re.Match object; span=(7, 12), match='peach'>


**The repeating_letter_a function checks if the text passed includes the letter "a" (lowercase or uppercase) at least twice.<br> For example, repeating_letter_a("banana") is True, while repeating_letter_a("pineapple") is False.**
* Note: 
 * \* repeats a character zero or more times
 * \+ repeats a character one or more times

In [52]:
def repeating_letter_a(text):
    result = re.search(r"[Aa].*[aA]", text)  # Also can >> result = re.search(r"[aA]+.*a", text)
    return result != None

print(repeating_letter_a("banana"))
print(repeating_letter_a("pineapple"))
print(repeating_letter_a("Animal Kingdom"))
print(repeating_letter_a("A is for apple"))

True
False
True
True


# Escaping Characters
* A pattern that includes a \ could be escaping a special regex character or a special string character, use an escape character (\\) to match one of the special characters

In [8]:
print(re.search(r".com", "welcome"))  # This will search "_com", so it will search "lcom" 
print(re.search(r"\.com", "welcome")) # This will search ".com" exactly, so it will search nothing.
print(re.search(r"\.com", "gg.com"))  # This will search ".com" edactly, so it can search anything with ".com"

<re.Match object; span=(2, 6), match='lcom'>
None
<re.Match object; span=(2, 6), match='.com'>


**Use \w to match any alphanumeric character including letters, numbers, and underscores<br>
Use \d to match digits<br>
Use \D to match non-digits<br>
Use \s for matching whitespace characters like space, tab or new line<br>
Use \b for word boundaries**

In [5]:
print(re.search(r"\w*", "This is an example"))  # Pattern mataches the 1st four letters until the space because /                                               # spaces aren't part of that set of characters.
print(re.search(r"\w*", "And_this_is_another")) # This matched the whole string. 
print(re.search(r"\w*", "_this_is_the_third"))  # Underscore considered as alphanumeric
print(re.search(r"\d*", "Numbers_are_1_and_2"))
print(re.search(r"\s", "__Dunder Init__"))c

<re.Match object; span=(0, 4), match='This'>
<re.Match object; span=(0, 19), match='And_this_is_another'>
<re.Match object; span=(0, 18), match='_this_is_the_third'>
<re.Match object; span=(0, 0), match=''>
<re.Match object; span=(8, 9), match=' '>


**check if the text passed has at least 2 groups of alphanumeric characters (including letters, numbers, and underscores) separated by one or more whitespace characters.**

In [18]:
def check_character_groups(text):
    result = re.search(r"[0-9]\w", text)
    return result != None

print(check_character_groups("One")) # False
print(check_character_groups("123  Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False

False
True
True
False


# Regular Expressions in Action

**For example, say you had a list of all the countries in the world and you want to check which of those names start and end in a. <br>"A.\*a" will only return characters in between A and a**

In [13]:
print(re.search(r"A.*a", "Argentina"))
print(re.search(r".*a", "Azerbaijan"))  # "Azerbaijan" returns "Azerbaija" because we did not specify the end

<re.Match object; span=(0, 9), match='Argentina'>
<re.Match object; span=(0, 9), match='Azerbaija'>


**By adding a dollar sign to our pattern, we've made it clear that we only want to match lines that begin and end with the letter a**

In [8]:
print(re.search(r"^A.*a$", "Azerbaijan"))
print(re.search(r"^A.*a$", "Australia"))

None
<re.Match object; span=(0, 9), match='Australia'>


**Using regular expressions, we can also construct a pattern that would validate if the string is a valid variable name in Python. It can contain any number of letters numbers or underscores, but it can't start with a number.**

In [20]:
pattern = r"^[a-zA-Z_][a-zA-Z0-9_]*$"  # If use: r"^[a-zA-Z0-9_]*$", then the strings can start with a number
print(re.search(pattern, "_this_is_a_valid_variable_name"))
print(re.search(pattern, "this isn't a valid variable name"))
print(re.search(pattern, "my_variable1"))
print(re.search(pattern, "2my_variable1"))

<re.Match object; span=(0, 30), match='_this_is_a_valid_variable_name'>
None
<re.Match object; span=(0, 12), match='my_variable1'>
None


**Check if the text passed looks like a standard sentence, meaning that it starts with an uppercase letter, followed by at least some lowercase letters or a space, and ends with a period, question mark, or exclamation point.**
* `^` must start at the beginning of the input string
* `[A-Z][a-z]*` the first word must start with a capital letter, followed by an arbitrary number of lowercase letters
* `(\s[a-z]+)*` after the first word, there can be an arbitrary number of additional words (also zero). Each word must be preceded by one white space (we can also use just a space instead of `\s` to only allow space but not for instance tab) and consist of at least one lowercase letter
* `[\.!\?]$` the punctuation must be at the end of the input string.

In [37]:
def check_sentence(text):
    result = re.search(r"^[A-Z][ |a-z]*[.?\!]$", text)  # If use r"^[A-Z][a-z]*(\s[a-z]+)*[\.!\?]$", then can forbidding /
    return result != None                               # consecutive spaces and spaces directly before punctuations.

print(check_sentence("Is this is a sentence?")) # True
print(check_sentence("is this is a sentence?")) # False
print(check_sentence("Hello")) # False
print(check_sentence("1-2-3-GO!")) # False
print(check_sentence("A star is born.")) # True

True
False
False
False
True
