# Regular Expresions
## What is a *regular expression*?
A regular expression, or regex, is essentially a search query for text that's expressed by a string pattern.

On a console, we can use de `grep` command, to query a file.
The `/usr/share/dict/words` file is generally used by spell-checking programs to check if a word exist.  

```Bash
grep thon /usr/share/dict/words
```
This command returns all words that matches with `thon`, case sensitive.
To make the grep command query insensitive, just add the -i flag.
<br>

```Bash
grep -i python /usr/share/dict/words
```
<br>



# Special Characters
Are those characters that allow us do a more advanced matching.
<br>
<br>

## The  `dot(.)`  character
Matches any character.
<br>
```Bash
grep l.rts /usr/share/dict/words
```

The `dot(.)` will match any character(s) between l and rts.
<br>
<br>

## The  `circumflex(^)`  character
Matches any end of line that starts with the pattern.
<br>

```Bash
grep ^fruit /usr/share/dict/words
```

The `circumflex(^)` will match any end of line that starts with "fruit".
<br>
<br>

## The  `dollar sign($)`  character
Matches any end of line that ends with the pattern.
<br>

```Bash
grep cat$ /usr/share/dict/words
```

The `dollar sign($)` will match any end of line that ends with "cat".


# Basic Regular Expressions
On Python, the module for regular expressions is the `re` module.  
The re.search() function will return a match object, with the span of where the "end of line is found".  
The `r`, on the following example, represents a *rawstring*, which is to let Python know to not interpret any special characters. 
<br>

If a match is not found, a `None` result will be returned.

In [35]:
import re
result = re.search(r"aza", "plaza")
print(result)

<re.Match object; span=(2, 5), match='aza'>


In [36]:
result = re.search(r"aza","this")
print(result)

None


In [37]:
result = re.search(r"p.ng","penguin")
print(result)
result = re.search(r"p.ng","sponge")
print(result)
# To match regardless of the case, add the parameter IGNORECASE to the search functions
result = re.search(r"p.ng","Pangaea", re.IGNORECASE)
print(result)

<re.Match object; span=(0, 4), match='peng'>
<re.Match object; span=(1, 5), match='pong'>
<re.Match object; span=(0, 4), match='Pang'>


The following example shows code to check if the text passed contains the vowels a, e and i, with exactly one occurrence of any other character in between.

In [38]:
import re
def check_aei (text):
  result = re.search(r"a.e.i", text)
  return result != None

print(check_aei("academia")) # True
print(check_aei("aerial")) # False
print(check_aei("paramedic")) # True

True
False
True


# Wildcards and Character Classes

The dot is known as a wildcard, because it can match more than one character.    
Using a dot is the broadest possible wildcard because it matches absolutely any character. But what if we wanted something stricter, like checking if an answer given by a user contains a valid character, or finding all the usernames in a CSV file that start with a vowel?  
We have to restrict our wildcards to a range of characters to do this.  
For this task we use another feature of regexes called character classes.  

## Character Clases
Character classes are written inside square brackets and let us list the characters we want to match inside of those brackets.

In [39]:
# In this example, the [Pp] matches either Python or python
result = re.search(r"[Pp]ython","Python")
print(result)

<re.Match object; span=(0, 6), match='Python'>


Inside the square brackets, we can also define a range of characters using a dash.
For example, we could use lowercase a to lowercase z to state any lowercase letter. So if we wanted to look for the string way preceded by any letter, we could write the expression like this:

In [40]:
print(re.search(r"[a-z]way", "The end of the highway"))

<re.Match object; span=(18, 22), match='hway'>


Code to check if the text passed contains punctuation symbols: commas, periods, colons, semicolons, question marks, and exclamation points.

In [41]:
import re
def check_punctuation (text):
  result = re.search(r"[.,;:?!]", text)
  return result != None

print(check_punctuation("This is a sentence that ends with a period.")) # True
print(check_punctuation("This is a sentence fragment without a period")) # False
print(check_punctuation("Aren't regular expressions awesome?")) # True
print(check_punctuation("Wow! We're really picking up some steam now!")) # True
print(check_punctuation("End of the line")) # False

True
False
True
True
False


Sometimes we may want to match any characters that aren't in a group. To do that, we use a circumflex inside the square brackets `[^]`. 
For example, let's create a search pattern that looks for any characters that's not a letter.

In [42]:
#This will match all characters that are not letters
print(re.search(r"[^a-zA-Z]", "The end of the highway"))
# If a space would not be matched, we can add it into the brackets
print(re.search(r"[^a-zA-Z ]", "The end of the highway"))
# To find all matchs in a string, use findall
print(re.findall(r"cat|dog", "I like cat and dogs"))

<re.Match object; span=(3, 4), match=' '>
None
['cat', 'dog']


# Repetition Qualifiers
Repeated matches is a concept used for expressions that include repeatitions, marked as `*`.  
 It's quite common to see expressions that include a dot followed by a star.  
 This means that it matches any character repeated as many times as possible including zero. 

In [43]:
# This will match any word or line that has Py followed by any number of characters and finished in n
print(re.search(r"Py.*n", "Pygmalion"))
# This is a greedy strategy, since it will match the last occurence from the pattern
print(re.search(r"Py.*n", "Python Programming"))
# This a better way to match 
print(re.search(r"Py[a-z]*n", "Python Programming"))

<re.Match object; span=(0, 9), match='Pygmalion'>
<re.Match object; span=(0, 17), match='Python Programmin'>
<re.Match object; span=(0, 6), match='Python'>


We have also another repetition qualifiers `+` or `?`.
The plus `+` character matches one or more occurrences of the character that comes before it.
The question mark `?` means either zero or one occurrence of the character before it.

In [44]:
# Matches + occurences
print(re.search(r"o+l+","goldfish"))
print(re.search(r"o+l+","boily"))
# Matches ? ocurrences
print(re.search(r"p?each","To each their own"))


<re.Match object; span=(1, 3), match='ol'>
None
<re.Match object; span=(3, 7), match='each'>


The repeating_letter_a function checks if the text passed includes the letter "a" (lowercase or uppercase) at least twice.   
For example, repeating_letter_a("banana") is True, while repeating_letter_a("pineapple") is False. 

In [45]:
import re
def repeating_letter_a(text):
  result = re.search(r"a.*a", text, re.IGNORECASE)
  return result != None

print(repeating_letter_a("banana")) # True
print(repeating_letter_a("pineapple")) # False
print(repeating_letter_a("Animal Kingdom")) # True
print(repeating_letter_a("A is for apple")) # True

True
False
True
True


# Escaping Characters
We can use in our regular expressions to make them match different kinds of strings: dot, star, plus, question mark, circumflex, dollar sign, and square brackets.
But what if a string needs to match one of these characters?  
For this cases we need to use an escape character `\`.

In [46]:
# Using backslash to escape a character
print(re.search(r"\.com","carlos123@gmail.com"))

<re.Match object; span=(15, 19), match='.com'>


Code to check if the text passed has at least 2 groups of alphanumeric characters (including letters, numbers, and underscores) separated by one or more whitespace characters.

In [47]:
import re
def check_character_groups(text):
  result = re.search(r"\w +", text)
  return result != None

print(check_character_groups("One")) # False
print(check_character_groups("123  Ready Set GO")) # True
print(check_character_groups("username user_01")) # True
print(check_character_groups("shopping_list: milk, bread, eggs.")) # False

False
True
True
False
