# REGULAR EXPRESSIONS

![faces](imgs/express.jpg)

The kinds of faces you will make when working with regular expressions.

## Objectives

- Review foundational `re` syntax 
- Learn about tools to help
- Practice
___


## Overview

![terrifying](imgs/example.png)


**This.  
Is.  
Terrifying.**

![gif_of_you](imgs/fear_ignorance.gif)

## Basic Parts

1) Groups - the foundations of regular expressions. A group is a portion in the string you are searching for. <br>
- It can be a word: "bottle", "AROUND", "Link"  
- It can be a number: "25", "19910312", "1.333"  
- It can be anything: "1337 5P3Ak", "<( '.' )>", "something@email.com"  

2) Quantifiers - these instruct how to gather information around the group. <br>
- Do you want to get everything infront of a group?  
- Do you want only the 3 charaters the follow the group?  
- Do you want to match only partially match the group?  
    
3) Anchors, Ranges, and More - these are regex specific terms that concisely picks out groups. <br>
- Want a group that contains all lowercase letters?  
- Want a group that had only digits?  
- Want a group that only appears at the end of a string?  
    
It is when all these things are thrown together that regex becomes a powerful tool.

In [2]:
import re

First, we need a string. Any old string will do. The next step is to create a pattern (a combination of groups, quantifiers, metacharacters, and then some) to pull put the specific infomation we seek.

In [3]:
string_1 = ''

When specifying a pattern, **best pratice** is to lead the string with `r`. This turns the backslash character into a literal and visible regex and more! [Find out more on that here](https://stackoverflow.com/questions/12871066/what-exactly-is-a-raw-string-regex-and-how-can-you-use-it) if interested.

So a standard pattern in Python looks like `r"( -CAPTURING GROUP- )"`

In [4]:
pattern = r"()"

We will use `re.search` our entire pattern to see if there is a group that matches. 

### re.findall

In [7]:
result = re.findall(pattern, string_1)
result

['']

### re.search

In [6]:
result = re.search(pattern, string_1)
result

<re.Match object; span=(0, 0), match=''>

`re.search` returns a `Match object`. `re.search` finds the first instance of the pattern in the string and then stops. The object gives the the specifc position of the string.

### re.match

In [66]:
result = re.match(pattern, string_1)
print(result)

None


`re.match` returns a `Match` object. `re.match` only looks for the pattern at the **start** of a string.

In [67]:
string_2 = "it is crazy, isn't it!"

In [68]:
result = re.match(pattern, string_2)
print(result)

<re.Match object; span=(0, 2), match='it'>


___
## Anchors, Ranges, and More

Characters | Anchors | Groups 
-|-|-
![characters](imgs/cclass.png) | ![anchors](imgs/anchors.png) | ![groups](imgs/groups.png)

In [87]:
string_1 = 'This is test string. This took me all of 7 \n seconds to write it. I wrote it fast.'
string_2 = 'itiscrazyhowlongthisru790nonsente234nceis!'
string_3 = 'How in the world would it work without being crazy?'

loop_through = [string_1, string_2, string_3]

In [88]:
pattern = r"(^it)" #adding the start of string anchor

In [89]:
for string in loop_through:
    print(re.search(pattern, string))

None
<re.Match object; span=(0, 2), match='it'>
None


In [59]:
pattern = r"\d" #any digit.

In [60]:
for string in loop_through:
    print(re.search(pattern, string))

<re.Match object; span=(41, 42), match='7'>
None
None


In [90]:
pattern = r"[a-z].*" #range of a or b or c ... or z with the quantifier '.' that has been made greedy.

In [91]:
for string in loop_through:
    print(re.search(pattern, string))

<re.Match object; span=(1, 43), match='his is test string. This took me all of 7 '>
<re.Match object; span=(0, 36), match='itiscrazyhowlongthisrunonsentenceis!'>
<re.Match object; span=(1, 51), match='ow in the world would it work without being crazy>


In [84]:
pattern = r"[\w]+" #any word AND Matches one or more consecutive `\w` characters.


In [67]:
for string in loop_through:
    print(re.search(pattern, string))

<re.Match object; span=(0, 4), match='This'>
<re.Match object; span=(0, 33), match='itiscrazyhowlongthisrunonstenceis'>
<re.Match object; span=(0, 3), match='How'>


___
## Quantifiers

`regex` quanitifers can be broken into roughly three categories: specific, greedy, and lazy.

Specific: you will need set a certain quantity or a range of _same_ characters to gather.  
Greedy: match as many characters as possible.  
Lazy: match as few characters as possible.  

In [None]:
repeating_patterns = "ah aaaaaaaah aahhhh aaaahhhhhhh ahah "

pattern_1 = r"(a{1})"
result = re.search(pattern_1, repeating_patterns)
result

In [None]:
pattern_2 = r"(a{2})"
result = re.search(pattern_2, repeating_patterns)
result

In [None]:
pattern_3 = r"(a{3,})"
result = re.search(pattern_3, repeating_patterns)
result

Now lets take a look at greey and lazy quantifiers.

The simplest quantifier of regex is just a period, `.`. It matches any single character besides newline.

In [None]:
state = "Mississippi"
anything_pattern = r"(.)"
result = re.match(anything_pattern, state)
result

In [None]:
state = "Mississippi"
specific_pattern = r"(s{2})"
result = re.search(specific_pattern, state)
result

To make a lazy pattern, all that is required is a question mark `?`. This will grab the fewest amount of characters while obeying all the group constraints.

In [None]:
state = "Mississippi"
greedy_pattern = r"(s{2}.?)"
result = re.search(greedy_pattern, state)
result

To make a greedy pattern, put an asterisk `*`

In [None]:
state = "Mississippi"
lazy_pattern = r"(s{2}.*)"
result = re.search(lazy_pattern, state)
result

A `Match object` has a method called `.groups()`. This is helpful when you start adding additional groups and quantifiers.


## Tools

From least helpful to most helpful
___
Cheat Sheets
- [Data Quest](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)
- [Rexegg](https://www.rexegg.com/regex-quickstart.html)
- [Debuggex](https://www.debuggex.com/cheatsheet/regex/python)
___
Tutorials
- [Regular-Expressions.info](https://www.regular-expressions.info/tutorial.html)
- [RegexOne](https://regexone.com)
- [RegexTutorials](http://regextutorials.com/)
___
Online live editors
- https://regex101.com
- https://regexr.com (No Python)
- https://www.regextester.com
___
Stackoverflow
- https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075

Even with all these resources, regex still takes a while to learn let alone master. That just means....
___
___
___
## PRACTICE

    - `re.match` only looks at the beginning of a string. Returns a Match obj.
    - `re.search` looks at the entire string and finds the first instance. Returns a Match obj.
    - `re.findall' finds all instances of pattern in the string. DOES NOT return a Match obj.

### 1) Take the first full first word of the strings.

_Hint: You will likely use the `+` quantifier_

In [115]:
list_of_sentences = ["How can you even do that?", "Who thinks this is a waste of their time?", "Nevermind, don't tell me if you think it is."]
pattern = r"(\w+)"
for sentence in list_of_sentences:
    result = re.match(pattern, sentence)
    print(result)

<re.Match object; span=(0, 3), match='How'>
<re.Match object; span=(0, 3), match='Who'>
<re.Match object; span=(0, 9), match='Nevermind'>


### 2) Find the second full word of the strings but now using `re.search`

_Hint: Word boundries paried with something else. Look for patterns._

In [127]:
list_of_statements = ["I do not think you are having fun yet.", "See, this is why people don't like regex.", "Try now to your resources!"]
pattern = r"\b[a-z]+"
for statment in list_of_statements:
    result = re.search(pattern, statment)
    print(result)

<re.Match object; span=(2, 4), match='do'>
<re.Match object; span=(5, 9), match='this'>
<re.Match object; span=(4, 7), match='now'>


### 3) Find all the lowercase letters in the follow strings

_Hint: What would you do if you had to take all the uppercase letters?_

In [111]:
list_of_gibberish = ["g3N3r471N' words 7O be 48l3 to 3X7r4c7 V14 R3g3x", "k4N anyone r34Lly read 7h1z?", "1 K4'nt but TH@ d032'nt 5T0P me pHR0m using 1t"]
pattern = r"[a-z]+"
for gibber in list_of_gibberish:
    result = re.findall(pattern, gibber)
    print(result)

['g', 'r', 'words', 'be', 'l', 'to', 'r', 'c', 'g', 'x']
['k', 'anyone', 'r', 'ly', 'read', 'h', 'z']
['nt', 'but', 'd', 'nt', 'me', 'p', 'm', 'using', 't']


### 4) Extract the **ALL** the phone numbers from the following strings.

_Hint: You will need to use a quantifier you JUST used, if you want the pattern as short as possible. Otherwise, make the pattern as long as you need it._

In [95]:
list_of_numbers = ["Call me at 530-657-9090", "Wasn't your number 606-849-9038", "No, it was 703-952-6949"]
pattern = r"[\d-]+"
for number in list_of_numbers:
    result = re.search(pattern, number)
    print(result)

<re.Match object; span=(11, 23), match='530-657-9090'>
<re.Match object; span=(19, 31), match='606-849-9038'>
<re.Match object; span=(11, 23), match='703-952-6949'>


### 5) Make a pattern that will identify all the following emails address.

_Hint: You have seen this before. Go look through everything again._

In [97]:
list_of_emails = ["please_stop@gmail.com", "STORMLORD668@doom.com", "jonnel_roxs@aol.com"]
pattern = r"[\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}"
for email in list_of_emails:
    result = re.findall(pattern, email)
    print(result)

['please_stop@gmail.com']
['STORMLORD668@doom.com']
['jonnel_roxs@aol.com']


___
## Further Practice

https://regexone.com/ - Lessons

https://regexcrossword.com - Practice recognizing and reasoning out regex