# Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Regular expressions are handled using Python's built-in **re** library. See [the docs](https://docs.python.org/3/library/re.html) for more information.

In [1]:
text = 'The phone of the agent is 750-643-8666. Call soon!'

In [2]:
"750-643-8666" in text

True

In [3]:
import re

In [4]:
pattern = 'phone'

In [5]:
re.search(pattern,text)

<re.Match object; span=(4, 9), match='phone'>

In [6]:
my_match = re.search(pattern,text)

In [7]:
my_match.span()

(4, 9)

In [8]:
my_match.start()

4

In [9]:
my_match.end()

9

In [10]:
text = 'my phone is a new phone'

In [11]:
match = re.search(pattern,text)

In [12]:
match.span()

(3, 8)

In [13]:
all_matches = re.findall(pattern,text)

In [14]:
len(all_matches)

2

In [15]:
for match in re.finditer(pattern,text):
    print(match.span())

(3, 8)
(18, 23)


# Patterns

So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this; often it's just a matter of looking up the pattern code.

Let's begin!

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

    r'mypattern'
    
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [16]:
text = 'my phone is 888-643-8666.'

In [17]:
pattern = r'\d\d\d-\d\d\d-\d\d\d'

In [18]:
phone_number = re.search(pattern,text)

In [19]:
phone_number.group()

'888-643-866'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## Quantifiers

Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [20]:
pattern = r'\d{3}-\d{3}-\d{4}'

In [21]:
phone = re.search(pattern,text)

In [22]:
phone.group()

'888-643-8666'

In [23]:
pattern = r'(\d{3})-(\d{3})-(\d{4})'

In [24]:
mymatch = re.search(pattern,text)

In [28]:
mymatch.group(2)

'643'

In [32]:
re.search(r"man|woman","this man was here with woman")

<re.Match object; span=(5, 8), match='man'>

In [35]:
re.findall(r'.at',"The cat in hat sat")

['cat', 'hat', 'sat']

In [40]:
re.findall(r'^\d','2 THis end with no')

['2']

In [41]:
phrase = "There are 3 number 34 in inside 5 this sentence"

In [43]:
re.findall(r"[^\d]+",phrase)

['There are ', ' number ', ' in inside ', ' this sentence']

In [44]:
test_phrase = "This is string! but it has puncutation. How to remove it?"

In [46]:
mylist = re.findall(r'[^!.?]+',test_phrase)

In [47]:
mylist

['This is string', ' but it has puncutation', ' How to remove it']

In [49]:
removed_puncutation = ' '.join(mylist)

In [50]:
removed_puncutation

'This is string  but it has puncutation  How to remove it'

In [51]:
text = 'Only find the hypen-words. Were are the long-ish dash words?'

In [52]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']