# Regular Expressions
 A Regular Expression is a sequence of characters that defines a search pattern. It is used to match a specific pattern of text in a larger body of text
 
 
 “Knowing Regular Expression can mean the difference between solving a problem in 3 steps and solving it in 3,000 steps. When you’re a nerd, you forget that the problems you solve with a couple keystrokes can take other people days of tedious, error-prone work to slog through.” - Cory Doctorow

In [None]:
import re

use the re.compile() method, which takes a string as its argument. This string represents the pattern you want to match in a text.

```python
re.compile()

```

In [None]:
phoneNum = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

In [None]:
type(phoneNum)

re.Pattern

In [None]:
foundNum = phoneNum.search('bla bla bla 123-123-1234 133-133-1234 bla bla bla ') #4

The search() method will return None if the regex pattern is not found in the string

In [None]:
print(foundNum)
print(foundNum.group())

<re.Match object; span=(12, 24), match='123-123-1234'>
123-123-1234


the 'match' objects have a group() method that will return the actual matched text from the searched string.

- Import the regex module
- Create a Regex object with the re.compile() function. 
    - (Remember to use a raw string.)
- Pass the string you want to search into the Regex object’s search() method. 
    - This returns a Match object.
- group() method 
    - returns a string of the actual matched text.

# Grouping

In [None]:
phoneNum = re.compile(r'(\d\d\d)-(\d\d\d)-(\d\d\d\d)')
foundNum = phoneNum.search('bla bla bla 123-456-7890 098-765-5434 bla bla bla ') #4

In [None]:
for i in range(len(foundNum.groups())+1):
    print('group'+ str(i)+' '+foundNum.group(i)) #returns each match object 
print(foundNum.group()) # this returns the entire matched text
foundNum.groups()# this returns all the groups of matched text

group0 123-456-7890
group1 123
group2 456
group3 7890
123-456-7890


('123', '456', '7890')

In [None]:
phoneNum = re.compile(r'(\(\d\d\d\))-(\d\d\d-\d\d\d\d)')
foundNum = phoneNum.search('bla bla bla (123)-456-7890 bla bla bla ') #4

print('foundNum is a: '+ str(type(foundNum.groups())))

areaCode, mainNumber = foundNum.groups()
print(areaCode)
print(type(areaCode))
print(mainNumber)

foundNum is a: <class 'tuple'>
(123)
<class 'str'>
456-7890


## Pipe
    | character is called a pipe <shift> '\'

NOTE: The | for RE works differently from linux | in linix runs one input into another with RE it is used as an 'OR'

happy or sad => happy|sad

When both happy and sad occur in the searched string, the first occurrence of matching text will be returned as the Match object.

In [None]:
import json
with open('superheros.json') as f:
    superheros = json.load(f)
findWord = re.compile(r'DC Comics|Batman')
foundWord = findWord.search(json.dumps(superheros)) #4
print(foundWord.group())

Batman


In [None]:
bats = re.compile(r'bat(wo)?man')
foundWord = bats.search('the adventures of batwoman')
print(foundWord.group())
foundWord = bats.search('the adventures of batman')
print(foundWord.group())

batwoman
batman


# Other options

    The *  means "match zero or more"
    The +  means "match one or more"
    The {} means "repeat a specific number of times" 
    la{3} = (la)(la)(la) 
    la{3,5} = (la)(la)(la)|(la)(la)(la)(la)|(la)(la)(la)(la)(la)
    
    \d Any numeric digit from 0 to 9.
    \D Any character that is not a numeric digit from 0 to 9.
    \w Any letter, numeric digit, or the underscore character.
    \W Any character that is not a letter, numeric digit, or the underscore character.
    \s Any space, tab, or newline character.
    \S Any character that is not a space, tab, or newline.
    [] match any of the characters in brackets
    [^] do not match any of the characters in brackets
    
    $ Match must occur at the end of the searched text
    ^ Match must occur at the beginning of the searched text
    . WildCard and will match any character except newline
    .* Match everything 
    

In [None]:
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d') # has no groups
phoneNumRegex.findall('Cell: 415-555-9999 Work: 212-555-0000')



In [None]:
findWord = re.compile(r'Batman')#(\w{16})')
foundWord = findWord.findall(json.dumps(superheros))
print(foundWord)
#findWord = re.compile(r'[aeiou]')
#foundWord = findWord.findall(json.dumps(superheros))
#print(foundWord)

['Batman']


In [None]:
findWord = re.compile(r'man^123') #
foundWord = findWord.findall('man123 batman is an amazing man')# man
print(foundWord)

[]


In [None]:
findWord = re.compile(r'[\w.*?]+[mM]+an')#
foundWord = findWord.findall("Ironman and Batman fought Superman. Superman beat the bat. Man that was a good fight")# man
print(foundWord)

['Ironman', 'Batman', 'Superman', 'Superman']
