# Regular Expressions

#### A regular expression is a pattern that the regular expression engine attempts to match in input text.

Regular Expressions allow a user to search for strings using almost any sort of rule.

For example, finding all capital letters in a string, or finding a phone number in a document. 

Regular expressions have to be able to filter out any string pattern.

### Searching for Basic Patterns

In [1]:
text = "The agent's phone phone number is 408-555-1234. Call soon!"

In [2]:
'phone' in text

True

In [3]:
'phone' not in text

False

In [4]:
'Phone' in text

False

In [6]:
import re

re.search() will take the pattern, scan the text, and then returns a Match object. 

                                        If no pattern is found, a None is returned.

In [14]:
text = "phone The agent's phone phone number is 408-555-1234. Call soon!"
pattern = 'phone'

In [15]:
match = re.search(pattern,text)

In [16]:
print(match)

<re.Match object; span=(0, 5), match='phone'>


In [17]:
match.span()

(0, 5)

In [18]:
match.start()

0

In [19]:
match.end()

5

In [20]:
text = "my phone is a new phone phone"

In [21]:
match = re.search("phone",text)

In [22]:
match.span()

(3, 8)

In [23]:
matches = re.findall("phone",text)

In [24]:
matches

['phone', 'phone', 'phone']

In [25]:
len(matches)

3

To get actual match objects, use the iterator:

In [26]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)
(24, 29)


To get the actual text that matched, use the .group() method.

In [27]:
match.group()

'phone'

In [28]:
import string
string.punctuation

'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

# Patterns

## Identifiers for Characters in Patterns

Characters such as a digit or a single string have different codes that represent them.

    r'mypattern'

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-space</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

\b = space around whole words

In [29]:
text = "My telephone number is 353-545-1234"

In [30]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [31]:
phone.group()

'353-545-1234'

In [32]:
string="Hello I live on street 9 which is near street 23"

print(re.findall(r"\d",string))


['9', '2', '3']


In [33]:
string="Hello I live on street 9 which is near street 23"

print(re.findall(r"\d\d",string))


['23']


In [34]:
text1 = "abcd1234"

In [35]:
phone1 = re.search(r'\D\D\D\D',text1)

In [36]:
phone1.group()

'abcd'

In [None]:
text2 = "!@#$%^&*1234abcdABCD"

In [None]:
phone2 = re.search(r'\W\W\W\W',text2)

In [None]:
phone2.group()

Note: the repetition of \d. 

## Quantifiers


<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [None]:
import re

In [None]:
text = "My telephone number is 12453-555-1234"
r = re.search(r'\d{5}-\d{3}-\d{4}',text)
r.group()

In [None]:
r = re.search(r'\d{3,5}-\d{2,3}-\d{3,}',text)
r.group()

In [None]:
string="Hello I live on street 9 which is near street 23"
print(re.findall(r"\d+",string))

In [None]:
string="get out Of my house !!!"
print(re.findall(r"\w+",string))

In [None]:
string="get out house !!!"
print(re.findall(r"\w{2}",string))

In [None]:
string="get out Of my house !!!"
print(re.findall(r"\w{2,}",string))

In [None]:
string="name and names are 23 blah blah"

print(re.findall(r"\w+es?",string))

## Groups 

In [None]:
import re

In [None]:
text

In [None]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [None]:
results = re.search(phone_pattern,text)

In [None]:
results.group()

In [None]:
results.group(1)

In [None]:
results.group(2)

In [None]:
results.group(3)

In [None]:
#results.group(4)

## Additional Regex Syntax

### Or operator |

Use the pipe operator to have an **or** statment

In [None]:
res = re.search(r"man|woman","This man woman was here.")
print(res)
print(res.group())

In [None]:
res2 = re.findall(r"man|woman","This man woman was here.")
print(res2)

In [None]:
res1= re.search(r"man|woman","This woman was here.")
print(res1)
print(res1.group())

### The Wildcard Character

Use a "wildcard" as a placement that will match any character placed there. Use a simple period **.** for this.

In [None]:
re.findall(r".at","The zat in the hat sat here.")

In [None]:
re.findall(r"...at","The bat went splat")

In [None]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

In [None]:
string='''I am Hussain Mujtaba and M12  !a
'''
print(re.findall(r"M.....a",string))

### Starts With and Ends With

use the **^** to signal starts with, and the **$** to signal ends with:

In [None]:
# Ends with a number
re.findall(r'\d$','This ends 2 with a number 2')

In [None]:
# Starts with a number
re.findall(r'^\d','1 is the loneliest number.')

### Exclusion

To exclude characters, use the **^** symbol in conjunction with a set of brackets []

Anything inside the brackets is excluded. 

In [None]:
phrase = "there are 3 numbers 34 inside 345 this sentence."

In [None]:
re.findall(r'[^\d]',phrase)

To get the words back together, use a + sign 

In [None]:
re.findall(r'[^\d]+',phrase)

We can use this to remove punctuation from a sentence.

In [None]:
test_phrase = 'This is a string #@  $ ! But it has punctuation. How can we remove it?'

In [None]:
re.findall(r'[^!.?@$# ]+',test_phrase)

In [None]:
clean = ' '.join(re.findall(r'[^!.$@#? ]+',test_phrase))

In [None]:
clean

In [None]:
# Inclusion
string='''123345678'''
print(re.findall(r"[123]",string))

In [None]:
#exclusion
string='''123345678'''

print(re.findall(r"[^123]",string))

In [None]:
string='''
hello I am a student from India
'''
print(re.findall(r"[A-Z][a-z]+",string))

In [None]:
string='''
hello I am a student from India
'''
print(re.findall(r"[A-Z][a-z]*",string))

## Brackets for Grouping

In [None]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are 123-123'

In [None]:
re.findall(r'[\w]+-[\w]+',text)

## Parentheses for Multiple Options

In [None]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"

In [None]:
res = re.search(r'cat(fish|nap|claw)',text)
res.group()

In [None]:
res1= re.search(r'cat(fish|nap|claw)',texttwo)
print(res1.group())

### re.split
The re.split function returns a list where the string has been split at each match

In [None]:
import re

In [None]:
s_nums = 'one1two22three333four'
print(re.split(r'\d+', s_nums))

### re.sub
The re.sub function replaces the matches with the text of your choice

In [None]:
s = 'aaa@aaa.com bbb@bbb.com ccc@ccc.com @ddd.com'
print(re.sub('[a-z]+@', 'xxx@', s))

In [None]:
s = 'aaa@aaa.com bbb@bbb.com ccc@ccc.com'
print(re.subn('[a-z]*@', 'xxx@', s, 1))