# Regular Expressions

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with.

Regular expressions are handled using Python's built-in **re** library

## Search for the Basic Patterns

In [8]:
text = "The agent's phone number is 408-555-1234. Call soon!"

In [9]:
#Checking whether the phone string is there or not
'phone' in text

True

In [10]:
#Importing the regex library
import re

In [11]:
pattern = 'phone'

In [12]:
#If a pattern is matched, it will return a Match Object
match = re.search(pattern,text)
match

<re.Match object; span=(12, 17), match='phone'>

In [13]:
pattern1 = 'NOT IN TEXT'

In [14]:
#If a pattern is not there, it will return nothing
re.search(pattern1,text)

In [15]:
#there is also a start and end index information
match.span()

(12, 17)

In [16]:
match.start()

12

In [17]:
match.end()

17

If there is more than one pattern

In [18]:
text = "my phone is a new phone"

In [19]:
match = re.search("phone",text)

In [20]:
match.span()

(3, 8)

Notice it only matches the first instance. If we wanted a list of all matches, we can use .findall() method:

In [21]:
matches = re.findall("phone",text)

In [22]:
matches

['phone', 'phone']

In [23]:
len(matches)

2

To get actual match objects, use the iterator:

In [24]:
for match in re.finditer("phone",text):
    print(match.span())

(3, 8)
(18, 23)


If you wanted the actual text that matched, you can use the .group() method.

In [25]:
match.group()

'phone'

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

In [26]:
text = "My telephone number is 408-555-1234"

In [27]:
phone = re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [28]:
phone.group()

'408-555-1234'

In [29]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

### Groups

In [30]:
phone_pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})')

In [31]:
results = re.search(phone_pattern,text)

In [32]:
# The entire result
# Something to note is that group ordering starts at 1. Passing in 0 returns everything
results.group()

'408-555-1234'

In [33]:
results.group(1)

'408'

In [34]:
results.group(2)

'555'

In [35]:
results.group(3)

'1234'

### Additional Regex Syntax

* Using | (pipe) operator for or option

In [36]:
re.search(r"man|woman","This man was here.")

<re.Match object; span=(5, 8), match='man'>

In [37]:
re.search(r"man|woman","This woman was here.")

<re.Match object; span=(5, 10), match='woman'>

### Wildcard Character

In [38]:
re.findall(r".at" , "The cat in the hat sat there")

['cat', 'hat', 'sat']

In [39]:
# One or more non-whitespace that ends with 'at'
re.findall(r'\S+at',"The bat went splat")

['bat', 'splat']

### Starts With and Ends With

We can use the **^** to signal starts with, and the **$** to signal ends with:

In [40]:
# Ends with a number
re.findall(r'\d$','This ends with a number 2')

['2']

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [41]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [42]:
re.findall(r'[^\d]',phrase)

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

To get the words back together, use a + sign 

In [43]:
re.findall(r'[^\d]+',phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

**We can use this to remove punctuation from a sentence.**

In [44]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [45]:
#Take everything except from punctuations
re.findall('[^!.? ]+',test_phrase)

['This',
 'is',
 'a',
 'string',
 'But',
 'it',
 'has',
 'punctuation',
 'How',
 'can',
 'we',
 'remove',
 'it']

In [46]:
clean = ' '.join(re.findall('[^!.? ]+',test_phrase))

In [47]:
clean

'This is a string But it has punctuation How can we remove it'

In [48]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [49]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

In [50]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [51]:
re.search(r'cat(fish|nap|claw)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [52]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [53]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

## f-Strings
Print an f-string that displays `NLP stands for Natural Language Processing` using the variables provided.

In [54]:
abbr = 'NLP'
full_text = 'Natural Language Processing'

# Enter your code here:
print(f'{abbr} stands for {full_text}')

NLP stands for Natural Language Processing


## Files
**Create a file in the current working directory called `contacts.txt` by running the cell below:**/

In [55]:
%%writefile contacts.txt
First_Name Last_Name, Title, Extension, Email

Overwriting contacts.txt


**Open the file and use .read() to save the contents of the file to a string called `fields`.  Make sure the file is closed at the end.**

In [56]:
# Write your code here:
with open('contacts.txt') as c:
    fields = c.read()

    
# Run fields to see the contents of contacts.txt:
fields

'First_Name Last_Name, Title, Extension, Email\n'

## Working with PDF Files
**Use PyPDF2 to open the file `Business_Proposal.pdf`. Extract the text of page 2.**

In [57]:
# Perform import
import PyPDF2

# Open the file as a binary object
f = open('Business_Proposal.pdf','rb')

# Use PyPDF2 to read the text of the file
pdf_reader = PyPDF2.PdfFileReader(f)


# Get the text from page 2 (CHALLENGE: Do this in one step!)
page_two_text = pdf_reader.getPage(1).extractText()



# Close the file
f.close()

# Print the contents of page_two_text
print(page_two_text)

AUTHORS:
 
Amy Baker, Finance Chair, x345, abaker@ourcompany.com
 
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com
 
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com
 



**Open the file `contacts.txt` in append mode. Add the text of page 2 from above to `contacts.txt`.**

**Also remove the word "AUTHORS:"**

In [58]:
# Simple Solution:
with open('contacts.txt','a+') as c:
    c.write(page_two_text)
    c.seek(0)
    print(c.read())

First_Name Last_Name, Title, Extension, Email
AUTHORS:
 
Amy Baker, Finance Chair, x345, abaker@ourcompany.com
 
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com
 
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com
 



In [59]:
#Removing 'Authors' (re-run the %%writefile cell above to obtain an unmodified contacts.txt file):
with open('contacts.txt','a+') as c:
    c.write(page_two_text[8:])
    c.seek(0)
    print(c.read())

First_Name Last_Name, Title, Extension, Email
AUTHORS:
 
Amy Baker, Finance Chair, x345, abaker@ourcompany.com
 
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com
 
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com
 

 
Amy Baker, Finance Chair, x345, abaker@ourcompany.com
 
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com
 
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com
 



## Regular Expressions
**Using the `page_two_text` variable created above, extract any email addresses that were contained in the file `Business_Proposal.pdf`.**

In [60]:
import re

# Enter your regex pattern here. This may take several tries!
pattern = r'\w+@\w+.\w{3}'

re.findall(pattern, page_two_text)

['abaker@ourcompany.com',
 'cdonaldson@ourcompany.com',
 'efreeman@ourcompany.com']