<a href="https://colab.research.google.com/github/gururaja-ai/Data_Science/blob/Natural_Language_Processing/1.%20NLP_PYTHON_BASICS_TEXT_PDF_REGULAR_EXPRESSION_PROCESSING.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Topic 1. Working with Text File**s

In this section we'll cover

Working with f-strings (formatted string
literals) to format printed text
Working with Files - opening, reading, writing and appending text files

**Formatted String Literals (f-strings)**
Introduced in Python 3.6, f-strings offer several benefits over the older .format() string method.
For one, you can bring outside variables immediately into to the string rather than pass them through as keyword arguments:

In [1]:
name='gururaja'
print("his name is {var}".format(var=name))

his name is gururaja


In [2]:
print(f'his name is {name}')

his name is gururaja


Pass !r to get the string representation:

In [4]:
print(f'his name is {name!r}')

his name is 'gururaja'


In [5]:
var=123

In [6]:
print(f'his name is {var!r}')

his name is 123


In [11]:
d = {'a':123,'b':456}

In [12]:
print(f'address : d['a'] Main Street')

SyntaxError: ignored

In [13]:
#understand the power of aphostropies here
print(f"address : {d['a']} Main Street")

address : 123 Main Street


## **Minimum Widths, Alignment and Padding**
You can pass arguments inside a nested set of curly braces to set a minimum width for the field, the alignment and even padding characters.

In [14]:
library = [('Author', 'Topic', 'Pages'), ('Twain', 'Rafting', 601), ('Feynman', 'Physics', 95), ('Hamilton', 'Mythology', 144)]

In [17]:
for author in library:
  print(author)


('Author', 'Topic', 'Pages')
('Twain', 'Rafting', 601)
('Feynman', 'Physics', 95)
('Hamilton', 'Mythology', 144)


In [18]:
for book in library:
  print (f'{book[0]:{10}} {book[1]:{10}} {book[2]:{10}}')

Author     Topic      Pages     
Twain      Rafting           601
Feynman    Physics            95
Hamilton   Mythology         144


In [None]:
for book in library:
  print (f'{book[0]:{10}} {book[1]:{10}} {book[2]:{10}}')

In [21]:
for book in library:
  print (f'{book[0]:{10}} {book[1]:{10}} {book[2]:.>{10}}')

Author     Topic      .....Pages
Twain      Rafting    .......601
Feynman    Physics    ........95
Hamilton   Mythology  .......144


## Date Formatting

In [22]:
from datetime import datetime

In [23]:
today=datetime(year=2023, month=12, day=28)
print(today)

2023-12-28 00:00:00


In [26]:
print(f'{today:%B%d, %Y}')

December28, 2023


# Dealing with Test Files

Python uses file objects to interact with external files on your computer. These file objects can be any sort of file you have on your computer, whether it be an audio file, a text file, emails, Excel documents, etc. Note: You will probably need to install certain libraries or modules to interact with those various file types, but they are easily available. (We will cover downloading modules later on in the course).

Python has a built-in open function that allows us to open and play with basic file types. First we will need a file though. We're going to use some IPython magic to create a text file!


## **Creating a File with IPython**
This function is specific to jupyter notebooks! Alternatively, quickly create a simple .txt file with Sublime text editor.

In [27]:
%%writefile test.txt
Hello, this is a quick test file.
This is the second line of the file.

Writing test.txt


## Python Opening a File
Know Your File's Location
It's easy to get an error on this step:

In [28]:
myfile=open("whoops.txt")

FileNotFoundError: ignored

In [29]:
pwd

'/content'

In [30]:
#open the text.txt file we created
my_file=open('test.txt',mode='r')

In [31]:
my_file

<_io.TextIOWrapper name='test.txt' mode='r' encoding='UTF-8'>

myfile is now an open file object held in memory. We'll perform some reading and writing exercises, and then we have to close the file to free up memory.

.read() and .seek()

In [33]:
note=my_file.read()

In [34]:
note

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [35]:
note

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [37]:
my_file.read()

''

In [38]:
my_file.seek(0)

0

In [39]:
my_file.read()

'Hello, this is a quick test file.\nThis is the second line of the file.\n'

In [41]:
my_file.seek(0)

0

.readlines()

You can read a file line by line using the readlines method. Use caution with large files, since everything will be held in memory. We will learn how to iterate over large files later in the course.

In [42]:
my_file.readlines()

['Hello, this is a quick test file.\n',
 'This is the second line of the file.\n']

## Writing to a **File**

By default, the open() function will only allow us to **read** the file. We need to pass the argument 'w' to write over the file. For example

In [43]:
my_file=open('test.txt','w+')

In [44]:
my_file.write('This is a new line')

18

In [45]:
my_file.close()

## **Appending to a File**
Passing the argument 'a' opens the file and puts the pointer at the end, so anything written is appended. Like 'w+', 'a+' lets us read and write to a file. If the file does not exist, one will be created.

In [46]:
my_file=open("test.txt",'a+')

In [47]:
my_file.write("\n This is a new line appended to tst.txt")

40

In [48]:
my_file.write ("\n This is another line added" )

28

In [51]:
my_file.read()

''

In [52]:
my_file.seek(0)

0

***Appending with %%writefile***
Jupyter notebook users can do the same thing using IPython cell magic:

In [54]:
%%writefile -a test.txt
this is dditional text being added to the initial test.txt file
this may be probaby 4th line in the test

Appending to test.txt


Add a blank space if you want the first line to begin on its own line, as Jupyter won't recognize escape sequences like \n

In [55]:
note=open('test.txt','r')

In [58]:
note.read()

'This is a new line\n This is a new line appended to tst.txt\n This is another line addedthis is dditional text being added to the initial test.txt file\nthis may be probaby 4th line in the test\n'

Aliases and Context Managers

You can assign temporary variable names as aliases, and manage the opening and closing of files automatically using a context manager:

In [68]:
with open('test.txt','r') as txt:
  first_line=txt.readlines()
print(first_line)

['This is a new line\n', ' This is a new line appended to tst.txt\n', ' This is another line addedthis is dditional text being added to the initial test.txt file\n', 'this may be probaby 4th line in the test\n']


In [69]:
with open('test.txt','r') as txt:
  first_line=txt.readlines()[0]
print(first_line)

This is a new line



In [71]:
with note as txt: #this doesnt work here
  first_line=txt.readlines()[0]
print(first_line)

ValueError: ignored

Note that the with ... as ...: context manager automatically closed test.txt after assigning the first line of text to first_line:

In [74]:
txt.read() #throws error

ValueError: ignored

# **Iterating through a File**

In [75]:
with open('test.txt','r') as txt:
  for line in txt:
    print(line,end="") #the end " " arguement removes extra line breaks

This is a new line
 This is a new line appended to tst.txt
 This is another line addedthis is dditional text being added to the initial test.txt file
this may be probaby 4th line in the test


## **Working with PDF Files**

Often you will have to deal with PDF files. There are many libraries in Python for working with PDFs, each with their pros and cons, the most common one being PyPDF2. You can install it with (note the case-sensitivity, you need to make sure your capitilization matches):

pip install PyPDF2
Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a special encoding, encrypted, or maybe just created with a particular program that doesn't work well with PyPDF2 won't be able to be read. If you find yourself in this situation, try using the libraries linked above, but keep in mind, these may also not work. The reason for this is because of the many different parameters for a PDF and how non-standard the settings can be, text could be shown as an image instead of a utf-8 encoding. There are many parameters to consider in this aspect.

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to grab images or other media files from a PDF.

## **Working with PyPDF2**
Let's begin by showing the basics of the PyPDF2 library.

In [76]:
#NOTE CAPITALISATION

In [77]:
pip install PyPDF2

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/232.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/232.6 kB[0m [31m1.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1


In [78]:
import PyPDF2

## **Reading PDFs**
First we open a pdf, then create a reader object for it. Notice how we use the binary method of reading , 'rb', instead of just 'r'.

In [79]:
#notice we read it ass binary with rb

In [109]:
f=open("/content/US_Declaration.pdf", 'rb')

In [110]:
pdf_reader = PyPDF2.PdfReader(f)

In [111]:
print(len(pdf_reader.pages))

5


In [112]:
page_one=pdf_reader.pages[0] #initate reading first page of the pdf

In [113]:
page_one_text=page_one.extract_text() # then extract the text of page 1 in a variablee

In [114]:
#print extrctaed page one

In [115]:
page_one_text

"Declaration of Independence\nIN CONGRESS, July 4, 1776.  \nThe unanimous Declaration of the thirteen united States of America,  \nWhen in the Course of human events, it becomes necessary for one people to dissolve thepolitical bands which have connected them with another, and to assume among the powers of theearth, the separate and equal station to which the Laws of Nature and of Nature's God entitlethem, a decent respect to the opinions of mankind requires that they should declare the causeswhich impel them to the separation. We hold these truths to be self-evident, that all men are created equal, that they are endowed bytheir Creator with certain unalienable Rights, that among these are Life, Liberty and the pursuit\nof Happiness.— \x14That to secure these rights, Governments are instituted among Men, derivingtheir just powers from the consent of the governed,—  \x14That whenever any Form of Government\nbecomes destructive of these ends, it is the Right of the People to alter or to 

In [None]:
f.close()

## **Adding to PDFs**

We can not write to PDFs using Python because of the differences between the single string type of Python, and the variety of fonts, placements, and other parameters that a PDF could have.

What we can do is copy pages and append pages to the end.

In [117]:
f=('/content/US_Declaration.pdf','rb')

In [120]:
page_one=pdf_reader.pages[0] #initate reading first page of the pdf


In [122]:
pdf_writer=PyPDF2.PdfWriter()

In [None]:
pdf_writer.add_page(page_one)

In [126]:
pdf_output=open("Some_new_doc.pdf","wb")

In [127]:
pdf_writer.write(pdf_output)

(False, <_io.BufferedWriter name='Some_new_doc.pdf'>)

In [128]:
pdf_output.close()

In [None]:
f.close()

## **Simple Example**
Let's try to grab all the text from this PDF file

In [136]:
s=open("/content/Some_new_doc.pdf",'rb')

In [137]:
pdf_text=[0]

In [138]:
pdf_reader=PyPDF2.PdfReader(s)

In [147]:
for p in range(len(pdf_reader.pages)):
  page=pdf_reader.getPage(p)
  pdf_text.append(page.extract_text())
  s.close()

In [148]:
pdf_text

[0]

In [149]:
print(pdf_text[2])

IndexError: ignored

## **Next up: Regular Expressions**

Regular Expressions (sometimes called regex for short) allow a user to search for strings using almost any sort of rule they can come up with. For example, finding all capital letters in a string, or finding a phone number in a document.

Regular expressions are notorious for their seemingly strange syntax. This strange syntax is a byproduct of their flexibility. Regular expressions have to be able to filter out any string pattern you can imagine, which is why they have a complex string pattern format.

Regular expressions are handled using Python's built-in re library. See the docs for more information.

Let's begin by explaining how to search for basic patterns in a string!

Searching for Basic Patterns


Let's imagine that we have the following string:

In [150]:
text = "The agent's phone number is 408-555-1234. Call soon!"

We'll start off by trying to find out if the string "phone" is inside the text string. Now we could quickly do this with:


But let's show the format for regular expressions, because later on we will be searching for patterns that won't have such a simple solution.



In [151]:
'phone' in text

True

In [152]:
"Phone" in text

False

In [153]:
import re

In [154]:
pattern = 'phone'

In [155]:
re.search(pattern,text)

<re.Match object; span=(12, 17), match='phone'>

In [156]:
pattern="NOT IN TEXT"

In [157]:
re.search(pattern,text)

Now we've seen that re.search() will take the pattern, scan the text, and then returns a Match object. If no pattern is found, a None is returned (in Jupyter Notebook this just means that nothing is output below the cell).

Let's take a closer look at this Match object.

In [158]:
pattern = 'phone'

In [159]:
match=re.search(pattern,text)

In [160]:
match

<re.Match object; span=(12, 17), match='phone'>

In [161]:
match.span()

(12, 17)

In [162]:
match.start()

12

In [163]:
match.end()

17

In [164]:
#but what if more than once occurs

In [165]:
text = "my phone is a new phone"

In [166]:
match=re.search("phone",text)

In [167]:
match.span()

(3, 8)

In [169]:
matches=re.findall("phone",text)

In [170]:
matches

['phone', 'phone']

In [171]:
len(matches)

2

In [174]:
for match in re.finditer ("phone",text):
  print(match.span())

(3, 8)
(18, 23)


In [175]:
match.group()

'phone'

## **Patterns**
So far we've learned how to search for a basic string. What about more complex examples? Such as trying to find a telephone number in a large string of text? Or an email address?

We could just use search method if we know the exact phone or email, but what if we don't know it? We may know the general format, and we can use that along with regular expressions to search the document for strings that match a particular pattern.

This is where the syntax may appear strange at first, but take your time with this; often it's just a matter of looking up the pattern code.

Let's begin!

# Identifiers for Characters in Patterns
Characters such as a digit or a single string have different codes that represent them. You can use these to build up a pattern string. Notice how these make heavy use of the backwards slash \ . Because of this when defining a pattern string for regular expression we use the format:

r'mypattern'
placing the r in front of the string allows python to understand that the \ in the pattern string are not meant to be escape slashes.

Below you can find a table of all the possible identifiers:


For example:

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >\d</span></td><td>A digit</td><td>file_\d\d</td><td>file_25</td></tr>

<tr ><td><span >\w</span></td><td>Alphanumeric</td><td>\w-\w\w\w</td><td>A-b_1</td></tr>



<tr ><td><span >\s</span></td><td>White space</td><td>a\sb\sc</td><td>a b c</td></tr>



<tr ><td><span >\D</span></td><td>A non digit</td><td>\D\D\D</td><td>ABC</td></tr>

<tr ><td><span >\W</span></td><td>Non-alphanumeric</td><td>\W\W\W\W\W</td><td>*-+=)</td></tr>

<tr ><td><span >\S</span></td><td>Non-whitespace</td><td>\S\S\S\S</td><td>Yoyo</td></tr></table>

In [176]:
text = "My telephone number is 408-555-1234"

In [178]:
phone=re.search(r'\d\d\d-\d\d\d-\d\d\d\d',text)

In [179]:
phone.group()

'408-555-1234'

Notice the repetition of \d. That is a bit of an annoyance, especially if we are looking for very long strings of numbers. Let's explore the possible quantifiers.

## **Quantifiers**
Now that we know the special character designations, we can use them along with quantifiers to define how many we expect.

<table ><tr><th>Character</th><th>Description</th><th>Example Pattern Code</th><th >Exammple Match</th></tr>

<tr ><td><span >+</span></td><td>Occurs one or more times</td><td>	Version \w-\w+</td><td>Version A-b1_1</td></tr>

<tr ><td><span >{3}</span></td><td>Occurs exactly 3 times</td><td>\D{3}</td><td>abc</td></tr>



<tr ><td><span >{2,4}</span></td><td>Occurs 2 to 4 times</td><td>\d{2,4}</td><td>123</td></tr>



<tr ><td><span >{3,}</span></td><td>Occurs 3 or more</td><td>\w{3,}</td><td>anycharacters</td></tr>

<tr ><td><span >\*</span></td><td>Occurs zero or more times</td><td>A\*B\*C*</td><td>AAACC</td></tr>

<tr ><td><span >?</span></td><td>Once or none</td><td>plurals?</td><td>plural</td></tr></table>

Let's rewrite our pattern using these quantifiers:

In [180]:
re.search(r'\d{3}-\d{3}-\d{4}',text)

<re.Match object; span=(23, 35), match='408-555-1234'>

# Groups
What if we wanted to do two tasks, find phone numbers, but also be able to quickly extract their area code (the first three digits). We can use groups for any general task that involves grouping together regular expressions (so that we can later break them down).

Using the phone number example, we can separate groups of regular expressions using parentheses:

In [181]:
phone_pattern=re.compile(r"(\d{3})-(\d{3})-(\d{4})")

In [182]:
results=re.search(phone_pattern,text)

In [183]:
results.group()

'408-555-1234'

In [185]:
results.group(1)

'408'

In [186]:
results.group(2)

'555'

In [187]:
results.group(3)

'1234'

## Additional Regex **Syntax**

Or operator |
Use the pipe operator to have an **or **statment. For example

In [189]:
re.search(r"man|woman", "this woman was here.")

<re.Match object; span=(5, 10), match='woman'>

## **The Wildcard Character**
Use a "wildcard" as a placement that will match any character placed there. You can use a simple period . for this. For example:

In [190]:
re.findall(r".at","the cat in the hat sat here")

['cat', 'hat', 'sat']

In [194]:
re.findall(r".at","the bat went splat")

['bat', 'lat']

Notice how we only matched the first 3 letters, that is because we need a . for each wildcard letter. Or use the quantifiers described above to set its own rules.


However this still leads the problem to grabbing more beforehand. Really we only want words that end with "at".

In [195]:
re.findall(r'\S+at',"the bat went splat")

['bat', 'splat']

## **Starts With and Ends With**
We can use the ^ to signal starts with, and the $ to signal ends with:

In [196]:
#ends with a number

In [197]:
re.findall(r"\d$","This ends with a number 2")

['2']

In [198]:
re.findall(r"^\d","1 is a non lonliest number ")

['1']

Note that this is for the entire string, not individual words!

### Exclusion

To exclude characters, we can use the **^** symbol in conjunction with a set of brackets **[]**. Anything inside the brackets is excluded. For example:

In [199]:
phrase = "there are 3 numbers 34 inside 5 this sentence."

In [208]:
re.findall(r'[^\d]',phrase) #finds all non digits in the text using [^\]

['t',
 'h',
 'e',
 'r',
 'e',
 ' ',
 'a',
 'r',
 'e',
 ' ',
 ' ',
 'n',
 'u',
 'm',
 'b',
 'e',
 'r',
 's',
 ' ',
 ' ',
 'i',
 'n',
 's',
 'i',
 'd',
 'e',
 ' ',
 ' ',
 't',
 'h',
 'i',
 's',
 ' ',
 's',
 'e',
 'n',
 't',
 'e',
 'n',
 'c',
 'e',
 '.']

In [203]:
re.findall(r"[\d]+",phrase)

['3', '34', '5']

In [204]:
re.findall(r"[^\d]+",phrase)

['there are ', ' numbers ', ' inside ', ' this sentence.']

In [205]:
test_phrase = 'This is a string! But it has punctuation. How can we remove it?'

In [206]:
re.findall('[^!.?]+',test_phrase)

['This is a string', ' But it has punctuation', ' How can we remove it']

In [209]:
clean=' '.join(re.findall('[^?.!]+',test_phrase))

In [210]:
clean

'This is a string  But it has punctuation  How can we remove it'

## Brackets for Grouping

As we showed above we can use brackets to group together options, for example if we wanted to find hyphenated words:

In [211]:
text = 'Only find the hypen-words in this sentence. But you do not know how long-ish they are'

In [212]:
re.findall(r'[\w]+-[\w]+',text)

['hypen-words', 'long-ish']

## **Parentheses for Multiple Options**
If we have multiple options for matching, we can use parentheses to list out these options. For Example:

In [213]:
# Find words that start with cat and end with one of these options: 'fish','nap', or 'claw'
text = 'Hello, would you like some catfish?'
texttwo = "Hello, would you like to take a catnap?"
textthree = "Hello, have you seen this caterpillar?"

In [214]:
re.search(r'cat(fish|claw|nap)',text)

<re.Match object; span=(27, 34), match='catfish'>

In [215]:
re.search(r'cat(fish|nap|claw)',texttwo)

<re.Match object; span=(32, 38), match='catnap'>

In [216]:
# None returned
re.search(r'cat(fish|nap|claw)',textthree)

Conclusion
Excellent work! For full information on all possible patterns, check out: https://docs.python.org/3/howto/regex.html

Python Text Basics Assessment

# Python Text Basics Assessment

Welcome to your assessment! Complete the tasks described in bold below by typing the relevant code in the cells.<br>
You can compare your answers to the Solutions notebook provided in this folder.

## f-Strings
#### 1. Print an f-string that displays `NLP stands for Natural Language Processing` using the variables provided.

In [218]:
abbr = 'NLP'
full_text = 'Natural Language Processing'

In [219]:
print(f'{abbr} stands for {full_text}')

NLP stands for Natural Language Processing


## Files
#### 2. Create a file in the current working directory called `contacts.txt` by running the cell below:

In [220]:
%%writefile contacts.txt
First_Name Last_Name, Title, Extension, Email
gururaja talepalli, Sir , +91, gurutsgrs@gmail.com

Writing contacts.txt


#### 3. Open the file and use .read() to save the contents of the file to a string called `fields`.  Make sure the file is closed at the end.

In [233]:
with open('contacts.txt') as f:
  fields=f.read()

In [234]:
# file=open('contacts.txt',mode='r')
# fields=file.read()
# file.close()

## Working with PDF Files
#### 4. Use PyPDF2 to open the file `Business_Proposal.pdf`. Extract the text of page 2.

In [240]:
# Perform import
import PyPDF2

# Open the file as a binary object
pdf=open("Business_Proposal.pdf",'rb')

# Use PyPDF2 to read the text of the file
readings=PyPDF2.PdfReader(pdf)


# Get the text from page 2 (CHALLENGE: Do this in one step!)
page_two_text = readings.pages[1].extract_text()



# Close the file
pdf.close()

# Print the contents of page_two_text
print(page_two_text)

AUTHORS:  
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com  


#### 5. Open the file `contacts.txt` in append mode. Add the text of page 2 from above to `contacts.txt`.

#### CHALLENGE: See if you can remove the word "AUTHORS:"

In [243]:
with open('contacts.txt',"a+") as f:
  f.write(page_two_text)
  f.seek(0)
  print(f.read())

First_Name Last_Name, Title, Extension, Email
gururaja talepalli, Sir , +91, gurutsgrs@gmail.com
AUTHORS:  
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com    
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com  AUTHORS:  
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com  


In [244]:
# CHALLENGE Solution (re-run the %%writefile cell above to obtain an unmodified contacts.txt file):

with open('contacts.txt','a+') as f:
  f.write(page_two_text[8:])
  f.seek(0)
  print(f.read())



First_Name Last_Name, Title, Extension, Email
gururaja talepalli, Sir , +91, gurutsgrs@gmail.com
AUTHORS:  
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com    
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com  AUTHORS:  
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com    
Amy Baker, Finance Chair, x345, abaker@ourcompany.com  
Chris Donaldson, Accounting Dir., x621, cdonaldson@ourcompany.com  
Erin Freeman, Sr. VP, x879, efreeman@ourcompany.com  


## Regular Expressions
#### 6. Using the `page_two_text` variable created above, extract any email addresses that were contained in the file `Business_Proposal.pdf`.

In [245]:
import re


In [250]:
pattern= r'\w+@\w+.\w{3}'

re.findall(pattern, page_two_text)

['abaker@ourcompany.com',
 'cdonaldson@ourcompany.com',
 'efreeman@ourcompany.com']

Great Job for the day!