# All material ©2019, Alex Siegman

---

## Regular Expressions

### Regular expressions are like pattern matchers – they allow us to match certain texts within a larger block of text. But, unlike a Command+F search, regular expressions allow you to be flexible. 

### To best understand, let's delve right into an example.  I have copied the text from http://obamaspeeches.com/ into a .txt file for ease of reading. This file represents the full text of Obama's 2009 Inauguration Speech. 

In [1]:
# first, we are going to open the .txt file 

sentences = open('/Users/siegmanA/Desktop/NYU-Projects-in-Programming-Fall-2019/(Class 4) Regular Expressions /Obama_2009_InauguralAddress.txt').readlines()    
    
print("The object 'sentences' is of type", type(sentences)) # let's see how our sentences are being stored in Python
print("There are", len(sentences), "sentences in the list") # how many sentences are there?
print("\n") # print a new line just for ease of reading

print(sentences) # let's see all of our sentences

The object 'sentences' is of type <class 'list'>
There are 69 sentences in the list


['My fellow citizens:\n', '\n', 'I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.\n', '\n', 'Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.\n', '\n', 'So it has been. So it must be with this generation of Americans.\n', '\n', 'That we are in the midst of crisis is now well

In [2]:
print(sentences[:10]) # what about the first ten sentences?

['My fellow citizens:\n', '\n', 'I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.\n', '\n', 'Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.\n', '\n', 'So it has been. So it must be with this generation of Americans.\n', '\n', 'That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatr

### You'll see that there's a lot of ugly formatting involved. Luckly, we can strip all of that extraneous content. 

In [3]:
new_sentences = []

for i in sentences: 
    new_sentences.append(i.strip())
    
# you could also write the above loop as: new_sentences.append(i.splitlines()) for i in sentences
    
print(new_sentences)

['My fellow citizens:', '', 'I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.', '', 'Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.', '', 'So it has been. So it must be with this generation of Americans.', '', 'That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy 

### Now, let's delve right in!

In [4]:
import re
from re import search # import the search function from the regular expression (re) library

### One of the first things you may want to do is search for a literal – simply match the exact text in the document in question. For instance, if we want to find any mention of the word, "America"...

In [5]:
for i in new_sentences:
    result = re.search("America",i)
    print(result)

None
None
None
None
<re.Match object; span=(11, 18), match='America'>
None
<re.Match object; span=(54, 61), match='America'>
None
None
None
<re.Match object; span=(170, 177), match='America'>
None
<re.Match object; span=(164, 171), match='America'>
None
None
None
None
None
None
None
None
None
None
None
None
None
None
None
<re.Match object; span=(143, 150), match='America'>
None
<re.Match object; span=(548, 555), match='America'>
None
None
None
None
None
None
None
None
<re.Match object; span=(512, 519), match='America'>
None
None
None
None
None
<re.Match object; span=(563, 570), match='America'>
None
None
None
None
None
<re.Match object; span=(94, 101), match='America'>
None
<re.Match object; span=(98, 105), match='America'>
None
<re.Match object; span=(484, 491), match='America'>
None
None
None
None
None
None
None
<re.Match object; span=(101, 108), match='America'>
None
None
<re.Match object; span=(0, 7), match='America'>
None
<re.Match object; span=(60, 67), match='America'>


In [6]:
[i for i in new_sentences if search("America",i)]

['Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.',
 'So it has been. So it must be with this generation of Americans.',
 "These are the indicators of crisis, subject to data and statistics. Less measurable but no less profound is a sapping of confidence across our land - a nagging fear that America's decline is inevitable, and that the next generation must lower its sights.",
 'Today I say to you that the challenges we face are real. They are serious and they are many. They will not be met easily or in a short span of time. But know this, America - they will be met.',
 'Time and aga

### What if we aren't looking for a proper noun, but rather for the word 'homes' – it could be spelled 'homes' or 'Homes' depending on where in the sentence it is. 

### Rather than write two different literal searches, we can do this: 

In [7]:
[i for i in new_sentences if search("[Hh]omes",i)] # looking for upper or lower-case 'h' followed by 'omes'

['That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many; and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.']

### It's important to note that putting a backslach \ before a special metacharacter let's you include that metacharacter as a literal. 

For instance, if I'm looking for a dollar value in a text, and wanted to search for the '$' symbol, I would have to type: 

> re.search("\$) 

in order to look for it. 

### To illustrate, imagine we have a new text: 


In [8]:
text = ["In 2019 profits rose by $4,000,000","This is the first time that profits have rose by more than %40 in a decade."]

In [9]:
[i for i in text if search("$",i)]

['In 2019 profits rose by $4,000,000',
 'This is the first time that profits have rose by more than %40 in a decade.']

### You'll see this returns both of our setnences. That's because dollar sign is a special character (which we'll discuss in a moment). But ultimately, to search for the literal dollar symbol, we use the backslach to "escape" that special character.

In [10]:
[i for i in text if search("\$",i)]

['In 2019 profits rose by $4,000,000']

## Dealing with 'metacharacters' 

Metacharacters include "  \ ^ $ . | ? * + ( ) [ ] and \

These metacharacters help us match various, non-literal components of a sentence. For instance, the search: 

re.search("^I think",sentence) means that you are searching for the words "I think" at the start of a line (that's what the '^' represents.

In [11]:
[i for i in new_sentences if search("^My",i)]

['My fellow citizens:']

### To gain insight into what your regular expression is doing at any time, I highly recommend using regexper.com (https://regexper.com/) which will allow you to see exactly what a given search is doing. 

For instance, check out https://regexper.com/#%5EMy%0A to see what we just did with '^My'

Here is a good cheat sheet for all the special characters, too, From Emma Wedekind: https://dev.to/emmawedekind/regex-cheat-sheet-2j2a

Finally, I'd also recommend RegEx101, a handy debugger for regular expressions: https://regex101.com/

### Here are some more regex special character examples: 

In [12]:
[i for i in new_sentences if search("sights.$",i)]

# the '.' represents a wildcard (it can refer to any character)
# the '$' represents the end of a sentence
# thus, we are looking for a sentence that ends in, "remember."

["These are the indicators of crisis, subject to data and statistics. Less measurable but no less profound is a sapping of confidence across our land - a nagging fear that America's decline is inevitable, and that the next generation must lower its sights."]

In [13]:
[i for i in new_sentences if search("[0-9]",i)]

# the '[0-9]' will match any integer, 0-9. s you'll see, it matches '9211' which isn't really part of the text, 
# but is part of the formatting. Still, it works!

["We are the keepers of this legacy. Guided by these principles once more, we can meet those new threats that demand even greater effort - even greater cooperation and understanding between nations. We will begin to responsibly leave Iraq to its people, and forge a hard-earned peace in Afghanistan. With old friends and former foes, we\\'92ll work tirelessly to lessen the nuclear threat, and roll back the specter of a warming planet. We will not apologize for our way of life, nor will we waver in its defense, and for those who seek to advance their aims by inducing terror and slaughtering innocents, we say to you now that our spirit is stronger and cannot be broken; you cannot outlast us, and we will defeat you."]

In [14]:
[i for i in new_sentences if search("[^?.]$",i)]

# the pattern *"[^?.]\$"* will match sentences that don't end in a period or a question mark 
# it's important to note that you don't have to "escape" (backslach) characters in a character class -- or 
# between [ and ]) 

['My fellow citizens:',
 "What the cynics fail to understand is that the ground has shifted beneath them - that the stale political arguments that have consumed us for so long no longer apply. The question we ask today is not whether our government is too big or too small, but whether it works - whether it helps families find jobs at a decent wage, care they can afford, a retirement that is dignified. Where the answer is yes, we intend to move forward. Where the answer is no, programs will end. And those of us who manage the public's dollars will be held to account - to spend wisely, reform bad habits, and do our business in the light of day - because only then can we restore the vital trust between a people and their government.\\",
 "So let us mark this day with remembrance, of who we are and how far we have traveled. In the year of America's birth, in the coldest of months, a small band of patriots huddled by dying campfires on the shores of an icy river. The capital was abandoned. 

In [15]:
[i for i in new_sentences if search("remember|forget",i)]

# the 'remeber|forget' means we're searching for either 'remember' or 'forget'

['As we consider the road that unfolds before us, we remember with humble gratitude those brave Americans who, at this very hour, patrol far-off deserts and distant mountains. They have something to tell us, just as the fallen heroes who lie in Arlington whisper through the ages. We honor them not only because they are guardians of our liberty, but because they embody the spirit of service; a willingness to find meaning in something greater than themselves. And yet, at this moment - a moment that will define a generation - it is precisely this spirit that must inhabit us all.',
 "America. In the face of our common dangers, in this winter of our hardship, let us remember these timeless words. With hope and virtue, let us brave once more the icy currents, and endure what storms may come. Let it be said by our children's children that when we were tested we refused to let this journey end, that we did not turn back nor did we falter; and with eyes fixed on the horizon and God's grace upon

In [16]:
[i for i in new_sentences if search("day|month|year",i)]

# you can also search with multiple "or" statements
# here, we are looking for either the word "day", "month", or "year"

['I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.',
 'That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many; and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.',
 'Today I say to you that the challenges we face are real. They are serious and they are many. They will not be met easily or in a short span of time. But know 

In [17]:
[i for i in new_sentences if search("^[Tt]oday|OK\?$",i)]

# this is complex, but we're looking for a sentence that either starts with "Today" or "today" or that ends with "OK?"
# check https://regexper.com/#%5E%5BWw%5Datch%7COK%5C?%24 for a graphical representation 

['Today I say to you that the challenges we face are real. They are serious and they are many. They will not be met easily or in a short span of time. But know this, America - they will be met.']

In [18]:
[i for i in new_sentences if search("^We.*",i)]

# the \* and + signs are metacharacters used to indicate repetition
# \* means “any number, including zero, of the item” 
# + means “at least one of the item”

# above, we're looking for the word "We" that starts a sentence, followed by any character, any number of times

['We remain a young nation, but in the words of Scripture, the time has come to set aside childish things. The time has come to reaffirm our enduring spirit; to choose our better history; to carry forward that precious gift, that noble idea, passed on from generation to generation: the God-given promise that all are equal, all are free, and all deserve a chance to pursue their full measure of happiness.',
 "We are the keepers of this legacy. Guided by these principles once more, we can meet those new threats that demand even greater effort - even greater cooperation and understanding between nations. We will begin to responsibly leave Iraq to its people, and forge a hard-earned peace in Afghanistan. With old friends and former foes, we\\'92ll work tirelessly to lessen the nuclear threat, and roll back the specter of a warming planet. We will not apologize for our way of life, nor will we waver in its defense, and for those who seek to advance their aims by inducing terror and slaughter

In [19]:
# {} are known as "interval quantifiers" that let us specify the number of matches we want 

[i for i in new_sentences if search("We (\\w+ ){1,7}nation",i)]

# this one is tricky...
# first, we are looking for the word "We"...
# then, (\\w+ ) stands for "any letter or number", and the {1,7} means we are looking for any letter or number 1 to 7 times
# then we are looking for the word "nation"

# thus, we are looking for "We", then between 1 and 7 (inclusive) words, then the word "nation"

['We remain a young nation, but in the words of Scripture, the time has come to set aside childish things. The time has come to reaffirm our enduring spirit; to choose our better history; to carry forward that precious gift, that noble idea, passed on from generation to generation: the God-given promise that all are equal, all are free, and all deserve a chance to pursue their full measure of happiness.',
 'For we know that our patchwork heritage is a strength, not a weakness. We are a nation of Christians and Muslims, Jews and Hindus - and non-believers. We are shaped by every language and culture, drawn from every end of this Earth; and because we have tasted the bitter swill of civil war and segregation, and emerged from that dark chapter stronger and more united, we cannot help but believe that the old hatreds shall someday pass; that the lines of tribe shall soon dissolve; that as the world grows smaller, our common humanity shall reveal itself; and that America must play its role

## What about a more complex document, now? A PDF, perhaps? 

### _For more on PyPDF2, check out https://pythonhosted.org/PyPDF2/_

In [20]:
!pip install PyPDF2



In [21]:
from PyPDF2 import PdfFileReader

# open the file for 'reading' and signal that the data inside might be 'binary'
# let's read'Statistics Manual by Edwin L. Crow, Frances A. Davis, and Margaret W. Maxfield' for the heck of it

file = open('/Users/siegmanA/Desktop/NYU-Projects-in-Programming-Fall-2019/(Class 4) Regular Expressions /2019Q1_alphabet_earnings_release.pdf', 'rb')

# use the file to create a PDF reader object to extract the text
pdf = PdfFileReader(file)
type(pdf)

PyPDF2.pdf.PdfFileReader

In [22]:
# let's see what sorts of things we can do: 

help(pdf)

Help on PdfFileReader in module PyPDF2.pdf object:

class PdfFileReader(builtins.object)
 |  
 |  Initializes a PdfFileReader object.  This operation can take some time, as
 |  the PDF stream's cross-reference tables are read into memory.
 |  
 |  :param stream: A File object or an object that supports the standard read
 |      and seek methods similar to a File object. Could also be a
 |      string representing a path to a PDF file.
 |  :param bool strict: Determines whether user should be warned of all
 |      problems and also causes some correctable problems to be fatal.
 |      Defaults to ``True``.
 |      ``sys.stderr``).
 |      ``True``).
 |  
 |  Methods defined here:
 |  
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  cacheGetIndirectObject(self, generation, idnum)
 |  
 |  cacheIndirectObject(self, generation, idnum, obj)
 |  
 |  decrypt(self, password)
 |      When using an encrypted / secured PDF file with the PDF Standard
 |      encryp

In [23]:
pdf.numPages # how many pages in our pdf? 

11

In [24]:
page = pdf.getPage(4) # let's take a look at page 33 of our pdf...

page = page.extractText()

print(page) # and print just the text

Alphabet Inc.CONSOLIDATED STATEMENTS OF INCOME(In millions, except per share amounts; unaudited)Three Months EndedMarch 31,20182019Revenues$31,146$36,339Costs and expenses:Cost of revenues13,46716,012Research and development5,0396,029Sales and marketing3,6043,905General and administrative1,4032,088European Commission fine01,697Total costs and expenses23,51329,731Income from operations7,6336,608Other income (expense), net2,9101,538Income before income taxes 10,5438,146Provision for income taxes1,1421,489Net income$9,401$6,657Basic earnings per share of Class A and B common stock and Class C capital stock$13.53$9.58Diluted earnings per share of Class A and B common stock and Class C capital stock$13.33$9.50


In [25]:
# and, last but not least, some regular expressions to prove we've got everything we need: 

import re
from re import search # import the search function from the regular expression (re) library

result = re.search("Research",page)
print(result)

<re.Match object; span=(200, 208), match='Research'>


---

## Next week we'll take our knowledge of regular expressions and marry it with our soon-to-be web scraping skills. 