# All material ©2019, Alex Siegman

---

## Regular Expressions

### Regular expressions are like pattern matchers – they allow us to match certain texts within a larger block of text. But, unlike a Command+F search, regular expressions allow you to be flexible. 

### To best understand, let's delve right into an example.  I have copied the text from http://obamaspeeches.com/ into a .txt file for ease of reading. This file represents the full text of Obama's 2009 Inauguration Speech. 

In [None]:
# first, we are going to open the .txt file 

sentences = open('/Users/siegmanA/Desktop/NYU-Projects-in-Programming-Fall-2019/(Class 4) Regular Expressions /Obama_2009_InauguralAddress.txt').readlines()    
    
print("The object 'sentences' is of type", type(sentences)) # let's see how our sentences are being stored in Python
print("There are", len(sentences), "sentences in the list") # how many sentences are there?
print("\n") # print a new line just for ease of reading

print(sentences) # let's see all of our sentences

In [None]:
print(sentences[:10]) # what about the first ten sentences?

### You'll see that there's a lot of ugly formatting involved. Luckly, we can strip all of that extraneous content. 

In [None]:
new_sentences = []

for i in sentences: 
    new_sentences.append(i.strip())
    
# you could also write the above loop as: new_sentences.append(i.splitlines()) for i in sentences
    
print(new_sentences)

### Now, let's delve right in!

In [None]:
import re
from re import search # import the search function from the regular expression (re) library

### One of the first things you may want to do is search for a literal – simply match the exact text in the document in question. For instance, if we want to find any mention of the word, "America"...

In [None]:
for i in new_sentences:
    result = re.search("America",i)
    print(result)

In [None]:
[i for i in new_sentences if search("America",i)]

### What if we aren't looking for a proper noun, but rather for the word 'homes' – it could be spelled 'homes' or 'Homes' depending on where in the sentence it is. 

### Rather than write two different literal searches, we can do this: 

In [None]:
[i for i in new_sentences if search("[Hh]omes",i)] # looking for upper or lower-case 'h' followed by 'omes'

### It's important to note that putting a backslach \ before a special metacharacter let's you include that metacharacter as a literal. 

For instance, if I'm looking for a dollar value in a text, and wanted to search for the '$' symbol, I would have to type: 

> re.search("\$) 

in order to look for it. 

### To illustrate, imagine we have a new text: 


In [None]:
text = ["In 2019 profits rose by $4,000,000","This is the first time that profits have rose by more than %40 in a decade."]

In [None]:
[i for i in text if search("$",i)]

### You'll see this returns both of our setnences. That's because dollar sign is a special character (which we'll discuss in a moment). But ultimately, to search for the literal dollar symbol, we use the backslach to "escape" that special character.

In [None]:
[i for i in text if search("\$",i)]

## Dealing with 'metacharacters' 

Metacharacters include "  \ ^ $ . | ? * + ( ) [ ] and \

These metacharacters help us match various, non-literal components of a sentence. For instance, the search: 

re.search("^I think",sentence) means that you are searching for the words "I think" at the start of a line (that's what the '^' represents.

In [None]:
[i for i in new_sentences if search("^My",i)]

### To gain insight into what your regular expression is doing at any time, I highly recommend using regexper.com (https://regexper.com/) which will allow you to see exactly what a given search is doing. 

For instance, check out https://regexper.com/#%5EMy%0A to see what we just did with '^My'

Here is a good cheat sheet for all the special characters, too, From Emma Wedekind: https://dev.to/emmawedekind/regex-cheat-sheet-2j2a

Finally, I'd also recommend RegEx101, a handy debugger for regular expressions: https://regex101.com/

### Here are some more regex special character examples: 

In [None]:
[i for i in new_sentences if search("sights.$",i)]

# the '.' represents a wildcard (it can refer to any character)
# the '$' represents the end of a sentence
# thus, we are looking for a sentence that ends in, "remember."

In [None]:
[i for i in new_sentences if search("[0-9]",i)]

# the '[0-9]' will match any integer, 0-9. s you'll see, it matches '9211' which isn't really part of the text, 
# but is part of the formatting. Still, it works!

In [None]:
[i for i in new_sentences if search("[^?.]$",i)]

# the pattern *"[^?.]\$"* will match sentences that don't end in a period or a question mark 
# it's important to note that you don't have to "escape" (backslach) characters in a character class -- or 
# between [ and ]) 

In [None]:
[i for i in new_sentences if search("remember|forget",i)]

# the 'remeber|forget' means we're searching for either 'remember' or 'forget'

In [None]:
[i for i in new_sentences if search("day|month|year",i)]

# you can also search with multiple "or" statements
# here, we are looking for either the word "day", "month", or "year"

In [None]:
[i for i in new_sentences if search("^[Tt]oday|OK\?$",i)]

# this is complex, but we're looking for a sentence that either starts with "Today" or "today" or that ends with "OK?"
# check https://regexper.com/#%5E%5BWw%5Datch%7COK%5C?%24 for a graphical representation 

In [None]:
[i for i in new_sentences if search("^We.*",i)]

# the \* and + signs are metacharacters used to indicate repetition
# \* means “any number, including zero, of the item” 
# + means “at least one of the item”

# above, we're looking for the word "We" that starts a sentence, followed by any character, any number of times

In [None]:
# {} are known as "interval quantifiers" that let us specify the number of matches we want 

[i for i in new_sentences if search("We (\\w+ ){1,7}nation",i)]

# this one is tricky...
# first, we are looking for the word "We"...
# then, (\\w+ ) stands for "any letter or number", and the {1,7} means we are looking for any letter or number 1 to 7 times
# then we are looking for the word "nation"

# thus, we are looking for "We", then between 1 and 7 (inclusive) words, then the word "nation"

## What about a more complex document, now? A PDF, perhaps? 

### _For more on PyPDF2, check out https://pythonhosted.org/PyPDF2/_

In [None]:
!pip install PyPDF2

In [None]:
from PyPDF2 import PdfFileReader

# open the file for 'reading' and signal that the data inside might be 'binary'
# let's read'Statistics Manual by Edwin L. Crow, Frances A. Davis, and Margaret W. Maxfield' for the heck of it

file = open('/Users/siegmanA/Desktop/NYU-Projects-in-Programming-Fall-2019/(Class 4) Regular Expressions /2019Q1_alphabet_earnings_release.pdf', 'rb')

# use the file to create a PDF reader object to extract the text
pdf = PdfFileReader(file)
type(pdf)

In [None]:
# let's see what sorts of things we can do: 

help(pdf)

In [None]:
pdf.numPages # how many pages in our pdf? 

In [None]:
page = pdf.getPage(4) # let's take a look at page 33 of our pdf...

page = page.extractText()

print(page) # and print just the text

In [None]:
# and, last but not least, some regular expressions to prove we've got everything we need: 

import re
from re import search # import the search function from the regular expression (re) library

result = re.search("Research",page)
print(result)

---

## Next week we'll take our knowledge of regular expressions and marry it with our soon-to-be web scraping skills. 