### Regex

In [1]:
# import regex library
import re

In [2]:
# Lets create some text for an example
text = "This is a good day."

if re.search("good", text):
    print("Wonderfull")
else:
    print("Alas :(")

Wonderfull


In [3]:
# The findall() and split() functions will parse the string for us and return chunks. Lets try and example
text = "Amy works diligently. Amy gets good grades. Our student Amy is succesful."

re.split("Amy", text)

['',
 ' works diligently. ',
 ' gets good grades. Our student ',
 ' is succesful.']

In [5]:
# findall counts how many times it appears in a list
re.findall("Amy", text)

['Amy', 'Amy', 'Amy']

### Anchors ( specify the start and end of the string you're trying to match )
#### ^ means the start of the string
#### $ means the end of the string

In [6]:
re.search("^Amy", text)

<re.Match object; span=(0, 3), match='Amy'>

### Patterns and Character Classes

In [7]:
# Let's talk more about patterns and start with character classes. Let's create a string of a single learners'
# grades over a semester in one course across all of their assignments
grades="ACAAAABCBCBAA"

# If we want to answer the question "How many B's were in the grade list?" we would just use B
re.findall("B", grades)

['B', 'B', 'B']

In [8]:
# If we wanted to count the number of A's or B's in the list, we can't use "AB" since this is used to match
# all A's followed immediately by a B. Instead, we put the characters A and B inside square brackets

re.findall("[AB]", grades)

['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'A', 'A']

In [9]:
# This is called the set operator. You can also include a range of characters, which are ordered
# alphanumerically. For instance, if we want to refer to all lower case letters we could use [a-z] Lets build
# a simple regex to parse out all instances where this student receive an A followed by a B or a C
re.findall("[A][B-C]", grades)

['AC', 'AB']

In [13]:
# Notice how the [AB] pattern describes a set of possible characters which could be either (A OR B), while the
# [A][B-C] pattern denoted two sets of characters which must have been matched back to back. You can write
# this pattern by using the pipe operator, which means OR
re.findall("AB|AC", grades)

['AC', 'AB']

In [10]:
# We can use the caret with the set operator to negate our results. For instance, if we want to parse out only
# the grades which were not A's
re.findall("[^A]", grades)

['C', 'B', 'C', 'B', 'C', 'B']

In [11]:
# caret and other characters loose their meaning inside set operator
re.findall("^[^A]", grades)

[]

### Quantifiers

In [15]:
# Quantifiers are the number of times you want a pattern to be matched in order to match. The most basic
# quantifier is expressed as e{m,n}, where e is the expression or character we are matching, m is the minimum
# number of times you want it to matched, and n is the maximum number of times the item could be matched.

# Let's use these grades as an example. How many times has this student been on a back-to-back A's streak?
re.findall("A{2,10}", grades)

['AAAA', 'AA']

In [16]:
# trying using single value
re.findall("A{1,1}A{1,1}", grades)

['AA', 'AA', 'AA']

In [18]:
# Now, that's a bit of a hack, because we included a maximum that was just arbitrarily large. There are three
# other quantifiers that are used as short hand, an asterix * to match 0 or more times, a question mark ? to
# match one or more times, or a + plus sign to match one or more times. Lets look at a more complex example,
# and load some data scraped from wikipedia
with open("data/ferpa.txt", "r") as file:
    wiki = file.read()
wiki

'Overview[edit]\nFERPA gives parents access to their child\'s education records, an opportunity to seek to have the records amended, and some control over the disclosure of information from the records. With several exceptions, schools must have a student\'s consent prior to the disclosure of education records after that student is 18 years old. The law applies only to educational agencies and institutions that receive funds under a program administered by the U.S. Department of Education.\n\nOther regulations under this act, effective starting January 3, 2012, allow for greater disclosures of personal and directory student identifying information and regulate student IDs and e-mail addresses.[2] For example, schools may provide external companies with a student\'s personally identifiable information without the student\'s consent.[2]\n\nExamples of situations affected by FERPA include school employees divulging information to anyone other than the student about the student\'s grades o

In [19]:
# all headers have the word [edit]
# to get a list of all headers
re.findall("[a-zA-Z]{1,100}\[edit\]", wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [20]:
re.findall("[a-zA-Z]*\[edit\]", wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [21]:
# Ok, that didn't quite work. It got all of the headers, but only the last word of the header, and it really
# was quite clunky. Lets iteratively improve this. First, we can use \w to match any letter, including digits
# and numbers.

re.findall("[\w]{1,100}\[edit\]",  wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [22]:
# we can use * as quantifier
re.findall("[\w]*\[edit\]", wiki)

['Overview[edit]', 'records[edit]', 'records[edit]']

In [23]:
# Now that we have shortened the regex, let's improve it a little bit. We can add in a spaces using the space
# character
re.findall("[\w ]*\[edit\]", wiki)

['Overview[edit]',
 'Access to public records[edit]',
 'Student medical records[edit]']

In [24]:
for title in re.findall("[\w ]*\[edit\]", wiki):
    print(re.split("[\[]", title)[0])

Overview
Access to public records
Student medical records


### Groups

In [25]:
# Ok, this works, but it's a bit of a pain. To this point we have been talking about a regex as a single
# pattern which is matched. But, you can actually match different patterns, called groups, at the same time,
# and then refer to the groups you want. To group patterns together you use parentheses, which is actually
# pretty natural. Lets rewrite our findall using groups

re.findall("([\w ]*)(\[edit\])", wiki)

[('Overview', '[edit]'),
 ('Access to public records', '[edit]'),
 ('Student medical records', '[edit]')]

In [26]:
# Nice - we see that the python re module breaks out the result by group. We can actually refer to groups by
# number as well with the match objects that are returned. But, how do we get back a list of match objects?
# Thus far we've seen that findall() returns strings, and search() and match() return individual Match
# objects. But what do we do if we want a list of Match objects? In this case, we use the function finditer()

for item in re.finditer("([\w ]*)(\[edit\])", wiki):
    print(item.groups())

('Overview', '[edit]')
('Access to public records', '[edit]')
('Student medical records', '[edit]')
