<a href="https://colab.research.google.com/github/lorenzo-crippa/3M_NLP_ESS_2022/blob/main/Tutorial_One_(Python)_Intro_to_String_Manipulation_and_Regular_Expressions.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro to String Manipulation and Regular Expressions in Python
### Douglas Rice

*This tutorial was originally created by Burt Monroe for his prior work with the Essex Summer School. I've updated and modified it.*

In this notebook, we'll learn about doing standard natural language processing (NLP) tasks in Python, and will be introduced to regular expressions. After completing this notebook, you should be familar with:


1.   Manipulate strings in Python.
2.   Extract elements from lists.
3. Work with Regular Expressions in Python



# Working with Strings

Our focus is on "strings", or ordered sequences of characters. Importantly, a string can be short ("Hello!"), long ("supercalafragalisticexpealadotios!"), much longer (like a book), and everything in between. In a word, strings are text. Python is a language of choice for many doing work in data science specifically because of it's facility in working with text and with complex data, but also because of the utility of many add-on modules that facilitate better work. Here, much of the work we'll do can be done in the base Python distribution, but you'll need to import the "re" module when we get to some regular expressions work.

Let's look at a string. These can be specified using double quotes (“) or single quotes (’):

In [None]:
a_string = 'Example STRING, with numbers (12, 15 and also 10.2)?!'
a_string

'Example STRING, with numbers (12, 15 and also 10.2)?!'

Whether you use one or two quotes is really up to you, but you might choose one if your string actually contains the other:

In [None]:
my_double_quoted_string = "He asked, 'Why would you use double quotes?'"
my_double_quoted_string

"He asked, 'Why would you use double quotes?'"

What happens if you use one of the quotes that *is* inside the string? 

In [None]:
my_whoops_string = "He asked, "Why would you use double quotes""

SyntaxError: ignored

Ah, a dreaded syntax error! This is because quotes are *special characters*; they signify something in the programming language. 

To get around this conundrum, you can use a \ (backslash) to tell Python to *escape* the next character. In the example below, the \" is saying, " is part of the string, not the end of the string.

In [None]:
my_string_with_double_quotes = "She answered, \"Convenience, but you never really have to.\""
my_string_with_double_quotes

'She answered, "Convenience, but you never really have to."'

If you ever want to see how your string with escape characters displays when printed or (typically) in an editor, use print.

In [None]:
print(my_double_quoted_string)

He asked, 'Why would you use double quotes?'


In [None]:
print(my_string_with_double_quotes)

She answered, "Convenience, but you never really have to."


This can get a little bit confusing. For example, since the backslash character tells Python to escape, to indicate an actual backslash character you have to backslash your backslashes:

In [None]:
a_string_with_backslashes = "To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\."
a_string_with_backslashes

'To indicate a backslash, \\, you have to type two: \\\\. Just there, to indicate two backslashes, I had to type four: \\\\\\\\.'

In [None]:
print(a_string_with_backslashes)

To indicate a backslash, \, you have to type two: \\. Just there, to indicate two backslashes, I had to type four: \\\\.


## Other Escape Characters

As mentioned above, quotes (single or double) are just one many special escape characters that are used to perform special functions. The most common are two that you’re already used to tapping a keyboard key for without expecting a character to appear on your screen: \t (tab) and \n (newline, or "Enter").

In [None]:
test_string = "Hark, a Lark!\t*Tweet Tweet*\n  \nWhere'd it go?"
test_string

"Hark, a Lark!\t*Tweet Tweet*\n  \nWhere'd it go?"

In [None]:
print(test_string)

Hark, a Lark!	*Tweet Tweet*
  
Where'd it go?


If you want to define a multiline string without using escaped newline characters, use triple quotation marks:

In [None]:
test_string2 = """Hark, a Lark!\t*Tweet Tweet*

Where'd it go?"""
test_string2

"Hark, a Lark!\t*Tweet Tweet*\n\nWhere'd it go?"

## Lists of Strings 

To this point, we have just looked at a single string. Rarely is that what we are actually interested in manipulating as social scientists, though. More frequently, we have sets of lots and lots of strings. To that end, we can create lists of strings.

In [None]:
a_list_of_strings = ["Manchester City", "Liverpool", "Chelsea", "Tottenham", "Arsenal"]
a_list_of_strings

['Manchester City', 'Liverpool', 'Chelsea', 'Tottenham', 'Arsenal']



Let's load in some other strings to work with. Here are letters from the "string" module. 

In [None]:
import string
letters_string = string.ascii_lowercase
letters_string

'abcdefghijklmnopqrstuvwxyz'

We can create a list from those letters, and call it `letters_list` 



In [None]:
letters_list = list(letters_string)
letters_list

['a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z']

We can also use some built in functionality in the "string" module to create upper-case versions of both the string and the list, and we'll call them `LETTERS`. 

In [None]:
LETTERS_string = string.ascii_uppercase
LETTERS_string

'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

In [None]:
LETTERS_list = list(LETTERS_string)
LETTERS_list

['A',
 'B',
 'C',
 'D',
 'E',
 'F',
 'G',
 'H',
 'I',
 'J',
 'K',
 'L',
 'M',
 'N',
 'O',
 'P',
 'Q',
 'R',
 'S',
 'T',
 'U',
 'V',
 'W',
 'X',
 'Y',
 'Z']

We'll make two different month lists, one with the abbreviation (`month_abb`) and one with the full name (`month_name`).

In [None]:
month_abb = ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
month_abb

['Jan',
 'Feb',
 'Mar',
 'Apr',
 'May',
 'Jun',
 'Jul',
 'Aug',
 'Sep',
 'Oct',
 'Nov',
 'Dec']

In [None]:
month_name = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
month_name

['January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August',
 'September',
 'October',
 'November',
 'December']

In the R tutorial, we use two lists that are provided directly in R. To get around that, Burt has downloaded and posted both lists to his Github. We'll load them up. The first (`fruit`) is a list of fruits, and the second (`words`) is a list of words.  

In [None]:
import pandas as pd
fruit = pd.read_csv("https://raw.githubusercontent.com/burtmonroe/TextAsDataCourse/master/Tutorials/fruit.txt", header=None)[0].to_list()
print(fruit)

['apple', 'apricot', 'avocado', 'banana', 'bell pepper', 'bilberry', 'blackberry', 'blackcurrant', 'blood orange', 'blueberry', 'boysenberry', 'breadfruit', 'canary melon', 'cantaloupe', 'cherimoya', 'cherry', 'chili pepper', 'clementine', 'cloudberry', 'coconut', 'cranberry', 'cucumber', 'currant', 'damson', 'date', 'dragonfruit', 'durian', 'eggplant', 'elderberry', 'feijoa', 'fig', 'goji berry', 'gooseberry', 'grape', 'grapefruit', 'guava', 'honeydew', 'huckleberry', 'jackfruit', 'jambul', 'jujube', 'kiwi fruit', 'kumquat', 'lemon', 'lime', 'loquat', 'lychee', 'mandarine', 'mango', 'mulberry', 'nectarine', 'nut', 'olive', 'orange', 'pamelo', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'pineapple', 'plum', 'pomegranate', 'pomelo', 'purple mangosteen', 'quince', 'raisin', 'rambutan', 'raspberry', 'redcurrant', 'rock melon', 'salal berry', 'satsuma', 'star fruit', 'strawberry', 'tamarillo', 'tangerine', 'ugli fruit', 'watermelon']


In [None]:

words = pd.read_csv("https://raw.githubusercontent.com/burtmonroe/TextAsDataCourse/master/Tutorials/words.txt", header=None)[0].to_list()
len(words)

980

As you can see above, the word list is 980 words long, so let's spare ourselves printing all of them now and just look at a few.


In [None]:
words[0:5]

['a', 'able', 'about', 'absolute', 'accept']

In [None]:
words[1]

'able'

In [None]:
words[5]

'account'

The first line above takes the first five elements of the list. Folks who do not have as much experience with indexing in Python might be a bit confused by the `[0:5]` chunk. Python starts at base 0 for indexing, whereas other programming languages like R start at base 1 for indexing. So `words[0]` is "a". When you want to take multiple elements, the index operates in terms of "between" characters. You can think of the index as capturing any terms "slightly to the left of" the indicated number. So `[0:5]` captures elements indexed between `[-0.01,4.99]`.   

Finally, let's read in the sentences list, which is also long.

In [None]:

sentences = pd.read_csv("https://raw.githubusercontent.com/burtmonroe/TextAsDataCourse/master/Tutorials/sentences.txt", header=None, sep="@")[0].to_list()
len(sentences)

720

In [None]:
sentences[0:5]

['The birch canoe slid on the smooth planks.',
 'Glue the sheet to the dark blue background.',
 "It's easy to tell the depth of a well.",
 'These days a chicken leg is a rare dish.',
 'Rice is often served in round bowls.']

# Manipulating strings

We have lots of strings to play with now. You can combine, or “concatenate”, strings very naturally using the "+" sign. Note that in the below we also add in a space between the two strings. You could potentially add anything you want in this particular spot (and can keep going with more strings).

In [None]:
a_string = "A first sentence!"
second_string = "Wow, two sentences."
combined_string = a_string + " " + second_string
combined_string

'A first sentence! Wow, two sentences.'

You can also combine lists of strings by a separator using the "join" method. To again join the two strings above separated by a space, place the strings to be joined in a *list* by using square brackets, and the separator in a string and use the syntax *sep*`.join(`*list*`)`:

In [None]:
" ".join([a_string,second_string]) 

'A first sentence. Wow, two sentences.'

Note that "join" takes a list of strings of *any* length and concatenates *all* the strings together with the separator.

In [None]:
" then ".join(month_name)

'January then February then March then April then May then June then July then August then September then October then November then December'

Let's work with those months to match abbreviations to month names. There are multiple ways to do this. We could do a `for` loop that iterates through the elements:

In [None]:
month_explanations = []
for i in range(12):
    new_string = " stands for ".join([month_abb[i], month_name[i]])
    month_explanations.append(new_string)
month_explanations

['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

Python programming prides itself on simplicity and clarity though, so the loop might be frowned upon. What are some alternatives? The "zip" function iterates over and "zips" together two lists. Let's look inside the zip function first by making a list of the zipped elements:

In [None]:
list(zip(month_abb, month_name))

[('Jan', 'January'),
 ('Feb', 'February'),
 ('Mar', 'March'),
 ('Apr', 'April'),
 ('May', 'May'),
 ('Jun', 'June'),
 ('Jul', 'July'),
 ('Aug', 'August'),
 ('Sep', 'September'),
 ('Oct', 'October'),
 ('Nov', 'November'),
 ('Dec', 'December')]

Not quite what we want, but we can expand on this as we iterate over the zip object using a "list comprehension":

In [None]:
[" stands for ".join([abbrev,name]) for abbrev,name in zip(month_abb, month_name)]

['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

The "list comprehension" is defined by those square brackets on the outside (making it a list) and the "for loop"-like instruction inside. 

There are many ways to do the same thing. For example, we can change the string manipulation operation that gets repeated from the join method to the format method:

In [None]:
["{} stands for {}".format(abbrev,name) for abbrev,name in zip(month_abb, month_name)]

['Jan stands for January',
 'Feb stands for February',
 'Mar stands for March',
 'Apr stands for April',
 'May stands for May',
 'Jun stands for June',
 'Jul stands for July',
 'Aug stands for August',
 'Sep stands for September',
 'Oct stands for October',
 'Nov stands for November',
 'Dec stands for December']

The join/zip idiom works for the letters example in the other notebook as well:

In [None]:
letterpairs = ["".join([lower,upper]) for lower, upper in zip(letters_list, LETTERS_list)]
print(letterpairs)

['aA', 'bB', 'cC', 'dD', 'eE', 'fF', 'gG', 'hH', 'iI', 'jJ', 'kK', 'lL', 'mM', 'nN', 'oO', 'pP', 'qQ', 'rR', 'sS', 'tT', 'uU', 'vV', 'wW', 'xX', 'yY', 'zZ']


You can zip two lists together, concatenate those element by element, and then join them by a separator.

In [None]:
" then ".join(["{} ({})".format(name,abbrev) for name,abbrev in zip(month_name,month_abb)])

'January (Jan) then February (Feb) then March (Mar) then April (Apr) then May (May) then June (Jun) then July (Jul) then August (Aug) then September (Sep) then October (Oct) then November (Nov) then December (Dec)'

You can split up a string into pieces, based on a pattern, with the "split" method.

In [None]:
combined_string.split("! ")

['A first sentence', 'Wow, two sentences.']

## Substrings (Slices)

Substrings are just slices in Python. We are just taking a little slice from a longer string and exporting it. For instance, to get a list of the second through fourth character in each fruit name:

In [None]:
substringfromfruit = [eachfruit[1:4] for eachfruit in fruit]
print(substringfromfruit)

['ppl', 'pri', 'voc', 'ana', 'ell', 'ilb', 'lac', 'lac', 'loo', 'lue', 'oys', 'rea', 'ana', 'ant', 'her', 'her', 'hil', 'lem', 'lou', 'oco', 'ran', 'ucu', 'urr', 'ams', 'ate', 'rag', 'uri', 'ggp', 'lde', 'eij', 'ig', 'oji', 'oos', 'rap', 'rap', 'uav', 'one', 'uck', 'ack', 'amb', 'uju', 'iwi', 'umq', 'emo', 'ime', 'oqu', 'ych', 'and', 'ang', 'ulb', 'ect', 'ut', 'liv', 'ran', 'ame', 'apa', 'ass', 'eac', 'ear', 'ers', 'hys', 'ine', 'lum', 'ome', 'ome', 'urp', 'uin', 'ais', 'amb', 'asp', 'edc', 'ock', 'ala', 'ats', 'tar', 'tra', 'ama', 'ang', 'gli', 'ate']


You can also count backwards from the end of the string in order to extract a slice from that end. To do so, use negative numbers and count backwards. 

In [None]:
subfromend = [eachfruit[-3:-1] for eachfruit in fruit]
print(subfromend)

['pl', 'co', 'ad', 'an', 'pe', 'rr', 'rr', 'an', 'ng', 'rr', 'rr', 'ui', 'lo', 'up', 'oy', 'rr', 'pe', 'in', 'rr', 'nu', 'rr', 'be', 'an', 'so', 'at', 'ui', 'ia', 'an', 'rr', 'jo', 'fi', 'rr', 'rr', 'ap', 'ui', 'av', 'de', 'rr', 'ui', 'bu', 'ub', 'ui', 'ua', 'mo', 'im', 'ua', 'he', 'in', 'ng', 'rr', 'in', 'nu', 'iv', 'ng', 'el', 'ay', 'ui', 'ac', 'ea', 'mo', 'li', 'pl', 'lu', 'at', 'el', 'ee', 'nc', 'si', 'ta', 'rr', 'an', 'lo', 'rr', 'um', 'ui', 'rr', 'll', 'in', 'ui', 'lo']


You might notice that misses the last character. Remember how the indexing works with Python, where slices are "slightly to the left of" the index. Therefore, if you want to catch the last character, you need to leave it blank:

In [None]:
subfromend = [eachfruit[-3:] for eachfruit in fruit]
print(subfromend)

['ple', 'cot', 'ado', 'ana', 'per', 'rry', 'rry', 'ant', 'nge', 'rry', 'rry', 'uit', 'lon', 'upe', 'oya', 'rry', 'per', 'ine', 'rry', 'nut', 'rry', 'ber', 'ant', 'son', 'ate', 'uit', 'ian', 'ant', 'rry', 'joa', 'fig', 'rry', 'rry', 'ape', 'uit', 'ava', 'dew', 'rry', 'uit', 'bul', 'ube', 'uit', 'uat', 'mon', 'ime', 'uat', 'hee', 'ine', 'ngo', 'rry', 'ine', 'nut', 'ive', 'nge', 'elo', 'aya', 'uit', 'ach', 'ear', 'mon', 'lis', 'ple', 'lum', 'ate', 'elo', 'een', 'nce', 'sin', 'tan', 'rry', 'ant', 'lon', 'rry', 'uma', 'uit', 'rry', 'llo', 'ine', 'uit', 'lon']


You can use slicing to extract data from strings:

In [None]:
some_dates = ["1999/01/01","1998/12/15","2001/09/03"]
years = [date[0:4] for date in some_dates]
print(years)

['1999', '1998', '2001']


In [None]:
months = [date[5:7] for date in some_dates]
print(months)

['01', '12', '09']


Getting a copy of a string with specific positions replaced is also a matter of slicing:

In [None]:
apple = "apple"
zebra = "--!ZEBRA!--"
zebraapple = apple[0:1] + zebra + apple[3:]
zebraapple

'a--!ZEBRA!--le'

## Capitalization

Strings have a simple casefolding method that can be applied:

In [None]:
combined_string.lower()

'a first sentence! wow, two sentences.'

In [None]:
combined_string.upper()

'A FIRST SENTENCE! WOW, TWO SENTENCES.'

## White Space

Also, Python has several methods to trim excess white space *off the ends* of strings:

In [None]:
lotsofspace = '   Why   so much  space?   '
lotsofspace.strip()

'Why   so much  space?'

In [None]:
lotsofspace.lstrip()

'Why   so much  space?   '

In [None]:
lotsofspace.rstrip()

'   Why   so much  space?'

## Matching substrings

If we're looking for specific substrings, there are string methods to do that.

In [None]:
"strawberry".find("berry")

5

That returns the position of the first match. If there is no match, find returns a value of -1.

In [None]:
"apple".find("berry")

-1

If there are multiple matches, find returns the position of the first match.

In [None]:
"berryberryboberrybananafanafoferrymemymomerry berry".find("berry")

0

We can use this in a list comprehension, with the addition of an "if" condition, to extract a list of all matching fruits.

In [None]:
[fr for fr in fruit if fr.find("berry")> -1]

['bilberry',
 'blackberry',
 'blueberry',
 'boysenberry',
 'cloudberry',
 'cranberry',
 'elderberry',
 'goji berry',
 'gooseberry',
 'huckleberry',
 'mulberry',
 'raspberry',
 'salal berry',
 'strawberry']

We can get a copy of the string with the substring replaced with something else:

In [None]:
"strawberry".replace("berry","fish")

'strawfish'

In [None]:
fishfruit = [fr.replace("berry","fish") for fr in fruit]
print(fishfruit)

['apple', 'apricot', 'avocado', 'banana', 'bell pepper', 'bilfish', 'blackfish', 'blackcurrant', 'blood orange', 'bluefish', 'boysenfish', 'breadfruit', 'canary melon', 'cantaloupe', 'cherimoya', 'cherry', 'chili pepper', 'clementine', 'cloudfish', 'coconut', 'cranfish', 'cucumber', 'currant', 'damson', 'date', 'dragonfruit', 'durian', 'eggplant', 'elderfish', 'feijoa', 'fig', 'goji fish', 'goosefish', 'grape', 'grapefruit', 'guava', 'honeydew', 'hucklefish', 'jackfruit', 'jambul', 'jujube', 'kiwi fruit', 'kumquat', 'lemon', 'lime', 'loquat', 'lychee', 'mandarine', 'mango', 'mulfish', 'nectarine', 'nut', 'olive', 'orange', 'pamelo', 'papaya', 'passionfruit', 'peach', 'pear', 'persimmon', 'physalis', 'pineapple', 'plum', 'pomegranate', 'pomelo', 'purple mangosteen', 'quince', 'raisin', 'rambutan', 'raspfish', 'redcurrant', 'rock melon', 'salal fish', 'satsuma', 'star fruit', 'strawfish', 'tamarillo', 'tangerine', 'ugli fruit', 'watermelon']


# Regular Expressions

Our searches above rely on relatively inflexible patterns ("berry") that could miss lots of variations (e.g., "-berries"). We can instead leverage regular expressions as  flexible patterns for finding matches. For this we need to import the **re** module.

Just for comparison's sake, let's start with a search for the same pattern as above: "berry".

In [None]:
import re
mo = re.search(r'berry', 'strawberry')
mo

<re.Match object; span=(5, 10), match='berry'>

We created a `Match` object above called `mo`. That object has different attributes that we can then pull. For instance, the start and end positions of the match object are in the "span" attribute:

In [None]:
mo.span()

(5, 10)

The match itself is in the "group" attribute, which I'll explain below.

In [None]:
mo.group()

'berry'

If there is no match, the match object is null-valued ("None"). You can, more or less, use match objects in conditional statements, with null equalling "False" and any match resulting in "True".

In [None]:
mo_miss = re.search(r'berry','apple')
mo_miss

In [None]:
print(mo_miss)

None


In [None]:
if mo:
    print("Strawberry is a berry!")
else:
    print("Strawberry is not a berry.")

Strawberry is a berry!


In [None]:
if mo_miss:
    print("Apple is a berry")
else:
    print("Apple is not a berry.")

Apple is not a berry.


Which, again can be put in a list comprehension to get a list of all berries:

In [None]:
berries = [itsaberry for itsaberry in fruit if re.search(r'berry',itsaberry)]
print(berries)

['bilberry', 'blackberry', 'blueberry', 'boysenberry', 'cloudberry', 'cranberry', 'elderberry', 'goji berry', 'gooseberry', 'huckleberry', 'mulberry', 'raspberry', 'salal berry', 'strawberry']


As a sidebar, this "compiles" the regular expression every time through the loop. It's more efficient to compile it once before the loop using a slightly different syntax:

In [None]:
reo = re.compile(r'berry') # compile the pattern into a regular expression object
berries = [itsaberry for itsaberry in fruit if reo.search(itsaberry)]
print(berries)

['bilberry', 'blackberry', 'blueberry', 'boysenberry', 'cloudberry', 'cranberry', 'elderberry', 'goji berry', 'gooseberry', 'huckleberry', 'mulberry', 'raspberry', 'salal berry', 'strawberry']


The "search" method will return a single object describing only the first match in the string.

In [None]:
mo_many = re.search(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
mo_many

<re.Match object; span=(0, 5), match='berry'>

The findall method returns a list of all matching strings.

In [None]:
mo_many2 = re.findall(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
mo_many2

['berry', 'berry', 'berry', 'berry']

The "finditer" method returns an "iterator" (thing, like a list, over which you can, um, iterate) containing match objects for every match.

In [None]:
mo_iter = re.finditer(r'berry',"berryberryboberrybananafanafoferrymemymomerry berry")
for moi in mo_iter:
    print(moi)

<re.Match object; span=(0, 5), match='berry'>
<re.Match object; span=(5, 10), match='berry'>
<re.Match object; span=(12, 17), match='berry'>
<re.Match object; span=(46, 51), match='berry'>


Now let's use regex to look for more complex patterns than just substrings.

#### Square brackets for “or” (disjunction) of characters.

Match “any one of” the characters in the square brackets.

In [None]:
reodemo = re.compile(r' [bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'Feel the heat of the weak dying flame.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'It takes heat to bring out the odor.']

#### Square brackets with ^ for negation.

Match “anything but one of” the characters in the square brackets.

(Be careful ... the carat ... ^ ... means something else in different context.)

In [None]:
reodemo = re.compile(r' [^bhp]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['Pack the records in a neat thin case.', 'A clean neck means a neat collar.']

#### Square brackets for “or” over a range of characters

In [None]:
reodemo = re.compile(r' [b-p]eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'Feel the heat of the weak dying flame.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'Pack the records in a neat thin case.',
 'It takes heat to bring out the odor.',
 'A clean neck means a neat collar.']

#### Pipe operator for "or" over multi-character patterns

When we need an “or” over multi-character patterns, we can use the “pipe” operator, using parentheses as necessary to identify what’s with what.

In [None]:
reodemo = re.compile(r'(black|blue|red)(currant|berry)')
matches = [itsamatch for itsamatch in fruit if reodemo.search(itsamatch)]
matches

['blackberry', 'blackcurrant', 'blueberry', 'redcurrant']

#### Special characters and the backslash

In addition to the backslash itself, there are several characters that have special meaning in Python regexes, and (may) have to be escaped in order to match the literal character. Here are the big ones: ^ $ . * + | ! ? ( ) [ ] { } < >.

For example, the period – “.” – means “any character but a newline.” It’s a wildcard. We get different results when we escape or don’t escape it.

In [None]:
allchars = re.findall(r'.',combined_string)
print(allchars)

['A', ' ', 'f', 'i', 'r', 's', 't', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', '!', ' ', 'W', 'o', 'w', ',', ' ', 't', 'w', 'o', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


In [None]:
allperiods = re.findall(r'\.',combined_string)
print(allperiods)

['.']


In [None]:
matches = re.findall(r'e.',combined_string)
print(matches)

['en', 'en', 'e!', 'en', 'en', 'es']


In [None]:
matches = re.findall(r'e\.',combined_string)
print(matches)

[]


Some of these are only special characters in certain contexts and don’t have to be escaped to be recognized when not in those contexts. But they can be escaped in all circumstances and I recommend that rather than trying to figure out the exact rules.

The exclamation point is such a character.

In [None]:
matches = re.findall(r'\!',combined_string)
print(matches)

['!']


In [None]:
matches = re.findall(r'!',combined_string) # Not special char in this context, so still finds it
print(matches)

['!']


#### Class shorthands

Conversely, there are a number of characters that have special meaning only when escaped. The main ones for now are “\w” (any alphanumeric character), “\s” (any space character), and “\d” (any numeric digit). The capitalized versions of these are used to mean “anything but” that class.

In [None]:
matches = re.findall(r'\w',combined_string) # any alphanumeric character
print(matches)

['A', 'f', 'i', 'r', 's', 't', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 'W', 'o', 'w', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's']


In [None]:
matches = re.findall(r'\W',combined_string) # any non-alphanumeric character
print(matches)

[' ', ' ', '!', ' ', ',', ' ', ' ', '.']


In [None]:
matches = re.findall(r'\s',combined_string) # any whitespace character
print(matches)

[' ', ' ', ' ', ' ', ' ']


In [None]:
matches = re.findall(r'\S',combined_string) # any non-whitespace character
print(matches)

['A', 'f', 'i', 'r', 's', 't', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', '!', 'W', 'o', 'w', ',', 't', 'w', 'o', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


In [None]:
matches = re.findall(r'\d',combined_string) # any digit character
print(matches)

[]


In [None]:
matches = re.findall(r'\D',combined_string) # any non-digit character
print(matches)

['A', ' ', 'f', 'i', 'r', 's', 't', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', '!', ' ', 'W', 'o', 'w', ',', ' ', 't', 'w', 'o', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', 's', '.']


The Python re module does not directly support "POSIX" classes.

#### Quantifiers: * (zero or more of the previous)

This is also known as the “Kleene star” (pronounced clean-ee), after its original user (Kleene) who introduced the notation in formal logic.

In [None]:
matches = re.findall('\d*',combined_string) # any string of zero or more digits
print(matches)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


Note the "zero" or more led it to identify every position of the string as a match, many of them empty (containing no characters).

#### Quantifiers: + (one or more of the previous)

This is also known as the “Kleene plus.”

In [None]:
matches = re.findall('\d+',combined_string) # any string of zero or more digits
print(matches)

[]


#### Quantifiers {n} {n,m} and {n,}

{n} = “exactly n” of the previous
{n,m} = “between n and m” of the previous
{n,} = “n or more” of the previous

In [None]:
matches = re.findall(r'x{3}','x xx xxx xxxx xxxxx') # 3 x's
print(matches)

['xxx', 'xxx', 'xxx']


In [None]:
matches = re.findall(r'x{3,4}','x xx xxx xxxx xxxxx') # 3 or 4 x's
print(matches)

['xxx', 'xxxx', 'xxxx']


In [None]:
matches = re.findall(r'x{3,}','x xx xxx xxxx xxxxx') # 3 or more x's
print(matches)

['xxx', 'xxxx', 'xxxxx']


Were any of those unexpected? (Probably ... how many strings of 3 x's are in that string?) Use your regex viewer to see what's going on.

#### Quantifier ? (zero or one of the previous)

In [None]:
matches = re.findall(r'\d?', combined_string) # any string of zero or one digits
print(matches)

['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']


In [None]:
reodemo = re.compile(r' [bp]?eat ')
matches = [itsamatch for itsamatch in sentences if reodemo.search(itsamatch)]
matches

['The heart beat strongly and with firm strokes.',
 'Burn peat after the logs give out.',
 'A speedy man can beat this track mark.',
 'Even the worst will beat his low score.',
 'Quench your thirst, then eat the crackers.']

#### Question Mark as Nongreedy Modifier to Quantifier (smallest match of previous possible)


In [None]:
# greedy - roughly, longest match
matches = re.findall(r'\(.+\)','(First bracketed statement) Other text (Second bracketed statement)')
print(matches)

['(First bracketed statement) Other text (Second bracketed statement)']


In [None]:
# nongreedy - roughly, smallest matches
matches = re.findall(r'\(.+?\)','(First bracketed statement) Other text (Second bracketed statement)')
print(matches)

['(First bracketed statement)', '(Second bracketed statement)']


In [None]:
# greedy - matches whole string
matches = re.findall(r'x.+x','x xx xxx xxxx xxxxx')
print(matches)

['x xx xxx xxxx xxxxx']


In [None]:
# nongreedy - minimal match as placeholder moves across string
matches = re.findall(r'x.+?x','x xx xxx xxxx xxxxx')
print(matches)

['x x', 'x x', 'xx x', 'xxx', 'xxx']


#### Anchors at beginning and end of string

In [None]:
matches = re.findall(r'^\w+',combined_string) # ^ is beginning of string
print(matches)

['A']


In [None]:
matches = re.findall(r'\w+$',combined_string) # $ is end of string
print(matches)

[]


In [None]:
matches = re.findall(r'\W+$',combined_string) # $ is end of string
print(matches)

['.']


#### Anchors at word boundaries

Similarly, we can identify "word boundaries" with \b. This solves the greedy/nongreedy problem we had with the ”x" sequences above. 

In [None]:
matches = re.findall(r'\bx.*?\b','x xx xxx xxxx xxxxx')
print(matches)

['x', 'xx', 'xxx', 'xxxx', 'xxxxx']


In [None]:
matches = re.findall(r'\b\w+?\b',combined_string) 
print(matches)

['A', 'first', 'sentence', 'Wow', 'two', 'sentences']


#### Capture groups

When we use parentheses, it tells the regex engine to capture the part of the match enclosed in parentheses. Each set of parentheses defines its own "capture group" and these are held in the group() attribute of the match object. Whether there are parentheses are not, the entire match is held in group(0). Smaller parts are in group(1), group(2), etc.

In [None]:
matches = [re.search(r'^(.+?)(berry|fruit)$',fr) for fr in fruit]
for match in matches:
    if match:
        print(match.group(0), match.group(1), match.group(2))

bilberry bil berry
blackberry black berry
blueberry blue berry
boysenberry boysen berry
breadfruit bread fruit
cloudberry cloud berry
cranberry cran berry
dragonfruit dragon fruit
elderberry elder berry
goji berry goji  berry
gooseberry goose berry
grapefruit grape fruit
huckleberry huckle berry
jackfruit jack fruit
kiwi fruit kiwi  fruit
mulberry mul berry
passionfruit passion fruit
raspberry rasp berry
salal berry salal  berry
star fruit star  fruit
strawberry straw berry
ugli fruit ugli  fruit


## An example

Consider the following text from a Congressional Bill. We're going to try to use regular expressions to make data out of the appropriations dollars and purposes in bullets 1-9.

In [None]:
text = """SEC. 101. FISCAL YEAR 2017.
(a) In General.--There are authorized to be appropriated to NASA
for fiscal year 2017 $19,508,000,000, as follows:
(1) For Exploration, $4,330,000,000.
(2) For Space Operations, $5,023,000,000.
(3) For Science, $5,500,000,000.
(4) For Aeronautics, $640,000,000.
(5) For Space Technology, $686,000,000.
(6) For Education, $115,000,000.
(7) For Safety, Security, and Mission Services,
$2,788,600,000.
(8) For Construction and Environmental Compliance and
Restoration, $388,000,000.
(9) For Inspector General, $37,400,000.
(b) Exception.--In addition to the amounts authorized to be
appropriated for each account under subsection (a), there are
authorized to be appropriated additional funds for each such account,
but only if the authorized amounts for all such accounts are fully
provided for in annual appropriation Acts, consistent with the
discretionary spending limits in section 251(c) of the Balanced Budget
and Emergency Deficit Control Act of 1985."""

Note that's *one* string with a bunch of newline characters.

In [None]:
text

'SEC. 101. FISCAL YEAR 2017.\n(a) In General.--There are authorized to be appropriated to NASA\nfor fiscal year 2017 $19,508,000,000, as follows:\n(1) For Exploration, $4,330,000,000.\n(2) For Space Operations, $5,023,000,000.\n(3) For Science, $5,500,000,000.\n(4) For Aeronautics, $640,000,000.\n(5) For Space Technology, $686,000,000.\n(6) For Education, $115,000,000.\n(7) For Safety, Security, and Mission Services,\n$2,788,600,000.\n(8) For Construction and Environmental Compliance and\nRestoration, $388,000,000.\n(9) For Inspector General, $37,400,000.\n(b) Exception.--In addition to the amounts authorized to be\nappropriated for each account under subsection (a), there are\nauthorized to be appropriated additional funds for each such account,\nbut only if the authorized amounts for all such accounts are fully\nprovided for in annual appropriation Acts, consistent with the\ndiscretionary spending limits in section 251(c) of the Balanced Budget\nand Emergency Deficit Control Act of 1

Lets play around with a few things. Extract all contiguous sequences of one or more numbers.

In [None]:
digitmatches = re.findall(r'[0-9]+',text) # one or more consecutive digits
print(digitmatches)

['101', '2017', '2017', '19', '508', '000', '000', '1', '4', '330', '000', '000', '2', '5', '023', '000', '000', '3', '5', '500', '000', '000', '4', '640', '000', '000', '5', '686', '000', '000', '6', '115', '000', '000', '7', '2', '788', '600', '000', '8', '388', '000', '000', '9', '37', '400', '000', '251', '1985']


That does two things we don't like ... separates numbers at the 1000s separating comma and gets numbers ("101", "2017", etc.) that aren't dollar amounts. So, let's try getting everything that:
* Starts with a "$" (which needs to be escaped)
* Followed by one or more strings of commas or digits.

In [None]:
dollarmatches = re.findall(r'\$[,0-9]+',text) # $ followed by one or more digits or commas
print(dollarmatches)

['$19,508,000,000,', '$4,330,000,000', '$5,023,000,000', '$5,500,000,000', '$640,000,000', '$686,000,000', '$115,000,000', '$2,788,600,000', '$388,000,000', '$37,400,000']


Almost ... don't like that extra comma on the first number. Let's require it to end with a number.

In [None]:
dollarmatches2 = re.findall(r'\$[,0-9]+[0-9]',text) # $ followed by one or more digits or commas AND ENDS IN A NUMBER
print(dollarmatches2)

['$19,508,000,000', '$4,330,000,000', '$5,023,000,000', '$5,500,000,000', '$640,000,000', '$686,000,000', '$115,000,000', '$2,788,600,000', '$388,000,000', '$37,400,000']


The things we want are demarcated by numbered items in parentheses. Let's see if we can extract those:

In [None]:
bulletmatches = re.findall(r'\([0-9]\)',text) # ( followed by a digit followed by )
print(bulletmatches)

['(1)', '(2)', '(3)', '(4)', '(5)', '(6)', '(7)', '(8)', '(9)']


Let's go back to the original and get rid of the newlines. Note that the string.replace() method doesn't accept regular expressions and you need to use re.sub().

In [None]:
one_line = re.sub('\n',' ',text)
one_line

and find all the matches from "(number)" to a period, lazily rather than greedily:

In [None]:
item_strings = re.findall('\(\d\).+?\.', one_line)
print(item_strings)

We can use a capture group to gather just the "for what" data ...

In [None]:
for_matches = [re.search(r'For (.+), \$', item_string) for item_string in item_strings]
for_strings = [for_match.group(1) for for_match in for_matches if for_match]
for_strings

['Exploration',
 'Space Operations',
 'Science',
 'Aeronautics',
 'Space Technology',
 'Education',
 'Safety, Security, and Mission Services',
 'Construction and Environmental Compliance and Restoration',
 'Inspector General']

We can also use a capture group just for the money data

In [None]:
money_matches = [re.search(r'\$([,\d]+)', item_string) for item_string in item_strings]
money_strings = [money_match.group(1) for money_match in money_matches if money_match]
money_strings

['4,330,000,000',
 '5,023,000,000',
 '5,500,000,000',
 '640,000,000',
 '686,000,000',
 '115,000,000',
 '2,788,600,000',
 '388,000,000',
 '37,400,000']

We'll probably want those just to be numbers, so we'll strip the $ sign and commas:

In [None]:
money_strings_clean = [re.sub('[\$,]','',moneystring) for moneystring in money_strings]
money_strings_clean

['4330000000',
 '5023000000',
 '5500000000',
 '640000000',
 '686000000',
 '115000000',
 '2788600000',
 '388000000',
 '37400000']

We'll turn that into numbers and reformat them as millions.

In [None]:
millions=[int(x)/10e6 for x in money_strings_clean]

Finally, we can format the data in a pandas dataframe for later processing.

In [None]:
# we already imported pandas above
appropriations_df = pd.DataFrame({'item': for_strings,'mdollars': millions})
print(appropriations_df)

                                                item  mdollars
0                                        Exploration    433.00
1                                   Space Operations    502.30
2                                            Science    550.00
3                                        Aeronautics     64.00
4                                   Space Technology     68.60
5                                          Education     11.50
6             Safety, Security, and Mission Services    278.86
7  Construction and Environmental Compliance and ...     38.80
8                                  Inspector General      3.74
