## Introduction to Regular Expressions in Python and Formatting Texts

**Adapted from a lesson written by Adam Anderson**

We have been using Python's in-built string functions and NLTK to tokenize, count, and analyze strings. Today, we will be learning about another technique of text analysis that involves using a special type of code called "regular expressions". To implement regular expressions we will be using functions from the 're' library of Python. 

### Learning Goals:
The goal of this lesson is to gain an introductory understanding of how regular expressions can be used with large portions of text to cleanly pull data from the text. Regular expressions can seem overwhelming at first, but with practice, they become easier to use. The goal is to add the usage of regular expressions to your text analysis 'toolbox'! We'll use regular expressions in practice to transform text files into a pandas dataframe, and in doing so we'll learn about the os library and we'll practice using for loops.


### Lesson Outline:
- Regular Expressions Overview
- findall, sub, and more
- Examples
- Using split and re to format texts
    - os library
    - for loops

### Key Terms

* regular expressions
    * a sequence of characters that define a search pattern. Usually this pattern is then used by string searching algorithms for "find" or "find and replace" operations on strings.
* for loop
    * a control flow statement for specifying iteration, which allows code to be executed repeatedly. 
* os library
    * The os and sys modules provide numerous tools to deal with filenames, paths, directories directly in Python.
* directory
    * A directory is an organizational unit, or container, used to organize folders and files into a hierarchical structure.

### Testing Regular Expressions

There are a number of online tools that can be used to build and test regular expressions. This is my favorite: https://regex101.com/#python

## Regular Expressions

We have worked with Python's NLTK to tokenize words and sort/structure/manipulate them using the in-built functions. Regular expressions are an alternative way to 'search' for information within strings. At their most basic level, regular expressions are sequences of characters that define a pattern with which the computer searches. Regular expressions give us immense power by allowing us to search within extremly large portions of text for very specific types of text/information. 

Imagine that we are given a farm of various animals. Think of regular expressions as defining features that help us find exactly what we are looking for. In my farm, I want to search for animals that are brown, have 4 legs, and weigh more than 10 pounds. Each of those animal characteristics is analogous to a "sequences of characters" in the context of regular expressions. I may be looking for words that contain capital letters, or words that specifically start with a certain sequence of characters, similarly to how I am looking for brown animals, or animals with 4 legs.

In the cells below, we will introduce some basic regular expression code, and the findall function within the 're' Python library. This function will allow us to transform our confusing regex code into tangible results.



In [1]:
#importing the package for regular expressions
import re
#we'll also use pandas
import pandas

### re.findall

You can use `re.findall` to find all instances of some string/regex/pattern within a larger string.

It is used with the syntax `re.findall(pattern, string)`, where `pattern` is the pattern that you want to look for in `string`. It returns all instances of that pattern in a list.

In [2]:
# this is our example string
example = 'The dog and cat and muskrat and snake and cow and mouse and moose and mare and deer and macaw and bear all went to the store.'

In [3]:
# you can put the string that you want to look for
re.findall('and', example)

['and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and', 'and']

In [4]:
# if you call len on the list, it will tell you how many items there are
len(re.findall('and', example))

10

In [5]:
#how do you interpret the following example?
len(re.findall(' ', example))

26

### re Special Characters

We can search for strings, but the power of regular expressions comes in its special characters. Special characters indicate a type of character to match. For example:

In [6]:
# using '.' will return any character
print(re.findall('.', example))

['T', 'h', 'e', ' ', 'd', 'o', 'g', ' ', 'a', 'n', 'd', ' ', 'c', 'a', 't', ' ', 'a', 'n', 'd', ' ', 'm', 'u', 's', 'k', 'r', 'a', 't', ' ', 'a', 'n', 'd', ' ', 's', 'n', 'a', 'k', 'e', ' ', 'a', 'n', 'd', ' ', 'c', 'o', 'w', ' ', 'a', 'n', 'd', ' ', 'm', 'o', 'u', 's', 'e', ' ', 'a', 'n', 'd', ' ', 'm', 'o', 'o', 's', 'e', ' ', 'a', 'n', 'd', ' ', 'm', 'a', 'r', 'e', ' ', 'a', 'n', 'd', ' ', 'd', 'e', 'e', 'r', ' ', 'a', 'n', 'd', ' ', 'm', 'a', 'c', 'a', 'w', ' ', 'a', 'n', 'd', ' ', 'b', 'e', 'a', 'r', ' ', 'a', 'l', 'l', ' ', 'w', 'e', 'n', 't', ' ', 't', 'o', ' ', 't', 'h', 'e', ' ', 's', 't', 'o', 'r', 'e', '.']


In [7]:
# you can combine special characters like '.' with plain letters
re.findall('m.', example)

['mu', 'mo', 'mo', 'ma', 'ma']

In [8]:
# '\w' is the special character for any letter (i.e. no alpha characters, or no numbers)
# '+' indicates that we want instances where there are one or more in a row
print(re.findall('\w+', example))

['The', 'dog', 'and', 'cat', 'and', 'muskrat', 'and', 'snake', 'and', 'cow', 'and', 'mouse', 'and', 'moose', 'and', 'mare', 'and', 'deer', 'and', 'macaw', 'and', 'bear', 'all', 'went', 'to', 'the', 'store']


In [9]:
# you can also specify that you want a certain amount of repeats of a character using {}
print(re.findall('\w{1,3}', example))

['The', 'dog', 'and', 'cat', 'and', 'mus', 'kra', 't', 'and', 'sna', 'ke', 'and', 'cow', 'and', 'mou', 'se', 'and', 'moo', 'se', 'and', 'mar', 'e', 'and', 'dee', 'r', 'and', 'mac', 'aw', 'and', 'bea', 'r', 'all', 'wen', 't', 'to', 'the', 'sto', 're']


In [10]:
# '\s' is the character for whitespace
#Question: what will this regular expression match?
re.findall('m\w+\s', example)

['muskrat ', 'mouse ', 'moose ', 'mare ', 'macaw ']

In [15]:
# you can use [] to indicate that the next character can come from any of the options within the brackets
re.findall('m[u,a]\w+\S', example)

['muskrat', 'mare', 'macaw']

In [None]:
# '?' means that the before character is optional

In [18]:
##EX: Find and print all the words in the example sentence that begin with the letter c or d.
##You should print the words dog, cat, cow, and deer

re.findall(' [c,d]\w+ ',example)

[' dog ', ' cat ', ' cow ', ' deer ']

### Testing regular expressions

[Online tools!](https://regex101.com/#python)

### re.sub

If you wanted to substitute something in for all of the patterns that you found with `re.findall`, you could use `re.sub`. 

It is used with the syntax `re.sub(pattern, repl, string)`, where `pattern` is the pattern that you are looking for within `string`. The string that you want to replace `pattern` with is `repl`.

In [19]:
re.sub('and m[u,a]\w+\S ', '',example)

'The dog and cat and snake and cow and mouse and moose and deer and bear all went to the store.'

In [20]:
re.sub(' and', ',',example, count = )

'The dog, cat, muskrat, snake, cow, mouse, moose, mare, deer, macaw, bear all went to the store.'

You can also add a 'count' option to the re.sub to indicate how many replacements you want to make. 

In the above example, what if we want to keep the last 'and' in. We can do this in a few steps.

In [21]:
#find the total number of 'and's in the sentence:
num = len(re.findall('and', example))
num

10

In [22]:
#EX: what do we want to put after the count option to ensure we do not replace the last 'and'
re.sub(' and', ',',example, count = num-1)

'The dog, cat, muskrat, snake, cow, mouse, moose, mare, deer, macaw and bear all went to the store.'

EX: Gold star challenge: write code using regular expression that prints the word 'and' and the surrounding 2 words, that is the two words that come before and the two words that come after each instance 'and'. In essence, reproduce what the nltk.concordance() function does.

In [23]:
#EX code here
re.findall('\w+ \w+ and \w+ \w+ ', example)

['The dog and cat and ',
 'and snake and cow and ',
 'and moose and mare and ',
 'and macaw and bear all ']

### Reminder: split texts

In [24]:
#if you don't put anything in the parenthesis after .split, it will default to splitting by spaces
split_by_spaces = example.split()

print(split_by_spaces)


#sometimes it will be more helpful to split by a specific string
split_by_and = example.split('and')

print(split_by_and)

['The', 'dog', 'and', 'cat', 'and', 'muskrat', 'and', 'snake', 'and', 'cow', 'and', 'mouse', 'and', 'moose', 'and', 'mare', 'and', 'deer', 'and', 'macaw', 'and', 'bear', 'all', 'went', 'to', 'the', 'store.']
['The dog ', ' cat ', ' muskrat ', ' snake ', ' cow ', ' mouse ', ' moose ', ' mare ', ' deer ', ' macaw ', ' bear all went to the store.']


### Formatting Text Files: Split

As you're collecting your own corpus you will likely run into many problems getting the text in a format that is useful for analysis. We'll cover one potential scenerio today.

If you have a bunch of text files that have some helpful patterns, you might want to format them into a Pandas dataframe. For example, I created a folder called 'texts' and put the four texts we've been working with into it. (Carefull, I changed the text a bit so make sure you download the specific files from bCourses).

The goal: create a pandas dataframe with each text as a cell in a row, and columns containin the author and title.

We can do this two ways. As the author and title are in the filname, we can simply use the split function.

First, assign the filename to a variable. To do this we'll need a few extra tricks.
The os library allows us to list the contents of a folder (or what we call a directory).

In [25]:
import os
folder_path = "../data_dump/raw_texts/"
print(os.listdir(folder_path))

['Machiavelli_ThePrince.txt', 'Austen_PrideAndPrejudice.txt', 'Marx_CommunistManifesto.txt', 'Alcott_GarlandForGirls.txt']


Let's first work with the first filename. We want to create a dataframe with the author, Machiavelli, as one column, the title, The Prince, as another column, and the text as the third column. We'll do this by creating a list that we'll convert to a pandas dataframe.

In [26]:
#assign the filename to a variable
filename = os.listdir(folder_path)[0]
filename

'Machiavelli_ThePrince.txt'

In [27]:
##EX: Use the split function to create a list (call the list text_list) with two elements: 
##the first element should be the author
##the second element should be the title (careful here, print it out to make sure it's correct)

text_list = []
#use the split function to create the list
author = filename.split('_')[0]
title = filename.split('_')[1][:-4]
text_list.extend([author,title])
text_list

['Machiavelli', 'ThePrince']

In [28]:
#Our final column is the text
text_list.append(open(folder_path+filename, encoding='utf-8').read())
text_list

['Machiavelli',
 'ThePrince',

Now let's loop through the filenames to create a list of lists, that we can turn into a dataframe

In [29]:
#first initialize the square list
full_text = []

#loop through all of the filenames, and do what we did above
for filename in os.listdir(folder_path):
    print(filename)
    text_list = []
    author = filename.split('_')[0]
    title = filename.split('_')[1][:-4]
    text = open(folder_path+filename, encoding='utf-8').read()
    text_list.extend([author, title, text])
    full_text.append(text_list)
full_text

Machiavelli_ThePrince.txt
Austen_PrideAndPrejudice.txt
Marx_CommunistManifesto.txt
Alcott_GarlandForGirls.txt


[['Machiavelli',
  'ThePrince',
 ['Austen',
  'PrideAndPrejudice',
  'Title: Pride and Prejudice\nAuthor: Jane Austen\n\nPRIDE AND PREJUDICE:\n\nA NOVEL.\n\nIN THREE VOLUMES.\n\nBY THE AUTHOR OF "SENSE AND SENSIBILITY."\n\nVOL. I.\n\n\nPRIDE & PREJUDICE.\n\nCHAPTER I.\n\nIt is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\n\nHowever little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.\n\n"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"\n\nMr. Bennet replied that he had not.\n\n"But it is," returned she; "for Mrs. Long has just been here, and she told me all about it."\n\nMr. Bennet made no answer.\n\n"Do not you want to know who has taken it?" cried his wife impatient

In [30]:
import pandas
#create column names
column_names = ['author', 'title', 'text']

#create dataframe
df = pandas.DataFrame(full_text, columns = column_names)
df

Unnamed: 0,author,title,text
0,Machiavelli,ThePrince,Title: THE PRINCE\nAuthor: Nicolo Machiavelli\...
1,Austen,PrideAndPrejudice,Title: Pride and Prejudice\nAuthor: Jane Auste...
2,Marx,CommunistManifesto,Title: The Communist Manifesto\nAuthors: Karl ...
3,Alcott,GarlandForGirls,Title: A GARLAND FOR GIRLS\nAuthor: Louisa May...


### Formatting Text Files: Regular Expressions

What if the title of the file is not helpful? Sometimes the text of the file has a consistent formatting that can help us pull out metadata. This is the case for downloads from LexisNexis, for example. I have slightly edited these files to have a consistent formatting at the beginning of the text that we can use.

In [31]:
print(df.loc[0,'text'][:50])
print(df.loc[1,'text'][:50])
print(df.loc[2,'text'][:50])

Title: THE PRINCE
Author: Nicolo Machiavelli


Tra
Title: Pride and Prejudice
Author: Jane Austen

PR
Title: The Communist Manifesto
Authors: Karl Marx 


We can use regular expressions to pull out the metadata. To do so I'll introduce the concept of groups, denoted by parentheses.

In [32]:
import re
prince = open(folder_path+filename, encoding='utf-8').read()
title = re.findall("Title: .*", prince)
title

['Title: A GARLAND FOR GIRLS']

In [33]:
title = re.findall("Title: (.*)", prince) #use parentheses to indicate what part of the text you want to keep
title

['A GARLAND FOR GIRLS']

In [34]:
##EX: Do the same for author. Careful!
author = re.findall("Authors?: (.*)", prince) #use parentheses to indicate what part of the text you want to keep
author

['Louisa May Alcott']

We can replicate what we did with Authors, but also find the text. We'll use a second set of parenthases here.

In [35]:
my_string = re.findall("Authors?: (.*)\n([\s\S]*)", prince)
my_string

[('Louisa May Alcott',

Notice here the list/tuple structure. We can access the elements in the tuple and in the list:

In [36]:
print(my_string[0][0])

Louisa May Alcott


In [37]:
##EX: Write a for loop to structure the texts into a dataframe, with columns for author, title, and text

full_text = []
for filename in os.listdir(folder_path):
    text_list = []
    my_text = open(folder_path+filename, encoding='utf-8').read()
    title = re.findall("Title: (.*)\n", my_text)[0]
    my_string = re.findall("Authors?: (.*)\n([\s\S]*)", my_text)
    author = my_string[0][0]
    text = my_string[0][1]
    text_list.extend([title,author,text])
    full_text.append(text_list)
full_text

[['THE PRINCE',
  'Nicolo Machiavelli',
 ['Pride and Prejudice',
  'Jane Austen',
  '\nPRIDE AND PREJUDICE:\n\nA NOVEL.\n\nIN THREE VOLUMES.\n\nBY THE AUTHOR OF "SENSE AND SENSIBILITY."\n\nVOL. I.\n\n\nPRIDE & PREJUDICE.\n\nCHAPTER I.\n\nIt is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.\n\nHowever little known the feelings or views of such a man may be on his first entering a neighbourhood, this truth is so well fixed in the minds of the surrounding families, that he is considered as the rightful property of some one or other of their daughters.\n\n"My dear Mr. Bennet," said his lady to him one day, "have you heard that Netherfield Park is let at last?"\n\nMr. Bennet replied that he had not.\n\n"But it is," returned she; "for Mrs. Long has just been here, and she told me all about it."\n\nMr. Bennet made no answer.\n\n"Do not you want to know who has taken it?" cried his wife impatiently.\n\n"You want to tell me, and I

In [38]:
column_names = ['title', 'author', 'text']
df2 = pandas.DataFrame(full_text, columns = column_names)
df2

Unnamed: 0,title,author,text
0,THE PRINCE,Nicolo Machiavelli,\n\nTranslated by W. K. Marriott\n\n\n\nINTROD...
1,Pride and Prejudice,Jane Austen,\nPRIDE AND PREJUDICE:\n\nA NOVEL.\n\nIN THREE...
2,The Communist Manifesto,Karl Marx and Friedrich Engels,\nTranscribed by Allen Lutins with assistance ...
3,A GARLAND FOR GIRLS,Louisa May Alcott,\n\n\nTO R.A. LAWRENCE\n\nTHIS LITTLE BOOK IS ...
