## Regex and Julia

In this section, we introduce regex usage in Julia. Since we only cover a few of the most commonly used methods, you will find it useful to consult [the official documentation](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions-1) as well.

## `occursin`

`occursin(pattern, string)` simply returns true or false, indicating whether a match for the given `pattern` occurs in the `string`.

In [2]:
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Call me at 382-384-3840."
if occursin(phone_re, text)
    println("Found a match!")
end

Found a match!


In [3]:
if occursin(phone_re, "Hello world")
    println("No match; this won't print")
end

Another commonly used method is `match(pattern, string)`, which captures the information on how the pattern matched. If a regular expression does match, the value returned is a `RegexMatch`. Otherwise, it will return `nothing`.

In [4]:
match(r"[0-9]{3}-[0-9]{3}-[0-9]{4}", "Call me at 382-384-3840.")

RegexMatch("382-384-3840")

## `eachmatch`

We use `eachmatch(pattern, string)` to extract substrings that match a regex. This method returns a `RegexMatchIterator` which we can collect it items all matches of `pattern` in `string`.

In [5]:
gmail_re = r"[a-zA-Z0-9]+@gmail\.com"
text = """
From: email1@gmail.com
To: email2@yahoo.com and email3@gmail.com
"""
collect(eachmatch(gmail_re, text))

2-element Array{RegexMatch,1}:
 RegexMatch("email1@gmail.com")
 RegexMatch("email3@gmail.com")

## Regex Groups



Using **regex groups**, we specify subpatterns to extract from a regex by wrapping the subpattern in parentheses `( )`. When a regex contains regex groups, `eachmatch` returns a the substring found in each pattern as well.

For example, the following familiar regex extracts phone numbers from a string:

In [10]:
phone_re = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
collect(eachmatch(phone_re, text))

2-element Array{RegexMatch,1}:
 RegexMatch("382-384-3840")
 RegexMatch("123-456-7890")

In [11]:
# Same regex with parentheses around the digit groups
phone_re = r"([0-9]{3})-([0-9]{3})-([0-9]{4})"
text  = "Sam's number is 382-384-3840 and Mary's is 123-456-7890."
collect(eachmatch(phone_re, text))

2-element Array{RegexMatch,1}:
 RegexMatch("382-384-3840", 1="382", 2="384", 3="3840")
 RegexMatch("123-456-7890", 1="123", 2="456", 3="7890")

As promised, `eachmatch` returns the substring matched along with each capture.

## `replace`

`replace(string, pattern => replacement)` replaces all occurrences of `pattern` with `replacement` in the provided `string`. This method can also take a regular string instead of a regex, and maximum number of occurencies to be replaced.

In the code below, we alter the dates to have a common format by substituting the date separators with a dash.

In [12]:
messy_dates = "03/12/2018, 03.13.18, 03/14/2018, 03:15:2018"
regex = r"[/.:]"
replace(messy_dates, regex => "-")

"03-12-2018, 03-13-18, 03-14-2018, 03-15-2018"

In [14]:
toc = strip("""
PLAYING PILGRIMS============3
A MERRY CHRISTMAS===========13
THE LAURENCE BOY============31
BURDENS=====================55
BEING NEIGHBORLY============76
""")

# First, split into individual lines
lines = split(toc, '\n')

5-element Array{SubString{String},1}:

In [15]:
# Then, split into chapter title and page number
split_re = r"=+" # Matches any sequence of = characters
[split(line, split_re) for line in lines]

5-element Array{Array{SubString{String},1},1}:
 ["PLAYING PILGRIMS", "3"]  
 ["A MERRY CHRISTMAS", "13"]
 ["THE LAURENCE BOY", "31"] 
 ["BURDENS", "55"]          
 ["BEING NEIGHBORLY", "76"] 

## Regex and DataFrames

We can combine the methods discussed above with the `.` operator to work with DataFrames efficiently.

We've stored the text of the first five sentences of the novel *Little Women* in the DataFrame below. We can use the Regex methods that Julia provides to extract the spoken dialog in each sentence.

In [19]:
# HIDDEN
Base.displaysize() = (5, 80)
using DataFrames
text = strip("""
"Christmas won't be Christmas without any presents," grumbled Jo, lying on the rug.
"It's so dreadful to be poor!" sighed Meg, looking down at her old dress.
"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all," added little Amy, with an injured sniff.
"We've got Father and Mother, and each other," said Beth contentedly from her corner.
The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, "We haven't got Father, and shall not have him for a long time."
""")

little = DataFrame(sentences=split(text, '\n'));

In [20]:
little

Unnamed: 0_level_0,sentences
Unnamed: 0_level_1,SubStrin…
1,"""Christmas won't be Christmas without any presents,"" grumbled Jo, lying on the rug."
2,"""It's so dreadful to be poor!"" sighed Meg, looking down at her old dress."
3,"""I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all,"" added little Amy, with an injured sniff."
4,"""We've got Father and Mother, and each other,"" said Beth contentedly from her corner."
5,"The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, ""We haven't got Father, and shall not have him for a long time."""


Since spoken dialog lies within double quotation marks, we create a regex that captures a double quotation mark, a sequence of characters, and the closing quotation mark. As we are dealing with DataFrames, we can use `.` to broadcast an operations along all elements. When using `eachmatch` we obtain a `RegexMatch` object for each row in the DataFrame:

In [37]:
quote_re = r"\".+\""
matches = collect.(eachmatch.(quote_re, little.sentences))
println(matches)

Array{RegexMatch,1}[[RegexMatch("\"Christmas won't be Christmas without any presents,\"")], [RegexMatch("\"It's so dreadful to be poor!\"")], [RegexMatch("\"I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all,\"")], [RegexMatch("\"We've got Father and Mother, and each other,\"")], [RegexMatch("\"We haven't got Father, and shall not have him for a long time.\"")]]


We can now extract the matches using a list comprehension, and remove the quotation marks before adding the results as a new column in our DataFrame:

In [71]:
# Each RegexMatch only returned one element, hence we are accessing it directly:
spoken = [strip(x[1].match, '\"') for x in matches]
little.dialog = spoken

first(little, 3)

Unnamed: 0_level_0,sentences,dialog
Unnamed: 0_level_1,SubStrin…,SubStrin…
1,"""Christmas won't be Christmas without any presents,"" grumbled Jo, lying on the rug.","Christmas won't be Christmas without any presents,"
2,"""It's so dreadful to be poor!"" sighed Meg, looking down at her old dress.",It's so dreadful to be poor!
3,"""I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all,"" added little Amy, with an injured sniff.","I don't think it's fair for some girls to have plenty of pretty things, and other girls nothing at all,"


We can confirm that our string manipulation behaves as expected for the last sentence in our DataFrame by printing the original and extracted text:

In [72]:
little.sentences[5]

"The four young faces on which the firelight shone brightened at the cheerful words, but darkened again as Jo said sadly, \"We haven't got Father, and shall not have him for a long time.\""

In [73]:
little.dialog[5]

"We haven't got Father, and shall not have him for a long time."

## Summary

Julia provides a useful group of methods for manipulating text using regular expressions. When working with DataFrames, we can combine the `.` operator to broadcast those methods along the rows.

For more information on regular expressions, see the [documentation](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions-1).