## Chapter 25: Regular Expressions

A common task in handling data or other places is to take a string (perhaps from a file) and parse it into various pieces (aka into an object). There are a number of features of Julia (and other languages) to do this.  One important way to do this is using regular expressions, which in short will match a string with various features and often extra substrings. 

### 25.1: Simple Patten Matching

If we are searching through a string for a specific substring, we have some built-in methods to do this including `occursin`, `startswith`, `endswith`. 

Let's search for the string "cat" inside of some larger strings. 

In [1]:
str1 = "catalog"
str2 = "complication"
str3 = "housecat"

"housecat"

In [2]:
occursin("cat", str1), occursin("cat", str2), occursin("cat", str3)

(true, true, true)

We can test these also in the following way:

In [3]:
map(s -> occursin("cat", s), ("catalog", "complication", "housecat"))

(true, true, true)

The `startswith` function matches the beginning of the string.  Note: the order of arguments is different. :(

In [4]:
map(s -> startswith(s, "cat"), ("catalog", "complication", "housecat"))

(true, false, false)

And the `endswith` method matches the end. 

In [5]:
map(s -> endswith(s, "cat"), ("catalog", "complication", "housecat"))

(false, false, true)

### Regular Expressions

We can test all three of these with a regular expression and the `occursin` method.  To maka a regular expression, we prepend the string with an `r`.  The regular expression `r"cat"` will match anywhere in the string. 

In [6]:
map(s -> occursin(r"cat",s), ("catalog", "complication", "housecat"))

(true, true, true)

If we put a `^` at the front of a regular expression, then this will match "cat" only at the start of the string. 

In [7]:
map(s -> occursin(r"^cat",s), ("catalog", "complication", "housecat"))

(true, false, false)

If we put a `$` at the end of a regular expression, then this will match "cat" only at the end of the string. 

In [8]:
map(s -> occursin(r"cat$",s), ("catalog", "complication", "housecat"))

(false, false, true)

The characters in a regular expression are either a **regular character** or a **special character**. So far, the alphabetic characters are regular and the `^` and `$` are special.  They have a meaning separate from their character.  Most of understanding regular expressions and how to use them are how to handle special characters. 

We often use both `^` and `$` in a regular expression to match an entire string.  Without these, a regular expression can match a substring. 

In [9]:
map(s->occursin(r"^cat$",s), ("cat", "catalog", "complication", "housecat"))

(true, false, false, false)

Note: to match a special character, you may have to escape it with a `\`.  

In [10]:
map(s->occursin(r"\^",s), ("cat", "^^^", "&*()"))

(false, true, false)

#### Character Class and Ranges

What if instead of matching "cat", we'd also like to match "cot" and "cut".  Instead of building three different regular expressions, "c[aou]t" will match between a "c" and a "t" either "a", "o" or "u".  

In [11]:
map(s -> occursin(r"c[aou]t",s), ("catalog", "cotangent", "scuttle", "facet"))

(true, true, true, false)

If a `^` occurs inside a `[]`, then it negates the characters inside.  So `r"c[^aou]t"` will match anything except "cat", "cot" or "cut"

In [12]:
map(s -> occursin(r"c[^aou]t",s), ("catalog", "cotangent", "scuttle", "facet"))

(false, false, false, true)

We can also match ranges of characters with the `[]` notation.  If we use `c[a-f]t`, will match "cat", "cbt", "cct", "cdt", "cet", "cft":

In [13]:
map(s -> occursin(r"c[a-f]t",s), ("catalog", "cotangent", "scuttle", "facet"))

(true, false, false, true)

Another important special character is a `.` which matches any character.  

In [14]:
map(s -> occursin(r"c.t",s), ("catalog", "cotangent", "scuttle", "facet", "tact"))

(true, true, true, true, false)

#### Exercise

Develop a regular expression that matches a string with d, any character, then g at the front of the string.  Test your results.

In [16]:
map(s -> occursin(r"^d.g", s), ("dogma", "catdog", "dig dug", "fred"))

(true, false, true, false)

### Optional substrings

We often want to match either this or that, and can do that with the special character `|`. Consider

In [17]:
map(s -> occursin(r"(mega|giga)byte",s), ("kilobyte", "megabyte", "gigabyte", "terabyte"))

(false, true, true, false)

Note that we surround the option with a set of `()`.  This is only needed to separate the option from the string `byte`

### Alphabetic characters

It is very common to want to match any alphabetic character.  In this section, we'll show how to use them. 

First, if we want the alphabetic characters for the Latin alphabet (the standard one in the U.S.), we typically use a character range.  

In [18]:
map(s -> occursin(r"^[a-z]",s), ("catalog", "Catalog", "1234"))

(true, false, false)

If we want both upper and lower case, we can change the `[a-z]` to `[A-Za-z]`:

In [19]:
map(s -> occursin(r"^[A-Za-z]",s), ("catalog", "Catalog", "1234"))

(true, true, false)

If we want to break out of the Latin alphabet, we can use the `[:alpha:]` class.  This will also other characters, like letters with accents and greek letters

In [21]:
map(s -> occursin(r"^[[:alpha:]]",s), ("catalog", "Catalog", "αβγ", "é", "1234"))

(true, true, true, true, false)

### Numeric Characters

The `\d` special character matches a decimal.  Basically is it is a shorthand for `[0-9]`.

In [22]:
map(s -> occursin(r"^[0-9]",s), ("catalog", "Catalog", "1234"))

(false, false, true)

In [23]:
map(s -> occursin(r"^\d",s), ("catalog", "Catalog", "1234"))

(false, false, true)

One other common related special character is a word character, `\w`. This matches any alpha-numeric or `_`.  The tradition for this is typically programming related in that these are the characters that can be used for variable and function names. 

In [24]:
map(s -> occursin(r"^\w",s), ("catalog", "1234", "_varname", "!@#"))

(true, true, true, false)

#### Exercise

1. Write a regular expression that matches strings of 3-digit numbers.  Test with 2-, 3- and 4-digits numbers as well as strings with alphabetic characters. 
2. Write a regular expression that detects a phone number in the form "xxx-xxx-xxxx", where x is a digit. 

In [26]:
map(s -> occursin(r"^\d\d\d$",s), ("12", "123", "1234", "hello"))

(false, true, false, false)

### Matching spaces

Often a space is a desireable character to match.  The specical character `\s` matches a space, a tab and other less common characters like line and form feeds.  

A common place for using these are within the `split` method.  Recall, this takes a string and splits according to a character, but can also split on a regular expression.

In [27]:
split("Julia is a rockin' language.", r"\s")

5-element Vector{SubString{String}}:
 "Julia"
 "is"
 "a"
 "rockin'"
 "language."

And although it is difficult to see, there are both spaces and tabs in the string above. 

#### Exercise

Split the string `"1,2,3;4,5|6;7|8,9"` by either a `,`, `;` or `|`. 

### Quantifiers

Above, we wrote regular expressions to detect a phone number.  We had to repeat `\d` a number of times.  There is a way to do this with quantifiers. 

In [28]:
occursin(r"^\d{3}-\d{3}-\d{4}$", "978-555-1234")

true

If we want a range of possible character matches, we can do `{m,n}` will match between m and n times. 

In [29]:
map(s->occursin(r"o{2,4}",s), ("honey", "moon", "sooon", "ooooo"))

(false, true, true, true)

If you want up to some number, say 3, use `{,3}` and if you want above some number, say 2, use `{2,}`

There are a couple of special cases with quantifier because they are used often.  `+` matches one or more times, `*` matches 0 or more times and `?` matches 0 or 1 times. 

In [30]:
map(s->occursin(r"o+",s), ("cat", "honey", "moon", "sooon", "ooooo"))

(false, true, true, true, true)

In [31]:
map(s->occursin(r"o*",s), ("cat", "honey", "moon", "sooon", "ooooo"))

(true, true, true, true, true)

In [32]:
map(s->occursin(r"ca?t",s), ("catalog", "lactose", "caat"))

(true, true, false)

### Parsing strings and extracting substrings

A bigger use of regular expressions is for extracting substrings and parsing those substrings.  We first examine the `match` method to extract information. 

If we want to match a string with 3 words separated by spaces, we can use the regular expression `r"\w+\s+\w+\s+\w+"` for example:

In [33]:
occursin(r"\w+\s+\w+\s+\w+", "Three big pigs")

true

Now this just tells us if the string matches.  However lets say that we want to extract the three strings. We can do that first of all by surrounding the `\w+` with `()`, which makes a grouping.  

In [34]:
occursin(r"(\w+)\s+(\w+)\s+(\w+)", "Three big pigs")

true

And now we will use `match` instead of `occursin`:

In [35]:
m = match(r"(\w+)\s+(\w+)\s+(\w+)", "Three big pigs")

RegexMatch("Three big pigs", 1="Three", 2="big", 3="pigs")

This returns a `RegexMatch` object, which returns the matched string (the whole thing) and the three groupings.  We can next get the groups with `m[1]`, `m[2]` and `m[3]`:

In [36]:
m[1], m[2], m[3]

("Three", "big", "pigs")

### Exercise

Let's say there are sports scores like `78-75` or `5-3` where the first number is the home team, the second is the visitor team.  Extract the scores.  Test with a few options. 

In [None]:
score_re = r"^(\d+)-(\d+)$"

r"^(\d+)-(\d+)$"

In [46]:
match(score_re, "78-75")

RegexMatch("78-75", 1="78", 2="75")

In [47]:
match(score_re, "5-3")

RegexMatch("5-3", 1="5", 2="3")

In [48]:
match(score_re, "126-94")

RegexMatch("126-94", 1="126", 2="94")

### Matching Integers and Decimal Numbers

A good use of regular expressions in scientific computation is that of parsing a string into a decimal or integer.  There are methods to do this, but often we may either need to first detect one or extract a number from a larger string. This section will discuss this. 

An integer is a sequence of digits with a sign `-` prepended.  

In [49]:
map(s->occursin(r"^-?\d+$",s), ("1234","-1234", "12.34"))

(true, true, false)

And note that we have put a `^` in front and `$` in back to make sure that it matches the entire string. 

For example, a way to use this is to write a for loop that goes through possible strings and parse only integers. 

In [50]:
ints = Int[]
for s in ("1234","-1234", "12.34", "housecat")
  if occursin(r"^-?\d+$", s)
    push!(ints, parse(Int, s))
  end
end
ints

2-element Vector{Int64}:
  1234
 -1234

Before going on, we're going to use a test suite for this in "test-numbers.jl".  

In [51]:
include("test-numbers.jl");

[0m[1mTest Summary:     | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Matching Integers | [32m   6  [39m[36m    6  [39m[0m0.3s
[0m[1mTest Summary: | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Non-integers  | [32m   4  [39m[36m    4  [39m[0m0.0s
[0m[1mTest Summary:     | [22m[33m[1mBroken  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Matching Decimals | [33m    10  [39m[36m   10  [39m[0m0.0s
[0m[1mTest Summary: | [22m[33m[1mBroken  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Non decimals  | [33m     2  [39m[36m    2  [39m[0m0.0s
[0m[1mTest Summary:  | [22m[33m[1mBroken  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Parse integers | [33m     6  [39m[36m    6  [39m[0m0.0s
[0m[1mTest Summary:  | [22m[33m[1mBroken  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Parse Decimals | [33m    10  [39m[36m   10  [39m[0m0.0s


Note: we have included some decimal tests as well here and have a flag whether or not to run these.  At this point, just look at the top two test suites. 

Decimals are a bit more difficult.  We'll build this up though. First of all, we'll try the Test Driven Development, so make a test suite first.  To run the decimal test, change the variable `run_dec_test` in the file to `true`.  We also need to define a `dec_re` that must be a regular expression. 

Let's start with a regular expression that matches one or more digits before a decimal point and 0 or more after. 

In [2]:
dec_re = r"\d+\.(\d+)"

r"\d+\.(\d+)"

And then rerun the test suite. 

In [3]:
include("test-numbers.jl");

[0m[1mTest Summary:     | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Matching Integers | [32m   6  [39m[36m    6  [39m[0m0.0s
[0m[1mTest Summary: | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Non-integers  | [32m   4  [39m[36m    4  [39m[0m0.0s
Matching Decimals: [91m[1mTest Failed[22m[39m at [39m[1m/Users/pstaab/code/sci-comp-notebooks/notebooks/test-numbers.jl:28[22m
  Expression: occursin(dec_re, ".1234")
   Evaluated: occursin(r"\d+\.(\d+)", ".1234")

Stacktrace:
 [1] [0m[1mmacro expansion[22m
[90m   @[39m [90m~/.julia/juliaup/julia-1.11.1+0.aarch64.apple.darwin14/share/julia/stdlib/v1.11/Test/src/[39m[90m[4mTest.jl:679[24m[39m[90m [inlined][39m
 [2] [0m[1mmacro expansion[22m
[90m   @[39m [90m~/code/sci-comp-notebooks/notebooks/[39m[90m[4mtest-numbers.jl:28[24m[39m[90m [inlined][39m
 [3] [0m[1mmacro expansion[22m
[90m   @[39m [90m~/.julia/juliaup/julia-1.11.1+0.aarch64.

LoadError: LoadError: Some tests did not pass: 6 passed, 4 failed, 0 errored, 0 broken.
in expression starting at /Users/pstaab/code/sci-comp-notebooks/notebooks/test-numbers.jl:23

First, recall that since `.` is a special character, we need to escape it (write `\.` to match the character `.`). 

However, will this work with ".1234", let's try:

And the answer is clearly no because the `\d+` means we need to match a decimal one or more times. We could change this to `\d*` to fix this, but let's do more.  Most decimals if less than one have a leading 0 (but not necessary), but not more than one.  To detect a single leading zero, we can add: 

In [4]:
dec_re = r"[+-]?\d*\.\d+|[+-]?\d+\.\d*"

r"[+-]?\d*\.\d+|[+-]?\d+\.\d*"

In [5]:
include("test-numbers.jl");

[0m[1mTest Summary:     | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Matching Integers | [32m   6  [39m[36m    6  [39m[0m0.0s
[0m[1mTest Summary: | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Non-integers  | [32m   4  [39m[36m    4  [39m[0m0.0s
[0m[1mTest Summary:     | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Matching Decimals | [32m  10  [39m[36m   10  [39m[0m0.0s
[0m[1mTest Summary: | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Non decimals  | [32m   3  [39m[36m    3  [39m[0m0.0s
Parse integers: [91m[1mError During Test[22m[39m at [39m[1m/Users/pstaab/code/sci-comp-notebooks/notebooks/test-numbers.jl:45[22m
  Test threw exception
  Expression: parseIntOrDec("1234") == 1234
  UndefVarError: `parseIntOrDec` not defined in `Main`
  Suggestion: check for spelling errors or missing imports.
  Stacktrace:
   [1] [0m[1mmacro expans

LoadError: LoadError: Some tests did not pass: 0 passed, 0 failed, 6 errored, 0 broken.
in expression starting at /Users/pstaab/code/sci-comp-notebooks/notebooks/test-numbers.jl:44

### Matching any number type

We have a way to detect either integers or decimals above, however to effectively parse with these, we'd have to test any possible number against both and then `parse` based on that. We can combine the two by putting an option `?` on the decimal part of the number

In [6]:
num_re = r"([+-]?\d*)?(\.\d*)?"

r"([+-]?\d*)?(\.\d*)?"

In [7]:
map(s-> match(num_re, s), ("1234", "-1234", "12.34", ".1234","123."))

(RegexMatch("1234", 1="1234", 2=nothing), RegexMatch("-1234", 1="-1234", 2=nothing), RegexMatch("12.34", 1="12", 2=".34"), RegexMatch(".1234", 1="", 2=".1234"), RegexMatch("123.", 1="123", 2="."))

In [8]:
match(num_re, "12.34")

RegexMatch("12.34", 1="12", 2=".34")

In [10]:
function parseIntOrDec(str::String)
  m = match(r"^([+-]?\d*)?(\.\d*)?$", str)
  m == nothing && throw(ArgumentError("The string: $str is not an integer or decimal number"))
  m[2] != nothing ? parse(Float64, "$(m[1])$(m[2])") : parse(Int, m[1])
end

parseIntOrDec (generic function with 1 method)

In [11]:
parseIntOrDec("1234.")

1234.0

In [12]:
parseIntOrDec(".1234")

0.1234

In [13]:
parseIntOrDec("1234")

1234

In [14]:
include("test-numbers.jl");

[0m[1mTest Summary:     | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Matching Integers | [32m   6  [39m[36m    6  [39m[0m0.0s
[0m[1mTest Summary: | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Non-integers  | [32m   4  [39m[36m    4  [39m[0m0.0s
[0m[1mTest Summary:     | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Matching Decimals | [32m  10  [39m[36m   10  [39m[0m0.0s
[0m[1mTest Summary: | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Non decimals  | [32m   3  [39m[36m    3  [39m[0m0.0s
[0m[1mTest Summary:  | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Parse integers | [32m   6  [39m[36m    6  [39m[0m0.0s
[0m[1mTest Summary:  | [22m[32m[1mPass  [22m[39m[36m[1mTotal  [22m[39m[0m[1mTime[22m
Parse Decimals | [32m  10  [39m[36m   10  [39m[0m0.0s
