## Regular Expressions

In this section we introduce regular expressions, an important tool to specify patterns in strings.

## Motivation

In a larger piece of text, many useful substrings come in a specific format. For instance, the sentence below contains a U.S. phone number.

`"give me a call, my number is 123-456-7890."`

The phone number contains the following pattern:

1. Three numbers
1. Followed by a dash
1. Followed by three numbers
1. Followed by a dash
1. Followed by four Numbers

Given a free-form segment of text, we might naturally wish to detect and extract the phone numbers. We may also wish to extract specific pieces of the phone numbers—for example, by extracting the area code we may deduce the locations of individuals mentioned in the text.

To detect whether a string contains a phone number, we may attempt to write a method like the following:

In [12]:
function is_phone_number(string)
    
    digits = "0123456789"
    
    function is_not_digit(token)
        return (!(token in digits))
    end
    
    # Three numbers
    for i in 1:3
        if is_not_digit(string[i])
            return false
        end
    end
    
    # Followed by a dash
    if string[4] != '-'
        return false
    end
    
    # Followed by three numbers
    for i in 5:7
        if is_not_digit(string[i])
            return false
        end
    end
        
    # Followed by a dash    
    if string[8] != '-'
        return false
    end
    
    # Followed by four numbers
    for i in 9:11
        if is_not_digit(string[i])
            return false
        end
    end
    
    return true
end

is_phone_number (generic function with 1 method)

In [13]:
is_phone_number("382-384-3840")

true

In [14]:
is_phone_number("phone number")

false

The code above is unpleasant and verbose. Rather than manually loop through the characters of the string, we would prefer to specify a pattern and command Julia to match the pattern.

**Regular expressions** (often abbreviated **regex**) conveniently solve this exact problem by allowing us to create general patterns for strings. Using a regular expression, we may re-implement the `is_phone_number` method in two short lines of Julia:

In [18]:
function is_phone_number(string)
    regex = r"[0-9]{3}-[0-9]{3}-[0-9]{4}"
    return typeof(match(regex, string)) != Nothing
end

is_phone_number("382-384-3840")

true

In the code above, we use the regex `[0-9]{3}-[0-9]{3}-[0-9]{4}` to match phone numbers. Although cryptic at a first glance, the syntax of regular expressions is fortunately much simpler to learn than the Julia language itself; we introduce nearly all of the syntax in this section alone. Check out the [docs](https://docs.julialang.org/en/v1/manual/strings/#Regular-Expressions-1) for more information on manipulating Strings with regular expressions.

## Regex Syntax

We start with the syntax of regular expressions. In Python, regular expressions are most commonly stored as raw strings. Raw strings behave like normal Python strings without special handling for backslashes.

For example, to store the string `hello \ world` in a normal Python string, we must write: