# **Using Regular Expressions to Search Text**
This lesson covers regular expressions, or **regexes**, a powerful tool for searching and parsing text. Regexes are more difficult than most of the concepts we've covered so far, but they can be invaluable for language processing.

### Why are regexes useful?
Suppose you are trying to check if a string matches a possible age. We know the string should be only numeric digits and no more than 3 characters long, so we could write the following `if` statement:

In [None]:
age_string = "64"

if len(age_string) <= 3 and age_string.isnumeric():
    print("The string is an age")

However, what if we are trying to match a string that contains a name *and* an age, in the format `"Hannah, 42"`. With the tools we have for parsing text, there's no obvious way to do this. 

With regexes, this is easy. We can write the following **regex pattern string** that describes the pattern we are looking to match.

In [None]:
import re

name_age_pattern = r"^[A-Za-z]+, [0-9]{1,3}$"

if re.search(name_age_pattern, "Hannah, 42"):
    print("'Hannah, 42' matches")
    
if re.search(name_age_pattern, "Hannah"):
    print("'Hannah' matches")
    
if re.search(name_age_pattern, "42"):
    print("'42' matches")

Calling `re.search(pattern, string)` returns the first match for the pattern in the string, or `None` if there is no match.

### Components of an example regex
Let's break down the pattern string `^[A-Za-z]+, [0-9]{1,3}$`. 

First, `^` matches the start of the string.

Next, the bracketed set `[A-Za-z]` matches any letter in the English alphabet, upper or lowercase. Adding `+` allows the pattern to match one or more letters.

In [None]:
letter_pattern = r"^[A-Za-z]+$"

print(re.search(letter_pattern, "Hello")) # Matches
print(re.search(letter_pattern, "12344")) # Doesn't match
print(re.search(letter_pattern, "Hello1234")) # Doesn't match

Next, the pattern `, ` matches a literal comma and space. In regex, writing non-special characters matches the exact string.

In [None]:
exact_pattern = r"^Hello$"

print(re.search(exact_pattern, "Hello")) # Matches
print(re.search(exact_pattern, "Hi")) # Doesn't match
print(re.search(exact_pattern, "Hello1234")) # Doesn't match

The set `[0-9]` matches any digit character, and `{1,3}` requires between 1 and 3 digits in a row.  

In [None]:
digit_pattern = r"^[0-9]{1,3}$"

print(re.search(digit_pattern, "72")) # Matches
print(re.search(digit_pattern, "1")) # Matches
print(re.search(digit_pattern, "882")) # Matches
print(re.search(digit_pattern, "1002")) # Doesn't match, too many digits
print(re.search(digit_pattern, "a12")) # Doesn't match, includes non-digit

Finally, the `$` character matches the end of the string.

## **Using Regex to Search**
If we want to match an entire string with a pattern, we put `^` and `$` at the start and end of the pattern. However, if we want to match a pattern *anywhere in the string*, we can omit these character. For example, we can find the first chunk of letters in the string using the following regex.

In [None]:
letters_pattern = r"[A-Za-z]+"

print(re.search(letters_pattern, "The quick brown fox jumped over the lazy dog"))

If we want to find *all* of the matches for a regex in a string, we can use `re.findall`.

In [None]:
print(re.findall(letters_pattern, "The quick brown fox jumped over the lazy dog"))

We can see how this would be useful to separate the words in a raw text file. 

## **Regex Special Characters**
Regex has a number of useful special characters to make pattern matching easy. You can find a complete list [here](https://www.w3schools.com/python/python_regex.asp).

You've seen the `+` symbol, which matches one or more instances of a symbol. We can also use `*`, which matches zero or more instances, and `?`, which matches zero or one instance.

In [None]:
plus_pattern     = r"^A+$"
star_pattern     = r"^A*$"
optional_pattern = r"^A?$"

for string in ["", "A", "AA"]:
    if re.search(plus_pattern, string) != None:
        print(f"'{string}' matches {plus_pattern}")
    if re.search(star_pattern, string) != None:
        print(f"'{string}' matches {star_pattern}")
    if re.search(optional_pattern, string) != None:
        print(f"'{string}' matches {optional_pattern}")
    print('\n')

#### **Exercise 1**
Create a regex that matches one or more `A` characters followed by one or zero `B` characters.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">ab_pattern = "^A+B?$" </code></pre>
</details>

In [None]:
# TODO: Replace the following with your own regex
ab_pattern = "" 

print(re.search(ab_pattern, "AAB") != None, "Should be true")
print(re.search(ab_pattern, "AA") != None, "Should be true")
print(re.search(ab_pattern, "AABBB") != None, "Should be false")
print(re.search(ab_pattern, "BB") != None, "Should be false")

### Special sequences
We can also use the following special sequences as shortcuts for common patterns.
- `\d` matches a digit character (number)
- `\s` matches a whitespace character (space, newline, tab, etc.)
- `\w` matches a word character (letter, digit, or underscore)

In [None]:
word_pattern = r"\w+"
number_pattern = r"\d+"
whitespace_pattern = r"\s+"

print(re.findall(word_pattern, "The 22 quick brown foxes jumped over the 304 lazy dogs"))
print(re.findall(number_pattern, "The 22 quick brown foxes jumped over the 304 lazy dogs"))
print(re.findall(whitespace_pattern, "The 22 quick brown foxes jumped over the 304 lazy dogs"))

If we want to match a character that is otherwise a special character, we can use the backslash `\` to **escape** the special character.

In [None]:
title_pattern = r"\w+\." # Matches some word characters followed by a period

print(re.findall(title_pattern, "Mr. Smith and Mrs. Smith went to Dr. Hartman for help"))

#### **Exercise 2**
Create a regex that matches whole words 1-3 letters long. Words should be surrounded by spaces on either side.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">short_word_pattern = r"\s\w{1,3}\s" </code></pre>
</details>

In [None]:
# TODO: Replace the following with your own regex
short_word_pattern = r"" 

print(re.findall(short_word_pattern, " I am going to the mall with my family "))

### Sets
We've seen a few regex sets, which allow you to match several different characters.
- `[A-Z]` matches all uppercase letters
- `[a-z]` matches all lowercase letters
- `[0-9]` matches all digits (you can also use `\d`)
- `[xyz]` matches `x`, `y`, or `z`
- `[^xyz]` matches anything *except* `x`, `y`, or `z`

#### **Exercise 3**
Create a regex that matches any string of characters, surrounded by spaces, that **does not** contain any digits.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">short_word_pattern = r"\s\w{1,3}\s" </code></pre>
</details>

In [None]:
# TODO: Replace the following with your own regex
no_numbers_pattern = r"" 

print(re.findall(no_numbers_pattern, " hello hell0 123 coo1 easy 3ASY "))

#### **Exercise 4**
Use a regex to extract all of the names in the following text, where names start with an uppercase letter and only contain letters.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">name_regex = r"[A-Z][a-z]+"
print(re.findall(name_regex, text))</code></pre>
</details>

In [None]:
text = "Sarah told Ally and Michael that Henry was coming to her party with Chris"

# TODO: Extract all of the names using a regex and findall


## **Conclusion**
In this lesson, we learned about using regular expressions to parse text.
- Using metacharacters like `+` and `*` to match a particular number of characters
- Using special sequences like `\d` and `\w` to match common types of symbols
- Using sets like `[A-Z]` to match one of several possible characters

Regular expressions are an incredibly useful tool for language processing, although their syntax can be intimidating. The best way to get comfortable using regexes is by creating and testing them yourself. You can use tools such as [regex101](https://regex101.com) to try out regexes with live updates.

Next, we'll take a look at using machine learning tools in Python.

[Next Lesson](<./10. Machine Learning.ipynb>)