# Regular Expressions

## External Resources

- C++ Reference for Regular Expressions: [https://en.cppreference.com/w/cpp/regex.html](https://en.cppreference.com/w/cpp/regex.html)
- YouTube Video - Regular Expressions - [https://youtu.be/xxOq9YE466w](https://youtu.be/xxOq9YE466w)
- YouTube Podcast - Regular Expressions - [https://youtu.be/xDOTiIwtHXI](https://youtu.be/xDOTiIwtHXI)
- NotebookLM learning materials - [https://notebooklm.google.com/notebook/bfb466f1-6060-4b1a-8406-b50508f3c035](https://notebooklm.google.com/notebook/bfb466f1-6060-4b1a-8406-b50508f3c035)

## Overview
- A regular expression (regex or regexp) is a sequence of characters that forms a search pattern. It can be used for string matching and manipulation.
- Regular expressions are commonly used in programming languages for tasks such as input validation, searching, and replacing text.

## Basic Syntax
- `.` : Matches any single character except newline.
- `^` : Matches the start of a string.
- `$` : Matches the end of a string.
- `*` : Matches 0 or more repetitions of the preceding element.
- `+` : Matches 1 or more repetitions of the preceding element.
- `?` : Matches 0 or 1 repetition of the preceding element.
- `[]` : Matches any one of the characters inside the brackets.
- `|` : Acts as a logical OR between expressions.
- `()` : Groups expressions and captures the matched text.
- `{n}` : Matches exactly n repetitions of the preceding element.
- `{n,}` : Matches n or more repetitions of the preceding element.
- `{n,m}` : Matches between n and m repetitions of the preceding element.

## Character Classes
- `\d` : Matches any digit (equivalent to `[0-9]`).
- `\D` : Matches any non-digit (equivalent to `[^0-9]`).
- `\w` : Matches any word character (alphanumeric plus underscore, equivalent to `[a-zA-Z0-9_]`).
- `\W` : Matches any non-word character (equivalent to `[^a-zA-Z0-9_]`).
- `\s` : Matches any whitespace character (spaces, tabs, line breaks).
- `\S` : Matches any non-whitespace character.  

## Examples
- To match an email address (e.g., example@example.com or example@domain.co):
  ```regex
  [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}
  ```
- To match a phone number (e.g., 123-456-7890):

  ```regex
  \d{3}-\d{3}-\d{4}
  ```

  ### regex in C++
  - must include `<regex>` header
  - use `std::regex` class to define a regular expression pattern
  - use `std::smatch` class to store the results of a regex search
  - use `std::regex_search` function to search for a pattern in a string
  - use `std::regex_replace` function to replace occurrences of a pattern in a string
  - use `std::regex_match` function to check if a string matches a pattern
  - use `std::sregex_iterator` to iterate over all matches in a string
  - use `std::regex_constants` for various regex options (e.g., case-insensitive matching, multiline mode)

#### Matching and Searching patterns

- these are the three main functions for regex matching in C++. They differ in what they try to match:

| Function            | What it tries to match                       |
| ------------------- | -------------------------------------------- |
| `std::regex_match`  | The **entire string must match the pattern: ^pattern$** |
| `std::regex_search` | **Any substring** may match the first occurance of pattern   |
| `std::sregex_iterator` | **Find all occurrences** of the pattern in the string (using string iterators) | 

- Synatax for regex matching functions:

```cpp
// checks if the entire string s matches the regex pattern
bool regex_match(string, regex);
// checks if the entire string s matches the regex pattern and stores the results in m
bool regex_match(string, smatch, regex);
// searches for the first occurrence of the regex pattern in the string
bool regex_search(string, regex);
// searches for the first occurrence of the regex pattern in the string and stores the results in m
bool regex_search(string, smatch, regex);
// searches for all occurrences of the regex pattern in the string using iterators
sregex_iterator regex_iterator(string::begin, string::end, regex);
```

In [1]:
#include <iostream>
#include <regex>

using namespace std;

string s;

In [2]:
s = "cat";

In [3]:
// This will return true because the entire string "cat" matches the regex pattern "cat"
regex_match(s, regex("cat"));   // true

true

In [4]:
// This will return false because the entire string "cat" does not match the regex pattern "c"
regex_match(s, regex("c"));     // false

false

In [5]:
// This will return true because the regex_search function looks for the pattern "c" anywhere in the string "cat"
regex_search(s, regex("c"));    // true

true

In [6]:
// search all the numbers in the string
string text = "ID: 42, code: 99, value: 1234";
regex number_pattern("\\d+");

auto begin = sregex_iterator(text.begin(), text.end(), number_pattern);
auto end   = sregex_iterator(); // default constructor creates end iterator

for (auto it = begin; it != end; ++it)
    cout << it->str() << endl;

42
99
1234


In [7]:
s = "970-123-4567";

In [8]:
regex phone_regex(R"(\d{3}-\d{3}-\d{4})");
// if you want to store the matched phone number, you can use a smatch object
smatch phone_match;

In [9]:
if (regex_match(s, phone_match, phone_regex)) {
    cout << phone_match[0] << " is valid phone number format." << endl;
} else {
    cout << s << " is not a valid phone number format." << endl;
}

970-123-4567 is valid phone number format.


In [10]:
regex_search(s, phone_match, phone_regex);
if (!phone_match.empty()) {
    cout << "Found phone number: " << phone_match[0] << endl;
} else {
    cout << "No phone number found in the string." << endl;
}

Found phone number: 970-123-4567


In [11]:
 s = "Some text. So call 970-123-4567 or 970-222-3333 for more information.";

In [12]:
// search all phone numbers in the string
auto phone_begin = sregex_iterator(s.begin(), s.end(), phone_regex);
auto phone_end   = sregex_iterator(); // default constructor creates end iterator
for (auto it = phone_begin; it != phone_end; ++it)
    cout << "Found phone number: " << it->str() << endl;

Found phone number: 970-123-4567
Found phone number: 970-222-3333


In [13]:
 regex email_patern("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}");

In [14]:
s = "Please contact us at email@domain.com, or email1@domain.edu.eu for more information.";

auto email_begin = sregex_iterator(s.begin(), s.end(), email_patern);
auto email_end   = sregex_iterator();
for (auto it = email_begin; it != email_end; ++it)
    cout << "Found email: " << it->str() << endl;

Found email: email@domain.com
Found email: email1@domain.edu.eu


In [17]:
void example()
{
    std::string s = "Some people, when confronted with a problem, think "
        "\"I know, I'll use regular expressions.\" "
        "Now they have two problems.";
 
    std::regex self_regex("REGULAR EXPRESSIONS",
        std::regex_constants::ECMAScript | std::regex_constants::icase);
    if (std::regex_search(s, self_regex))
        std::cout << "Text contains the phrase 'regular expressions'\n";
 
    std::regex word_regex("(\\w+)");
    auto words_begin = 
        std::sregex_iterator(s.begin(), s.end(), word_regex);
    auto words_end = std::sregex_iterator();
 
    std::cout << "Found "
              << std::distance(words_begin, words_end)
              << " words\n";
 
    const int N = 6;
    std::cout << "Words longer than " << N << " characters:\n";
    for (std::sregex_iterator i = words_begin; i != words_end; ++i)
    {
        std::smatch match = *i;
        std::string match_str = match.str();
        if (match_str.size() > N)
            std::cout << "  " << match_str << '\n';
    }
 
    std::regex long_word_regex("(\\w{7,})");
    std::string new_s = std::regex_replace(s, long_word_regex, "[$&]");
    std::cout << new_s << '\n';
}


In [18]:
example();

Text contains the phrase 'regular expressions'
Found 20 words
Words longer than 6 characters:
  confronted
  problem
  regular
  expressions
  problems
Some people, when [confronted] with a [problem], think "I know, I'll use [regular] [expressions]." Now they have two [problems].


## Labs

- Lab 1: Validate email address, social security, and credit card numbers using regex patterns.
    - Regex - [../labs/regex/regex](../labs/regex/regex/README.md)

## Kattis Problems

- I've not come across any regex problems on Kattis, if you find any please share them with me and I'll add them here.