_____

<table align="left" width=100%>
    <td>
        <div style="text-align: center;">
          <img src="./images/bar.png" alt="entidades financiadoras"/>
        </div>
    </td>
    <td>
        <p style="text-align: center; font-size:24px;"><b>Introduction to Data Science</b></p>
        <p style="text-align: center; font-size:18px;"><b>Master in Electrical and Computer Engineering</b></p>
        <p style="text-align: center; font-size:14px;"><b>Pedro Cardoso (pcardoso@ualg.pt)</b></p>
    </td>
</table>

_____


__Short Lesson Title:__ Regular expressions

*__Summary:__ This lesson delves into regular expressions, a powerful tool for string manipulation. It covers the basics of regex syntax, including character classes, metacharacters, anchors, flags, alternation, quantifiers, and grouping. The lesson also provides practical examples and exercises to reinforce understanding of regex concepts.__*

# Regular expression in a nutshell

Regular expressions are a powerful tool for string manipulation. They are used to find, replace, and split strings. They are also used to extract information from strings. For example, we can extract the title of the passenger from the name feature (as we'll see later).

So, a regular expression (regex) is a pattern used to match character combinations in strings. Here's a nutshell tutorial on regular expressions.

## Matching characters

You can match specific characters by simply including them in the pattern. For example, the pattern "cat" matches the characters "c", "a", and "t", i.e., the string "cat".

In [None]:
import re

string = """"In a lab where wires snake and spin,
A cat-5 naps lightly on a Z390 motherboard bin.
Nearby a dog, with 2X zoom goggles tight,
Barks at code through the midnight night.

The screen glows RGB-blue,
Neural nets hum with feline v1.3 pride.
The data scientist sips her 250ml tea,
Plotting 4D graphs with a purring decree.

Ohm’s Law scribbled on the chalkboard — V=IR,
While 8086 transistors buzz from afar.
The engineer nods, adjusts gear 3,
As binary barks echo in bitstream 101010 harmony."""

re.findall(r"cat", string)  # Find all occurrences of "cat"

we can also use the `re.finditer()` function to find all occurrences of a pattern in a string. This function returns an iterator yielding match objects for all non-overlapping matches of the pattern in the string. Each match object contains information about the match, such as its start and end positions in the string.

In [None]:
def match_positions(pattern, string, flags=0):
    for match in re.finditer(pattern, string, flags):
        print(f"Match: {match.group()}  Start: {match.start()}  End: {match.end()-1}")

match_positions(r"cat", string)


## Character classes

You can match a set of characters by using character classes. For example, the pattern `[cat]` matches any of the characters "c", "a", or "t". This means that the pattern will match any occurrence of any of these characters in the string.

In [None]:
match_positions(r"[cat]", string)


## Metacharacters

Certain characters have a special meaning in regex patterns. For example, the dot "." matches any character except a newline, and the asterisk "*" matches zero or more occurrences of the previous character. For example the patter "c.t" matches "cat", "cot", "cut", etc. and the pattern "c.*t" matches "cat", "cot", "cut", "catt", "cxt", etc.




In [None]:
# greedy version -- matches as much as possible
match_positions(r"c.*t", string)

In [None]:
# ungreedy version -- matches as little as possible - lazy
match_positions(r"c.*?t", string)

## Anchors

Anchors are used to match the position of a string. The caret `^` matches the start of a string, and the dollar sign `$` matches the end of a string. For example, the pattern `^cat` matches "cat" at the start of a string, and the pattern cat$ matches "cat" at the end of a string.

In [None]:
# Using the re.MULTILINE flag to match "The" at the start of each line
match_positions(r"^The", string, re.MULTILINE)


In [None]:
# (?m) inline flag to match "The" at the start of each line
for match in re.finditer(r"(?m)^The", string):
    print(f"Match: {match.group()}  Start: {match.start()}  End: {match.end()}")

## Flags

Depending on the flags used, the behavior of the regex engine can change.

| Flag / Pattern     | Type / Name          | Description                                                       | Example Text                       | `re.findall()` Example                                        | Output                      |
|--------------------|----------------------|-------------------------------------------------------------------|------------------------------------|----------------------------------------------------------------|-----------------------------|
| `re.I`             | Ignore Case          | Match letters regardless of case                                  | "Cat cAt"                          | `re.findall(r"cat", "Cat cAt", re.I)`                          | `['Cat', 'cAt']`           |
| `re.M`             | Multiline            | `^` and `$` match start/end of **lines**, not whole string        | "cat\nbat"                         | `re.findall(r"^cat", "cat\nbat", re.M)`                        | `['cat']`                  |
| `re.S`             | Dot All              | `.` matches newline (`\n`) too                                     | "cat\ndog"                         | `re.findall(r"cat.*dog", "cat\ndog", re.S)`                    | `['cat\ndog']`             |
| `re.X`             | Verbose (Extended)   | Allow whitespace and comments inside pattern                       | "cat   dog"                        | `re.findall(r"cat \s+ dog", "cat   dog", re.X)`                | `['cat   dog']`            |
| `re.U` or `re.UNICODE` | Unicode         | Unicode character support (default in Python 3)                   | "café"                             | `re.findall(r"\w+", "café", re.U)`                             | `['café']`                 |
| `(?i)`             | Inline Ignore Case   | Enables case-insensitive matching within the pattern               | "Cat cAt"                          | `re.findall(r"(?i)cat", "Cat cAt")`                            | `['Cat', 'cAt']`           |
| `(?m)`             | Inline Multiline     | Enables multiline mode inside the pattern                          | "cat\nbat"                         | `re.findall(r"(?m)^cat", "cat\nbat")`                          | `['cat']`                  |
| `(?s)`             | Inline Dot All       | Enables dot-all mode inside the pattern                            | "cat\ndog"                         | `re.findall(r"(?s)cat.*dog", "cat\ndog")`                      | `['cat\ndog']`             |
| `(?x)`             | Inline Verbose       | Enables verbose mode inside the pattern                            | "cat   dog"                        | `re.findall(r"(?x)cat \s+ dog", "cat   dog")`                  | `['cat   dog']`            |
| `a.*b`             | Greedy Pattern       | Matches from first `a` to **last** `b`                             | "a123b456b"                        | `re.findall(r"a.*b", "a123b456b")`                             | `['a123b456b']`            |
| `a.*?b`            | Non-Greedy Pattern   | Matches from first `a` to **first** `b`                            | "a123b456b"                        | `re.findall(r"a.*?b", "a123b456b")`                            | `['a123b']`                |
| `".*"`             | Greedy Quoted Text   | Matches everything between first `"` and last `"`                 | 'He said "yes" and "no"'           | `re.findall(r'".*"', 'He said "yes" and "no"')`                | `['"yes" and "no"']`       |
| `".*?"`            | Lazy Quoted Text     | Matches everything between **each** pair of quotes                | 'He said "yes" and "no"'           | `re.findall(r'".*?"', 'He said "yes" and "no"')`               | `['"yes"', '"no"']`        |


For example

In [None]:
# Using the re.I flag or the (?i) inline flag to match "cat" in different cases
match_positions(r"(?i)CAT", string)

In [None]:
match_positions(r"CAT", string, re.I)


## Alternation

Alternation is used to match one of several possible patterns. For example, the pattern cat|dog matches either "cat" or "dog".

In [None]:
match_positions(r"cat|dog", string)

## Quantifiers

Quantifiers are used to specify how many times a character or group of characters should appear. For example, the pattern a{3} matches exactly three occurrences of the letter "a", and the pattern a{3,5} matches three to five occurrences of the letter "a".

In [None]:
another_string = """a b aa bb aaa bbb aaaa bbbb aaaaa bbbbb aaaaaa bbbbbb"""

print("-"*60)
print("0         1         2         3         4         5")
print("0123456789012345678901234567890123456789012345678901234567890123456789")
print(another_string)
print("-"*60)

match_positions(r"a{3}", another_string)

In [None]:
print("-"*60)
print("0         1         2         3         4         5")
print("0123456789012345678901234567890123456789012345678901234567890123456789")
print(another_string)
print("-"*60)

match_positions(r"a{3,5}", another_string)



## Grouping

Grouping is used to group parts of a pattern together. This is useful for applying quantifiers or alternation to a group of characters. For example, the pattern (ab)+ matches one or more occurrences of the string "ab".


In [None]:
another_string = "ab abab ababab abababab ababababab"

print("-"*60)
print("0         1         2         3         ")
print("0123456789012345678901234567890123456789")
print(another_string)
print("-"*60)

match_positions(r"(ab)+", another_string)

## Digit and Word Characters

`\d` is a metacharacter that represents any digit from 0 to 9. It is commonly used to match numbers in text, and it is equivalent to the character class `[0-9]`.

In [None]:
match_positions(r"\d", string)

In [None]:
match_positions(r"\d{3,6}", string)

In [None]:
match_positions(r"[0-9]{3,6}", string)

`[a-z]` is a character class that matches any lowercase letter from "a" to "z". It represents a range of characters between "a" and "z" inclusive. For example, the pattern `c[a-z]t` matches any three-letter word that starts with "c" and ends with "t", where the middle letter can be any lowercase letter. This would match words such as "cat", "cet", "cxt", and so on.


In [None]:
match_positions(r"c[a-z]t", string)

In [None]:
# The pattern c[a-z]+t matches any word that starts with "c", ends with "t", and has one or more lowercase letters in between. This would match words such as "cat", "coat", "cxt", and so on.
match_positions(r"c[a-z]+t", string)

So:

- The pattern [a-z]+ matches one or more consecutive lowercase letters
- The pattern [a-z]{3} matches exactly three consecutive lowercase letters
- The pattern ^[a-z]+$ matches a string that consists entirely of lowercase letters
- You can also use other character classes in regular expressions to match other types of characters, such as uppercase letters ([A-Z]), digits (\d), whitespace (\s), or non-word characters (\W).

See https://docs.python.org/3/library/re.html for more information on the Python regular expression module. See also (Campesato, 2018).

Test your regular expressions using https://regex101.com/

# References

- Campesato, O. (2018). Regular expressions: Pocket primer. Mercury Learning and Information.