In [2]:
import re

# Nutshell

## Regular Expressions and the re Module

### Pattern String Syntax

#### Raw strings `r''`

Why do we use *raw* strings in `regex`? It's not that easy to explain in full. Let's go step-by-step:
- First of all we have so called string escape symbols like `\n` or `\t`. There are not many intersections with `regex` patterns, but there are some: 1) `\b` is a backspace for strings and `\b` is a pattern for empty string; 2) `\x` is an escape for hexadecimal numbers and `\number` is a pattern for a group match. If we use raw strings string escape symbols are not interpreted.
- More importantly we may *match* this string escape symbols: `r'\t'` matches a tab symbol.
> Pattern elements (such as `r'\t'`, equivalent to the string literal `'\\t'`) do match the corresponding special characters (in this case, the tab character `\t`), so you can use a raw string literal even when you need a literal match for such special characters.
- Fo these reasons we can not use patterns as is in a string, we have to escape with yet another backslash: `'\\s+'`, not `'\s+'`. But with raw string we do not need to do this: `f'\s+'`.
> In raw string literals, escape sequences are not interpreted as in Table 3-3, but are literally copied into the string, including backslashes and newline characters. Raw string literal syntax is handy for strings that include many backslashes, especially regular expression patterns.

#### Greedy vs Non-Greedy 

What is the difference between greedy and non-greedy `regex`? 
> All of these examples are greedy, meaning that they match the substring beginning with the first occurrence of 'pre' all the way to the last occurrence of 'post'. When you care about what part of the string you match, you may often want to specify nongreedy matching, which in our example would match the substring beginning with the first occurrence of 'pre' but only up to the first following occurrence of 'post'.

In [1]:
s = 'preposterous and post facto'

In [8]:
# greedy match
re.match(r'pre.*post', s).group()

'preposterous and post'

In [7]:
# non-greedy match
re.match(r'pre.?post', s).group()

'prepost'

#### Word boundaries

In [88]:
s = 'preposterous and post facto'

In [98]:
re.search(r'\w*post', s).group()

'prepost'

In [99]:
re.search(r'\bpost', s).group()

'post'

In [100]:
re.search(r'\bpost\b', s).group()

'post'

#### Alternative word matching

In [101]:
s = "Finnegan-O'Hara"

In [105]:
re.search(r'\b\w+\b', s).group()

'Finnegan'

In [107]:
re.search(r'[a-zA-Z\'\-]+', s).group()

"Finnegan-O'Hara"

#### Named Groups

In [None]:
# Define a regex pattern with named groups
# log_pattern = re.compile(r'''
#     \[(?P<timestamp>\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\]  # Timestamp
#     \s+
#     (?P<type>USER|SYSTEM|ERROR)  # Log type
#     (?:\s+\((?P<error_code>\d+)\))?:  # Optional error code
#     \s+
#     (?P<message>.+)  # Message content
# ''', re.VERBOSE)

In [85]:
# Define a regex pattern with named groups
log_pattern = re.compile(r"""
    \[(?P<timestamp>\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\]  # Timestamp
    \s+
    (?P<type>USER|SYSTEM|ERROR)                               # Log type
    (?:
        \s+\((?P<error_code>\d+)\)                            # Optional error code 
    )?:                                                
""", re.VERBOSE)

In [86]:
# Example log entries
log_entries = [
    "[2024-08-05 10:15:30] USER: john_doe: Logged in successfully",
    "[2024-08-05 10:16:45] SYSTEM: BACKUP - Daily backup completed",
    "[2024-08-05 10:17:20] ERROR (404): Page not found"
]

In [87]:
# Parse log entries 
for entry in log_entries:
    parsed_entry = {}
    match = log_pattern.match(entry)
    if match:
        parsed_entry['timestamp'] = match.group('timestamp')
        parsed_entry['type'] = match.group('type')
        if match.group('error_code'):
            parsed_entry['error_code'] = match.group('error_code')
        # print(f"  Message: {match.group('message')}")
        print(parsed_entry)

{'timestamp': '2024-08-05 10:15:30', 'type': 'USER'}
{'timestamp': '2024-08-05 10:16:45', 'type': 'SYSTEM'}
{'timestamp': '2024-08-05 10:17:20', 'type': 'ERROR', 'error_code': '404'}


## Match Versus Search

What's the difference between `match` and `search`? It turns out that `match` implicitely starts with the beginning of the string.
> So far, we’ve been using regular expressions to match strings. For example, the RE with pattern `r'box'` matches strings such as `'box'` and `'boxes'`, but not `'inbox'`. In other words, an RE match is implicitly anchored at the start of the target string, as if the RE’s pattern started with `\A`.

In [20]:
s = 'inbox'

In [13]:
re.match(r'box', s) is None

True

In [15]:
re.search(r'box', s) is None

False

To test if the full string matches we may use `fullmatch()`.

In [17]:
re.fullmatch(r'box', s) is None

True

In [24]:
re.fullmatch(r'.*box', s) is None

False

## Anchoring at String Start and End

### Read entire file

> `$` Matches end of string (if MULTILINE, also matches right before `\n`)

In [40]:
digatend = re.compile(r'\d$', re.MULTILINE)
with open('afile.txt', 'r') as f:
    fstring = f.read()
    print(f'fstring: {repr(fstring)}')
    match = digatend.search(fstring)
    if match:
        print(f'match:{match.group()}')

fstring: 'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene 1\n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.\n\n'
match:1


Without this flag we'll match only at the beginning of the line.

In [45]:
digatend = re.compile(r'\d$')
with open('afile.txt', 'r') as f:
    fstring = f.read()
    print(f'fstring: {repr(fstring)}')
    match = digatend.search(fstring)
    if match:
        print(f'match:{match.group()}')

fstring: 'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene 1\n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.\n\n'


In [44]:
digatend = re.compile(r'\d$')
with open('afile2.txt', 'r') as f:
    fstring = f.read()
    print(f'fstring: {repr(fstring)}')
    match = digatend.search(fstring)
    if match:
        print(f'match:{match.group()}')

fstring: 'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene 1\n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2\n'
match:2


Finally, let's show that `r'\d$'` and `r'\d\n'` can be the same or different.

In [72]:
s = """The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene \n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2\n"""

In [73]:
print(repr(s))

'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene \n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2\n'


In [74]:
print(repr(s.strip()))

'The Life and Death of King Richard III\nby William Shakespeare\n\nACT I\n\nScene \n\nNow is the winter of our discontent\nMade glorious summer by this son of York,\nAnd all the clouds that loured upon our house\nIn the deep bosom of the ocean buried.2'


In [81]:
re.search(r'\d\n', s, re.MULTILINE).group(), re.search(r'\d\n', s.strip()) is None

('2\n', True)

### Read line by line

In case of a single string it does not matter if a file contains `\n`. If we need to distinguish these 2 cases we may do this with `r'\d\n$'` pattern.

In [55]:
re.search(r'\d$', 'Scene 1') is None, re.search(r'\d\n$', 'Scene 1') is None

(False, True)

In [57]:
re.search(r'\d$', 'Scene 1\n') is None, re.search(r'\d\n$', 'Scene 1\n') is None

(False, False)

### Alternative anchors

It turns out that we have alternative anchors for a beginning and an end of a line: `\A` and `\Z`:
> For RE objects that are not flagged as `MULTILINE`, `^` is the same as `\A`, and `$` is the same as `\Z`. For a multiline RE, however, `^` can anchor at the start of the string or the start of any line (where “lines” are determined based on `\n` separator characters). Similarly, with a multiline RE, `$` can anchor at the end of the string or the end of any line. `\A` and `\Z` always anchor exclusively at the start and end of the string, whether the RE object is multiline or not.

## Regular Expression Object

#### `findall()`

> When `regex` has no groups,`findall` returns a list of strings, each a substring of `s` that is a nonoverlapping match with `regex`.

In [123]:
# Find list of words in the file
reword = re.compile(r'\w+') 
words = []
with open('afile.txt') as f:
    for word in reword.findall(f.read()): 
        words.append(word.lower())

> When `regex` has exactly one group, findall also returns a list of strings, but each is the substring of `s` that matches `regex` group.

In [124]:
words[:5]

['the', 'life', 'and', 'death', 'of']

In [121]:
# Find words that are followed by punctuation
reword = re.compile(r'(\w+)[,.!?]') 
words = []
with open('afile.txt') as f:
    for word in reword.findall(f.read()): 
        words.append(word.lower())

In [122]:
words

['york', 'buried']

> When `regex` has `n` groups (with `n> 1`), findall returns a list of tuples, one per nonoverlapping match with `regex`.

In [129]:
# Find the first and the last words in a line
reword = re.compile(r'^\W*(\w+)\b.*\b(\w+)\W*$', re.MULTILINE) 
words = []
with open('afile.txt') as f:
    for word in reword.findall(f.read()): 
        words.append(word)

In [130]:
words

[('The', 'III'),
 ('by', 'Shakespeare'),
 ('ACT', 'I'),
 ('Scene', '1'),
 ('Now', 'discontent'),
 ('Made', 'York'),
 ('And', 'house'),
 ('In', 'buried')]

## Functions of the re Module

> It is usually better to compile pattern strings into RE objects explicitly and call the RE object’s methods, but sometimes, for a one-off use of an RE pattern, calling functions of the module re can be handier.

## REs and the `:=` Operator

> In Perl, the `if ($var =~ /regExpr/)` statement both evaluates the regular expression and saves the successful match in the variable `var`. The introduction of the `:=` operator in Python 3.8 established support for a successive-match idiom in Python similar to the one that’s common in Perl. 

In [111]:
# statement = 'I love Python'
statement = 'Ich liebe Python'

In [112]:
if m := re.match(r'I love (\w+)', statement): print(f'He loves {m.group(1)}')
elif m := re.match(r'Ich liebe (\w+)', statement): print(f'Er liebt {m.group(1)}')
elif m := re.match(r'J\'aime (\w+)', statement): print(f'Il aime {m.group(1)}')

Er liebt Python


# Exercises 

## Exercise 1: Extracting Dates

**Objective:** Write a Python function that extracts all dates from a given text. The dates can be in the formats `DD/MM/YYYY, DD-MM-YYYY, or DD.MM.YYYY`.


**Description:**
- Your function should take a string as input.
- It should return a list of all dates found in the input string.
- The dates should be in the format `DD/MM/YYYY, DD-MM-YYYY, or DD.MM.YYYY`.

In [30]:
text = "Today's date is 05/08/2024. Another important date is 12-11-2023. Also, don't forget 23.09.2022!"

In [31]:
date_pattern = re.compile(r"\d{2}[/.-]\d{2}[/.-]\d{4}")

In [35]:
for token in text.split():
    match = date_pattern.match(token)
    if match:
        print(match.group())

05/08/2024
12-11-2023
23.09.2022


## Exercise 2: Validating Email Addresses

**Objective:** Write a Python function that validates a list of email addresses. The function should return a list of valid email addresses.

**Description:**
- Your function should take a list of email addresses as an input.
- It should return a list of email addresses that match the pattern of a valid email address.
- A valid email address should follow the pattern: `[username]@[domain].[extension]`, where:
    - `[username]` can contain letters, digits, underscores, periods, and hyphens, non-empty.
    - `[domain]` can contain letters, digits, and hyphens, non-empty.
    - `[extension]` can contain letters, is 2 to 6 characters long.

In [26]:
emails = ["test.email@example.com", 
          "test09_eMail-@example.com",
          "@example.com",
          "test.email@example09-.com",
          "invalid-email@.com", 
          "user.name@domain.co", 
          "user.name@domain.cocococo",
          "user.name@domain.c",
          "user@domaincom"]

In [27]:
email_pattern = re.compile(r"[\w.-]+@[a-zA-Z0-9-]+\.[a-zA-Z]{2,6}")

In [29]:
for email in emails:
    match = email_pattern.fullmatch(email)
    if match: 
        print(f"{email} is valid, match:{match.group()} ")
    else:
        print(f"{email} is NOT valid")

test.email@example.com is valid, match:test.email@example.com 
test09_eMail-@example.com is valid, match:test09_eMail-@example.com 
@example.com is NOT valid
test.email@example09-.com is valid, match:test.email@example09-.com 
invalid-email@.com is NOT valid
user.name@domain.co is valid, match:user.name@domain.co 
user.name@domain.cocococo is NOT valid
user.name@domain.c is NOT valid
user@domaincom is NOT valid


## Exercise 3: Log File Parser

**Objective:** Write a Python function that parses a log file and extracts specific information from each log entry. The log file contains entries with varying formats, and you need to extract and categorize the information.

**Description:**
- Your function should take a string (the contents of the log file) as input.
- It should return a *list of dictionaries*, where each dictionary represents a parsed log entry.
- The log entries can be in one of the following formats:
    - `[TIMESTAMP] USER: MESSAGE`
    - `[TIMESTAMP] SYSTEM: ACTION - DETAILS`
    - `[TIMESTAMP] ERROR (CODE): ERROR_MESSAGE`
- The timestamp format is always `YYYY-MM-DD HH:MM:SS`
- Extract the following information for each log entry:
    - Timestamp
    - Type (`USER, SYSTEM, or ERROR`)
    - User name (for `USER` type)
    - Message (for `USER` type)
    - Action and Details (for `SYSTEM` type)
    - Error Code and Error Message (for `ERROR` type)

In [37]:
with open('log_file.txt', 'r') as file:
    for line in file:
        print(line.strip())

[2024-08-05 10:15:30] USER: john_doe: Logged in successfully
[2024-08-05 10:16:45] SYSTEM: BACKUP - Daily backup completed
[2024-08-05 10:17:20] ERROR (404): Page not found
[2024-08-05 10:18:00] USER: jane_smith: Uploaded file 'document.pdf'
[2024-08-05 10:19:15] SYSTEM: UPDATE - System updated to version 2.1
