# Python Regular Expressions

This notebook covers regular expressions (regex) in Python, based on [Google's Python Class](https://developers.google.com/edu/python/), specially modified for Marina PANDOLFINO by Daniel Patrick MORGAN.

## What are Regular Expressions?

**Regular expressions** (often called "regex" or "regexp") are patterns that describe sets of strings. They're like a super-powered "find and replace" tool that can match complex patterns in text.

**Why use regular expressions?**
- Find specific patterns in text (e.g., email addresses, phone numbers, dates)
- Extract information from structured data
- Validate that text matches a certain format
- Replace patterns with other text

**Example:** Instead of searching for exact text like "M-00001", you could search for any pattern like "M-" followed by digits, which would match "M-00001", "M-12345", etc.

In Python, we use the `re` module for regular expressions.


In [2]:
# First, we need to import the re module
import re


## Escape Sequences

Before we dive into regex patterns, let's understand **escape sequences** - special characters that represent things you can't type directly.

**Common escape sequences:**
- `\n` - Newline (line break)
- `\t` - Tab
- `\"` - Double quote (when inside a string that uses double quotes)
- `\'` - Single quote (when inside a string that uses single quotes)
- `\\` - Backslash (the escape character itself)

**Why escape?** Some characters have special meaning in Python strings. The backslash `\` tells Python "the next character is special, treat it differently."


In [None]:
# Examples of escape sequences
print("Line 1\nLine 2")  # \n creates a new line
print("Column1\tColumn2\tColumn3")  # \t creates a tab
print("He said \"Hello\"")  # \" allows quotes inside a string
print("Path: C:\\Users\\Documents")  # \\ creates a single backslash


## Raw Strings

When working with regular expressions, you'll often use **raw strings** (strings prefixed with `r`). Raw strings treat backslashes as literal characters, not escape sequences.

**Why use raw strings for regex?**
- Regex patterns often contain backslashes (like `\d` for digits)
- In a normal string, Python would try to interpret `\d` as an escape sequence
- In a raw string, `\d` stays as `\d`, which is what regex needs

**Syntax:** `r"pattern"` instead of `"pattern"`


In [None]:
# Example: Normal string vs raw string
normal_string = "Hello\nWorld"  # \n is interpreted as newline
raw_string = r"Hello\nWorld"     # \n is treated as literal characters

print("Normal string:", repr(normal_string))  # repr() shows the actual characters
print("Raw string:", repr(raw_string))

# For regex, we almost always use raw strings
pattern = r"\d+"  # This means "one or more digits" in regex
print(f"Pattern: {pattern}")


## Basic Regex Patterns

Here are the most common regex patterns you'll use:

**Special Characters:**
- `.` - Matches any single character (except newline)
- `*` - Matches zero or more of the preceding character
- `+` - Matches one or more of the preceding character
- `?` - Matches zero or one of the preceding character

**Character Classes:**
- `\d` - Matches any digit (0-9)
- `\w` - Matches any word character (letter, digit, or underscore)
- `\s` - Matches any whitespace character (space, tab, newline)
- `\D` - Matches any non-digit
- `\W` - Matches any non-word character
- `\S` - Matches any non-whitespace character


In [3]:
# Example: Basic patterns
text = "The year is 2024, and Python 3.12 is great!"

# Find all digits
digits = re.findall(r'\d', text)
print(f"All digits: {digits}")

# Find sequences of digits (years, versions)
numbers = re.findall(r'\d+', text)
print(f"Number sequences: {numbers}")

# Find word characters
words = re.findall(r'\w+', text)
print(f"Words: {words[:5]}...")  # Show first 5


All digits: ['2', '0', '2', '4', '3', '1', '2']
Number sequences: ['2024', '3', '12']
Words: ['The', 'year', 'is', '2024', 'and']...


## Character Classes

You can define your own character classes using square brackets `[]`:

- `[abc]` - Matches any one of: a, b, or c
- `[0-9]` - Matches any digit (same as `\d`)
- `[a-z]` - Matches any lowercase letter
- `[A-Z]` - Matches any uppercase letter
- `[^abc]` - Matches any character EXCEPT a, b, or c (the `^` means "not")


In [None]:
# Example: Character classes
text = "Python 3.12, Java 17, C++ 20"

# Find all vowels
vowels = re.findall(r'[aeiouAEIOU]', text)
print(f"Vowels: {vowels}")

# Find all digits
digits = re.findall(r'[0-9]', text)
print(f"Digits: {digits}")

# Find sequences of digits
numbers = re.findall(r'[0-9]+', text)
print(f"Numbers: {numbers}")


## Anchors

**Anchors** don't match characters - they match positions in the string:

- `^` - Matches the start of the string
- `$` - Matches the end of the string

**Example:**
- `^Hello` - Matches "Hello" only if it's at the start of the string
- `world$` - Matches "world" only if it's at the end of the string
- `^Hello world$` - Matches the entire string (from start to end)


In [None]:
# Example: Anchors
text1 = "Hello world"
text2 = "Say Hello world"
text3 = "Hello world goodbye"

# Match "Hello" at the start
match1 = re.search(r'^Hello', text1)
match2 = re.search(r'^Hello', text2)
print(f"'^Hello' in '{text1}': {match1 is not None}")
print(f"'^Hello' in '{text2}': {match2 is not None}")

# Match "world" at the end
match3 = re.search(r'world$', text1)
match4 = re.search(r'world$', text3)
print(f"'world$' in '{text1}': {match3 is not None}")
print(f"'world$' in '{text3}': {match4 is not None}")


## re.search() - Finding the First Match

`re.search(pattern, string)` searches for the **first occurrence** of the pattern in the string. It returns a **Match object** if found, or `None` if not found.

**Match object methods:**
- `.group()` - Returns the matched string
- `.start()` - Returns the start position of the match
- `.end()` - Returns the end position of the match


In [None]:
# Example: re.search()
text = "The year is 2024, and Python 3.12 is great!"

# Search for first sequence of digits
match = re.search(r'\d+', text)
if match:
    print(f"Found: {match.group()}")
    print(f"Position: {match.start()} to {match.end()}")
else:
    print("No match found")


## re.findall() - Finding All Matches

`re.findall(pattern, string)` finds **all occurrences** of the pattern and returns them as a list of strings.

**When to use:**
- `re.search()` - When you only need the first match
- `re.findall()` - When you need all matches


In [None]:
# Example: re.findall()
text = "The years are 2020, 2021, 2022, 2023, and 2024."

# Find all sequences of digits
all_years = re.findall(r'\d{4}', text)  # \d{4} means exactly 4 digits
print(f"All years found: {all_years}")

# Find all words starting with capital letters
capital_words = re.findall(r'[A-Z][a-z]+', text)
print(f"Capitalized words: {capital_words}")


## re.sub() - Replacing Matches

`re.sub(pattern, replacement, string)` replaces all matches of the pattern with the replacement string.

**Syntax:** `re.sub(pattern, replacement, string)`


In [None]:
# Example: re.sub()
text = "The year is 2024, and Python 3.12 is great!"

# Replace all digits with "X"
masked = re.sub(r'\d', 'X', text)
print(f"Masked: {masked}")

# Replace all sequences of digits with "[NUMBER]"
numbered = re.sub(r'\d+', '[NUMBER]', text)
print(f"Numbered: {numbered}")


## re.split() - Splitting by Pattern

`re.split(pattern, string)` splits the string wherever the pattern matches, similar to `str.split()` but using regex patterns.

**Example:** Split on any whitespace (spaces, tabs, newlines)


In [None]:
# Example: re.split()
text = "Apple, Banana; Cherry: Date"

# Split on punctuation (comma, semicolon, or colon)
fruits = re.split(r'[,;:]', text)
print(f"Fruits: {fruits}")

# Split on any whitespace
text2 = "Hello   World\tPython\nRegex"
words = re.split(r'\s+', text2)  # \s+ means one or more whitespace characters
print(f"Words: {words}")


## Groups - Capturing Parts of Matches

**Groups** use parentheses `()` to capture parts of a match. This is very useful for extracting specific information.

**How it works:**
- `(pattern)` - Creates a group
- `match.group(1)` - Returns the first group
- `match.group(2)` - Returns the second group
- `match.group(0)` or `match.group()` - Returns the entire match


In [None]:
# Example: Groups
text = "Contact: email@example.com or phone: 555-1234"

# Extract email (word characters, @, word characters, ., word characters)
email_match = re.search(r'(\w+)@(\w+\.\w+)', text)
if email_match:
    print(f"Full match: {email_match.group(0)}")
    print(f"Username: {email_match.group(1)}")
    print(f"Domain: {email_match.group(2)}")

# Extract phone number (digits-digits-digits)
phone_match = re.search(r'(\d{3})-(\d{4})', text)
if phone_match:
    print(f"Full match: {phone_match.group(0)}")
    print(f"Area code: {phone_match.group(1)}")
    print(f"Number: {phone_match.group(2)}")


## Domain-Specific Examples: Working with Japanese Text and Dictionary Data

Now let's apply regex to real problems from your project!


### Example 1: Finding Japanese Characters in Historical Text

For some reason, Daniel has the first two volumes of the 太平記 on his computer. Obviously, Marina's secret plan to turn him into a Japanologist and dress him up in a special Japanologist bonnet that she has knitted for him is coming to fruition. However, to push Daniel over the brink, Marina must extract data from these files.


In [None]:
# Load first and second volumes of the Taiheiki
with open('../data/太平記·卷一.txt', 'r', encoding='utf-8') as f:
    taiheiki_text = f.read()
with open('../data/太平記·卷二.txt', 'r', encoding='utf-8') as f:
    taiheiki_text += f.read()

# Let's find all mentions of an emperor
emperor_matches = re.findall(r'天皇', taiheiki_text)
print(f"Found {len(emperor_matches)} mentions of '天皇'")
print(f"First 10: {emperor_matches[:10]}")

# That is pretty useless, except to get a number, what we want is context!
# As such, let's capture the words before and after every mention of an emperor using \w


# OK, that gives us a CLAUSE, but we want a full sentence.
# First, delete all spaces, line breaks, and tabs.

# Second, 'split' the text into sentences at 。

# Third, iterate through each sentence, and if it contains the emperor, print it.


Found 2 mentions of '皇后'
First 10: ['皇后', '皇后']


In [None]:
# Oooh, look, there are odoriji (々) in this text! 
# Let's make a list of the characters that they go with!

### Example 2: Extracting Dates

Chinese dates have predictable patterns, and since the Japanese just copied the Chinese for everything, without any original ideas of their own, their dates are the same way. Let's extract all the dates from the first two volumes of the Taihekiji and show them to Daniel. Once he sees that there is history of astronomy stuff to do in Japan, and that it's all basically the same, then he will become a Japanologist and finally lead a life of meaning. 

In [None]:
# Use a character class [] to find all examples of numbers followed by 年

# Now, expand that regex pattern to include the chinese characters proceeding the number(s)
# (to capture the political era) (CJK characters in the range \u4e00-\u9fff)

### Example 3: Find and replace

After Marina discovered the Taihekiji on Daniel's computer and valliantly extracted calendrical data therefrom, Daniel nevertheless put up resistence. Please help Daniel express his forbidden desires using `re.sub()`:


In [None]:
daniel_text = """
Marina, that's great, but in my heart I know 
    that I am a Sinologist, 
    that I have always been a Sinologist, 
    and that I will always be a Sinologist, until the day I die.
It's not just that China is superior, and that Sinologists are less visibly crippled by autism,
it's that I am far too handsome and sociable to be a Japanologist.
Imagine the pressure that that would put on your poor colleagues! 
Also, I like my new beret and want to wear it everywhere I go.
"""