# 第23章：正则表达式

掌握文本模式匹配的强大工具,处理复杂的字符串操作。

## 什么是正则表达式?

想象你要从一大堆文本中找出所有的手机号、邮箱地址、网址...如果用普通的字符串方法,代码会又长又复杂!这时候就需要**正则表达式**(Regular Expression,简称Regex)。

**正则表达式是一种用特殊语法描述字符串模式的工具**,可以快速匹配、查找、替换文本。

### 为什么需要正则表达式?

In [None]:
# 不使用正则 - 提取所有数字很麻烦
text = "订单号:12345,金额:99.8元,数量:3件"
numbers = []
current = ""
for char in text:
    if char.isdigit() or char == ".":
        current += char
    elif current:
        numbers.append(current)
        current = ""
# 代码很长,还容易出错!

# 使用正则 - 一行搞定
import re
numbers = re.findall(r"\d+\.?\d*", text)
print(numbers)  # ['12345', '99.8', '3']

**正则表达式的优势**:
- **强大** - 复杂的文本匹配规则用一行就能表达
- **简洁** - 不用写大量if-else判断
- **灵活** - 可以精确控制匹配规则
- **通用** - 几乎所有编程语言都支持
- **高效** - 底层优化,速度快

**常见应用场景**:
- 验证输入(邮箱、手机号、身份证号等)
- 提取信息(从文本中提取特定内容)
- 替换文本(批量修改、脱敏处理)
- 解析日志(从日志中提取关键信息)
- 爬虫数据清洗

## re模块基础

Python通过`re`模块提供正则表达式支持。

In [None]:
import re

# 基本使用
text = "我的电话是13812345678"
result = re.search(r"\d+", text)  # 查找数字
if result:
    print(result.group())  # 13812345678

**注意**: 正则表达式字符串前面加`r`表示原始字符串,避免转义问题。

```python
# ❌ 错误
pattern = "\d+"  # \d会被转义

# ✅ 正确
pattern = r"\d+"  # 原始字符串,\d不会被转义
```

## 基本元字符

### 普通字符

In [None]:
import re

# 普通字符直接匹配
text = "hello world"
print(re.search(r"hello", text))  # 匹配
print(re.search(r"python", text))  # 不匹配(返回None)

### . (点) - 匹配任意字符

In [None]:
import re

text = "cat, bat, hat, mat"

# . 匹配任意字符(除了换行符)
print(re.findall(r".at", text))  # ['cat', 'bat', 'hat', 'mat']
print(re.findall(r"c.t", "cat cut cot"))  # ['cat', 'cut', 'cot']

### \d - 匹配数字

In [None]:
import re

text = "我今年25岁,身高175cm"

# \d 匹配0-9的数字
print(re.findall(r"\d", text))  # ['2', '5', '1', '7', '5']
print(re.findall(r"\d+", text))  # ['25', '175'] (连续的数字)
print(re.findall(r"\d{3}", text))  # ['175'] (恰好3位数字)

### \D - 匹配非数字

In [None]:
import re

text = "abc123def456"
print(re.findall(r"\D+", text))  # ['abc', 'def']

### \w - 匹配字母、数字、下划线

In [None]:
import re

text = "hello_world123 你好"

# \w 匹配 [a-zA-Z0-9_]
print(re.findall(r"\w+", text))  # ['hello_world123', '你好']

### \W - 匹配非单词字符

In [None]:
import re

text = "hello, world!"
print(re.findall(r"\W+", text))  # [', ', '!']

### \s - 匹配空白字符

In [None]:
import re

text = "hello\tworld\n"

# \s 匹配空格、制表符、换行符等
print(re.findall(r"\s+", text))  # ['\t', '\n']

### \S - 匹配非空白字符

In [None]:
import re

text = "hello world"
print(re.findall(r"\S+", text))  # ['hello', 'world']

## 量词

### * - 0次或多次

In [None]:
import re

# * 表示前面的字符可以出现0次或多次
print(re.findall(r"ab*", "a ab abb abbb"))
# ['a', 'ab', 'abb', 'abbb']

print(re.findall(r"\d*", "abc123"))
# ['', '', '', '123', ''] (注意空字符串也匹配)

### + - 1次或多次

In [None]:
import re

# + 表示前面的字符至少出现1次
print(re.findall(r"ab+", "a ab abb abbb"))
# ['ab', 'abb', 'abbb'] (不包含'a')

print(re.findall(r"\d+", "abc123def456"))
# ['123', '456']

### ? - 0次或1次

In [None]:
import re

# ? 表示前面的字符出现0次或1次
print(re.findall(r"ab?", "a ab abb"))
# ['a', 'ab', 'ab'] (abb匹配ab)

# 常用于可选项
print(re.findall(r"colou?r", "color colour"))
# ['color', 'colour']

### {n} - 恰好n次

In [None]:
import re

# {n} 表示恰好n次
print(re.findall(r"\d{4}", "2024年1月15日"))
# ['2024']

print(re.findall(r"a{3}", "aa aaa aaaa"))
# ['aaa', 'aaa'] (aaaa匹配到一个aaa)

### {n,m} - n到m次

In [None]:
import re

# {n,m} 表示n到m次
print(re.findall(r"\d{2,4}", "1 12 123 1234 12345"))
# ['12', '123', '1234', '1234'] (12345匹配到1234)

# {n,} 表示至少n次
print(re.findall(r"\d{3,}", "12 123 1234"))
# ['123', '1234']

### 贪婪与非贪婪匹配

In [None]:
import re

text = "<html><body>content</body></html>"

# 贪婪匹配(默认) - 尽可能多地匹配
print(re.findall(r"<.*>", text))
# ['<html><body>content</body></html>'] (匹配整个)

# 非贪婪匹配 - 尽可能少地匹配
print(re.findall(r"<.*?>", text))
# ['<html>', '<body>', '</body>', '</html>']

# 常见非贪婪量词
# *? +? ?? {n,m}?

## 字符类

### [] - 字符集合

In [None]:
import re

# [abc] 匹配a或b或c
print(re.findall(r"[abc]", "abcd"))  # ['a', 'b', 'c']

# [a-z] 匹配小写字母
print(re.findall(r"[a-z]+", "Hello World 123"))  # ['ello', 'orld']

# [A-Z] 匹配大写字母
print(re.findall(r"[A-Z]+", "Hello World"))  # ['H', 'W']

# [0-9] 等同于\d
print(re.findall(r"[0-9]+", "abc123def"))  # ['123']

# [a-zA-Z] 匹配所有字母
print(re.findall(r"[a-zA-Z]+", "Hello123World"))  # ['Hello', 'World']

# [^abc] 匹配除了a,b,c以外的字符
print(re.findall(r"[^0-9]+", "abc123def"))  # ['abc', 'def']

### 常用字符类

In [None]:
import re

text = "Email: test@example.com, Phone: 138-1234-5678"

# 匹配邮箱的本地部分
print(re.findall(r"[a-zA-Z0-9._%-]+@", text))  # ['test@']

# 匹配电话号码
print(re.findall(r"[\d-]+", text))  # ['138-1234-5678']

## 锚点

### ^ - 行首

In [None]:
import re

# ^ 匹配字符串开头
text = "hello world"
print(re.search(r"^hello", text))  # 匹配
print(re.search(r"^world", text))  # 不匹配

# 多行模式
text = "line1\nline2\nline3"
print(re.findall(r"^line", text))  # ['line'] (只匹配第一个)
print(re.findall(r"^line", text, re.MULTILINE))  # ['line', 'line', 'line']

### $ - 行尾

In [None]:
import re

# $ 匹配字符串结尾
text = "hello world"
print(re.search(r"world$", text))  # 匹配
print(re.search(r"hello$", text))  # 不匹配

### \b - 单词边界

In [None]:
import re

text = "cat scatter category"

# \b 匹配单词边界
print(re.findall(r"\bcat\b", text))  # ['cat'] (完整单词)
print(re.findall(r"cat", text))  # ['cat', 'cat', 'cat'] (包含cat的)

### \B - 非单词边界

In [None]:
import re

text = "cat scatter"
print(re.findall(r"\Bcat", text))  # ['cat'] (scatter中的cat)

## 分组和捕获

### () - 分组

In [None]:
import re

# () 用于分组
text = "2024-01-15"
match = re.search(r"(\d{4})-(\d{2})-(\d{2})", text)

if match:
    print(match.group(0))  # 2024-01-15 (整个匹配)
    print(match.group(1))  # 2024 (第一组)
    print(match.group(2))  # 01 (第二组)
    print(match.group(3))  # 15 (第三组)
    print(match.groups())  # ('2024', '01', '15')

### 命名分组

In [None]:
import re

text = "姓名:张三,年龄:25"
pattern = r"姓名:(?P<name>\w+),年龄:(?P<age>\d+)"
match = re.search(pattern, text)

if match:
    print(match.group("name"))  # 张三
    print(match.group("age"))   # 25
    print(match.groupdict())    # {'name': '张三', 'age': '25'}

### 反向引用

In [None]:
import re

# 查找重复的单词
text = "hello hello world world"
print(re.findall(r"(\w+) \1", text))  # ['hello', 'world']
# \1 表示引用第一个分组的内容

# 查找重复的字符
text = "aabbcc"
print(re.findall(r"(\w)\1", text))  # ['a', 'b', 'c']

### 非捕获分组

In [None]:
import re

# (?:...) 非捕获分组,不会被group()获取
text = "http://example.com https://test.com"

# 普通分组
matches = re.findall(r"(https?)://([\w.]+)", text)
print(matches)  # [('http', 'example.com'), ('https', 'test.com')]

# 非捕获分组
matches = re.findall(r"(?:https?)://([\w.]+)", text)
print(matches)  # ['example.com', 'test.com']

## re模块函数

### re.match() - 从开头匹配

In [None]:
import re

# match() 从字符串开头匹配
text = "hello world"

print(re.match(r"hello", text))  # 匹配
print(re.match(r"world", text))  # 不匹配(不在开头)

# 常用于验证
def validate_username(username):
    """用户名:字母开头,字母数字下划线,3-16位"""
    pattern = r"^[a-zA-Z][a-zA-Z0-9_]{2,15}$"
    return re.match(pattern, username) is not None

print(validate_username("user123"))  # True
print(validate_username("123user"))  # False
print(validate_username("ab"))       # False (太短)

### re.search() - 查找

In [None]:
import re

# search() 在字符串中查找第一个匹配
text = "hello world python"

match = re.search(r"world", text)
if match:
    print(f"找到: {match.group()}")
    print(f"位置: {match.start()}-{match.end()}")
    # 找到: world
    # 位置: 6-11

### re.findall() - 查找所有

In [None]:
import re

# findall() 返回所有匹配的列表
text = "价格:99元,199元,299元"
prices = re.findall(r"\d+", text)
print(prices)  # ['99', '199', '299']

# 带分组的findall
text = "张三:25岁,李四:30岁"
result = re.findall(r"(\w+):(\d+)岁", text)
print(result)  # [('张三', '25'), ('李四', '30')]

### re.finditer() - 迭代器

In [None]:
import re

# finditer() 返回匹配对象的迭代器
text = "apple 10, banana 20, cherry 30"

for match in re.finditer(r"(\w+) (\d+)", text):
    print(f"{match.group(1)}: {match.group(2)}")
# apple: 10
# banana: 20
# cherry: 30

### re.sub() - 替换

In [None]:
import re

# sub() 替换匹配的内容
text = "今天是2024-01-15"
new_text = re.sub(r"\d{4}", "YYYY", text)
print(new_text)  # 今天是YYYY-01-15

# 使用函数替换
def double(match):
    num = int(match.group())
    return str(num * 2)

text = "10, 20, 30"
result = re.sub(r"\d+", double, text)
print(result)  # 20, 40, 60

# 使用反向引用
text = "138-1234-5678"
masked = re.sub(r"(\d{3})-\d{4}-(\d{4})", r"\1-****-\2", text)
print(masked)  # 138-****-5678

### re.subn() - 替换并返回次数

In [None]:
import re

text = "apple apple banana apple"
result, count = re.subn(r"apple", "orange", text)
print(result)  # orange orange banana orange
print(count)   # 3

### re.split() - 分割

In [None]:
import re

# split() 按正则表达式分割
text = "apple,banana;orange|grape"
fruits = re.split(r"[,;|]", text)
print(fruits)  # ['apple', 'banana', 'orange', 'grape']

# 保留分隔符
text = "a1b2c3"
result = re.split(r"(\d)", text)
print(result)  # ['a', '1', 'b', '2', 'c', '3', '']

# 限制分割次数
text = "a,b,c,d,e"
result = re.split(r",", text, maxsplit=2)
print(result)  # ['a', 'b', 'c,d,e']

### re.compile() - 编译正则

In [None]:
import re

# compile() 编译正则表达式,重复使用时更高效
pattern = re.compile(r"\d+")

text1 = "abc123def"
text2 = "xyz456uvw"

print(pattern.findall(text1))  # ['123']
print(pattern.findall(text2))  # ['456']

# 编译时可以指定标志
pattern = re.compile(r"hello", re.IGNORECASE)
print(pattern.search("Hello World"))  # 匹配(忽略大小写)

## 正则表达式标志

In [None]:
import re

# re.IGNORECASE (re.I) - 忽略大小写
print(re.findall(r"hello", "Hello HELLO", re.I))  # ['Hello', 'HELLO']

# re.MULTILINE (re.M) - 多行模式
text = "line1\nline2\nline3"
print(re.findall(r"^line\d", text, re.M))  # ['line1', 'line2', 'line3']

# re.DOTALL (re.S) - . 匹配包括换行符
text = "hello\nworld"
print(re.findall(r"hello.world", text))     # [] (. 不匹配\n)
print(re.findall(r"hello.world", text, re.S))  # ['hello\nworld']

# re.VERBOSE (re.X) - 允许写注释和空格
pattern = re.compile(r"""
    \d{3}  # 区号
    -      # 分隔符
    \d{4}  # 前四位
    -      # 分隔符
    \d{4}  # 后四位
""", re.VERBOSE)

print(pattern.search("138-1234-5678"))

# 组合多个标志
pattern = re.compile(r"hello", re.I | re.M)

## 实战例子

### 例子1：邮箱验证

In [None]:
import re

def validate_email(email):
    """
    验证邮箱格式
    规则:
    - 本地部分:字母数字.-_
    - @符号
    - 域名:字母数字.-
    - 顶级域名:至少2位字母
    """
    pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    return re.match(pattern, email) is not None

# 测试
test_emails = [
    "user@example.com",      # ✅
    "test.user@domain.co.uk",  # ✅
    "invalid@",              # ❌
    "@invalid.com",          # ❌
    "no-at-sign.com",        # ❌
    "user@domain",           # ❌ (没有顶级域名)
]

for email in test_emails:
    result = "✅" if validate_email(email) else "❌"
    print(f"{result} {email}")

### 例子2：手机号验证和脱敏

In [None]:
import re

def validate_phone(phone):
    """
    验证中国手机号
    规则:1开头,第二位3-9,共11位
    """
    pattern = r"^1[3-9]\d{9}$"
    return re.match(pattern, phone) is not None

def mask_phone(text):
    """
    手机号脱敏:138****5678
    """
    pattern = r"1[3-9]\d{9}"

    def mask(match):
        phone = match.group()
        return f"{phone[:3]}****{phone[7:]}"

    return re.sub(pattern, mask, text)

# 验证
phones = ["13812345678", "12345678901", "18912345678"]
for phone in phones:
    print(f"{phone}: {'✅' if validate_phone(phone) else '❌'}")

# 脱敏
text = "联系方式:13812345678,备用:13987654321"
print(mask_phone(text))
# 联系方式:138****5678,备用:139****4321

### 例子3：URL提取和解析

In [None]:
import re

def extract_urls(text):
    """提取URL"""
    pattern = r"https?://[^\s<>\"']+"
    return re.findall(pattern, text)

def parse_url(url):
    """解析URL"""
    pattern = r"(https?)://([^/:]+)(?::(\d+))?(/[^\s?]*)?\??([^\s]*)?"
    match = re.match(pattern, url)

    if match:
        return {
            "protocol": match.group(1),
            "domain": match.group(2),
            "port": match.group(3) or "80/443",
            "path": match.group(4) or "/",
            "query": match.group(5) or ""
        }
    return None

# 提取URL
text = """
访问官网 https://www.example.com
文档 http://docs.python.org/3.14/
API https://api.example.com:8080/v1/users?id=123
"""

urls = extract_urls(text)
print("找到的URL:")
for url in urls:
    print(f"  {url}")

# 解析URL
print("\n解析第一个URL:")
info = parse_url(urls[0])
for key, value in info.items():
    print(f"  {key}: {value}")

### 例子4：日志解析器

In [None]:
import re
from collections import defaultdict

class LogParser:
    """日志解析器"""

    def __init__(self):
        # 日志格式:[时间] [级别] 消息
        self.pattern = r"\[(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})\] \[(\w+)\] (.+)"
        self.stats = defaultdict(int)

    def parse_line(self, line):
        """解析单行日志"""
        match = re.search(self.pattern, line)
        if match:
            return {
                "timestamp": match.group(1),
                "level": match.group(2),
                "message": match.group(3)
            }
        return None

    def parse_file(self, filename):
        """解析日志文件"""
        logs = []

        try:
            with open(filename, "r", encoding="utf-8") as f:
                for line in f:
                    log = self.parse_line(line)
                    if log:
                        logs.append(log)
                        self.stats[log["level"]] += 1
        except FileNotFoundError:
            print(f"文件不存在: {filename}")

        return logs

    def get_errors(self, logs):
        """获取错误日志"""
        return [log for log in logs if log["level"] == "ERROR"]

    def show_stats(self):
        """显示统计信息"""
        print("\n=== 日志统计 ===")
        for level, count in sorted(self.stats.items()):
            print(f"{level}: {count} 条")

# 使用示例(假设有日志文件)
parser = LogParser()

# 模拟日志
sample_logs = """
[2024-01-15 10:30:00] [INFO] Application started
[2024-01-15 10:30:05] [ERROR] Database connection failed
[2024-01-15 10:30:10] [WARNING] Retrying connection
[2024-01-15 10:30:15] [INFO] Connection established
[2024-01-15 10:30:20] [ERROR] Query timeout
"""

for line in sample_logs.strip().split("\n"):
    log = parser.parse_line(line)
    if log and log["level"] == "ERROR":
        print(f"错误: {log['message']} ({log['timestamp']})")

parser.show_stats()

### 例子5：身份证验证

In [None]:
import re

def validate_id_card(id_card):
    """
    验证中国身份证号
    18位:前17位数字 + 1位校验码(数字或X)
    """
    # 格式检查
    if not re.match(r"^\d{17}[\dXx]$", id_card):
        return False, "格式错误"

    # 提取信息
    birth_year = id_card[6:10]
    birth_month = id_card[10:12]
    birth_day = id_card[12:14]

    # 验证日期
    if not (1900 <= int(birth_year) <= 2024):
        return False, "年份错误"
    if not (1 <= int(birth_month) <= 12):
        return False, "月份错误"
    if not (1 <= int(birth_day) <= 31):
        return False, "日期错误"

    # 简化的校验码验证(实际需要更复杂的算法)
    return True, "有效"

def extract_id_info(id_card):
    """提取身份证信息"""
    if not re.match(r"^\d{17}[\dXx]$", id_card):
        return None

    return {
        "area": id_card[:6],
        "birth_year": id_card[6:10],
        "birth_month": id_card[10:12],
        "birth_day": id_card[12:14],
        "sex": "男" if int(id_card[16]) % 2 == 1 else "女"
    }

# 测试
id_cards = [
    "110101199001011234",  # 示例(不是真实号码)
    "12345678901234567X",
]

for id_card in id_cards:
    valid, msg = validate_id_card(id_card)
    print(f"{id_card}: {msg}")

    if valid:
        info = extract_id_info(id_card)
        print(f"  出生日期: {info['birth_year']}-{info['birth_month']}-{info['birth_day']}")
        print(f"  性别: {info['sex']}")

### 例子6：HTML标签清洗

In [None]:
import re

def remove_html_tags(html):
    """移除HTML标签"""
    # 移除所有标签
    text = re.sub(r"<[^>]+>", "", html)
    # 移除多余空白
    text = re.sub(r"\s+", " ", text).strip()
    return text

def extract_links(html):
    """提取所有链接"""
    pattern = r'<a\s+(?:[^>]*?\s+)?href="([^"]*)"[^>]*>(.*?)</a>'
    return re.findall(pattern, html, re.DOTALL)

def extract_images(html):
    """提取所有图片"""
    pattern = r'<img\s+[^>]*?src="([^"]*)"[^>]*>'
    return re.findall(pattern, html)

# 测试
html = """
<html>
<body>
    <h1>标题</h1>
    <p>这是一段<b>粗体</b>文字</p>
    <a href="https://example.com">链接</a>
    <img src="image.jpg" alt="图片">
</body>
</html>
"""

print("纯文本:")
print(remove_html_tags(html))

print("\n链接:")
for url, text in extract_links(html):
    print(f"  {text} -> {url}")

print("\n图片:")
for img in extract_images(html):
    print(f"  {img}")

### 例子7：价格提取和格式化

In [None]:
import re

def extract_prices(text):
    """提取价格"""
    # 匹配各种价格格式
    patterns = [
        r"¥\s*(\d+(?:\.\d{2})?)",  # ¥99.99
        r"(\d+(?:\.\d{2})?)\s*元",  # 99.99元
        r"\$\s*(\d+(?:\.\d{2})?)",  # $99.99
    ]

    prices = []
    for pattern in patterns:
        matches = re.findall(pattern, text)
        prices.extend(matches)

    return [float(p) for p in prices]

def format_price(price, currency="¥"):
    """格式化价格"""
    return f"{currency}{price:,.2f}"

# 测试
text = """
商品A: ¥99.99
商品B: 199.50元
商品C: $29.99
总计: ¥329.48
"""

prices = extract_prices(text)
print("提取的价格:")
for price in prices:
    print(f"  {format_price(price)}")

print(f"\n总价: {format_price(sum(prices))}")

### 例子8：密码强度验证

In [None]:
import re

def check_password_strength(password):
    """
    检查密码强度
    规则:
    - 至少8位
    - 包含大写字母
    - 包含小写字母
    - 包含数字
    - 包含特殊字符
    """
    checks = {
        "长度>=8": len(password) >= 8,
        "包含大写字母": bool(re.search(r"[A-Z]", password)),
        "包含小写字母": bool(re.search(r"[a-z]", password)),
        "包含数字": bool(re.search(r"\d", password)),
        "包含特殊字符": bool(re.search(r"[!@#$%^&*(),.?\":{}|<>]", password))
    }

    score = sum(checks.values())

    if score == 5:
        strength = "强"
    elif score >= 3:
        strength = "中"
    else:
        strength = "弱"

    return {
        "strength": strength,
        "score": score,
        "checks": checks
    }

# 测试
passwords = [
    "password",
    "Password123",
    "P@ssw0rd!",
    "Aa1!",
]

for pwd in passwords:
    result = check_password_strength(pwd)
    print(f"\n密码: {pwd}")
    print(f"强度: {result['strength']} ({result['score']}/5)")
    for check, passed in result['checks'].items():
        status = "✅" if passed else "❌"
        print(f"  {status} {check}")

### 例子9：中文处理

In [None]:
import re

def extract_chinese(text):
    """提取中文字符"""
    return "".join(re.findall(r"[\u4e00-\u9fff]+", text))

def remove_chinese(text):
    """删除中文字符"""
    return re.sub(r"[\u4e00-\u9fff]+", "", text)

def split_chinese_english(text):
    """分离中英文"""
    chinese = re.findall(r"[\u4e00-\u9fff]+", text)
    english = re.findall(r"[a-zA-Z]+", text)
    numbers = re.findall(r"\d+", text)

    return {
        "chinese": chinese,
        "english": english,
        "numbers": numbers
    }

# 测试
text = "Hello你好World世界123"

print(f"原文: {text}")
print(f"中文: {extract_chinese(text)}")
print(f"去除中文: {remove_chinese(text)}")

result = split_chinese_english(text)
print(f"分离结果: {result}")

### 例子10：数据清洗

In [None]:
import re

class DataCleaner:
    """数据清洗工具"""

    @staticmethod
    def remove_extra_spaces(text):
        """移除多余空格"""
        return re.sub(r"\s+", " ", text).strip()

    @staticmethod
    def remove_special_chars(text):
        """移除特殊字符,只保留字母数字中文"""
        return re.sub(r"[^\w\u4e00-\u9fff\s]", "", text)

    @staticmethod
    def normalize_phone(phone):
        """标准化电话号码"""
        # 移除所有非数字字符
        digits = re.sub(r"\D", "", phone)
        # 格式化为 138-1234-5678
        if len(digits) == 11:
            return f"{digits[:3]}-{digits[3:7]}-{digits[7:]}"
        return phone

    @staticmethod
    def extract_numbers(text):
        """提取所有数字(包括小数)"""
        return [float(n) for n in re.findall(r"-?\d+\.?\d*", text)]

    @staticmethod
    def clean_email_list(text):
        """从文本中提取邮箱列表"""
        pattern = r"\b[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}\b"
        return list(set(re.findall(pattern, text)))

# 测试
cleaner = DataCleaner()

# 清理空格
text = "Hello    World   Python"
print(cleaner.remove_extra_spaces(text))
# Hello World Python

# 标准化电话
phones = ["138 1234 5678", "(138)1234-5678", "13812345678"]
for phone in phones:
    print(f"{phone} -> {cleaner.normalize_phone(phone)}")

# 提取邮箱
text = """
联系邮箱: admin@example.com, test@test.com
支持: support@example.com
"""
emails = cleaner.clean_email_list(text)
print(f"邮箱列表: {emails}")

## 常见陷阱

### 陷阱1：忘记转义特殊字符

In [None]:
import re

# ❌ 错误 - . 匹配任意字符
pattern = "3.14"
print(re.search(pattern, "3a14"))  # 匹配了!

# ✅ 正确 - 转义.
pattern = r"3\.14"
print(re.search(pattern, "3.14"))  # 匹配
print(re.search(pattern, "3a14"))  # 不匹配

### 陷阱2：贪婪匹配陷阱

In [None]:
import re

html = "<div>content1</div><div>content2</div>"

# ❌ 错误 - 贪婪匹配
print(re.findall(r"<div>.*</div>", html))
# ['<div>content1</div><div>content2</div>'] (匹配了整个)

# ✅ 正确 - 非贪婪匹配
print(re.findall(r"<div>.*?</div>", html))
# ['<div>content1</div>', '<div>content2</div>']

### 陷阱3：忘记使用原始字符串

In [None]:
import re

# ❌ 错误 - \d被转义
pattern = "\d+"  # 实际是"\\d+"
# 可能导致意外结果

# ✅ 正确 - 使用原始字符串
pattern = r"\d+"

### 陷阱4：分组陷阱

In [None]:
import re

text = "2024-01-15"

# ❌ findall返回分组内容
result = re.findall(r"(\d{4})-(\d{2})-(\d{2})", text)
print(result)  # [('2024', '01', '15')] 不是完整匹配

# ✅ 使用非捕获分组
result = re.findall(r"\d{4}-\d{2}-\d{2}", text)
print(result)  # ['2024-01-15']

# 或者使用finditer
for match in re.finditer(r"(\d{4})-(\d{2})-(\d{2})", text):
    print(match.group(0))  # 完整匹配

## 最佳实践

### 1. 编译重复使用的正则

```python
import re

# ❌ 低效 - 每次都编译
for text in texts:
    re.search(r"\d+", text)

# ✅ 高效 - 编译一次,重复使用
pattern = re.compile(r"\d+")
for text in texts:
    pattern.search(text)
```

### 2. 使用原始字符串

```python
# ✅ 推荐
pattern = r"\d+\.\d+"

# ❌ 不推荐
pattern = "\\d+\\.\\d+"
```

### 3. 优先使用非贪婪匹配

```python
# ✅ 推荐
re.findall(r"<.*?>", html)

# ❌ 可能出问题
re.findall(r"<.*>", html)
```

### 4. 合理使用分组

```python
# ✅ 需要提取信息时使用捕获分组
re.search(r"(\d{4})-(\d{2})-(\d{2})", date)

# ✅ 不需要提取时使用非捕获分组
re.search(r"(?:https?://)example\.com", url)
```

### 5. 添加注释(复杂正则)

In [None]:
import re

pattern = re.compile(r"""
    ^                   # 开头
    [a-zA-Z0-9._%+-]+   # 用户名
    @                   # @符号
    [a-zA-Z0-9.-]+      # 域名
    \.                  # 点
    [a-zA-Z]{2,}        # 顶级域名
    $                   # 结尾
""", re.VERBOSE)

## 练习题

### 练习1：数据验证

编写验证函数:
- 验证邮箱(完整规则)
- 验证手机号(支持多种格式)
- 验证身份证号(18位,带校验码)
- 验证密码(自定义强度规则)

### 练习2：文本提取

从文本中提取:
- 所有URL
- 所有邮箱
- 所有IP地址
- 所有日期(多种格式)

### 练习3：数据清洗

实现数据清洗工具:
- 移除HTML标签
- 统一日期格式
- 电话号码格式化
- 价格提取和统一

### 练习4：日志分析

编写日志分析器:
- 解析多种日志格式
- 统计各级别日志数量
- 提取错误信息
- 分析访问频率

### 练习5：爬虫数据处理

处理网页数据:
- 提取页面标题
- 提取所有链接
- 提取图片地址
- 提取文章正文

## 下一步

掌握了正则表达式,下一章我们学习JSON和CSV处理,进行数据交换!

---

**本章重点**
- ✅ 理解正则表达式的概念
- ✅ 掌握基本元字符和量词
- ✅ 使用分组和捕获
- ✅ 掌握re模块的函数
- ✅ 学会常见数据验证
- ✅ 掌握文本提取和清洗
- ✅ 避免常见陷阱

**记住**
- 始终使用原始字符串 r"..."
- 注意贪婪与非贪婪匹配
- 编译重复使用的正则
- 合理使用分组
- 复杂正则添加注释
- 先测试再使用