# Regular Expressions
String을 탐색하는데 사용되는 pattern을 정의하는 일련의 문자열을 말한다.

참고: Regular expression으로 표현가능한 언어를 regular language

search: re 즉, pattern으로 기술되는 문자열이 존재하는지 확인 (match되는 substring이 존재하는지)

match: re으로 표현된 pattern과 처음부터 끝까지 match되는지 확인

match()/search() returns
- `Match` object if matched
- `None`, otherwise

In [14]:
import re
string = 'an example word:cat!!'
matched = re.search(r'word:\w\w\w', string)
matched

<re.Match object; span=(11, 19), match='word:cat'>

In [None]:
if matched:
    print(matched.group())
else:
    print('not found')

In [2]:
matched = re.match(r'.*word:\w\w\w.*', string)
print(matched)
if matched:
    print(matched.group())
else:
    print('not found')

<re.Match object; span=(0, 21), match='an example word:cat!!'>
an example word:cat!!


> See Google for Education > Python Regular Expressions
> 
> https://developers.google.com/edu/python/regular-expressions

## Grouping
### Group Extraction

In [18]:
pat = r'\b(대한민국)[은는].*?(\w+?)이다\.'
matched = re.search(pat, '헌법 제1조. 대한민국은 민주공화국이다.')
if matched:
    print(matched.group(1))
    print(matched.group(2))

대한민국
민주공화국


In [15]:
string = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
pat = r'([\w.-]+)@([\w.-]+)'
match = re.search(pat, string)
print(match.group())    # entire match
print(match.group(1))   # subgroup 1
print(match.group(2))   # subgroop 2
print(match.groups())   # all subgroups

alice@google.com
alice
google.com
('alice', 'google.com')


### findall, finditer
Find all the match.
- findall: returns list of tuples
- finditer: returns iterator of matched objects

In [16]:
match = re.findall(pat, string) # returns list of tuples
print(match)

[('alice', 'google.com'), ('bob', 'abc.com')]


Compiled version이 빠르다.

In [8]:
pat = re.compile(r'([\w.-]+)@([\w.-]+)')
match = pat.findall(string)
print(match)

[('alice', 'google.com'), ('bob', 'abc.com')]


In [9]:
for match in pat.finditer(string):
    print(match.groups())

('alice', 'google.com')
('bob', 'abc.com')


### Back-referencing Groups
Refer to previous groups `group(1)`, `group(2)` by `\1`, `\2`, ...

`re.sub(`pat, replacement, str`)` -- returns new string with all replacements,

In [10]:
new_str = pat.sub(r'\1@yo-yo-dyne.com', string)
print(new_str)

purple alice@yo-yo-dyne.com, blah monkey bob@yo-yo-dyne.com blah dishwasher


## Non-greedy matching
`.*`, `.+` 는 가장 크게 matching시키려 한다. (greedy)

Non-greedy(가장 작게) match하려면 `.*?`, `.+?` 

In [26]:
html = """<html>
    <head>
        <title>한림대학교</title>
    </head>
    <body>
        <p>LIFS lab 학생들은 열심히 공부한다.</p>
    </body>
</html>"""
re.search(r'<.*>', html).group()

'<html>'

In [27]:
re.search(r'<.*?>', html).group()

'<html>'

In [29]:
re.findall(r'<(.*?)>(.*)</\1>', html)

[('title', '한림대학교'), ('p', 'LIFS lab 학생들은 열심히 공부한다.')]

In [36]:
import requests

r = requests.get("https://www.naver.com")
regexp = r'<(img|script|a|style|link)\b.*?src="(.*?)".*?>'
pat = re.compile(regexp)
files_loaded = pat.findall(r.text)
files_loaded

[('link', 'https://ssl.pstatic.net/tveta/libs/ndpsdk/prod/ndp-loader.js'),
 ('script', 'https://ssl.pstatic.net/tveta/libs/glad/prod/gfp-core.js'),
 ('script',
  'https://ssl.pstatic.net/tveta/libs/assets/js/pc/main/min/pc.veta.core.min.js'),
 ('script', 'https://pm.pstatic.net/resources/js/polyfill.f47ccc9a.js?o=www'),
 ('script', 'https://pm.pstatic.net/resources/js/preload.2efda94c.js?o=www'),
 ('script', 'https://pm.pstatic.net/resources/js/search.90d1988d.js?o=www'),
 ('script', 'https://pm.pstatic.net/resources/js/main.2e40335b.js?o=www')]