# 正则表达式样例解析

## 适用正则表达式的场景
我们日常工作或生活，都会遇到大量的文本信息，往往需要做文本查找。<br>
对于简单的查询，比如查找"Close"这个文字在哪，通过ctrl+F即可做到简单查询。<br>
但是或许我们有如下的一些需求，比如获取文档中的日期、数字，包含某个或某些关键字的一段话。<br>
这种情况用简单的查询，就很难定位到所有相关的文本信息。<br>
那么正则表达式，就可以华丽登场了~~~

### 查找SEC文档中的日期
SEC文章中的格式，往往是这种格式：英文单词月 数字日1~2位, 数字年4位
那么正则表达式可以这样写：

In [5]:
import re
def getlistiterbyregex(keyword, text):
    pattern = re.compile(keyword, re.I)
    return re.finditer(pattern, text)

text = r'The effectivedate is May 1, 2018 the disclosed date is April 25, 2018 '
keyword = r'effectivedate is ((?P<month>january|february|march|april|may|june|july|august|september|october|november|december)[\s]*([0-9]{1,2})[\s]*,[\s]*([0-9]{4}))'
find_iter = getlistiterbyregex(keyword, text)
for find in find_iter:
    print('find: {0}'.format(find.group()))
    print(find.span()[0])
    print(find.groupdict().keys())
    print(find.group('month'))
    print(find.group(1))
    print(find.group(4))

find: effectivedate is May 1, 2018
4
dict_keys(['month'])
May
May 1, 2018
2018


### 查找文档中的含有百分号的数字

In [7]:
text = r'''Additional Taxes Payable On Withdrawals, Surrenders, Or Annuity Payouts
The Code may impose a 10% additional tax on any distribution from your contract which you must include in your gross income. The 10% additional tax does not apply if one of several exceptions exists. These exceptions include withdrawals, surrenders, or Annuity Payouts that:
you receive on or after you reach 59 , you receive because you became disabled (as defined in the Code), you receive from an immediate annuity, a Beneficiary receives on or after your death, or you receive as a series of substantially equal periodic payments based on your life or life expectancy (non-natural owners holding as agent for an individual do not qualify).
Unearned Income Medicare Contribution. .35%
Congress enacted the “Unearned Income Medicare Contribution” as a part of the Health Care and Education Reconciliation Act of 2010. This tax, which affects individuals whose modified adjusted gross income exceeds certain thresholds, is a 3.8% tax on the lesser of (i) the individual's “unearned income”, or (ii) the dollar amount by which the individual's modified adjusted gross income exceeds the applicable threshold. Unearned income includes the taxable portion of distributions that you take from your annuity contract. If you take a distribution from your contract that may be subject to the tax, we will include a Distribution Code “D” in Box 7 of the Form 1099-R issued to report the distribution. Please consult your tax advisor to determine whether your annuity distributions are subject to this tax.
Special Rules If You Own More Than One Annuity Contract'''
keyword = r'[\d]*(\.[\d]*)?\%'
matchlist = getlistiterbyregex(keyword, text)
for match in matchlist:
    print(match.group(0))

10%
10%
35%
3.8%


### 根据分组从一段话中拿到特定内容
分组，在正则表达式是通过括号:()实现的，如果加入```(?<groupname>正则表达式)```，即可实现有名称的分组。<br>
但是如果用python，则需要这样写:```(?P<groupname>正则表达式)```

In [20]:
text = r'''The JPMorgan Insurance Trust Intrepid Mid Cap Portfolio intends
to liquidate. intends'''
keyword = 'The(?P<sharename>.*?) intends\s*to\s*liquidate'

matchlist = getlistiterbyregex(keyword, text)
groupname = 'sharename'
groupnames = []
for match in matchlist:
#     print(match.groupdict().keys())
#     groupnames = match.groupdict().keys() if len(groupnames) == 0 else groupname
#     print(groupnames)
#     sharename = match.group(groupname) if groupname in groupnames else ''
#     print(sharename)
    for key, value in match.groupdict().items():
        if key == groupname:
            print('groupname:{0}, value: {1}'.format(key, value))
    

groupname:sharename, value:  JPMorgan Insurance Trust Intrepid Mid Cap Portfolio


### 通过或的方式达到匹配多种情况
有时一段话的关键字，往往有多种时态，比如过去时，现在进行时等，那么只写一种情况，或许无法匹配多种情况。<br>
可以通过或的方式解决这种问题。<br>
如下面的例子，是为了获取merge相关的previous与after share name。<br>
这个例子还可以达到演示通过多个分组，分别获取相应信息的目的。

In [21]:
text = r'''On March 15, 2018, the Board of Trustees of Voya Investors Trust approved a proposal to reorganize the Voya Multi-Manager Large Cap Core Portfolio (the “Reorganization”). 
Subject to shareholder approval, effective after the close of business on or about August 24, 2018 (the “Reorganization Date”), 
Class I shares of the Voya Multi-Manager Large Cap Core Portfolio (the “Merging Fund”) will be merged into 
Class I shares of the Voya Index Plus LargeCap Portfolio (the “Surviving Fund”).'''
keyword = 'on or about[\s]*?(?P<pendingdate>(january|february|march|april|may|june|july|august|september|october|november|december)[\s]*[0-9]{1,2}[\s]*,[\s]*[0-9]{4})[\s]*?\((.*)?\),[\s]*?(?P<previous>(.*)?)\(the(.*)?fund(”)?\)[\s]*?(.*)?merg(e|ed|ing) into[\s]*?(?P<after>(.*)?)\(the(.*)?fund(”)?\)\.'
matchlist = getlistiterbyregex(keyword, text)

groupnames = ['pendingdate', 'previous', 'after']
for match in matchlist:
    groupnamelist = match.groupdict().keys() 
    if len(groupnames) == 0 else groupname
    for groupname in groupnames:
        print('{0} is {1}'.format(groupname, match.group(groupname) 
                                  if groupname in groupnames else ''))

pendingdate is August 24, 2018
previous is Class I shares of the Voya Multi-Manager Large Cap Core Portfolio 
after is Class I shares of the Voya Index Plus LargeCap Portfolio 


### 先拿段落再拿明细的正则表达式写法
有时候，我们会遇到更复杂的文档。<br>
比如确定share name的段落，有明确的标识：UNDERLYING FUNDS:，但是share name是一个列表。<br>
这个时候，一般的做法，是先将段落，确定下来。<br>
然后再从段落里面，拿share name。<br>
示例：

In [27]:
text = r'''
UNDERLYING FUNDS:                                    MANAGED BY:                                  TRUST
--------------------------------------------------   ------------------------------------------   -------
                                                                                            
 Franklin Founding Funds Allocation VIP Fund         Franklin Templeton Services, LLC             FTVIPT
 Franklin Income VIP Fund                            Franklin Advisers, Inc.                      FTVIPT
 SA Allocation Balanced Portfolio                    SunAmerica Asset Management, LLC             SST
 SA Allocation Growth Portfolio                      SunAmerica Asset Management, LLC             SST
 SA Allocation Moderate Portfolio                    SunAmerica Asset Management, LLC             SST
 SA Allocation Moderate Growth Portfolio             SunAmerica Asset Management, LLC             SST
 SA American Funds(R) Asset Allocation Portfolio     Capital Research and Management Company      SAST
 SA BlackRock Multi-Asset Income Portfolio           BlackRock Investment Management, LLC         AST
 SA Edge Asset Allocation Portfolio                  Principal Global Investors, LLC              AST
 SA JPMorgan Diversified Balanced Portfolio          J.P. Morgan Investment Management Inc.       SAST
 SA MFS Total Return Portfolio*                      Massachusetts Financial Services Company     SAST



'''
keyword = r'[\n]UNDERLYING FUNDS(.*)[\n]-{5,}(.*)?[\n](?P<multiname>[\s\S]*?)[\n][\n]'
# keyword = r'[\n]UNDERLYING FUNDS.*[\n]-{5,}'

matchlist = getlistiterbyregex(keyword, text)
segment = ''
for index, match in enumerate(matchlist):
    print(index)
    segment = match.group('multiname')
#     segment = match.group()
    print(segment)

0
                                                                                            
 Franklin Founding Funds Allocation VIP Fund         Franklin Templeton Services, LLC             FTVIPT
 Franklin Income VIP Fund                            Franklin Advisers, Inc.                      FTVIPT
 SA Allocation Balanced Portfolio                    SunAmerica Asset Management, LLC             SST
 SA Allocation Growth Portfolio                      SunAmerica Asset Management, LLC             SST
 SA Allocation Moderate Portfolio                    SunAmerica Asset Management, LLC             SST
 SA Allocation Moderate Growth Portfolio             SunAmerica Asset Management, LLC             SST
 SA American Funds(R) Asset Allocation Portfolio     Capital Research and Management Company      SAST
 SA BlackRock Multi-Asset Income Portfolio           BlackRock Investment Management, LLC         AST
 SA Edge Asset Allocation Portfolio                  Principal Global Investors, L

<font color='red' size='5'><b>下面这个例子是通过贪婪匹配，直接获得share名称的方式，非常有用！</b></font>

In [28]:
if len(segment) > 0:
#      keyword = r'[\n](?P<sharename>(.*)?)[ ]{2,}'
#     keyword = r'[\n](?P<sharename>(.*)?(fund|portfolio)(\*)?)'
    # 如果想使用贪婪匹配，用(.*)?达不到效果，只能通过[\s\S]*?这种方式实现贪婪匹配
    # 即遇到超过2个空格的地方，就停下来
    keyword = r'[\n](?P<sharename>[\s\S]*?)[ ]{2,}' 
    matchlist = getlistiterbyregex(keyword, segment)
    for index, match in enumerate(matchlist):
        print('share {0}, name is: {1}'.format(index + 1, match.group('sharename')))

share 1, name is:  Franklin Founding Funds Allocation VIP Fund
share 2, name is:  Franklin Income VIP Fund
share 3, name is:  SA Allocation Balanced Portfolio
share 4, name is:  SA Allocation Growth Portfolio
share 5, name is:  SA Allocation Moderate Portfolio
share 6, name is:  SA Allocation Moderate Growth Portfolio
share 7, name is:  SA American Funds(R) Asset Allocation Portfolio
share 8, name is:  SA BlackRock Multi-Asset Income Portfolio
share 9, name is:  SA Edge Asset Allocation Portfolio
share 10, name is:  SA JPMorgan Diversified Balanced Portfolio
share 11, name is:  SA MFS Total Return Portfolio*


另一个示例

In [30]:
text = r'''
Variable Account Options
--------------------------------------------------------------------------------



                                     
    VALIC Company I Funds               VALIC Company II Funds
    Asset Allocation Fund               Aggressive Growth Lifestyle Fund
    Blue Chip Growth Fund               Capital Appreciation Fund
    Broad Cap Value Income Fund         Core Bond Fund
    Capital Conservation Fund           Conservative Growth Lifestyle Fund
    Core Equity Fund                    Government Money Market II Fund
    Dividend Value Fund                 High Yield Bond Fund
    Emerging Economies Fund             International Opportunities Fund
    Foreign Value Fund                  Large Cap Value Fund
    Global Real Estate Fund             Mid Cap Growth Fund
    Global Social Awareness Fund        Mid Cap Value Fund
    Global Strategy Fund                Moderate Growth Lifestyle Fund
    Government Money Market I Fund      Small Cap Growth Fund
    Government Securities Fund          Small Cap Value Fund
    Growth Fund                         Socially Responsible Fund
    Growth & Income Fund                Strategic Bond Fund
    Health Sciences Fund
    Inflation Protected Fund
    International Equities Index Fund
    International Government Bond Fund
    International Growth Fund
    Large Cap Core Fund
    Large Capital Growth Fund
    Mid Cap Index Fund
    Mid Cap Strategic Growth Fund
    Nasdaq-100(R) Index Fund
    Science & Technology Fund
    Small Cap Aggressive Growth Fund
    Small Cap Fund
    Small Cap Index Fund
    Small Cap Special Values Fund
    Small-Mid Growth Fund
    Stock Index Fund
    Value Fund





Table of Contents
--------------------------------------------------------------------------------

'''

In [31]:
keyword = r'[\n]Variable Account Options[\n]-{5,}(.*)?[\n](?P<multiname>[\s\S]*?)[\n][\n]Table of Contents[\n]'
matchlist = getlistiterbyregex(keyword, text)
segmentforname = ''
for match in matchlist:
    segmentforname = match.group('multiname')
    print(segmentforname)




                                     
    VALIC Company I Funds               VALIC Company II Funds
    Asset Allocation Fund               Aggressive Growth Lifestyle Fund
    Blue Chip Growth Fund               Capital Appreciation Fund
    Broad Cap Value Income Fund         Core Bond Fund
    Capital Conservation Fund           Conservative Growth Lifestyle Fund
    Core Equity Fund                    Government Money Market II Fund
    Dividend Value Fund                 High Yield Bond Fund
    Emerging Economies Fund             International Opportunities Fund
    Foreign Value Fund                  Large Cap Value Fund
    Global Real Estate Fund             Mid Cap Growth Fund
    Global Social Awareness Fund        Mid Cap Value Fund
    Global Strategy Fund                Moderate Growth Lifestyle Fund
    Government Money Market I Fund      Small Cap Growth Fund
    Government Securities Fund          Small Cap Value Fund
    Growth Fund                         Sociall

这个例子不仅将share name识别出来，而且将第一行作为share header进行拼接

In [32]:
if len(segmentforname) > 0:
    keyword = '[\n][ ]{2,}(?P<sharename>(.*)?)'
    matchlist = getlistiterbyregex(keyword, segmentforname)
    shareheaders = []
    isfirstline = True
    for match in matchlist:
        shareinfo = match.group('sharename')
        sharelist = [share for share in shareinfo.split('  ') if len(share.strip()) > 0]
        for index, share in enumerate(sharelist):
            if len(share.strip()) > 0:
                if isfirstline:
                    shareheaders.append(share)
                else:
                    sharename = '{0} {1}'.format(shareheaders[index], share)
                    print('sharename is {0}'.format(sharename))
        if len(shareheaders) > 0:
            isfirstline = False

sharename is VALIC Company I Funds Asset Allocation Fund
sharename is  VALIC Company II Funds  Aggressive Growth Lifestyle Fund
sharename is VALIC Company I Funds Blue Chip Growth Fund
sharename is  VALIC Company II Funds  Capital Appreciation Fund
sharename is VALIC Company I Funds Broad Cap Value Income Fund
sharename is  VALIC Company II Funds  Core Bond Fund
sharename is VALIC Company I Funds Capital Conservation Fund
sharename is  VALIC Company II Funds  Conservative Growth Lifestyle Fund
sharename is VALIC Company I Funds Core Equity Fund
sharename is  VALIC Company II Funds Government Money Market II Fund
sharename is VALIC Company I Funds Dividend Value Fund
sharename is  VALIC Company II Funds  High Yield Bond Fund
sharename is VALIC Company I Funds Emerging Economies Fund
sharename is  VALIC Company II Funds  International Opportunities Fund
sharename is VALIC Company I Funds Foreign Value Fund
sharename is  VALIC Company II Funds Large Cap Value Fund
sharename is VALIC Compa

In [34]:
line = r'56    67   89  43534     4535'
line = re.sub('( ){2,}', ' ', line)
print(line)

56 67 89 43534 4535


In [35]:
phone = r'0755-8-8$3911#'
phone = re.sub('\W','', phone).strip()
print(phone)

0755883911


In [36]:
text = r'aa  bb %^#$#$#$  ww&&&&&&&&&& my phone'
print([split.strip() for split in text.split('  ') if len(split) > 0])
my_list = [split for split in re.split('(\W){2,}', text) if len(split.strip()) > 0]
print(my_list)

['aa', 'bb %^#$#$#$', 'ww&&&&&&&&&& my phone']
['aa', 'bb', 'ww', 'my phone']


### 直接从文本文件做正则表达式匹配

In [40]:
import os
filename = './docs/txt/168104986/1.txt'
if os.path.exists(filename):
    with open(filename, 'r', encoding='utf-8', errors='ignore') as txt:
        text = txt.read().lower().replace('\x00', '').strip()
        keyword = '[\n](.*)?(liquidat|transfer[\s]*date)(.*)?'
        matchlist = getlistiterbyregex(keyword, text)
        for match in matchlist:
            print(match.group(0))


this fund will be liquidating on or about june 28, 2018.

if, pursuant to sec rules, an underlying money market fund suspends payment of redemption proceeds in connection with a liquidation of the fund, we will delay payment of any transfer, partial withdrawal, surrender, loan, or death benefit from the money market sub-account until the fund is liquidated. payment of contract proceeds from the fixed account may be delayed for up to six months.

* the numbers of accumulation units less than 1000 were rounded up to one. 1 the jpmorgan insurance trust intrepid mid cap portfolio was liquidated on may 19, 2017. 2 the lvip aqr enhanced global strategies fund was liquidated on january 10, 2017. 3 on december 9, 2016, this subaccount was closed and the values were transferred to the lvip ssga international managed volatility fund subaccount. 4 the transparent value directional allocation vi portfolio was liquidated on june 6, 2016.


### 通过正则表达式做结果替换
有时，爬下来的文本，有很多特殊符号，或者不想要的部分，可以通过正则表达式相关函数进行批量替换

#### 将不需要的特殊符号替换为空字符串

In [41]:
text = r'This    is    test  @  - text,    contain   part of  · special     Characters (formerly,   it''s ok).   (the test text is     for    regex   training)'
keyword = '(@|-|·)'
newtext = re.sub(keyword, ' ', text)
print(newtext)

This    is    test       text,    contain   part of    special     Characters (formerly,   its ok).   (the test text is     for    regex   training)


#### 去除不需要的括号部分
<font color='red'><b>注意：这里依然用```[\s\S]*?```做非贪婪匹配，用```(.*)?```做非贪婪匹配会失败！！！</b></font>

In [80]:
keyword = '\(formerly[\s\S]*?\)'
newtext = re.sub(keyword, '', newtext)
print(newtext)

This    is    test       text,    contain   part of    special     Characters .   (the test text is     for    regex   training)


#### 将2个或2个以上空格，替换为一个空格

In [42]:
# 将2个或2个以上的空格，统一替换为一个空格
keyword = '( ){2,}'
newtext = re.sub(keyword, ' ', newtext)
print(newtext)

This is test text, contain part of special Characters (formerly, its ok). (the test text is for regex training)


我们可以通过更通用的方式完成这个替换, 需要注意的是将多个空格替换为一个空格的动作，应该最后完成

In [43]:
text = r'This    is    test  @  - text,    contain   part of  · special     Characters (formerly,   it''s ok).   (the test text is     for    regex   training)'
# 将2个或2个以上的空格，统一替换为一个空格
keywords = ['(@|-|·)', '\(formerly[\s\S]*?\)', '( ){2,}']
for keyword in keywords:
    text = re.sub(keyword, ' ', text)
print(text)

This is test text, contain part of special Characters . (the test text is for regex training)


### 性能考虑
* 匹配一切字符并且非贪婪模式，```[\s\S]*?```是性能非常好的
* 如果确定是在一行内拿文本，并且不需要非贪婪模式，```(.*)?```这种写法非常常见
* 如果确定某文本关键字，必然是一段话的开头，那么使用这种方式，性能会较优：```[\n]Effective date of(.*)?```
* 如果重复的符号非常多，其实不需要完全把这些符号集合放在表达式里面。比如:

In [44]:
text = '''
Variable Account Options
--------------------------------------------------------------------------------
    VALIC Company I Funds               VALIC Company II Funds
    Asset Allocation Fund               Aggressive Growth Lifestyle Fund
'''

如果想确定Variable Account Options开头，然后取之后的share name，如下写法会显得简洁很多, 注意<b>-{10,}</b>的写法, 表示符号：-至少重复10次

In [45]:
keyword = '[\n]Variable Account Options[\n]-{10,}[\n](?P<multiname>[\s\S]*)'
matchlist = getlistiterbyregex(keyword, text)
for match in matchlist:
        print(match.group('multiname'))

    VALIC Company I Funds               VALIC Company II Funds
    Asset Allocation Fund               Aggressive Growth Lifestyle Fund

