In [4]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [5]:
import nltk

#### NLP的流程
1. Download web page, strip HTML if necessary, trim to desired content.
```
html = urllib.request.urlopen(url).read().decode('utf8')
from bs4 import BeautifulSoup
raw = BeautifulSoup(html, 'html.parser').get_text()
```
2. Tokenize the text, select tokens of interest, create an NLTK text
```
tokens = nltk.word_tokenize(raw)
text = nltk.Text(tokens)
```
3. Normalize the words, build the vocabulary
```
words = [w.lower() for w in text]
vocab = sorted(set(words))
```

## 3.1 从网络和硬盘访问文本

### 电子书

In [6]:
# 电子书
from urllib import request
url = 'http://www.gutenberg.org/files/2554/2554.txt'
raw = request.urlopen(url).read().decode('utf8')
type(raw) # 这是一个字符串
len(raw) # 变量raw包含一个有1176896个字符的字符串
raw[:75]  # 前75个字符

str

1176896

'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

In [7]:
# 将字符串分解为词和标点符号，这一步称为分词
tokens = nltk.word_tokenize(raw)
type(tokens) # list格式
len(tokens)  # 一共有254352个tokens
tokens[:10] # 前10个tokens

list

254352

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [8]:
text = nltk.Text(tokens)
type(text)

nltk.text.Text

In [9]:
text[:10]

['The',
 'Project',
 'Gutenberg',
 'EBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by']

In [10]:
text.collocations() 

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; Nikodim Fomitch; young man; Ilya Petrovitch; n't know;
Project Gutenberg; Dmitri Prokofitch; Andrey Semyonovitch; Hay Market


In [11]:
# find()和rfind()（反向的 find）得到字符串切片需要用到的正确的索引

raw.find('PART I')
raw.find('START')
raw.find('End')

5338

533

1157746

### 处理HTML

#### 网络上的文本大部分是HTML文件的形式。你可以使用网络浏览器将网页作为文本保存为本地文件， 然后按照后面关于文件的小节描述的那样来访问它。 不过， 如果你要经常这样做，最简单的办法是直接让Python来做这份工作

In [12]:
url = 'http://news.bbc.co.uk/2/hi/health/2284783.stm'
html = request.urlopen(url).read().decode('utf8')
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [13]:
len('<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN')

60

In [14]:
type(html)

str

In [15]:
from bs4 import BeautifulSoup
bshtml = BeautifulSoup(html, 'html.parser').get_text()
type(bshtml)

str

In [16]:
print(BeautifulSoup(html, 'html.parser').prettify())

<!DOCTYPE doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
 <head>
  <title>
   BBC NEWS | Health | Blondes 'to die out in 200 years'
  </title>
  <meta content="BBC, News, BBC News, news online, world, uk, international, foreign, british, online, service" name="keywords">
   <meta content="2002/09/27 11:51:55" name="OriginalPublicationDate">
    <meta content="/1/hi/health/2284783.stm" name="UKFS_URL">
     <meta content="/2/hi/health/2284783.stm" name="IFS_URL">
      <meta content="text/html;charset=iso-8859-1" name="HTTP-EQUIV">
       <meta content="Blondes 'to die out in 200 years'" name="Headline">
        <meta content="Health" name="Section">
         <meta content="Natural blondes are an endangered species and will die out by 2202, a study suggests." name="Description">
          <!-- GENMaps-->
          <map name="banner">
           <area alt="BBC NEWS" coords="7,9,167,32" href="http://news.bbc.co.uk/1/hi.html"

In [17]:
bshtml

'\n\n\nBBC NEWS | Health | Blondes \'to die out in 200 years\'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nNEWS\n\xa0\xa0SPORT\n\xa0\xa0WEATHER\n\xa0\xa0WORLD SERVICE\n\n\xa0\xa0A-Z INDEX\xa0\n\n\xa0\xa0SEARCH\xa0\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\r\n    \xa0You are in:\xa0Health \xa0\r\n    \r\n    \r\n\n\n\n\n\n\n\n\n\n\n\nNews Front Page\n\n\n\n\n\nAfrica\n\n\nAmericas\n\n\nAsia-Pacific\n\n\nEurope\n\n\nMiddle East\n\n\nSouth Asia\n\n\nUK\n\n\nBusiness\n\n\nEntertainment\n\n\nScience/Nature\n\n\nTechnology\n\n\nHealth\n\n\nMedical notes\n\n\n-------------\n\n\nTalking Point\n\n\n-------------\n\n\nCountry Profiles\n\n\nIn Depth\n\n\n-------------\n\n\nProgrammes\n\n\n-------------\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSERVICES\r\n\n\n\n\n\n\n\nDaily E-mail\r\n\n\n\n\n\n\n\nNews Ticker\r\n\n\n\n\n\n\n\nMobile/PDAs\r\n\n\n\n\n\n\n-------------\r\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nText Onl

In [18]:
tokens = nltk.word_tokenize(bshtml)
tokens[:10]

['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in']

In [19]:
tokens[110:120]
text = nltk.Text(tokens)
type(text)
text

['UK',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 "years'",
 'Scientists',
 'believe']

nltk.text.Text

<Text: BBC NEWS | Health | Blondes 'to die...>

In [22]:
text.concordance('gene') # 查找gene及其上下文

Displaying 7 of 7 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 
er's Polio campaign launched in Iraq Gene defect explains high blood pressure 


### RSS订阅

#### 博客圈是文本的重要来源， 无论是正式的还是非正式的。
The blogosphere is an important source of text, in both formal and informal registers. 

In [23]:
import feedparser

In [24]:
llog = feedparser.parse('http://languagelog.ldc.upenn.edu/nll/?feed=atom')
llog['feed']['title']

'Language Log'

In [25]:
len(llog.entries)

13

In [26]:
post = llog.entries[2]
post.title

'Uncle Martian knocks off Under Armour'

In [27]:
content = post.content[0].value
content[0:70] 
type(content)

'<p>From William Lou, "<a href="https://www.thescore.com/nba/news/13442'

str

In [28]:
raw = BeautifulSoup(content,'html.parser').get_text()
nltk.word_tokenize(raw)[:20]

['From',
 'William',
 'Lou',
 ',',
 '``',
 'Obvious',
 'Chinese',
 'knockoff',
 'ruled',
 'trademark',
 'infringement',
 'of',
 'Under',
 'Armour',
 "''",
 ',',
 'theScore',
 '(',
 '8/4/17',
 ')']

### 读取本地文件

#### 为了读取本地文件，我们需要使用Python内置的open()函数，然后是read()方法

In [29]:
f = open('document.txt')
raw = f.read()
raw

'A few words about Dostoevsky himself may help the English reader to\nunderstand his work.\n\nDostoevsky was the son of a doctor. His parents were very hard-working\nand deeply religious people, but so poor that they lived with their five\nchildren in only two rooms. The father and mother spent their evenings\nin reading aloud to their children, generally from books of a serious\ncharacter.\n\nThough always sickly and delicate Dostoevsky came out third in the\nfinal examination of the Petersburg school of Engineering. There he had\nalready begun his first work, "Poor Folk."\n'

In [30]:
#　open('document.txt', 'rU') — 
# 'r' means to open the file for reading (the default), 
# and 'U' stands for "Universal", which lets us ignore the different conventions used for marking newlines.

f=open('document.txt', 'r')
for line in f:
    print(line.strip())

A few words about Dostoevsky himself may help the English reader to
understand his work.

Dostoevsky was the son of a doctor. His parents were very hard-working
and deeply religious people, but so poor that they lived with their five
children in only two rooms. The father and mother spent their evenings
in reading aloud to their children, generally from books of a serious
character.

Though always sickly and delicate Dostoevsky came out third in the
final examination of the Petersburg school of Engineering. There he had
already begun his first work, "Poor Folk."


## 3.2 字符串：最底层的文本处理

#### 如果一个字符串中包含一个单引号，则在单引号前加反斜杠，也可将此字符串放在双引号中
多行字符串用斜杠\或者（）表示。
两行之间的转换用三重引号表示。

In [31]:
ouplet1 = "Shall I compare thee to a Summer's day?"\
"Thou are more lovely and more temperate:" 
ouplet2 = ("Shall I compare thee to a Summer's day?"
"Thou are more lovely and more temperate:" )

print(ouplet1)
print(ouplet2)

Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:
Shall I compare thee to a Summer's day?Thou are more lovely and more temperate:


In [32]:
ouplet3 = '''Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate.'''
print(ouplet3)

Shall I compare thee to a Summer's day?
Thou are more lovely and more temperate.


In [33]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]
for line in b:
    print(line)

            very
          veryvery
        veryveryvery
      veryveryveryvery
    veryveryveryveryvery
  veryveryveryveryveryvery
veryveryveryveryveryveryvery
  veryveryveryveryveryvery
    veryveryveryveryvery
      veryveryveryvery
        veryveryvery
          veryvery
            very


In [34]:
monty = 'Monty Python'
grail = 'Holy Grail'
print(monty, grail)
print(monty + grail)
print(monty,'and the','grail')

Monty Python Holy Grail
Monty PythonHoly Grail
Monty Python and the grail


In [35]:
# 计数单个字符
from nltk.corpus import gutenberg
raw = gutenberg.raw('melville-moby_dick.txt')
fdist = nltk.FreqDist(ch.lower() for ch in raw if ch.isalpha())
fdist.most_common(5)

[('e', 117092), ('t', 87996), ('a', 77916), ('o', 69326), ('n', 65617)]

In [36]:
[{char:count} for (char, count) in fdist.most_common()]

[{'e': 117092},
 {'t': 87996},
 {'a': 77916},
 {'o': 69326},
 {'n': 65617},
 {'i': 65434},
 {'s': 64231},
 {'h': 62896},
 {'r': 52134},
 {'l': 42793},
 {'d': 38219},
 {'u': 26697},
 {'m': 23277},
 {'c': 22507},
 {'w': 22222},
 {'f': 20833},
 {'g': 20820},
 {'p': 17255},
 {'b': 16877},
 {'y': 16872},
 {'v': 8598},
 {'k': 8059},
 {'q': 1556},
 {'j': 1082},
 {'x': 1030},
 {'z': 632}]

### String Methods

字符串不可更改，但list可更改，tuple不可更改

s.find(t)  s.rfind(t)
s.index(t)
s.rindex(t)
s.join(text)
s.split(t)
s.splitlines()
s.lower()
s.upper()
s.title()
s.strip()
s.replace(t, u)

In [37]:
s='who knows?'
s.find('o')

2

In [38]:
s.rfind('o')

6

In [39]:
s.index('s')

8

In [40]:
s.join('NM') 

'Nwho knows?M'

In [41]:
s.split('n')

['who k', 'ows?']

In [42]:
'What the HECK'.lower()
'What the HECK'.upper()
'What the HECK'.title() # 将字符串s首字母大写
'   What the HECK    '.strip() # 返回一个没有首尾空白字符的s的拷贝
'What the HECK'.replace('t', 'S')

'what the heck'

'WHAT THE HECK'

'What The Heck'

'What the HECK'

'WhaS She HECK'

## 3.3 使用Unicode进行文字处理

#### Unicode支持超过一百万种字符。每个字符分配一个编号，称为编码点。在 Python中，
编码点写作\uXXXX的形式，其中XXXX是四位十六进制形式数。  
文件中的文本都是有特定编码的， 所以我们需要一些机制来将文本翻译成Unicode——
翻译成 Unicode叫做解码(decoding)。 相对的， 要将Unicode写入一个文件或终端， 我们首先需要将U
nicode转化为合适的编码——这种将 Unicode转化为其它编码的过程叫做编码(encoding)

#### GB2312、Latin-2、UTF-8—>decode（Unicode）—> encode—> GB2312、Latin-2、UTF-8

In [43]:
# 从文件中提取已编码文件
path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')


In [44]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line)

Pruska Biblioteka Państwowa. Jej dawne zbiory znane pod nazwą
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały
odnalezione po 1945 r. na terytorium Polski. Trafiły do Biblioteki
Jagiellońskiej w Krakowie, obejmują ponad 500 tys. zabytkowych
archiwaliów, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.


In [45]:
f = open(path, encoding='latin2')
for line in f:
    line = line.strip()
    print(line.encode('unicode_escape'))

b'Pruska Biblioteka Pa\\u0144stwowa. Jej dawne zbiory znane pod nazw\\u0105'
b'"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez'
b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y'
b'odnalezione po 1945 r. na terytorium Polski. Trafi\\u0142y do Biblioteki'
b'Jagiello\\u0144skiej w Krakowie, obejmuj\\u0105 ponad 500 tys. zabytkowych'
b'archiwali\\xf3w, m.in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.'


In [46]:
import unicodedata
lines = open(path, encoding='latin2').readlines()
line=lines[2]
line

'Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały\n'

In [47]:
print(line.encode('unicode_escape'))

b'Niemc\\xf3w pod koniec II wojny \\u015bwiatowej na Dolny \\u015al\\u0105sk, zosta\\u0142y\\n'


In [48]:
# unicodedata模块使我们可以检查Unicode字符的属性。
# 首先选择超出ASCII范围的波兰语文本的第三行中的所有字符，输出它们的UTF-8转义值
# 然后用标准Unicode约定的它们的编码点整数（即以U+为前缀的十六进制数字），
# 随后是它们的Unicode名称

for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c.encode('utf8'), ord(c), unicodedata.name(c)))

b'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
b'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
b'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
b'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
b'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE


In [49]:
ord('ó') #查找一个字符的整数序数

243

In [50]:
for c in line:
    if ord(c) > 127:
        print('{} U+{:04x} {}'.format(c, ord(c), unicodedata.name(c)))

ó U+00f3 LATIN SMALL LETTER O WITH ACUTE
ś U+015b LATIN SMALL LETTER S WITH ACUTE
Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE
ą U+0105 LATIN SMALL LETTER A WITH OGONEK
ł U+0142 LATIN SMALL LETTER L WITH STROKE


### python字符串函数和re模块是如何接收unicode字符串

In [51]:
line
line.find(u'zosta\u0142y')

'Niemców pod koniec II wojny światowej na Dolny Śląsk, zostały\n'

54

In [52]:
line = line.lower()
line

'niemców pod koniec ii wojny światowej na dolny śląsk, zostały\n'

In [53]:
print(line.encode('unicode_escape'))

b'niemc\\xf3w pod koniec ii wojny \\u015bwiatowej na dolny \\u015bl\\u0105sk, zosta\\u0142y\\n'


In [54]:
import re
m = re.search(u'\u015b\w*', line)
m
m.group()

<_sre.SRE_Match object; span=(28, 37), match='światowej'>

'światowej'

In [55]:
# NLTK分词器允许Unicode字符串作为输入，并输出相应的Unicode字符串

nltk.word_tokenize(line)

['niemców',
 'pod',
 'koniec',
 'ii',
 'wojny',
 'światowej',
 'na',
 'dolny',
 'śląsk',
 ',',
 'zostały']

### 在python中使用本地编码

如果你习惯了使用特定的本地编码字符，你可能希望能够在一个Python文件中使用你的字符串输入及编辑的标准方法。为了做到这一点， 你需要在你的文件的第一行或第二行中包含字符串： '# -*- coding: <coding>-*-' 。 请注意， <coding>是一个像'latin-1',big5'或者'utf-8'的字符串。

In [56]:
# -*- coding: GB2312 -*-
import re
sent = '''我爱中华人民共和国，那是我的家乡'''
sent.encode('utf8')
replaced = re.sub('我','他', sent)
print(replaced)

b'\xe6\x88\x91\xe7\x88\xb1\xe4\xb8\xad\xe5\x8d\x8e\xe4\xba\xba\xe6\xb0\x91\xe5\x85\xb1\xe5\x92\x8c\xe5\x9b\xbd\xef\xbc\x8c\xe9\x82\xa3\xe6\x98\xaf\xe6\x88\x91\xe7\x9a\x84\xe5\xae\xb6\xe4\xb9\xa1'

他爱中华人民共和国，那是他的家乡


### 使用正则表达式检测词组搭配

In [57]:
import re
word = nltk.corpus.words.words('en')
wordlist = [w for w in word if w.islower()]
wordlist[:10]

['a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aardvark',
 'aardwolf',
 'aba',
 'abac',
 'abaca']

In [58]:
[w for w in wordlist if re.search('ed$', w)][:10]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded']

In [59]:
# ^表示字符串开头位置；$表示结尾位置

[w for w in wordlist if re.search('^..j..t..$', w)][:5]

['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector']

In [60]:
sum(1 for w in wordlist if re.search('^e-?mail$', w))

0

In [61]:
[w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)][:5]

['gold', 'golf', 'hold', 'hole']

In [62]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search('^m+i+n+e+$',w)]

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [63]:
[w for w in chat_words if re.search('^[ha]+$',w)] # h或a开头的单词

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

In [64]:
wsj=sorted(set(nltk.corpus.treebank.words()))

In [65]:
# .是匹配任意字符，如果想匹配本身，则需要加斜杠
[w for w in wsj if re.search('^[0-9]+\.[0-9]+$',w)][:10]

['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5']

In [66]:
[w for w in wsj if re.search('^[A-Z]+\$$',w)][:15]

['C$', 'US$']

In [67]:
[w for w in wsj if re.search('^[0-9]{4}$',w)][:8]

['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929']

In [68]:
[w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$',w)][:5]

['10-day', '10-lap', '10-year', '100-share', '12-point']

In [69]:
[w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$',w)][:5]

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

In [70]:
[w for w in wsj if re.search('(ed|ing)$',w)][:9]

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized']

## 3.5 正则表达式的应用

### 提取字符块

In [71]:
word='supercalifragilisticexpialidocious'
re.findall(r'[ae]', word)
re.search('[ae]', word)

['e', 'a', 'a', 'e', 'a']

<_sre.SRE_Match object; span=(3, 4), match='e'>

In [72]:
# 看看一些文本中的两个或两个以上的元音序列，并确定它们的相对频率

wsj = sorted(set(nltk.corpus.treebank.words()))
type(wsj)
len(wsj)
print(wsj[:50])

list

12408

['!', '#', '$', '%', '&', "'", "''", "'30s", "'40s", "'50s", "'80s", "'82", "'86", "'S", "'d", "'ll", "'m", "'re", "'s", "'ve", '*', '*-1', '*-10', '*-100', '*-101', '*-102', '*-103', '*-104', '*-105', '*-106', '*-107', '*-108', '*-109', '*-11', '*-110', '*-111', '*-112', '*-113', '*-114', '*-115', '*-116', '*-117', '*-118', '*-119', '*-12', '*-120', '*-121', '*-122', '*-123', '*-124']


In [73]:
fd = nltk.FreqDist(vs for word in wsj 
                  for vs in re.findall(r'[aeiou]{2,}', word))
fd.items()

dict_items([('ea', 476), ('oi', 65), ('ou', 329), ('io', 549), ('ee', 217), ('ie', 331), ('ui', 95), ('ua', 109), ('ai', 261), ('ue', 105), ('ia', 253), ('ei', 86), ('iai', 1), ('oo', 174), ('au', 106), ('eau', 10), ('oa', 59), ('oei', 1), ('oe', 15), ('eo', 39), ('uu', 1), ('eu', 18), ('iu', 14), ('aii', 1), ('aiia', 1), ('ae', 11), ('aa', 3), ('oui', 6), ('ieu', 3), ('ao', 6), ('iou', 27), ('uee', 4), ('eou', 5), ('aia', 1), ('uie', 3), ('iao', 1), ('eei', 2), ('uo', 8), ('uou', 5), ('eea', 1), ('ueui', 1), ('ioa', 1), ('ooi', 1)])

In [74]:
len(fd)

43

In [75]:
[int(n) for n in re.findall(r'[0-9]{2,4}', '2009-12-16')]

[2009, 12, 16]

### 在字符块上做更多事情

#### 一旦我们会使用re.findall()从词中提取字符块， 就可以在这些字符块上做一些有趣事情，例如将它们粘贴在一起或用它们绘图。

In [76]:
ch = '这碗面很好吃，你还会再做给我吃吗？'
re.findall('吃',ch)

['吃', '吃']

In [85]:
# 正则表达式匹配词首元音序列^[AEIOUaeiou]， 词尾元音序列和所有的辅音； 
# 其它的被忽略。这三个阶段从左到右处理，如果词匹配了三个部分之一， 
# 正则表达式后面的部分将被忽略。

word
regexp = r'^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]'

def compress(word):
    pieces = re.findall(regexp, word)
    return ''.join(pieces)

compress(word)

'supercalifragilisticexpialidocious'

'sprclfrglstcxpldcs'

In [89]:
''.join(re.findall(r'^[AEIOUaeiou]+|[AEIOUaeiou]+|[^AEIOUaeiou]', word))

'supercalifragilisticexpialidocious'

In [82]:
english_udhr = nltk.corpus.udhr.words('English-Latin1')
english_udhr

['Universal', 'Declaration', 'of', 'Human', 'Rights', ...]

In [79]:
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frdm of spch and


In [90]:
# 匹配辅音-元音序列
rotokas_words = nltk.corpus.toolbox.words('rotokas.dic')
rotokas_words

['kaa',
 'kaa',
 'kaa',
 'kaakaaro',
 'kaakaaviko',
 'kaakaavo',
 'kaakaoko',
 'kaakasi',
 'kaakau',
 'kaakauko',
 'kaakito',
 'kaakuupato',
 'kaaova',
 'kaapa',
 'kaapea',
 'kaapie',
 'kaapie',
 'kaapiepato',
 'kaapisi',
 'kaapisivira',
 'kaapo',
 'kaapopato',
 'kaara',
 'kaare',
 'kaareko',
 'kaarekopie',
 'kaareto',
 'Kaareva',
 'kaava',
 'kaavaaua',
 'kaaveaka',
 'kaaveakapie',
 'kaaveakapievira',
 'kaaveakavira',
 'kae',
 'kae',
 'kaekae',
 'kaekae',
 'kaekaearo',
 'kaekaeo',
 'kaekaesoto',
 'kaekaevira',
 'kaekeru',
 'kaepaa',
 'kaepie',
 'kaepie',
 'kaepievira',
 'kaereasi',
 'kaereasivira',
 'kaetu',
 'kaetupie',
 'kaetuvira',
 'kaeviro',
 'kagave',
 'kaie',
 'kaiea',
 'kaikaio',
 'Kaio',
 'kaipori',
 'kaiporipie',
 'kaiporivira',
 'kairi',
 'kairiro',
 'kairo',
 'kaita',
 'kaitutu',
 'kaitutupie',
 'kaitutuvira',
 'kakae',
 'kakae',
 'kakae',
 'kakaevira',
 'kakapikoa',
 'kakapikoto',
 'kakapu',
 'kakapua',
 'kakara',
 'Kakarapaia',
 'kakarau',
 'Kakarera',
 'kakata',
 'kakate

In [92]:
cvs = [cv for w in rotokas_words for cv in re.findall(r'[ptksvr][aeiou]',w)]
cfd = nltk.ConditionalFreqDist(cvs)
cfd

ConditionalFreqDist(nltk.probability.FreqDist,
                    {'k': FreqDist({'a': 418,
                               'e': 148,
                               'i': 94,
                               'o': 420,
                               'u': 173}),
                     'p': FreqDist({'a': 83,
                               'e': 31,
                               'i': 105,
                               'o': 34,
                               'u': 51}),
                     'r': FreqDist({'a': 187,
                               'e': 63,
                               'i': 84,
                               'o': 89,
                               'u': 79}),
                     's': FreqDist({'i': 100, 'o': 2, 'u': 1}),
                     't': FreqDist({'a': 47, 'e': 8, 'o': 148, 'u': 37}),
                     'v': FreqDist({'a': 93,
                               'e': 27,
                               'i': 105,
                               'o': 48,
                      

In [93]:
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


In [94]:
cv_word_pairs = [(cv, w) for w in rotokas_words
                         for cv in re.findall(r'[ptksvr][aeiou]',w)]
cv_index = nltk.Index(cv_word_pairs)
cv_index['su']

['kasuari']

In [95]:
cv_index['so']

['kaekaesoto', 'kekesopa']

### 查找词干

方法1：使用正则表达式查找词干   
方法2：使用NLTK内置的词干

In [100]:
def stem(word):
    for suffix in ['ing','ly','ed','ious', 'ies','ivs','es','s','ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return (word)

In [101]:
re.findall(r'^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$','processing')

['ing']

In [102]:
# 用括号来指定链接的范围，但不想选择要输出的字符串，添加‘？：’
re.findall(r'^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$','processing')

['processing']

In [111]:
# 将词分为词干和后缀
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$','processing')
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$','processes')
# 正则表达式错误地找到了后缀“ -s”， 而不是后缀“ -es”。 
# 这表明另一个微妙之处：“ *”操作符是“ 贪婪的”，
# 所以表达式的“ .*” 部分试图尽可能多的匹配输入的字符串。
# 如果使用“ 非贪婪” 版本的“ *” 操作符，写成“ *?”，就得到想要的：
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$','processes')
# 可以使第二个括号中的内容变成可选
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$','language')
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$','language')

[('process', 'ing')]

[('processe', 's')]

[('process', 'es')]

[]

[('language', '')]

In [112]:
def stem(word):
    regexp = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem,suffix = re.findall(regexp, word)[0]
    return stem

In [113]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""

In [116]:
tokens = nltk.word_tokenize(raw)
tokens[:6]

['DENNIS', ':', 'Listen', ',', 'strange', 'women']

In [118]:
[stem(t) for t in tokens][:10]

['DENNIS',
 ':',
 'Listen',
 ',',
 'strange',
 'women',
 'ly',
 'in',
 'pond',
 'distribut']

### 搜索已分词文本

In [119]:
from nltk.corpus import gutenberg, nps_chat
gutenberg.words('melville-moby_dick.txt')

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', ...]

In [121]:
moby = nltk.Text(gutenberg.words('melville-moby_dick.txt'))
moby

<Text: Moby Dick by Herman Melville 1851>

In [123]:
# 搜索文本中多个词，尖括号于标记标识符的边界，尖括号之间的所有空白都被忽略 （ 这只对 NLTK中的 findall()方法处理文本有效）
moby.findall(r'<a> (<.*>) <man>')

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave


In [126]:
moby.findall(r'<a> <.*> <man>')

a monied man; a nervous man; a dangerous man; a white man; a white
man; a white man; a pious man; a queer man; a good man; a mature man;
a white man; a Cape man; a great man; a wise man; a wise man; a
butterless man; a white man; a fiendish man; a pale man; a furious
man; a better man; a certain man; a complete man; a dismasted man; a
younger man; a brave man; a brave man; a brave man; a brave man


In [127]:
chat = nltk.Text(nps_chat.words())
chat

<Text: now im left with this gay name :P...>

In [128]:
# 选出以词bro为结尾的三个词组成的短语
chat.findall(r'<.*><.*><bro>')

you rule bro; telling you bro; u twizted bro


In [129]:
# 找出以l开头的连续三个词或多个词
chat.findall(r'<l.*>{3,}')

lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


## 3.6规范化文本

In [133]:
 raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = nltk.word_tokenize(raw)

### 词干提取器(stemmers)

LTK 中包括了一些现成的词干提取器， 如果你需要一个词干提取器， 你应该优先使用
它们中的一个，而不是使用正则表达式制作自己的词干提取器，因为NLTK 中的词干提取
器能处理的不规则的情况很广泛。 Porter和Lancaster词干提取器按照它们自己的规则剥离
词缀。请看Porter词干提取器正确处理了词 lying（将它映射为 lie），而Lancaster词干提取
器并没有处理好

In [134]:
porter = nltk.PorterStemmer()
porter

<PorterStemmer>

In [135]:
lancaster = nltk.LancasterStemmer()
lancaster

<LancasterStemmer>

In [140]:
print(tokens)

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'lying', 'in', 'ponds', 'distributing', 'swords', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


In [138]:
print([porter.stem(t) for t in tokens])

['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']


In [139]:
print([lancaster.stem(t) for t in tokens])

['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']
