本章原文链接：https://usyiyi.github.io/nlp-py-2e-zh/3.html

# 3.1 处理原始文本

NLTK 语料库集合中存有古腾堡项目的一小部分样例文本。

In [2]:
import nltk, re, pprint
from nltk import word_tokenize

## 1、 从网络和硬盘访问文本

从网络上下载文本

In [2]:
# 编号 2554 的文本是《罪与罚》的英文翻译，按照如下方式访问

from urllib import request
url = "https://www.gutenberg.org/files/2554/2554-0.txt"
response = request.urlopen(url)
raw = response.read().decode("utf8")
print(type(raw),"\n", len(raw),"\n")
raw[:80]

<class 'str'> 
 1176812 



'\ufeffThe Project Gutenberg eBook of Crime and Punishment, by Fyodor Dostoevsky\r\n\r\nTh'

In [3]:
tokens = word_tokenize(raw)
print (type(tokens),"\n",len(tokens))
tokens[:15]

<class 'list'> 
 257059


['\ufeffThe',
 'Project',
 'Gutenberg',
 'eBook',
 'of',
 'Crime',
 'and',
 'Punishment',
 ',',
 'by',
 'Fyodor',
 'Dostoevsky',
 'This',
 'eBook',
 'is']

collocations() 方法用于查找文本中的词语搭配（collocations），即那些经常一起出现的单词对或短语。

搭配通常是基于统计显著性来识别的，意味着这些词在一起出现的频率比随机出现的频率要高。

In [9]:
text = nltk.Text(tokens)
print(type(text),"\n")
print(text[1024:1062],"\n")
text.collocations()                # 常用搭配

<class 'nltk.text.Text'> 

['insight', 'impresses', 'us', 'as', 'wisdom', '...', 'that', 'wisdom', 'of', 'the', 'heart', 'which', 'we', 'seek', 'that', 'we', 'may', 'learn', 'from', 'it', 'how', 'to', 'live', '.', 'All', 'his', 'other', 'gifts', 'came', 'to', 'him', 'from', 'nature', ',', 'this', 'he', 'won', 'for'] 

Katerina Ivanovna; Pyotr Petrovitch; Pulcheria Alexandrovna; Avdotya
Romanovna; Rodion Romanovitch; Marfa Petrovna; Sofya Semyonovna; old
woman; Project Gutenberg-tm; Porfiry Petrovitch; Amalia Ivanovna;
great deal; young man; Nikodim Fomitch; Project Gutenberg; Ilya
Petrovitch; Andrey Semyonovitch; Hay Market; Dmitri Prokofitch; Good
heavens


手工检查文件以发现标记内容开始和结尾的独特的字符串

In [19]:
print(raw.find("PART I"),"\n")
print(raw.rfind("of Project Gutenberg")) # 注意，这里的 ’  是中文符号下的 ‘   这里的raw.rfind() 是反向find的意思

5575 

1172306


In [20]:
raw1 = raw[5336:1172306]
raw1.find("PART I")

239

In [21]:
raw1[:50]

's as wisdom... that wisdom of the heart\r\nwhich we '

## 2、 处理HTML

In [None]:
import urllib.request

proxies = {'http': 'proxy.lxxx:3128'}
url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
opener = urllib.request.build_opener()
opener.add_handler(urllib.request.ProxyHandler(proxies))
response = opener.open(url)
# 读取并解码内容
html = response.read().decode("utf8")
# html = request.urlopen(url,proxies=proxies).read().decode("utf8")
html[:60]

'<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN'

In [30]:
from bs4 import BeautifulSoup
raw = BeautifulSoup(html).get_text()
tokens = word_tokenize(raw)
tokens

['BBC',
 'NEWS',
 '|',
 'Health',
 '|',
 'Blondes',
 "'to",
 'die',
 'out',
 'in',
 '200',
 'years',
 "'",
 'NEWS',
 'SPORT',
 'WEATHER',
 'WORLD',
 'SERVICE',
 'A-Z',
 'INDEX',
 'SEARCH',
 'You',
 'are',
 'in',
 ':',
 'Health',
 'News',
 'Front',
 'Page',
 'Africa',
 'Americas',
 'Asia-Pacific',
 'Europe',
 'Middle',
 'East',
 'South',
 'Asia',
 'UK',
 'Business',
 'Entertainment',
 'Science/Nature',
 'Technology',
 'Health',
 'Medical',
 'notes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Talking',
 'Point',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Country',
 'Profiles',
 'In',
 'Depth',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Programmes',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'SERVICES',
 'Daily',
 'E-mail',
 'News',
 'Ticker',
 'Mobile/PDAs',
 '--',
 '--',
 '--',
 '--',
 '--',
 '--',
 '-',
 'Text',
 'Only',
 'Feedback',
 'Help',
 'EDITIONS',
 'Change',
 'to',
 'UK',
 'Friday',
 ',',
 '27',
 'September',
 ',',
 '2002',
 ',',
 '11:51',
 'GMT',
 '1

In [31]:
tokens = tokens[110:390]
text = nltk.Text(tokens)
text.concordance('gene')

Displaying 5 of 5 matches:
hey say too few people now carry the gene for blondes to last beyond the next 
blonde hair is caused by a recessive gene . In order for a child to have blond
 have blonde hair , it must have the gene on both sides of the family in the g
ere is a disadvantage of having that gene or by chance . They do n't disappear
des would disappear is if having the gene was a disadvantage and I do not thin


## 3、 处理 RSS 订阅

In [33]:
%%cmd

pip install feedparser

Microsoft Windows [�汾 10.0.19045.5198]
(c) Microsoft Corporation����������Ȩ����

(base) d:\Documents\Python_nlp_notes>
(base) d:\Documents\Python_nlp_notes>pip install feedparser
Defaulting to user installation because normal site-packages is not writeable
Collecting feedparser
  Downloading feedparser-6.0.11-py3-none-any.whl.metadata (2.4 kB)
Collecting sgmllib3k (from feedparser)
  Downloading sgmllib3k-1.0.0.tar.gz (5.8 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Downloading feedparser-6.0.11-py3-none-any.whl (81 kB)
   ---------------------------------------- 81.3/81.3 kB 507.9 kB/s eta 0:00:00
Building wheels for collected packages: sgmllib3k
  Building wheel for sgmllib3k (setup.py): started
  Building wheel for sgmllib3k (setup.py): finished with status 'done'
  Created wheel for sgmllib3k: filename=sgmllib3k-1.0.0-py3-none-any.whl size=6060 sha256=b74d64df3f7a87cf611972cd3ce8a821abd59aa59b182a2a4def926ae650c97a
  Sto

In [59]:
import feedparser
llog = feedparser.parse("http://feed.cnblogs.com/blog/sitehome/rss")
llog

{'bozo': False,
 'entries': [{'id': 'https://www.cnblogs.com/Sol9/p/18586964',
   'guidislink': True,
   'link': 'https://www.cnblogs.com/Sol9/p/18586964',
   'title': 'Prime1_解法一：cms渗透 & 内核漏洞提权 - Sol_9',
   'title_detail': {'type': 'text/plain',
    'language': None,
    'base': 'https://feed.cnblogs.com/blog/sitehome/rss',
    'value': 'Prime1_解法一：cms渗透 & 内核漏洞提权 - Sol_9'},
   'summary': 'Prime1_解法一：cms渗透 &amp; 内核漏洞提权 目录Prime1_解法一：cms渗透 &amp; 内核漏洞提权信息收集主机发现nmap扫描tcp扫描tcp详细扫描22，80端口udp扫描漏洞脚本扫描目录爆破dirsearchWeb渗透wfuzz常见的 wfuzz 过滤器：获得wordpress后台权限w',
   'summary_detail': {'type': 'text/plain',
    'language': None,
    'base': 'https://feed.cnblogs.com/blog/sitehome/rss',
    'value': 'Prime1_解法一：cms渗透 &amp; 内核漏洞提权 目录Prime1_解法一：cms渗透 &amp; 内核漏洞提权信息收集主机发现nmap扫描tcp扫描tcp详细扫描22，80端口udp扫描漏洞脚本扫描目录爆破dirsearchWeb渗透wfuzz常见的 wfuzz 过滤器：获得wordpress后台权限w'},
   'published': '2024-12-04T10:52:00Z',
   'published_parsed': time.struct_time(tm_year=2024, tm_mon=12, tm_mday=4, tm_hour=10, tm_min=52, tm_sec=

In [60]:
llog["feed"]["title"]

'博客园_首页'

In [61]:
len(llog.entries)

20

In [62]:
post = llog.entries[1]
post.title

'写简历应该怎么准备项目 - 程序员回家养猪'

In [64]:
content = post.content[0].value
content

'【摘要】找实习应该怎么准备项目?\n造轮子应该怎么造?\n面试应该怎么聊?\n一篇文章为大家排忧解难, 帮大家写好简历, 做好项目, 提升就业竞争力 <a href="https://www.cnblogs.com/huijiayangzhu/p/18586889" target="_blank">阅读全文</a>'

In [65]:
raw = BeautifulSoup(content).get_text()
word_tokenize(raw)

['【摘要】找实习应该怎么准备项目',
 '?',
 '造轮子应该怎么造',
 '?',
 '面试应该怎么聊',
 '?',
 '一篇文章为大家排忧解难',
 ',',
 '帮大家写好简历',
 ',',
 '做好项目',
 ',',
 '提升就业竞争力',
 '阅读全文']

## 4、 读取本地文件

In [66]:
f = open("3.document.txt",'r') # 'r'意味着以只读方式打开文件（默认），'U'表示“通用”，它让我们忽略不同的换行约定。
raw = f.read()
raw

' 沁园春·雪\n作者：毛泽东\n北国风光，千里冰封，万里雪飘。\n望长城内外，惟余莽莽；大河上下，顿失滔滔。\n山舞银蛇，原驰蜡象，欲与天公试比高。\n须晴日，看红装素裹，分外妖娆。\n江山如此多娇，引无数英雄竞折腰。\n惜秦皇汉武，略输文采；唐宗宋祖，稍逊风骚。 '

In [67]:
import os
os.listdir(".")

['.git',
 '3.document.txt',
 '3.output.txt',
 '4.test.html',
 '5.t2.pkl',
 'picture',
 'Python自然语言处理.pdf',
 'Python自然语言处理实战.pdf',
 'README.md',
 '【Python自然语言处理】读书笔记：第七章：从文本提取信息.ipynb',
 '【Python自然语言处理】读书笔记：第五章：分类和标注词汇.ipynb',
 '【Python自然语言处理】读书笔记：第六章：学习分类文本.ipynb',
 '【Python自然语言处理】读书笔记：第四章：编写结构化程序.ipynb',
 '【Python自然语言处理】读书笔记：第四章：编写结构化程序.md',
 '第一章：语言处理与Python.ipynb',
 '第三章：处理原始文本.ipynb',
 '第二章：获得文本语料和词汇资源.ipynb']

In [68]:
f = open("3.document.txt","r")
for line in f:
    print(line.strip()) # strip()方法删除输入行结尾的换行符。

沁园春·雪
作者：毛泽东
北国风光，千里冰封，万里雪飘。
望长城内外，惟余莽莽；大河上下，顿失滔滔。
山舞银蛇，原驰蜡象，欲与天公试比高。
须晴日，看红装素裹，分外妖娆。
江山如此多娇，引无数英雄竞折腰。
惜秦皇汉武，略输文采；唐宗宋祖，稍逊风骚。


In [None]:
path = nltk.data.find(r'C:\Users\XXXX\AppData\Roaming\nltk_data\tokenizers\punkt_tab\dutch\abbrev_types.txt')
raw = open(path, 'r').read()
print(raw)

m.j
t
ph
j.h
p.a.m
j.m
dr
st
j.b.m
p
nr
h.s
e.d
t.e
a.v
esb
s.z
drs
b.b
m.o
inc
n
pensioenfonds
s.v.p
bod
fr
pk
r.p
c.p.j
v.l.n.r
chr
m.v.d
int
o.m
j.v.d
u.o.m
f.c
k
bijgebracht
ontwaakte
m
j.w
a.l
a.v.d
s.v
s
j.d
binnengekomen
ds
schouwburg
b.v
h
a
j.a
aanvielen
h.g
p.f
j.l
mgr
c.j
blz
l.e.h
w.k
g
m.g
r.v.d
ing
v.d
c.q
l
h.p
mr
gesch
e.l
p.j
mm
j.g
j.f
c
f.m
jl
r
o.a
a.s
ir
v
j
jr
e
m.i.v
l.a
f.v.d
aansluit
c.c
a.m
f.o.j
m.b
y
th


# 3.2 NLP 的流程

## 1、字符串：最底层的文本处理

In [75]:
a = [1, 2, 3, 4, 5, 6, 7, 6, 5, 4, 3, 2, 1]
b = [' ' * 2 * (7 - i) + 'very' * i for i in a]
for line in b:
    print(line)

            very
          veryvery
        veryveryvery
      veryveryveryvery
    veryveryveryveryvery
  veryveryveryveryveryvery
veryveryveryveryveryveryvery
  veryveryveryveryveryvery
    veryveryveryveryvery
      veryveryveryvery
        veryveryvery
          veryvery
            very


In [93]:
help(str)

Help on class str in module builtins:

class str(object)
 |  str(object='') -> str
 |  str(bytes_or_buffer[, encoding[, errors]]) -> str
 |  
 |  Create a new string object from the given object. If encoding or
 |  errors is specified, then the object must expose a data buffer
 |  that will be decoded using the given encoding and error handler.
 |  Otherwise, returns the result of object.__str__() (if defined)
 |  or repr(object).
 |  encoding defaults to sys.getdefaultencoding().
 |  errors defaults to 'strict'.
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __format__(...)
 |      S.__format__(format_spec) -> str
 |      
 |      Return a formatted version of S as described by format_spec.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getatt

**列表**中的元素可以很大也可以很小，只要我们喜欢：例如，它们可能是段落、句子、短语、单词、字符。

因此，我们在一段NLP 代码中可能做的第一件事情就是将一个字符串分词放入一个**字符串列表**中。

相反，当我们要将结果写入到一个文件或终端，我们通常会将它们格式化为一个**字符串**。

## 2、使用正则表达式检测词组搭配

In [76]:
import re
wordlist = [w for w in nltk.corpus.words.words("en") if w.islower()]
wordlist

['a',
 'aa',
 'aal',
 'aalii',
 'aam',
 'aardvark',
 'aardwolf',
 'aba',
 'abac',
 'abaca',
 'abacate',
 'abacay',
 'abacinate',
 'abacination',
 'abaciscus',
 'abacist',
 'aback',
 'abactinal',
 'abactinally',
 'abaction',
 'abactor',
 'abaculus',
 'abacus',
 'abaff',
 'abaft',
 'abaisance',
 'abaiser',
 'abaissed',
 'abalienate',
 'abalienation',
 'abalone',
 'abampere',
 'abandon',
 'abandonable',
 'abandoned',
 'abandonedly',
 'abandonee',
 'abandoner',
 'abandonment',
 'abaptiston',
 'abarthrosis',
 'abarticular',
 'abarticulation',
 'abas',
 'abase',
 'abased',
 'abasedly',
 'abasedness',
 'abasement',
 'abaser',
 'abash',
 'abashed',
 'abashedly',
 'abashedness',
 'abashless',
 'abashlessly',
 'abashment',
 'abasia',
 'abasic',
 'abask',
 'abastardize',
 'abatable',
 'abate',
 'abatement',
 'abater',
 'abatis',
 'abatised',
 'abaton',
 'abator',
 'abattoir',
 'abature',
 'abave',
 'abaxial',
 'abaxile',
 'abaze',
 'abb',
 'abbacomes',
 'abbacy',
 'abbas',
 'abbasi',
 'abbassi',


使用基本的元字符

In [77]:
[w for w in wordlist if re.search("ed$", w)]

['abaissed',
 'abandoned',
 'abased',
 'abashed',
 'abatised',
 'abed',
 'aborted',
 'abridged',
 'abscessed',
 'absconded',
 'absorbed',
 'abstracted',
 'abstricted',
 'accelerated',
 'accepted',
 'accidented',
 'accoladed',
 'accolated',
 'accomplished',
 'accosted',
 'accredited',
 'accursed',
 'accused',
 'accustomed',
 'acetated',
 'acheweed',
 'aciculated',
 'aciliated',
 'acknowledged',
 'acorned',
 'acquainted',
 'acquired',
 'acquisited',
 'acred',
 'aculeated',
 'addebted',
 'added',
 'addicted',
 'addlebrained',
 'addleheaded',
 'addlepated',
 'addorsed',
 'adempted',
 'adfected',
 'adjoined',
 'admired',
 'admitted',
 'adnexed',
 'adopted',
 'adossed',
 'adreamed',
 'adscripted',
 'aduncated',
 'advanced',
 'advised',
 'aeried',
 'aethered',
 'afeared',
 'affected',
 'affectioned',
 'affined',
 'afflicted',
 'affricated',
 'affrighted',
 'affronted',
 'aforenamed',
 'afterfeed',
 'aftershafted',
 'afterthoughted',
 'afterwitted',
 'agazed',
 'aged',
 'agglomerated',
 'aggri

In [78]:
[w for w in wordlist if re.search("^..j..t..$", w)] # 匹配第三个是j第六个是t的8个字母组成的单词

['abjectly',
 'adjuster',
 'dejected',
 'dejectly',
 'injector',
 'majestic',
 'objectee',
 'objector',
 'rejecter',
 'rejector',
 'unjilted',
 'unjolted',
 'unjustly']

In [79]:
[w for w in wordlist if re.search("..j..t..", w)] # 如果不限制 ^ 匹配字符的开始 $ 匹配字符的结尾，那么会有很多超过8字符的被匹配到

['abjectedness',
 'abjection',
 'abjective',
 'abjectly',
 'abjectness',
 'adjection',
 'adjectional',
 'adjectival',
 'adjectivally',
 'adjective',
 'adjectively',
 'adjectivism',
 'adjectivitis',
 'adjustable',
 'adjustably',
 'adjustage',
 'adjustation',
 'adjuster',
 'adjustive',
 'adjustment',
 'antejentacular',
 'antiprojectivity',
 'bijouterie',
 'coadjustment',
 'cojusticiar',
 'conjective',
 'conjecturable',
 'conjecturably',
 'conjectural',
 'conjecturalist',
 'conjecturality',
 'conjecturally',
 'conjecture',
 'conjecturer',
 'coprojector',
 'counterobjection',
 'dejected',
 'dejectedly',
 'dejectedness',
 'dejectile',
 'dejection',
 'dejectly',
 'dejectory',
 'dejecture',
 'disjection',
 'guanajuatite',
 'inadjustability',
 'inadjustable',
 'injectable',
 'injection',
 'injector',
 'injustice',
 'insubjection',
 'interjection',
 'interjectional',
 'interjectionalize',
 'interjectionally',
 'interjectionary',
 'interjectionize',
 'interjectiveness',
 'interjector',
 'interje

In [80]:
sum(1 for w in wordlist if re.search("^e-?mail$", w)) # ? 匹配前边的字符0次或1次         # 这行代码的意思是统计总共由多少email或e-mail 

0

范围与闭包

In [30]:
# 通过序列4653输入。有哪些其它词汇由相同的序列产生？
[w for w in wordlist if re.search("^[ghi][mno][jkl][def]$", w)]

['gold', 'golf', 'hold', 'hole']

In [31]:
# 匹配只使用中间行的4、5、6 键的词汇
[w for w in wordlist if re.search("^[g-o]+$", w)]   # - 表示范围 + 表示匹配1次或多次

['g',
 'ghoom',
 'gig',
 'giggling',
 'gigolo',
 'gilim',
 'gill',
 'gilling',
 'gilo',
 'gim',
 'gin',
 'ging',
 'gingili',
 'gink',
 'ginkgo',
 'ginning',
 'gio',
 'glink',
 'glom',
 'glonoin',
 'gloom',
 'glooming',
 'gnomon',
 'go',
 'gog',
 'gogo',
 'goi',
 'going',
 'gol',
 'goli',
 'gon',
 'gong',
 'gonion',
 'goo',
 'googol',
 'gook',
 'gool',
 'goon',
 'h',
 'hi',
 'high',
 'hill',
 'him',
 'hin',
 'hing',
 'hinoki',
 'ho',
 'hog',
 'hoggin',
 'hogling',
 'hoi',
 'hoin',
 'holing',
 'holl',
 'hollin',
 'hollo',
 'hollong',
 'holm',
 'homo',
 'homologon',
 'hong',
 'honk',
 'hook',
 'hoon',
 'i',
 'igloo',
 'ihi',
 'ilk',
 'ill',
 'imi',
 'imino',
 'immi',
 'in',
 'ing',
 'ingoing',
 'inion',
 'ink',
 'inkling',
 'inlook',
 'inn',
 'inning',
 'io',
 'ion',
 'j',
 'jhool',
 'jig',
 'jing',
 'jingling',
 'jingo',
 'jinjili',
 'jink',
 'jinn',
 'jinni',
 'jo',
 'jog',
 'johnin',
 'join',
 'joining',
 'joll',
 'joom',
 'k',
 'kiki',
 'kil',
 'kilhig',
 'kilim',
 'kill',
 'killing',

In [32]:
chat_words = sorted(set(w for w in nltk.corpus.nps_chat.words()))
[w for w in chat_words if re.search("^m+i+n+e+$", w)]  # + 表示匹配1次或多次

['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'mine',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee']

In [33]:
[w for w in chat_words if re.search("^m*i*n*e*$", w)]   # * 表示匹配0次或多次

['',
 'e',
 'i',
 'in',
 'm',
 'me',
 'meeeeeeeeeeeee',
 'mi',
 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee',
 'miiiiiinnnnnnnnnneeeeeeee',
 'min',
 'mine',
 'mm',
 'mmm',
 'mmmm',
 'mmmmm',
 'mmmmmm',
 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee',
 'mmmmmmmmmm',
 'mmmmmmmmmmmmm',
 'mmmmmmmmmmmmmm',
 'n',
 'ne']

In [34]:
[w for w in chat_words if re.search("^[ha]+$", w)]  # [ ] 匹配集合里边的没有顺序

['a',
 'aaaaaaaaaaaaaaaaa',
 'aaahhhh',
 'ah',
 'ahah',
 'ahahah',
 'ahh',
 'ahhahahaha',
 'ahhh',
 'ahhhh',
 'ahhhhhh',
 'ahhhhhhhhhhhhhh',
 'h',
 'ha',
 'haaa',
 'hah',
 'haha',
 'hahaaa',
 'hahah',
 'hahaha',
 'hahahaa',
 'hahahah',
 'hahahaha',
 'hahahahaaa',
 'hahahahahaha',
 'hahahahahahaha',
 'hahahahahahahahahahahahahahahaha',
 'hahahhahah',
 'hahhahahaha']

In [35]:
wsj = sorted(set(nltk.corpus.treebank.words()))
[w for w in wsj if re.search("^[0-9]+\.[0-9]+$", w)] # \. 表示后边的字符.不在具有转义含义而是字面的表示 . 

['0.0085',
 '0.05',
 '0.1',
 '0.16',
 '0.2',
 '0.25',
 '0.28',
 '0.3',
 '0.4',
 '0.5',
 '0.50',
 '0.54',
 '0.56',
 '0.60',
 '0.7',
 '0.82',
 '0.84',
 '0.9',
 '0.95',
 '0.99',
 '1.01',
 '1.1',
 '1.125',
 '1.14',
 '1.1650',
 '1.17',
 '1.18',
 '1.19',
 '1.2',
 '1.20',
 '1.24',
 '1.25',
 '1.26',
 '1.28',
 '1.35',
 '1.39',
 '1.4',
 '1.457',
 '1.46',
 '1.49',
 '1.5',
 '1.50',
 '1.55',
 '1.56',
 '1.5755',
 '1.5805',
 '1.6',
 '1.61',
 '1.637',
 '1.64',
 '1.65',
 '1.7',
 '1.75',
 '1.76',
 '1.8',
 '1.82',
 '1.8415',
 '1.85',
 '1.8500',
 '1.9',
 '1.916',
 '1.92',
 '10.19',
 '10.2',
 '10.5',
 '107.03',
 '107.9',
 '109.73',
 '11.10',
 '11.5',
 '11.57',
 '11.6',
 '11.72',
 '11.95',
 '112.9',
 '113.2',
 '116.3',
 '116.4',
 '116.7',
 '116.9',
 '118.6',
 '12.09',
 '12.5',
 '12.52',
 '12.68',
 '12.7',
 '12.82',
 '12.97',
 '120.7',
 '1206.26',
 '121.6',
 '126.1',
 '126.15',
 '127.03',
 '129.91',
 '13.1',
 '13.15',
 '13.5',
 '13.50',
 '13.625',
 '13.65',
 '13.73',
 '13.8',
 '13.90',
 '130.6',
 '130.7',
 '

In [113]:
[w for w in wsj if re.search("^[A-Z]+\$$", w)]

['C$', 'US$']

In [36]:
[w for w in wsj if re.search("^[0-9]{4}$", w)] # {4} 表示匹配前边的 字符或集合 四次

['1614',
 '1637',
 '1787',
 '1901',
 '1903',
 '1917',
 '1925',
 '1929',
 '1933',
 '1934',
 '1948',
 '1953',
 '1955',
 '1956',
 '1961',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1971',
 '1972',
 '1973',
 '1975',
 '1976',
 '1977',
 '1979',
 '1980',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '2000',
 '2005',
 '2009',
 '2017',
 '2019',
 '2029',
 '3057',
 '8300']

In [37]:
[w for w in wsj if re.search("^[0-9]+-[a-z]{3,5}$", w)] # 中间的 - 表示字符本身， {3,5} 表示匹配前边的字符或组合3次或5次

['10-day',
 '10-lap',
 '10-year',
 '100-share',
 '12-point',
 '12-year',
 '14-hour',
 '15-day',
 '150-point',
 '190-point',
 '20-point',
 '20-stock',
 '21-month',
 '237-seat',
 '240-page',
 '27-year',
 '30-day',
 '30-point',
 '30-share',
 '30-year',
 '300-day',
 '36-day',
 '36-store',
 '42-year',
 '50-state',
 '500-stock',
 '52-week',
 '69-point',
 '84-month',
 '87-store',
 '90-day']

In [38]:
[w for w in wsj if re.search("^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$", w)]  # {5,} 表示匹配前边的字符或组合5次或5次以上 {,6} 表示匹配前边的字符或组合6次或6次以下

['black-and-white',
 'bread-and-butter',
 'father-in-law',
 'machine-gun-toting',
 'savings-and-loan']

In [39]:
[w for w in wsj if re.search("(ed|ing)$", w)] # (ed|ing) 表示匹配已组合ed或者ing结尾的单词

['62%-owned',
 'Absorbed',
 'According',
 'Adopting',
 'Advanced',
 'Advancing',
 'Alfred',
 'Allied',
 'Annualized',
 'Anything',
 'Arbitrage-related',
 'Arbitraging',
 'Asked',
 'Assuming',
 'Atlanta-based',
 'Baking',
 'Banking',
 'Beginning',
 'Beijing',
 'Being',
 'Bermuda-based',
 'Betting',
 'Boeing',
 'Broadcasting',
 'Bucking',
 'Buying',
 'Calif.-based',
 'Change-ringing',
 'Citing',
 'Concerned',
 'Confronted',
 'Conn.based',
 'Consolidated',
 'Continued',
 'Continuing',
 'Declining',
 'Defending',
 'Depending',
 'Designated',
 'Determining',
 'Developed',
 'Died',
 'During',
 'Encouraged',
 'Encouraging',
 'English-speaking',
 'Estimated',
 'Everything',
 'Excluding',
 'Exxon-owned',
 'Faulding',
 'Fed',
 'Feeding',
 'Filling',
 'Filmed',
 'Financing',
 'Following',
 'Founded',
 'Fracturing',
 'Francisco-based',
 'Fred',
 'Funded',
 'Funding',
 'Generalized',
 'Germany-based',
 'Getting',
 'Guaranteed',
 'Having',
 'Heating',
 'Heightened',
 'Holding',
 'Housing',
 'Illumin

In [124]:
for i in [w for w in wsj if re.search("ed|ing$", w)]:                # 不加() 只要遇到ed就匹配截止
    if i not in [w for w in wsj if re.search("(ed|ing)$", w)]:
        print (i)

Biedermann
Breeden
Cathedral
Cedric
Confederation
Credit
Federal
Federalist
Federation
Freddie
Frederick
Friedrichs
Impediments
Intermediate
Kennedy
Media
Medical
Medicine
Mercedes
Montedison
Nederlanden
Needham
Proceeds
Reddington
Redevelopment
Roederer
Speedway
Sweden
Teddy
Toledo
Wednesday
Wedtech
acknowledge
acknowledges
agreed-upon
allegedly
beds
buttoned-down
closed-end
comedies
concede
concedes
credentials
credibility
credit
creditor
creditors
credits
creditworthiness
deeds
discredit
edition
editions
editor
editorial
editorially
editors
education
educational
educators
exceedingly
exceeds
federal
federally
feeds
fixed-income
fixed-price
fixed-rate
freedom
freedoms
greedy
hundreds
immediate
immediately
impede
incredible
ingredients
intermediate
knowledge
knowledgeable
limited-partnership
medallions
media
medical
medicine
mediocre
needle-like
needs
needy
obedient
pediatrician
pianist-comedian
precedent
precedes
predecessor
predict
predictable
predictably
predicts
predispose
procedu

In [126]:
[w for w in wsj if re.search("w(i|e|ai|oo)t", w)] # 匹配含有wit，wet，wait，woot

['Hymowitz',
 'Switzerland',
 'awaits',
 'bellwether',
 'notwithstanding',
 'switch',
 'switched',
 'wait',
 'waited',
 'waiting',
 'wherewithal',
 'witches',
 'with',
 'withdraw',
 'withdrawal',
 'withdrawn',
 'withdrew',
 'withhold',
 'within',
 'without',
 'withstand',
 'witness',
 'witnesses']

In [40]:
word = "supercalifragilisticexpialidocious"
re.findall(r"[aeiou]", word)

['u',
 'e',
 'a',
 'i',
 'a',
 'i',
 'i',
 'i',
 'e',
 'i',
 'a',
 'i',
 'o',
 'i',
 'o',
 'u']

In [43]:
# 看看一些文本中的两个或两个以上的元音序列，并确定它们的相对频率：
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for vs in re.findall(r"[aeiou]{2,}", word)  for word in wsj)
fd.most_common(12)

[('ia', 12408), ('iou', 12408)]

In [None]:
wsj = sorted(set(nltk.corpus.treebank.words()))
fd = nltk.FreqDist(vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word))
fd.most_common(12)

[('io', 549),
 ('ea', 476),
 ('ie', 331),
 ('ou', 329),
 ('ai', 261),
 ('ia', 253),
 ('ee', 217),
 ('oo', 174),
 ('ua', 109),
 ('au', 106),
 ('ue', 105),
 ('ui', 95)]

In [51]:
[vs for word in wsj for vs in re.findall(r'[aeiou]{2,}', word)]   # 疑问：for word in wsj 放在前边和放在后边为啥不一样？

['ea',
 'oi',
 'ea',
 'ou',
 'oi',
 'ea',
 'ea',
 'oi',
 'oi',
 'ea',
 'io',
 'ea',
 'ea',
 'ea',
 'oi',
 'ea',
 'ea',
 'ea',
 'ea',
 'ea',
 'ea',
 'ea',
 'ee',
 'ea',
 'ea',
 'ea',
 'ea',
 'ea',
 'ea',
 'ea',
 'ea',
 'oi',
 'ea',
 'ea',
 'ou',
 'ou',
 'ou',
 'ie',
 'ui',
 'io',
 'ua',
 'io',
 'ai',
 'ai',
 'ai',
 'io',
 'ie',
 'ue',
 'ue',
 'ia',
 'ie',
 'ea',
 'ai',
 'ou',
 'ia',
 'ei',
 'ie',
 'ea',
 'ea',
 'ie',
 'ia',
 'ia',
 'ua',
 'ie',
 'io',
 'ea',
 'ia',
 'io',
 'ui',
 'ia',
 'ia',
 'ea',
 'iai',
 'ai',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'io',
 'oo',
 'io',
 'ia',
 'ia',
 'ia',
 'ia',
 'ue',
 'ea',
 'ai',
 'ai',
 'ue',
 'ie',
 'au',
 'ea',
 'ea',
 'ea',
 'ea',
 'eau',
 'au',
 'ei',
 'ei',
 'ei',
 'ei',
 'ei',
 'ia',
 'ie',
 'io',
 'ue',
 'oa',
 'oei',
 'oe',
 'ia',
 'oo',
 'oo',
 'oo',
 'eau',
 'ou',
 'ou',
 'ai',
 'ou',
 'ai',
 'oo',
 'ea',
 'au',
 'ia',
 'ea',
 'ea',
 'ee',
 'ia',
 'ai',
 'oa',
 'oo',
 'oo',
 'oo',
 'ei',
 'ei',
 'ea',
 'ui',
 'ui',
 'eau',
 'ie',
 'ia',
 

In [53]:
[vs for vs in re.findall(r'[aeiou]{2,}', word)  for word in wsj]   # 疑问：for word in wsj 放在前边和放在后边为啥不一样？

['ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',
 'ia',

In [61]:
import re
[int(n) for n in re.findall("[0-9]{2,}", '2009-12-31')]

[2009, 12, 31]

忽略掉词内部的元音

英文文本是高度冗余的，忽略掉词内部的元音仍然可以很容易的阅读，有些时候这很明显。例如，declaration变成dclrtn，inalienable变成inlnble，保留所有词首或词尾的元音序列。在我们的下一个例子中，正则表达式匹配词首元音序列，词尾元音序列和所有的辅音；其它的被忽略。

In [66]:
regexp = r"^[AEIOUaeiou]+|[AEIOUaeiou]+$|[^AEIOUaeiou]"
def compress(word):
    pieces = re.findall(regexp, word)
    return "".join(pieces)
english_udhr = nltk.corpus.udhr.words("English-Latin1")
print(english_udhr[:75],"\n")
print(nltk.tokenwrap(compress(w) for w in english_udhr[:75]))

['Universal', 'Declaration', 'of', 'Human', 'Rights', 'Preamble', 'Whereas', 'recognition', 'of', 'the', 'inherent', 'dignity', 'and', 'of', 'the', 'equal', 'and', 'inalienable', 'rights', 'of', 'all', 'members', 'of', 'the', 'human', 'family', 'is', 'the', 'foundation', 'of', 'freedom', ',', 'justice', 'and', 'peace', 'in', 'the', 'world', ',', 'Whereas', 'disregard', 'and', 'contempt', 'for', 'human', 'rights', 'have', 'resulted', 'in', 'barbarous', 'acts', 'which', 'have', 'outraged', 'the', 'conscience', 'of', 'mankind', ',', 'and', 'the', 'advent', 'of', 'a', 'world', 'in', 'which', 'human', 'beings', 'shall', 'enjoy', 'freedom', 'of', 'speech', 'and'] 

Unvrsl Dclrtn of Hmn Rghts Prmble Whrs rcgntn of the inhrnt dgnty and
of the eql and inlnble rghts of all mmbrs of the hmn fmly is the fndtn
of frdm , jstce and pce in the wrld , Whrs dsrgrd and cntmpt fr hmn
rghts hve rsltd in brbrs acts whch hve outrgd the cnscnce of mnknd ,
and the advnt of a wrld in whch hmn bngs shll enjy frd

将正则表达式与条件频率分布结合起来

在这里，我们将从罗托卡特语词汇中提取所有辅音-元音序列，如ka和si。因为每部分都是成对的，它可以被用来初始化一个条件频率分布。然后我们为每对的频率画出表格：

In [68]:
rotokas_words = nltk.corpus.toolbox.words("rotokas.dic")
cvs = [cv for w in rotokas_words for cv in re.findall(r"[ptksvr][aeiou]", w)]
print (cvs[:10])

['ka', 'ka', 'ka', 'ka', 'ka', 'ro', 'ka', 'ka', 'vi', 'ko']


In [69]:
cfd = nltk.ConditionalFreqDist(cvs)
cfd.tabulate()

    a   e   i   o   u 
k 418 148  94 420 173 
p  83  31 105  34  51 
r 187  63  84  89  79 
s   0   0 100   2   1 
t  47   8   0 148  37 
v  93  27 105  48  49 


辅音-元音对的单词的列表

In [71]:
cv_word_pairs = [(cv, w) for w in rotokas_words for cv in re.findall(r"[ptksvr][aeiou]", w)]
cv_index = nltk.Index(cv_word_pairs)
print(cv_index["su"],"\n\n",cv_index["po"])

['kasuari'] 

 ['kaapo', 'kaapopato', 'kaipori', 'kaiporipie', 'kaiporivira', 'kapo', 'kapoa', 'kapokao', 'kapokapo', 'kapokapo', 'kapokapoa', 'kapokapoa', 'kapokapora', 'kapokapora', 'kapokaporo', 'kapokaporo', 'kapokari', 'kapokarito', 'kapokoa', 'kapoo', 'kapooto', 'kapoovira', 'kapopaa', 'kaporo', 'kaporo', 'kaporopa', 'kaporoto', 'kapoto', 'karokaropo', 'karopo', 'kepo', 'kepoi', 'keposi', 'kepoto']


这段代码依次处理每个词w，对每一个词找出匹配正则表达式«[ptksvr][aeiou]»的所有子字符串。对于词kasuari，它找到ka, su和ri。因此，cv_word_pairs将包含('ka', 'kasuari'), ('su', 'kasuari')和('ri', 'kasuari')。更进一步使用nltk.Index()转换成有用的索引。

查找词干

In [72]:
def stem(word):
    for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
        if word.endswith(suffix):
            return word[:-len(suffix)]
    return word

In [73]:
re.findall(r"^.*(ing|ly|ed|ious|ies|ive|es|s|ment)$", "processing")

['ing']

In [75]:
re.findall(r"^.*(?:ing|ly|ed|ious|ies|ive|es|s|ment)$", "processing")  # (?:) 表示返回匹配到的字符串，而不是匹配到的部分片段

['processing']

In [77]:
re.findall(r"^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$", "processing") # (.*) 表示两个部分分别提取出来 

[('process', 'ing')]

In [80]:
re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', 'processes')   # (.*) 表示贪婪提取

[('processe', 's')]

In [78]:
re.findall(r"^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)$", "processing") # (.*?) 添加一个 *? 号表示非贪婪提取

[('process', 'ing')]

In [81]:
re.findall(r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$', 'language') # 后边添加？表示可选提取

[('language', '')]

In [82]:
def stem2(word):
    regexp  = r'^(.*?)(ing|ly|ed|ious|ies|ive|es|s|ment)?$'
    stem, suffix = re.findall(regexp, word)[0]
    return stem
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
 is no basis for a system of government.  Supreme executive power derives from
 a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
print([stem(t) for t in tokens],"\n\n",[stem2(t) for t in tokens])

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.'] 

 ['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'ly', 'in', 'pond', 'distribut', 'sword', 'i', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'Supreme', 'execut', 'power', 'deriv', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


使用nltk.findall搜索已分词文本

你可以使用一种特殊的正则表达式搜索一个文本中多个词（这里的文本是一个词符列表）。例如，"<a > <man>" 找出文本中所有a man的实例。
    
 尖括号用于标记词符的边界，尖括号之间的所有空白都被忽略（这只对NLTK中的findall()方法处理文本有效）。
 
 在下面的例子中，我们使用<.*>[1]，它将匹配所有单个词符，将它括在括号里，于是只匹配词（例如monied）而不匹配短语（例如，a monied man）会生成。
 
 第二个例子找出以词bro结尾的三个词组成的短语[2]。
 
 最后一个例子找出以字母l开始的三个或更多词组成的序列[3]。

In [89]:
from nltk.corpus import gutenberg, nps_chat
moby = nltk.Text(gutenberg.words("melville-moby_dick.txt"))
print(moby[:10],"\n")
print(moby.findall(r"<a><.*><man>"),"\n")
print(moby.findall(r"<a>(<.*>)<man>"))

['[', 'Moby', 'Dick', 'by', 'Herman', 'Melville', '1851', ']', 'ETYMOLOGY', '.'] 

a monied man; a nervous man; a dangerous man; a white man; a white
man; a white man; a pious man; a queer man; a good man; a mature man;
a white man; a Cape man; a great man; a wise man; a wise man; a
butterless man; a white man; a fiendish man; a pale man; a furious
man; a better man; a certain man; a complete man; a dismasted man; a
younger man; a brave man; a brave man; a brave man; a brave man
None 

monied; nervous; dangerous; white; white; white; pious; queer; good;
mature; white; Cape; great; wise; wise; butterless; white; fiendish;
pale; furious; better; certain; complete; dismasted; younger; brave;
brave; brave; brave
None


In [86]:
chat = nltk.Text(nps_chat.words())
print(chat[:10],"\n")
chat.findall(r"<.*><.*><bro>")

['now', 'im', 'left', 'with', 'this', 'gay', 'name', ':P', 'PART', 'hey'] 

you rule bro; telling you bro; u twizted bro


In [87]:
chat.findall(r"<l.*>{3,}")

lol lol lol; lmao lol lol; lol lol lol; la la la la la; la la la; la
la la; lovely lol lol love; lol lol lol.; la la la; la la la


In [94]:
p=r'[a-zA-Z]+'
nltk.re_show(p,'123asd456')

123{asd}456


在大型文本语料库中搜索x and other ys形式的表达式

In [95]:
from nltk.corpus import brown
hobbies_learned = nltk.Text(brown.words(categories = ["hobbies", "learned"]))
hobbies_learned.findall(r"<\w*><and><other><\w*s>")

speed and other activities; water and other liquids; tomb and other
landmarks; Statues and other monuments; pearls and other jewels;
charts and other items; roads and other features; figures and other
objects; military and other areas; demands and other factors;
abstracts and other compilations; iron and other metals


In [96]:
hobbies_learned.findall(r"<as><\w*><as><\w*>")

as accurately as possible; as well as the; as faithfully as possible;
as much as what; as neat as a; as simple as you; as well as other; as
well as other; as involved as determining; as well as other; as
important as another; as accurately as possible; as accurate as any;
as much as any; as different as a; as Orphic as that; as coppery as
Delawares; as good as another; as large as small; as well as ease; as
well as their; as well as possible; as straight as possible; as well
as nailed; as smoothly as the; as soon as a; as well as injuries; as
well as many; as well as reason; as well as in; as well as of; as well
as a; as well as summer; as well as providing; as important as
cooling; as evenly as it; as much as shading; as well as some; as well
as subsoil; as high as possible; as well as many; as general as
electrical; as long as the; as well as the; as much as was; as well as
set; as well as by; as high as 15; as well as aid; as much as
possible; as well as personalities; as low as a; 

# 3.3 规范化文本

In [3]:
raw = """DENNIS: Listen, strange women lying in ponds distributing swords
... is no basis for a system of government.  Supreme executive power derives from
... a mandate from the masses, not from some farcical aquatic ceremony."""
tokens = word_tokenize(raw)
print(tokens)

['DENNIS', ':', 'Listen', ',', 'strange', 'women', 'lying', 'in', 'ponds', 'distributing', 'swords', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'masses', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


## 1、词干提取器
看Porter词干提取器正确处理了词lying（将它映射为lie），而Lancaster词干提取器并没有处理好。

In [4]:
porter = nltk.PorterStemmer()
print([porter.stem(t) for t in tokens])

['denni', ':', 'listen', ',', 'strang', 'women', 'lie', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'basi', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'power', 'deriv', 'from', 'a', 'mandat', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcic', 'aquat', 'ceremoni', '.']


In [5]:
lancaster = nltk.LancasterStemmer()
print([lancaster.stem(t) for t in tokens])

['den', ':', 'list', ',', 'strange', 'wom', 'lying', 'in', 'pond', 'distribut', 'sword', 'is', 'no', 'bas', 'for', 'a', 'system', 'of', 'govern', '.', 'suprem', 'execut', 'pow', 'der', 'from', 'a', 'mand', 'from', 'the', 'mass', ',', 'not', 'from', 'som', 'farc', 'aqu', 'ceremony', '.']


例 3-1 使用词干提取器索引文本

In [None]:
class IndexedText(object):
    def __init__(self, stemmer, text):
        self._text = text
        self._stemmer = stemmer
        self._index = nltk.Index((self._stem(word), i)
                                for (i, word) in enumerate(self._text))

    def concordance(self, word, width=40):
        key = self._stem(word)
        wc = width // 4  # words of context, use integer division
        index=0
        for i in self._index[key]:
            index+=1
            lcontext = ' '.join(self._text[max(0, i-wc):i])
            rcontext = ' '.join(self._text[i:min(len(self._text), i+wc)])
            ldisplay = '%*s' % (width, lcontext[-width:] if len(lcontext) >= width else lcontext)
            rdisplay = '%-*s' % (width, rcontext[:width] if len(rcontext) >= width else rcontext)
            print(f"{index}>",ldisplay, rdisplay)

    def _stem(self, word):
        return self._stemmer.stem(word).lower()

porter = nltk.PorterStemmer()
grail = nltk.corpus.webtext.words('grail.txt')
text = IndexedText(porter, grail)
text.concordance('lie')

1> r king ! DENNIS : Listen , strange women lying in ponds distributing swords is no
2>  beat a very brave retreat . ROBIN : All lies ! MINSTREL : [ singing ] Bravest of
3>        Nay . Nay . Come . Come . You may lie here . Oh , but you are wounded !   
4> doctors immediately ! No , no , please ! Lie down . [ clap clap ] PIGLET : Well  
5> ere is much danger , for beyond the cave lies the Gorge of Eternal Peril , which 
6>    you . Oh ... TIM : To the north there lies a cave -- the cave of Caerbannog --
7> h it and lived ! Bones of full fifty men lie strewn about its lair . So , brave k
8> not stop our fight ' til each one of you lies dead , and the Holy Grail returns t


## 2、词形归并
WordNet词形归并器只在产生的词在它的词典中时才删除词缀。这个额外的检查过程使词形归并器比刚才提到的词干提取器要慢。请注意，它并没有处理lying，但它将women转换为woman。

In [15]:
wnl = nltk.WordNetLemmatizer()
print([wnl.lemmatize(t) for t in tokens])

['DENNIS', ':', 'Listen', ',', 'strange', 'woman', 'lying', 'in', 'pond', 'distributing', 'sword', 'is', 'no', 'basis', 'for', 'a', 'system', 'of', 'government', '.', 'Supreme', 'executive', 'power', 'derives', 'from', 'a', 'mandate', 'from', 'the', 'mass', ',', 'not', 'from', 'some', 'farcical', 'aquatic', 'ceremony', '.']


# 3.3 用正则表达式为文本分词

## 1、分词的简单方法

In [16]:
raw = """'When I'M a Duchess,' she said to herself, (not in a very hopeful tone
though), 'I won't have any pepper in my kitchen AT ALL. Soup does very
well without--Maybe it's always pepper that makes people hot-tempered,'..."""
print(re.split(r" ", raw)) # 在 空格字符 处分割原始文本

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone\nthough),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very\nwell', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]


In [17]:
print(re.split(r"[ \t\n]+",raw)) # 在 空格 或 制表符（\t） 或 换行符（\n） 处分割原始文本

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]


In [18]:
print(re.split(r"\s+", raw)) # 在 所有空白字符 处分割原始文本

["'When", "I'M", 'a', "Duchess,'", 'she', 'said', 'to', 'herself,', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though),', "'I", "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL.', 'Soup', 'does', 'very', 'well', 'without--Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', "hot-tempered,'..."]


In [19]:
print(re.split(r"\W+", raw)) # 在 \w 的补集处分割原始文本 ； \w 表示匹配所有字符，相当于[a-zA-Z0-9_] ; \W 表示 \w 的补集，即所有字母数字下划线以外的字符

['', 'When', 'I', 'M', 'a', 'Duchess', 'she', 'said', 'to', 'herself', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', 'I', 'won', 't', 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', 'Soup', 'does', 'very', 'well', 'without', 'Maybe', 'it', 's', 'always', 'pepper', 'that', 'makes', 'people', 'hot', 'tempered', '']


In [20]:
print(re.findall(r"\w+|\S\w*", raw))  # 首先匹配字母数字下划线，如果没有则匹配非空白字符（\S 是\s 的补集）加上字母数字下划线

["'When", 'I', "'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'I", 'won', "'t", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '-', '-Maybe', 'it', "'s", 'always', 'pepper', 'that', 'makes', 'people', 'hot', '-tempered', ',', "'", '.', '.', '.']


In [21]:
print(re.findall(r"\w+(?:[-']\w+)*|'|[-.(]+|\S\w*", raw))  # \w+(?:[-']\w+)* 会匹配 hot-tempered和it's

["'", 'When', "I'M", 'a', 'Duchess', ',', "'", 'she', 'said', 'to', 'herself', ',', '(', 'not', 'in', 'a', 'very', 'hopeful', 'tone', 'though', ')', ',', "'", 'I', "won't", 'have', 'any', 'pepper', 'in', 'my', 'kitchen', 'AT', 'ALL', '.', 'Soup', 'does', 'very', 'well', 'without', '--', 'Maybe', "it's", 'always', 'pepper', 'that', 'makes', 'people', 'hot-tempered', ',', "'", '...']


# 3.4 分割

## 1、断句

In [22]:
len(nltk.corpus.brown.words()) / len(nltk.corpus.brown.sents()) # 计算布朗语料库中每个句子的平均词数

20.250994070456922

In [23]:
sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
sents = sent_tokenizer.tokenize(text)
pprint.pprint(sents[171:181])

['In the wild events which were to follow this girl had no\n'
 'part at all; he never saw her again until all his tale was over.',
 'And yet, in some indescribable way, she kept recurring like a\n'
 'motive in music through all his mad adventures afterwards, and the\n'
 'glory of her strange hair ran like a red thread through those dark\n'
 'and ill-drawn tapestries of the night.',
 'For what followed was so\nimprobable, that it might well have been a dream.',
 'When Syme went out into the starlit street, he found it for the\n'
 'moment empty.',
 'Then he realised (in some odd way) that the silence\n'
 'was rather a living silence than a dead one.',
 'Directly outside the\n'
 'door stood a street lamp, whose gleam gilded the leaves of the tree\n'
 'that bent out over the fence behind him.',
 'About a foot from the\n'
 'lamp-post stood a figure almost as rigid and motionless as the\n'
 'lamp-post itself.',
 'The tall hat and long frock coat were black; the\n'
 'face, in an abrupt shadow

## 2、分词
类似的问题在口语语言处理中也会出现，听者必须将连续的语音流分割成单个的词汇。

In [24]:
text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"
seg1 = "0000000000000001000000000010000000000000000100000000000"
seg2 = "0100100100100001001001000010100100010010000100010010000"
seg3 = "0000100100000011001000000110000100010000001100010000001"

In [None]:
# 例 3-2. 从分词表示字符串 seg1 和 seg2 中重建文本分词。seg1 和 seg2 表示假设的一些儿童讲话的初始和最终分词。函数 segment()可以使用它们重现分词的文本。
def segment(text, segs):
    words = []
    last = 0
    for i in range(len(segs)):
        if segs[i] == "1":
            words.append(text[last:i+1])
            last = i + 1
    words.append(text[last:])
    return words
print(segment(text,seg1))
print(segment(text,seg2))
print(segment(text,seg3))

['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
['do', 'you', 'see', 'the', 'kitty', 'see', 'the', 'doggy', 'do', 'you', 'like', 'the', 'kitty', 'like', 'the', 'doggy']
['doyou', 'see', 'thekitt', 'y', 'see', 'thedogg', 'y', 'doyou', 'like', 'thekitt', 'y', 'like', 'thedogg', 'y']


例 3-3 计算存储词典和重构源文本的成本

最后一步是寻找最大化目标函数值 0 和 1 的模式，如例 3-4 所示。请注意，最好的分词包括像“thekitty”这样的“词”，因为数据中没有足够的证据进一步分割这个词。


In [27]:
def evaluate(text, segs):
    words = segment(text, segs)
    text_size = len(words)
    lexicon_size = sum(len(word) + 1 for word in set(words))
    return text_size + lexicon_size
print(evaluate(text, seg1))
print(evaluate(text, seg2))
print(evaluate(text, seg3))

64
48
47


例 3-4. 使用模拟退火算法的非确定性搜索:一开始仅搜索短语分词;随机扰动 0 和 1,
它们与“温度”成比例;每次迭代温度都会降低,扰动边界会减少。

In [28]:
from random import randint

def flip(segs, pos):
    return segs[:pos] + str(1-int(segs[pos])) + segs[pos+1:]  # 将segs中pos位置的数字翻转：1变0 ; 0变1

def flip_n(segs, n):
    for i in range(n):
        segs = flip(segs, randint(0,len(segs)-1))                        # 随机翻转segs中的0或者1 n次
    return segs

def anneal(text, segs, iterations, cooling_rate):
    temperature = float(len(segs))
    while temperature > 0.5:
        best_segs, best = segs, evaluate(text, segs)
        for i in range(iterations):
            guess = flip_n(segs, int(round(temperature)))  # round 返回浮点数的四舍五入   
            score = evaluate(text, guess)
            if score < best:
                best, best_segs = score, guess
        score, segs = best, best_segs
        temperature = temperature / cooling_rate
        print (evaluate(text, segs), segment(text, segs))
    print
    return segs

text = "doyouseethekittyseethedoggydoyoulikethekittylikethedoggy"

seg1 = "0000000000000001000000000010000000000000000100000000000"

anneal(text, seg1, 5000, 1.2)


64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
64 ['doyouseethekitty', 'seethedoggy', 'doyoulikethekitty', 'likethedoggy']
62 ['doyouseethek', 'itty', 'seethedoggydoyo', 'u', 'liketh', 'ek', 'itty', 'liketh', 'edoggy']
62 ['doyouseethek', 'itty', 'seethedoggydoyo', 'u', 'liketh', 'ek', 'itty', 'liketh', 'edoggy']
62 ['doyouseethek', 'itty', 'seethedoggydoyo', 'u', 'liketh', 'ek', 'itty', 'liketh', 'edoggy']
59 ['do', 'you', 'seeth', 'ek', 'itty', 'seeth', 'edoggydo', 'you', 'liketh', 'ek', 'itty', 'liketh', 'e

'0000100001000001000010000010000100000100000100000100000'

有了足够的数据,就可能以一个合理的准确度自动将文本分割成词汇。这种方法可用于
为那些词的边界没有任何视觉表示的书写系统分词。

# 3.5 格式化：从列表到字符串

## 1、从列表到字符串

In [149]:
silly = ['We', 'called', 'him', 'Tortoise', 'because', 'he', 'taught', 'us', '.']
print(" ".join(silly))

We called him Tortoise because he taught us .


## 1、字符串与格式
我们已经看到了有两种方式显示一个对象的内容：

In [151]:
word = "cat"
sentence = """
hello
word
"""
print(word)
print(sentence)

cat

hello
word



print命令让Python努力以人最可读的形式输出的一个对象的内容。

In [152]:
word

'cat'

In [153]:
sentence

'\nhello\nword\n'

第二种方法——叫做变量提示——向我们显示可用于重新创建该对象的字符串。

## 2、格式化输出

In [154]:
# 1.变量和常量交替出现
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in fdist:
    print(word, "->", fdist[word], end = ";")

dog -> 4;cat -> 3;snake -> 1;

In [155]:
# 2.使用str.format（）方法
fdist = nltk.FreqDist(['dog', 'cat', 'dog', 'cat', 'dog', 'snake', 'dog', 'cat'])
for word in fdist:
    print("{}->{};".format(word, fdist[word]), end = ' ')

dog->4; cat->3; snake->1; 

## 3、使用str.format（）方法对齐

In [160]:
"{:6}".format(41)   # 字符宽度为6，数字默认右对齐

'    41'

In [161]:
"{:<6}".format(41)   # 字符宽度为6，数字 < 表示左对齐

'41    '

In [163]:
"{:6}".format("dog") # 字符宽度为6，字符默认左对齐

'dog   '

In [167]:
"{:>6}".format("dog") # 字符宽度为6，字符 > 表示左对齐 

'   dog'

In [168]:
# 指定浮点数的符号和精度
import math
"{:.4f}".format(math.pi)  # 表示小数点后边显示4位

'3.1416'

In [169]:
# 表示百分数
"accuracy for {} words: {:.4%}".format(9375, 3205 / 9375)

'accuracy for 9375 words: 34.1867%'

## 4、格式化字符串用于数据制表

In [29]:
def tabulate(cfdist, words, categories):
    print("{:20}".format("Category"), end = " ")
    for word in words:
        print("{:>6}".format(word), end = " ")
    print ()
    for category in categories:
        print("{:20}".format(category), end = " ")
        for word in words:
            print("{:6}".format(cfdist[category][word]), end = " ")
        print()
        
from nltk.corpus import brown

cfd = nltk.ConditionalFreqDist(
    (genre, word) 
    for genre in brown.categories() 
    for word in brown.words(categories = genre))

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']
modals = ['can', 'could', 'may', 'might', 'must', 'will']

tabulate(cfd, modals, genres)

Category                can  could    may  might   must   will 
news                     93     86     66     38     50    389 
religion                 82     59     78     12     54     71 
hobbies                 268     58    131     22     83    264 
science_fiction          16     49      4     12      8     16 
romance                  74    193     11     51     45     43 
humor                    16     30      8      8      9     13 


In [183]:
# 自动定制列的宽度
#width = max(len(w) for w in words)
"{:{width}}".format("Monty Python", width = 15)

'Monty Python   '

## 5、将结果写入文件

In [185]:
output_file = open("3.output.txt", "w")
words = set(nltk.corpus.genesis.words("english-kjv.txt"))
for word in sorted(words):
    print(word, file = output_file)

In [187]:
# 当我们将非文本数据写入文件时，我们必须先将它转换为字符串。
print(str(len(words)), file = output_file)

## 6、文本换行

In [188]:
saying = ['After', 'all', 'is', 'said', 'and', 'done', ',',
         'more', 'is', 'said', 'than', 'done', '.']
for word in saying:
    print(word, "(" + str(len(word)) + ") , ", end = " ")

After (5) ,  all (3) ,  is (2) ,  said (4) ,  and (3) ,  done (4) ,  , (1) ,  more (4) ,  is (2) ,  said (4) ,  than (4) ,  done (4) ,  . (1) ,  

我们可以在Python 的textwrap模块的帮助下采取换行。

In [190]:
from textwrap import fill
format = "%s (%d) , "
pieces = [format % (word, len(word)) for word in saying]
output = " ".join(pieces)
print(output,"\n")
print(fill(output))

After (5) ,  all (3) ,  is (2) ,  said (4) ,  and (3) ,  done (4) ,  , (1) ,  more (4) ,  is (2) ,  said (4) ,  than (4) ,  done (4) ,  . (1) ,  

After (5) ,  all (3) ,  is (2) ,  said (4) ,  and (3) ,  done (4) ,  ,
(1) ,  more (4) ,  is (2) ,  said (4) ,  than (4) ,  done (4) ,  . (1)
,
