第二部分 

介绍NLTK 中的 nltk.corpus 模块

nltk.corpus 模块主要用于访问和处理各种语料库（corpora）。语料库是文本和语言数据的集合，用于语言学研究、开发和评估自然语言处理（NLP）系统。nltk.corpus 模块提供了许多常见语言资源的接口，使得开发者可以方便地获取和使用这些资源。


1. stopwords

stopwords 方法提供了常见的停用词列表，可以帮助在文本预处理阶段去除无意义的词语。

In [3]:
import nltk
nltk.data.path.append('C:\\mini\\envs\\cl-43\\nltk_data')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\86182\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [4]:
import nltk
from nltk.corpus import stopwords

# 加载英语停用词
stop_words = set(stopwords.words('english'))

# 打印停用词列表
print(stop_words)

{'s', "shan't", "you've", 'these', 'shouldn', "shouldn't", 'them', 'had', 'over', 'him', "that'll", 'i', 'up', 'yourselves', 'more', 'weren', 'how', 've', 'do', 'after', 'can', 'mightn', 'doing', 'down', "couldn't", 'where', 'mustn', "weren't", 'theirs', "should've", 'myself', 'didn', 'there', 'yourself', "aren't", "didn't", 'they', 'as', 'ours', 'off', 'don', 'most', 'is', 'me', 'during', 'before', 'from', 'won', 'herself', 'for', 'which', 'you', 'by', 'only', 'at', 't', "you're", 'few', 'all', "hasn't", 'their', 'then', 'been', 'our', 'very', 'have', 'ma', 'y', 'same', 'couldn', "doesn't", 'wasn', 'not', 'its', 'against', 'am', 'too', 'further', 'here', 'doesn', 'hers', 'should', 'ain', 'or', 'having', 'the', 'other', 'who', 'm', 'above', "isn't", 'until', 'under', 'about', 'does', 'were', 'isn', 'into', 'if', 'be', 'through', 'both', "mustn't", 'my', 'was', 'his', 'some', 'now', "mightn't", 'such', 'd', 'but', 'because', 'when', 'will', 'wouldn', 'an', 'this', 'ourselves', 'your', '

测试用例

验证列表长度：

In [12]:
assert len(stop_words) > 100, "Expected more than 100 stopwords"

验证是否包含常见的停用词：

In [14]:
assert 'is' in stop_words, "'is' should be in stopwords"
assert 'from' in stop_words, "'from' should be in stopwords"


验证是否不包含罕见的词语：

In [13]:
assert 'conundrum' not in stop_words, "'conundrum' should not be in stopwords"

2. wordnet

wordnet 是一个英语词汇数据库，包含了大量的同义词集合（synsets）和词语之间的语义关系。

In [16]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\86182\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [9]:
from nltk.corpus import wordnet

# 查找单词的同义词集合
synsets = wordnet.synsets('car')
for synset in synsets:
    print(synset.name(), ':', synset.definition())

car.n.01 : a motor vehicle with four wheels; usually propelled by an internal combustion engine
car.n.02 : a wheeled vehicle adapted to the rails of railroad
car.n.03 : the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
car.n.04 : where passengers ride up and down
cable_car.n.01 : a conveyance for passengers or freight on a cable railway


测试用例

验证是否返回了正确的同义词集合：

In [10]:
assert len(synsets) > 0, "Expected at least one synset for 'car'"


验证同义词集合的定义是否准确：

In [19]:
assert any('motorcar' in s.name() for s in synsets), "'motorcar' should be in synsets"


AssertionError: 'motorcar' should be in synsets

3. treebank

treebank 是一个包含了已标注的英文句子语料库，用于训练和评估词性标注器和其他句法分析工具。

In [41]:
import nltk
nltk.download('treebank')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\86182\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.


True

In [42]:
from nltk.corpus import treebank

# 获取句子
sentences = treebank.sents()
print(sentences[0])

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']


测试用例

验证是否成功加载了语料库中的句子：

In [43]:
assert len(sentences) > 0, "Expected sentences in treebank corpus"


验证句子的格式是否正确：

In [44]:
assert isinstance(sentences[0], list), "Each sentence should be a list of tokens"

4. brown

brown 是一个经典的英语语料库，包含了不同类型的文本（新闻、社论、小说等）。

In [33]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\86182\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [34]:
from nltk.corpus import brown

# 获取文本类型和对应的文本
genres = brown.categories()
print(genres[:5])

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government']


测试用例

验证是否成功加载了不同类型的文本：

In [35]:
assert len(genres) > 0, "Expected genres in brown corpus"


验证每个类型是否包含了多个文本样本：

In [36]:
for genre in genres[:5]:
    texts = brown.fileids(categories=genre)
    assert len(texts) > 0, f"Expected texts for genre: {genre}"


5. gutenberg

gutenberg 是一个包含了古腾堡计划中的文本的语料库，包括了多种公共领域的文学作品。

In [29]:
import nltk
nltk.download('gutenberg')


[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\86182\AppData\Roaming\nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

In [28]:
from nltk.corpus import gutenberg

# 获取文本
text = gutenberg.raw('shakespeare-hamlet.txt')
print(text[:500])


[The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your selfe

   Bar. Long liue the King

   Fran. Barnardo?
  Bar. He

   Fran. You come most carefully vpon your houre

   Bar. 'Tis now strook twelue, get thee to bed Francisco

   Fran. For this releefe much thankes: 'Tis bitter cold,
And I am sicke at heart

   Barn. Haue you had quiet Guard?
  Fran. Not


验证是否成功加载了特定文本：

In [31]:
assert len(text) > 0, "Expected text from gutenberg corpus"


验证文本内容是否符合预期：

In [30]:
assert text.startswith('[') and text.endswith(']'), "Expected Shakespeare's Hamlet text"


AssertionError: Expected Shakespeare's Hamlet text