介绍NLTK 中的 nltk.corpus 模块

nltk.corpus 模块主要用于访问和处理各种语料库（corpora）。语料库是文本和语言数据的集合，用于语言学研究、开发和评估自然语言处理（NLP）系统。nltk.corpus 模块提供了许多常见语言资源的接口，使得开发者可以方便地获取和使用这些资源。


1. stopwords

stopwords 方法提供了常见的停用词列表，可以帮助在文本预处理阶段去除无意义的词语。

In [10]:
import nltk
nltk.data.path.append(r'.\nltk_data')
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\31542\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [11]:
import nltk
from nltk.corpus import stopwords

# 加载英语停用词
stop_words = set(stopwords.words('english'))

# 打印停用词列表
print(stop_words)

{'itself', 'when', 'at', 'into', 'in', 'off', 'its', 'ain', 'don', 'only', 'me', 'themselves', "it's", 'once', 'yourselves', 'have', 'each', "couldn't", 'were', 'y', 'to', "wouldn't", 'through', "didn't", "you'll", 'mustn', 'nor', 'having', 'until', 'no', 'below', "you're", 'than', 'has', 'an', "mightn't", 'them', 'as', 'most', 'other', "hadn't", 'he', 'aren', 'too', 'it', "haven't", 'didn', "mustn't", 's', 'o', "you'd", 'ours', 'theirs', 'd', 'that', 'out', 'down', 'shan', 'now', 'being', "don't", 'him', 'couldn', "isn't", "hasn't", 'shouldn', 'been', 'weren', 't', 'why', 'those', 'their', 'a', 'where', 'herself', 'because', 'but', 'did', 'our', 'here', 'hasn', 'while', 'or', 'some', 'if', 'from', 'doesn', "needn't", 'yourself', 'your', 'm', 'she', "weren't", 'himself', 'wasn', 'am', 'after', 'all', "shouldn't", 'hadn', 'will', 'such', 'before', 'his', 'does', 'won', 'these', 'same', 'just', 'for', "shan't", 'about', 'of', 'there', 'doing', 'over', 'which', 'what', 'then', 'so', 'abov

测试用例

基本文本处理：使用停用词列表去除文本中的停用词，验证文本处理后的结果是否符合预期。

先创建一个text，分词后将words 列表中过滤掉存在于 stop_words 集合中的停用词，并保存到filtered_words，与预期结果对比不对就报错

In [12]:
def processing():
    text = "This is an example sentence to demonstrate stopwords removal."
    stop_words = set(stopwords.words('english'))
    
    # 分词
    words = text.split()  
    
    # 去除停用词

    filtered_words = [word for word in words if word.lower() not in stop_words]
    
    # 预期结果
    expected_output = "example sentence demonstrate stopwords removal."
    
    # 将列表转换为字符串
    processed_text = ' '.join(filtered_words)
    
    # 预期输出与处理后的文本是否相等
    assert processed_text == expected_output, f"Expected: {expected_output}, but got: {processed_text}"

processing()


处理空文本：测试在处理空文本时是否能够正常工作，即不引发错误并返回预期的空字符串。


In [13]:
def empty_text():
    text = ""
    stop_words = set(stopwords.words('english'))
    
    # 分词
    words = text.split()
    
    # 去除停用词
    filtered_words = [word for word in words if word.lower() not in stop_words]
    
    # 预期结果是空字符串
    expected_output = ""
    
    # 将列表转换为字符串
    processed_text = ' '.join(filtered_words)
    
    # 输出与处理后的文本是否相等
    assert processed_text == expected_output, f"Expected: {expected_output}, but got: {processed_text}"

empty_text()


处理全是停用词的文本：测试当输入文本全部由停用词组成时的处理结果，预期结果应该是空字符串。

In [14]:
def all_stopwords():
    text = "the and a of to"
    stop_words = set(stopwords.words('english'))
    
    # 分词
    words = text.split()
    
    # 去除停用词
    filtered_words = [word for word in words if word.lower() not in stop_words]
    
    # 预期结果是空字符串
    expected_output = ""
    
    # 将列表转换为字符串
    processed_text = ' '.join(filtered_words)
    
    # 预期输出与处理后的文本是否相等
    assert processed_text == expected_output, f"Expected: {expected_output}, but got: {processed_text}"

all_stopwords()


2. wordnet

wordnet 是一个英语词汇数据库，包含了大量的同义词集合（synsets）和词语之间的语义关系。

In [15]:
import nltk
nltk.download('wordnet')


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\31542\AppData\Roaming\nltk_data...


True

In [16]:
from nltk.corpus import wordnet

# 查找单词的同义词集合
synsets = wordnet.synsets('car')
for synset in synsets:
    print(synset.name(), ':', synset.definition())

car.n.01 : a motor vehicle with four wheels; usually propelled by an internal combustion engine
car.n.02 : a wheeled vehicle adapted to the rails of railroad
car.n.03 : the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
car.n.04 : where passengers ride up and down
cable_car.n.01 : a conveyance for passengers or freight on a cable railway


测试用例

基本的同义词查找：测试基本的同义词查找功能，确保能正确地找到指定单词的同义词集合。

In [17]:
def search(word):
    # 查找单词的同义词集合
    synsets = wordnet.synsets(word)
    
    # 预期至少会找到一个同义词集合
    assert len(synsets) > 0, f"Expected to find synonyms for '{word}', but found none."
    
    # 输出找到的同义词
    synonyms = set()
    for synset in synsets:
        for lemma in synset.lemmas():
            synonyms.add(lemma.name())
    
    return synonyms

word = 'car'
synonyms = search(word)
print(f"Synonyms for '{word}': {synonyms}")

Synonyms for 'car': {'gondola', 'cable_car', 'machine', 'railway_car', 'auto', 'car', 'automobile', 'elevator_car', 'railroad_car', 'railcar', 'motorcar'}


同义词集合名称的格式：测试同义词集合名称的格式是否符合预期。
测试用例验证同义词集合名称是否遵循形如 'word.a.n' 的标准格式，例如 'car.n.01' 等。

In [18]:
def definitions():
    word = 'car'
    
    # 查找单词的同义词集合
    synsets = wordnet.synsets(word)
    
    # 检查每个同义词的定义是否非空
    for synset in synsets:
        assert synset.definition(), f"Definition for synonym '{synset.name()}' is empty or None."

definitions()

3. treebank

treebank 是一个包含了已标注的英文句子语料库，用于训练和评估词性标注器和其他句法分析工具。

In [19]:
import nltk
nltk.download('treebank')

[nltk_data] Downloading package treebank to
[nltk_data]     C:\Users\31542\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\treebank.zip.


True

In [20]:
from nltk.corpus import treebank

# 获取句子
sentences = treebank.sents()
print(sentences[0])

['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']


测试用例

基本的句子获取：测试基本的句子获取功能，确保能正确地从 treebank 中获取句子。

In [21]:
def sentence():
    # 获取句子
    sentences = treebank.sents()
    
    # 预期至少会找到一个句子
    assert len(sentences) > 0, "Expected to find sentences in treebank, but found none."
    
    # 输出找到的句子
    print("Sample sentences from treebank:")
    for index, sentence in enumerate(sentences[:5]):  # 输出前五个句子作为示例
        print(f"{index + 1}. {' '.join(sentence)}")

# 输出
sentence()

Sample sentences from treebank:
1. Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
2. Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
3. Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named *-1 a nonexecutive director of this British industrial conglomerate .
4. A form of asbestos once used * * to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed * to it more than 30 years ago , researchers reported 0 *T*-1 .
5. The asbestos fiber , crocidolite , is unusually resilient once it enters the lungs , with even brief exposures to it causing symptoms that *T*-1 show up decades later , researchers said 0 *T*-2 .


句子格式检查：测试获取的句子是否是符合预期的格式，例如列表中的每个元素应当是一个由单词组成的列表。

In [22]:
def format():
    # 获取句子
    sentences = treebank.sents()
    
    # 检查每个句子的格式
    for sentence in sentences:
        assert isinstance(sentence, list) and all(isinstance(word, str) for word in sentence), \
            f"Invalid format for sentence: {sentence}"

format()

句子数量检查：测试获取的句子数量是否符合预期，例如至少应当能获取到指定数量的句子。

In [23]:
def count():
    expected_sentence_count = 500  # 假设预期至少能获取到500个句子
    
    # 获取句子
    sentences = treebank.sents()
    
    # 检查获取的句子数量是否符合预期
    assert len(sentences) >= expected_sentence_count, \
        f"Expected at least {expected_sentence_count} sentences, but found {len(sentences)}."

count()

4. brown

brown 是一个经典的英语语料库，包含了不同类型的文本（新闻、社论、小说等）。

In [24]:
import nltk
nltk.download('brown')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\31542\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


True

In [25]:

from nltk.corpus import brown

# 下载并加载 'brown' 语料库（如果还没有下载的话）
nltk.download('brown')

# 获取文本类型和对应的文本
genres = brown.categories()
print("Available genres:", genres)

# 选择一个文本类型
selected_genre = 'news'

# 获取选定文本类型的句子列表
sentences = brown.sents(categories=selected_genre)

# 打印前5个句子
print(f"\nSample sentences from '{selected_genre}' genre:")
for index, sentence in enumerate(sentences[:5], 1):
    print(f"Sentence {index}: {' '.join(sentence)}")


Available genres: ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

Sample sentences from 'news' genre:
Sentence 1: The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
Sentence 2: The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .
Sentence 3: The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to investigate reports of possible `` irregularities '' in the hard-fought primary which was won by Mayor-nominate Ivan Allen Jr. .
Sentence 4: `` Only a relative handful of such reports was received '' , the jury said , `` considering t

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\31542\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


测试用例

基本的文本类型获取：测试基本的文本类型获取功能，确保能正确地从 brown 包中获取文本类型。

In [26]:
def retrieval():
    # 获取文本类型
    genres = brown.categories()
    
    # 预期至少会找到一个文本类型
    assert len(genres) > 0, "Expected to find genres in brown corpus, but found none."
    
    # 输出找到的文本类型
    print("Genres found in Brown corpus:")
    for genre in genres:
        print(genre)

# 输出
retrieval()

Genres found in Brown corpus:
adventure
belles_lettres
editorial
fiction
government
hobbies
humor
learned
lore
mystery
news
religion
reviews
romance
science_fiction


句子格式检查：测试获取的句子是否是符合预期的格式，例如列表中的每个元素应当是一个由单词组成的列表。


In [27]:
def format():
    # 选择一个文本类型
    selected_genre = 'news'
    
    # 获取选定文本类型的句子列表
    sentences = brown.sents(categories=selected_genre)
    
    # 检查每个句子的格式
    for sentence in sentences[:5]:
        assert isinstance(sentence, list) and all(isinstance(word, str) for word in sentence), \
            f"Invalid format for sentence: {sentence}"

format()

文本类型数量检查：测试获取的文本类型数量是否符合预期，例如至少应当能获取到指定数量的文本类型。

In [28]:
def count():
    # 选择一个文本类型
    selected_genre = 'news'
    
    # 获取选定文本类型的句子列表
    sentences = brown.sents(categories=selected_genre)
    
    expected_sentence_count = 500  # 假设预期至少能获取到500个句子
    
    # 检查获取的句子数量是否符合预期
    assert len(sentences) >= expected_sentence_count, \
        f"Expected at least {expected_sentence_count} sentences, but found {len(sentences)}."


count()

5. gutenberg

gutenberg 是一个包含了古腾堡计划中的文本的语料库，包括了多种公共领域的文学作品。

In [29]:
import nltk
nltk.download('gutenberg')


[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\31542\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


True

In [30]:
from nltk.corpus import gutenberg

# 获取文本
text = gutenberg.raw('shakespeare-hamlet.txt')
print(text[:500])


[The Tragedie of Hamlet by William Shakespeare 1599]


Actus Primus. Scoena Prima.

Enter Barnardo and Francisco two Centinels.

  Barnardo. Who's there?
  Fran. Nay answer me: Stand & vnfold
your selfe

   Bar. Long liue the King

   Fran. Barnardo?
  Bar. He

   Fran. You come most carefully vpon your houre

   Bar. 'Tis now strook twelue, get thee to bed Francisco

   Fran. For this releefe much thankes: 'Tis bitter cold,
And I am sicke at heart

   Barn. Haue you had quiet Guard?
  Fran. Not


基本的文本获取：测试基本的文本获取功能，确保能正确地从 gutenberg 包中获取指定文件的文本数据。

In [31]:
def retrieval():
    # 获取文本
    text = gutenberg.raw('shakespeare-hamlet.txt')
    
    # 预期文本长度大于0
    assert len(text) > 0, "Expected to retrieve text from gutenberg corpus, but found empty."

retrieval()

文本内容检查：测试获取的文本内容是否符合预期，例如文本的前几个字符应当是预期的片段。

In [32]:
def content():
    expected_start_text = "ACT I. SCENE I."
    
    # 获取文本
    text = gutenberg.raw('shakespeare-hamlet.txt')
    
    # 检查文本内容是否包含预期的起始文本
    assert expected_start_text in text, \
        f"Expected text to contain '{expected_start_text}', but it was not found in the text."

content()

AssertionError: Expected text to contain 'ACT I. SCENE I.', but it was not found in the text.

文本长度检查：测试获取的文本长度是否符合预期，例如文本长度应当大于或等于指定的最小长度。

In [33]:
def length():
    min_expected_length = 1000  # 假设预期文本长度至少为1000个字符
    
    # 获取文本
    text = gutenberg.raw('shakespeare-hamlet.txt')
    
    # 检查
    assert len(text) >= min_expected_length, \
        f"Expected text length to be at least {min_expected_length}, but found {len(text)}."

length()