第三部分

NLTK中的nltk.tokenize模块

nltk.tokenize模块专门用于分词，即将文本拆分成单词、句子或其他标记的过程。

1. word_tokenize

word_tokenize函数可以将字符串分割成单词列表，分词的目的是将一段连续的文本分解为更小的单位，以便于后续的分析和处理。

使用前需要先下载punkt模型

Punkt模型是一个基于无监督学习的句子边界检测工具，专门用于句子分割

In [1]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Honjoutx\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

测试用例

In [2]:
cases = [
    {
        "text": "The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday. "
    },
    {
        "text": "China and Russia opposed the meeting and called for a procedural vote by the 15 members, which was defeated. "
    },
    {
        "text": "One consequence is that divided families are even more divided. No departures means no reunification with families abroad."
    }
]

In [3]:
for case in cases:
    text = case["text"]
    result = word_tokenize(text)
    print(result)

['The', 'international', 'community', 'must', 'continue', 'to', 'pay', 'close', 'attention', 'to', 'the', 'troubling', 'human', 'rights', 'situation', 'in', 'the', 'Democratic', 'People', '’', 's', 'Republic', 'of', 'Korea', '(', 'DPRK', ')', 'and', 'find', 'ways', 'to', 'revive', 'dialogue', 'with', 'the', 'Government', ',', 'the', 'UN', 'Security', 'Council', 'heard', 'on', 'Wednesday', '.']
['China', 'and', 'Russia', 'opposed', 'the', 'meeting', 'and', 'called', 'for', 'a', 'procedural', 'vote', 'by', 'the', '15', 'members', ',', 'which', 'was', 'defeated', '.']
['One', 'consequence', 'is', 'that', 'divided', 'families', 'are', 'even', 'more', 'divided', '.', 'No', 'departures', 'means', 'no', 'reunification', 'with', 'families', 'abroad', '.']


2. sent_tokenize

sent_tokenize用于将文本分割成句子列表。句子分割的目的是将一段连续的文本分解为单独的句子，以便于后续的分析和处理。

In [4]:
from nltk.tokenize import sent_tokenize

测试用例

In [5]:
cases = [
    {
        "text": "The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday. "
    },
    {
        "text": "Born to a leading family in the capital, Pyongyang, Mr. Kim was 19 when he left to study in Beijing in 2010. Using the internet, he said he learned about his homeland and “the horrific truth” previously hidden to him."
    },
    {
        "text": "He too welcomed the OECD figures announced on Wednesday and said there is now an opportunity to consider what the transition to renewable energy really means for SIDS. It amounts to economic transformation, he said."
    }
]


In [6]:
for case in cases:
    text = case["text"]
    result = sent_tokenize(text)
    print(result)

['The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday.']
['Born to a leading family in the capital, Pyongyang, Mr. Kim was 19 when he left to study in Beijing in 2010.', 'Using the internet, he said he learned about his homeland and “the horrific truth” previously hidden to him.']
['He too welcomed the OECD figures announced on Wednesday and said there is now an opportunity to consider what the transition to renewable energy really means for SIDS.', 'It amounts to economic transformation, he said.']


3. RegexpTokenizer

RegexpTokenizer用于基于正则表达式进行分词。与word_tokenize和sent_tokenize等函数不同，RegexpTokenizer允许用户使用自定义的正则表达式来定义分词规则，从而实现更灵活和精确的分词。

In [9]:
from nltk.tokenize import RegexpTokenizer

测试用例

In [7]:
cases = [
    {
        "text": "The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday. "
    },
    {
        "text": "Born to a leading family in the capital, Pyongyang, Mr. Kim was 19 when he left to study in Beijing in 2010. Using the internet, he said he learned about his homeland and “the horrific truth” previously hidden to him."
    },
    {
        "text": "He too welcomed the OECD figures announced on Wednesday and said there is now an opportunity to consider what the transition to renewable energy really means for SIDS. It amounts to economic transformation, he said."
    }
]

定义正则表达式，只保留单词

In [15]:
tokenizer = RegexpTokenizer(r'\w+')

In [16]:
for case in cases:
    text = case["text"]
    tokens = tokenizer.tokenize(text)
    print(tokens)

['The', 'international', 'community', 'must', 'continue', 'to', 'pay', 'close', 'attention', 'to', 'the', 'troubling', 'human', 'rights', 'situation', 'in', 'the', 'Democratic', 'People', 's', 'Republic', 'of', 'Korea', 'DPRK', 'and', 'find', 'ways', 'to', 'revive', 'dialogue', 'with', 'the', 'Government', 'the', 'UN', 'Security', 'Council', 'heard', 'on', 'Wednesday']
['Born', 'to', 'a', 'leading', 'family', 'in', 'the', 'capital', 'Pyongyang', 'Mr', 'Kim', 'was', '19', 'when', 'he', 'left', 'to', 'study', 'in', 'Beijing', 'in', '2010', 'Using', 'the', 'internet', 'he', 'said', 'he', 'learned', 'about', 'his', 'homeland', 'and', 'the', 'horrific', 'truth', 'previously', 'hidden', 'to', 'him']
['He', 'too', 'welcomed', 'the', 'OECD', 'figures', 'announced', 'on', 'Wednesday', 'and', 'said', 'there', 'is', 'now', 'an', 'opportunity', 'to', 'consider', 'what', 'the', 'transition', 'to', 'renewable', 'energy', 'really', 'means', 'for', 'SIDS', 'It', 'amounts', 'to', 'economic', 'transform

定义正则表达式，保留单词和标点符号

In [17]:
tokenizer = RegexpTokenizer(r'\w+|[^\w\s]')

In [18]:
for case in cases:
    text = case["text"]
    tokens = tokenizer.tokenize(text)
    print(tokens)

['The', 'international', 'community', 'must', 'continue', 'to', 'pay', 'close', 'attention', 'to', 'the', 'troubling', 'human', 'rights', 'situation', 'in', 'the', 'Democratic', 'People', '’', 's', 'Republic', 'of', 'Korea', '(', 'DPRK', ')', 'and', 'find', 'ways', 'to', 'revive', 'dialogue', 'with', 'the', 'Government', ',', 'the', 'UN', 'Security', 'Council', 'heard', 'on', 'Wednesday', '.']
['Born', 'to', 'a', 'leading', 'family', 'in', 'the', 'capital', ',', 'Pyongyang', ',', 'Mr', '.', 'Kim', 'was', '19', 'when', 'he', 'left', 'to', 'study', 'in', 'Beijing', 'in', '2010', '.', 'Using', 'the', 'internet', ',', 'he', 'said', 'he', 'learned', 'about', 'his', 'homeland', 'and', '“', 'the', 'horrific', 'truth', '”', 'previously', 'hidden', 'to', 'him', '.']
['He', 'too', 'welcomed', 'the', 'OECD', 'figures', 'announced', 'on', 'Wednesday', 'and', 'said', 'there', 'is', 'now', 'an', 'opportunity', 'to', 'consider', 'what', 'the', 'transition', 'to', 'renewable', 'energy', 'really', 'mea

4. TreebankWordTokenizer

