第三部分

NLTK中的nltk.tokenize模块

nltk.tokenize模块专门用于分词，即将文本拆分成单词、句子或其他标记的过程。

1. word_tokenize

word_tokenize函数可以将字符串分割成单词列表，分词的目的是将一段连续的文本分解为更小的单位，以便于后续的分析和处理。

使用前需要先下载punkt模型

Punkt模型是一个基于无监督学习的句子边界检测工具，专门用于句子分割

In [14]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Honjoutx\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

测试用例

In [21]:
cases = [
    {
        "text": "The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday. "
    },
    {
        "text": "China and Russia opposed the meeting and called for a procedural vote by the 15 members, which was defeated. "
    },
    {
        "text": "One consequence is that divided families are even more divided. No departures means no reunification with families abroad."
    }
]

In [3]:
for case in cases:
    text = case["text"]
    result = word_tokenize(text)
    print(result)

['The', 'international', 'community', 'must', 'continue', 'to', 'pay', 'close', 'attention', 'to', 'the', 'troubling', 'human', 'rights', 'situation', 'in', 'the', 'Democratic', 'People', '’', 's', 'Republic', 'of', 'Korea', '(', 'DPRK', ')', 'and', 'find', 'ways', 'to', 'revive', 'dialogue', 'with', 'the', 'Government', ',', 'the', 'UN', 'Security', 'Council', 'heard', 'on', 'Wednesday', '.']
['China', 'and', 'Russia', 'opposed', 'the', 'meeting', 'and', 'called', 'for', 'a', 'procedural', 'vote', 'by', 'the', '15', 'members', ',', 'which', 'was', 'defeated', '.']
['One', 'consequence', 'is', 'that', 'divided', 'families', 'are', 'even', 'more', 'divided', '.', 'No', 'departures', 'means', 'no', 'reunification', 'with', 'families', 'abroad', '.']


In [18]:
from collections import Counter
for case in cases:
    text = case["text"]
    result = word_tokenize(text)
    word_freq = Counter(result)
    print(word_freq)


Counter({'the': 4, 'to': 3, 'The': 1, 'international': 1, 'community': 1, 'must': 1, 'continue': 1, 'pay': 1, 'close': 1, 'attention': 1, 'troubling': 1, 'human': 1, 'rights': 1, 'situation': 1, 'in': 1, 'Democratic': 1, 'People': 1, '’': 1, 's': 1, 'Republic': 1, 'of': 1, 'Korea': 1, '(': 1, 'DPRK': 1, ')': 1, 'and': 1, 'find': 1, 'ways': 1, 'revive': 1, 'dialogue': 1, 'with': 1, 'Government': 1, ',': 1, 'UN': 1, 'Security': 1, 'Council': 1, 'heard': 1, 'on': 1, 'Wednesday': 1, '.': 1})
Counter({'and': 2, 'the': 2, 'China': 1, 'Russia': 1, 'opposed': 1, 'meeting': 1, 'called': 1, 'for': 1, 'a': 1, 'procedural': 1, 'vote': 1, 'by': 1, '15': 1, 'members': 1, ',': 1, 'which': 1, 'was': 1, 'defeated': 1, '.': 1})
Counter({'divided': 2, 'families': 2, '.': 2, 'One': 1, 'consequence': 1, 'is': 1, 'that': 1, 'are': 1, 'even': 1, 'more': 1, 'No': 1, 'departures': 1, 'means': 1, 'no': 1, 'reunification': 1, 'with': 1, 'abroad': 1})


In [19]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Honjoutx\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [23]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from collections import Counter
for case in cases:
    text = case["text"]
    result = word_tokenize(text) 
    stop_words = set(stopwords.words('english'))
    filtered = [word for word in result if word.lower() not in stop_words and word.isalnum()]
    freq = Counter(filtered)

keywords = freq.most_common(5)
print(f"Keywords: {keywords}")



Keywords: [('divided', 2), ('families', 2), ('One', 1), ('consequence', 1), ('even', 1)]


2. sent_tokenize

sent_tokenize用于将文本分割成句子列表。句子分割的目的是将一段连续的文本分解为单独的句子，以便于后续的分析和处理。

In [4]:
from nltk.tokenize import sent_tokenize

测试用例

In [5]:
cases = [
    {
        "text": "The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday. "
    },
    {
        "text": "Born to a leading family in the capital, Pyongyang, Mr. Kim was 19 when he left to study in Beijing in 2010. Using the internet, he said he learned about his homeland and “the horrific truth” previously hidden to him."
    },
    {
        "text": "He too welcomed the OECD figures announced on Wednesday and said there is now an opportunity to consider what the transition to renewable energy really means for SIDS. It amounts to economic transformation, he said."
    }
]


In [6]:
for case in cases:
    text = case["text"]
    result = sent_tokenize(text)
    print(result)

['The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday.']
['Born to a leading family in the capital, Pyongyang, Mr. Kim was 19 when he left to study in Beijing in 2010.', 'Using the internet, he said he learned about his homeland and “the horrific truth” previously hidden to him.']
['He too welcomed the OECD figures announced on Wednesday and said there is now an opportunity to consider what the transition to renewable energy really means for SIDS.', 'It amounts to economic transformation, he said.']


In [24]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Honjoutx\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


True

句子单词词性标注

In [26]:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag

for case in cases:
    text = case["text"]
    result = sent_tokenize(text)
    tagged_sentences = [pos_tag(word_tokenize(sentence)) for sentence in result]
    for i, tagged_sentence in enumerate(tagged_sentences, start=1):
        print(f"Sentence {i}: {tagged_sentence}\n")


Sentence 1: [('The', 'DT'), ('international', 'JJ'), ('community', 'NN'), ('must', 'MD'), ('continue', 'VB'), ('to', 'TO'), ('pay', 'VB'), ('close', 'JJ'), ('attention', 'NN'), ('to', 'TO'), ('the', 'DT'), ('troubling', 'VBG'), ('human', 'JJ'), ('rights', 'NNS'), ('situation', 'NN'), ('in', 'IN'), ('the', 'DT'), ('Democratic', 'JJ'), ('People', 'NNP'), ('’', 'NNP'), ('s', 'VBZ'), ('Republic', 'NNP'), ('of', 'IN'), ('Korea', 'NNP'), ('(', '('), ('DPRK', 'NNP'), (')', ')'), ('and', 'CC'), ('find', 'VBP'), ('ways', 'NNS'), ('to', 'TO'), ('revive', 'VB'), ('dialogue', 'NN'), ('with', 'IN'), ('the', 'DT'), ('Government', 'NNP'), (',', ','), ('the', 'DT'), ('UN', 'NNP'), ('Security', 'NNP'), ('Council', 'NNP'), ('heard', 'NN'), ('on', 'IN'), ('Wednesday', 'NNP'), ('.', '.')]

Sentence 1: [('China', 'NNP'), ('and', 'CC'), ('Russia', 'NNP'), ('opposed', 'VBD'), ('the', 'DT'), ('meeting', 'NN'), ('and', 'CC'), ('called', 'VBD'), ('for', 'IN'), ('a', 'DT'), ('procedural', 'JJ'), ('vote', 'NN'), 

分句拼写纠正

In [27]:
import nltk
from nltk.tokenize import sent_tokenize
from autocorrect import Speller

spell = Speller()
sentences = []
for case in cases:
    text = case["text"]
    result = sent_tokenize(text)
    for sentence in result :
        words = sentence.split()
        corrected_words = [spell(word) for word in words]
        corrected_sentence = ' '.join(corrected_words)
        sentences.append(corrected_sentence)
        corrected_text = ' '.join(sentences)
        print(f"corrected: {corrected_text}")



corrected: The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (PRK) and find ways to revive dialogue with the Government, the Up Security Council heard on Wednesday.
corrected: The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (PRK) and find ways to revive dialogue with the Government, the Up Security Council heard on Wednesday. China and Russia opposed the meeting and called for a procedural vote by the 15 members, which was defeated.
corrected: The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (PRK) and find ways to revive dialogue with the Government, the Up Security Council heard on Wednesday. China and Russia opposed the meeting and called for a procedural vote by the 15 members, which w

3. RegexpTokenizer

RegexpTokenizer用于基于正则表达式进行分词。与word_tokenize和sent_tokenize等函数不同，RegexpTokenizer允许用户使用自定义的正则表达式来定义分词规则，从而实现更灵活和精确的分词。

In [29]:
from nltk.tokenize import RegexpTokenizer

测试用例

In [7]:
cases = [
    {
        "text": "The international community must continue to pay close attention to the troubling human rights situation in the Democratic People’s Republic of Korea (DPRK) and find ways to revive dialogue with the Government, the UN Security Council heard on Wednesday. "
    },
    {
        "text": "Born to a leading family in the capital, Pyongyang, Mr. Kim was 19 when he left to study in Beijing in 2010. Using the internet, he said he learned about his homeland and “the horrific truth” previously hidden to him."
    },
    {
        "text": "He too welcomed the OECD figures announced on Wednesday and said there is now an opportunity to consider what the transition to renewable energy really means for SIDS. It amounts to economic transformation, he said."
    }
]

定义正则表达式，只保留单词

In [15]:
tokenizer = RegexpTokenizer(r'\w+')

In [16]:
for case in cases:
    text = case["text"]
    tokens = tokenizer.tokenize(text)
    print(tokens)

['The', 'international', 'community', 'must', 'continue', 'to', 'pay', 'close', 'attention', 'to', 'the', 'troubling', 'human', 'rights', 'situation', 'in', 'the', 'Democratic', 'People', 's', 'Republic', 'of', 'Korea', 'DPRK', 'and', 'find', 'ways', 'to', 'revive', 'dialogue', 'with', 'the', 'Government', 'the', 'UN', 'Security', 'Council', 'heard', 'on', 'Wednesday']
['Born', 'to', 'a', 'leading', 'family', 'in', 'the', 'capital', 'Pyongyang', 'Mr', 'Kim', 'was', '19', 'when', 'he', 'left', 'to', 'study', 'in', 'Beijing', 'in', '2010', 'Using', 'the', 'internet', 'he', 'said', 'he', 'learned', 'about', 'his', 'homeland', 'and', 'the', 'horrific', 'truth', 'previously', 'hidden', 'to', 'him']
['He', 'too', 'welcomed', 'the', 'OECD', 'figures', 'announced', 'on', 'Wednesday', 'and', 'said', 'there', 'is', 'now', 'an', 'opportunity', 'to', 'consider', 'what', 'the', 'transition', 'to', 'renewable', 'energy', 'really', 'means', 'for', 'SIDS', 'It', 'amounts', 'to', 'economic', 'transform

定义正则表达式，保留单词和标点符号

In [17]:
tokenizer = RegexpTokenizer(r'\w+|[^\w\s]')

In [18]:
for case in cases:
    text = case["text"]
    tokens = tokenizer.tokenize(text)
    print(tokens)

['The', 'international', 'community', 'must', 'continue', 'to', 'pay', 'close', 'attention', 'to', 'the', 'troubling', 'human', 'rights', 'situation', 'in', 'the', 'Democratic', 'People', '’', 's', 'Republic', 'of', 'Korea', '(', 'DPRK', ')', 'and', 'find', 'ways', 'to', 'revive', 'dialogue', 'with', 'the', 'Government', ',', 'the', 'UN', 'Security', 'Council', 'heard', 'on', 'Wednesday', '.']
['Born', 'to', 'a', 'leading', 'family', 'in', 'the', 'capital', ',', 'Pyongyang', ',', 'Mr', '.', 'Kim', 'was', '19', 'when', 'he', 'left', 'to', 'study', 'in', 'Beijing', 'in', '2010', '.', 'Using', 'the', 'internet', ',', 'he', 'said', 'he', 'learned', 'about', 'his', 'homeland', 'and', '“', 'the', 'horrific', 'truth', '”', 'previously', 'hidden', 'to', 'him', '.']
['He', 'too', 'welcomed', 'the', 'OECD', 'figures', 'announced', 'on', 'Wednesday', 'and', 'said', 'there', 'is', 'now', 'an', 'opportunity', 'to', 'consider', 'what', 'the', 'transition', 'to', 'renewable', 'energy', 'really', 'mea

In [33]:
cases = "Contact us at support@example.com, sales@example.com, or info@example.com."

tokenizer = RegexpTokenizer(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b')
tokens = tokenizer.tokenize(cases)

print(tokens)


['support@example.com', 'sales@example.com', 'info@example.com']


4. TreebankWordTokenizer


TreebankWordTokenizer是一种基于宾州树库中使用的分词标准而设计的用于自然语言处理的分词器，用于将英语文本分割成单词和标点符号。这种分词器在处理标点符号、缩略词、连字符等方面具有特定的规则。

In [39]:
from nltk.tokenize import TreebankWordTokenizer

测试用例

In [40]:
cases = [
    {
        "text": "Born to a leading family in the capital, Pyongyang, Mr. Kim was 19 when he left to study in Beijing in 2010. Using the internet, he said he learned about his homeland and “the horrific truth” previously hidden to him."
    },
]

文本预处理

In [41]:
tokenizer = TreebankWordTokenizer()

In [42]:
text = cases[0]["text"]
tokens = tokenizer.tokenize(text)
print(tokens)

['Born', 'to', 'a', 'leading', 'family', 'in', 'the', 'capital', ',', 'Pyongyang', ',', 'Mr.', 'Kim', 'was', '19', 'when', 'he', 'left', 'to', 'study', 'in', 'Beijing', 'in', '2010.', 'Using', 'the', 'internet', ',', 'he', 'said', 'he', 'learned', 'about', 'his', 'homeland', 'and', '“the', 'horrific', 'truth”', 'previously', 'hidden', 'to', 'him', '.']


情感分析

通过分词提高识别准确性

需要额外使用到nltk.sentiment.vader库

In [34]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\Honjoutx\AppData\Roaming\nltk_data...


True

In [43]:
analyzer = SentimentIntensityAnalyzer()
target_text = " ".join(tokens)
sentiment = analyzer.polarity_scores(target_text)
print(sentiment)

{'neg': 0.104, 'neu': 0.896, 'pos': 0.0, 'compound': -0.6597}


机器翻译

作为预处理步骤包括将源语言文本分割成单词，然后进行翻译

In [None]:
from googletrans import Translator

In [45]:
tokenizer = TreebankWordTokenizer()
translator = Translator()
text = cases[0]["text"]
tokens = tokenizer.tokenize(text)
translated_tokens = [translator.translate(word, src='en', dest='es').text for word in tokens]
translate = ' '.join(translated_tokens)
print(f"Translated Text: {translate}")

ConnectTimeout: timed out

5. TweetTokenizer

TweetTokenizer是一个专门用于处理推文的分词器。它设计用来处理推文中特有的文本格式，比如表情符号、哈希标签、用户提及、URL等。

In [5]:
from nltk.tokenize import TweetTokenizer

测试用例

In [9]:
cases = [
    "Thanks for the support, @user1! #grateful",
    "Had a great time with @user2 and @user3 yesterday! #friends",
    "Shoutout to @user4 for the amazing work! #appreciation"
]

话题提取

In [7]:
tokenizer = TweetTokenizer()

In [10]:
tags = []
for tweet in cases:
    tokens = tokenizer.tokenize(tweet)
    tags.extend([token for token in tokens if token.startswith('#')])

print("tags:", tags)

tags: ['#grateful', '#friends', '#appreciation']


用户提及分析

In [12]:
tokenizer = TweetTokenizer()

In [13]:
users = []
for tweet in cases:
    tokens = tokenizer.tokenize(tweet)
    users.extend([token for token in tokens if token.startswith('@')])
    
print("User Mentions:", users)

User Mentions: ['@user1', '@user2', '@user3', '@user4']


In [50]:
tokenizer = TweetTokenizer()

cases = [
    "Check out the new product at https://example.com! #newproduct @example_user 😊",
    "Visit our website for more details: http://example.org #info",
    "Don't miss out on the sale! https://sale.example.com #discount",
    "More updates at https://blog.example.com #tech @example_user"
]


urls = []
for tweet in cases:
    tokens = tokenizer.tokenize(tweet)
    urls.extend([token for token in tokens if token.startswith('http') or token.startswith('https')])

print("URLs:", urls)


URLs: ['https://example.com', 'http://example.org', 'https://sale.example.com', 'https://blog.example.com']
