## 第 0 步：潜在狄利克雷分布 ##

LDA 用于将文档中的文本分类为特定话题。它会用狄利克雷分布构建一个话题/文档模型和单词/话题模型。

* 每个文档都建模为话题多态分布，每个话题建模为单词多态分布。
* LDA 假设我们传入其中的每段文本都相互关联。因此，选择正确的语料库很关键。
* 它还假设文档是根据多种话题创建的。然后，这些话题根据单词的分布概率生成单词。

## 第 1 步：加载数据集

我们将使用的数据集是一个列表，其中包含在 15 年内发表的超过 100 万条新闻标题。首先，我们将从 `abcnews-date-text.csv` 文件中加载该数据集。

In [1]:
'''
Load the dataset from the CSV and save it to 'data_text'
'''
import pandas as pd
data = pd.read_csv('abcnews-date-text.csv', error_bad_lines=False);
# We only need the Headlines text column from the data
data_text = data[:300000][['headline_text']];
data_text['index'] = data_text.index

documents = data_text

我们来看看该数据集：

In [2]:
'''
Get the total number of documents
'''
print(len(documents))

300000


In [3]:
documents[:5]

Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4


## 第 2 步：预处理数据 ##

我们将执行以下步骤：

* **标记化**：将文本拆分为句子，将句子拆分为单词。使单词全小写并删除标点。
* 删除少于 3 个字符的单词。
* 删除所有**停止词**。
* **词形还原**单词 - 第三人称的单词变成第一人称，过去式和将来式变成现在式。
* **词干提取**单词 - 将单词简化成根形式。

In [4]:
'''
Loading Gensim and nltk libraries
'''
# pip install gensim
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import numpy as np
np.random.seed(400)

In [5]:
import nltk
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/brentweiliu/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

### Lemmatizer 示例
在预处理数据集之前，我们先看一个词形还原示例。如果词形还原单词“went”，输出是什么：

In [6]:
print(WordNetLemmatizer().lemmatize('went', pos = 'v')) # past tense to present tense

go


### Stemmer 示例
再看一个词干提取示例。我们向 stemmer 中传入多个单词，看看它是如何处理每个单词的：

In [7]:
stemmer = SnowballStemmer("english")
original_words = ['caresses', 'flies', 'dies', 'mules', 'denied','died', 'agreed', 'owned', 
           'humbled', 'sized','meeting', 'stating', 'siezing', 'itemization','sensational', 
           'traditional', 'reference', 'colonizer','plotted']
singles = [stemmer.stem(plural) for plural in original_words]

pd.DataFrame(data={'original word':original_words, 'stemmed':singles })

Unnamed: 0,original word,stemmed
0,caresses,caress
1,flies,fli
2,dies,die
3,mules,mule
4,denied,deni
5,died,die
6,agreed,agre
7,owned,own
8,humbled,humbl
9,sized,size


In [8]:
'''
Write a function to perform the pre processing steps on the entire dataset
'''
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

# Tokenize and lemmatize
def preprocess(text):
    result=[]
    for token in gensim.utils.simple_preprocess(text) :
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            # TODO: Apply lemmatize_stemming() on the token, then add to the results list
            result.append(lemmatize_stemming(token))
    return result



In [9]:
'''
Preview a document after preprocessing
'''
document_num = 4310

# Pandas dataframe, select data with mask
mask = (documents['index'] == document_num)
doc_sample = documents[mask].values[0][0]

print("Original document: ")
words = []
for word in doc_sample.split(' '):
    words.append(word)
print(words)

print("\n\nTokenized and lemmatized document: ")
print(preprocess(doc_sample))

Original document: 
['rain', 'helps', 'dampen', 'bushfires']


Tokenized and lemmatized document: 
['rain', 'help', 'dampen', 'bushfir']


In [10]:
print(type(documents))
print(len(documents))
documents[:10]

<class 'pandas.core.frame.DataFrame'>
300000


Unnamed: 0,headline_text,index
0,aba decides against community broadcasting lic...,0
1,act fire witnesses must be aware of defamation,1
2,a g calls for infrastructure protection summit,2
3,air nz staff in aust strike for pay rise,3
4,air nz strike to affect australian travellers,4
5,ambitious olsson wins triple jump,5
6,antic delighted with record breaking barca,6
7,aussie qualifier stosur wastes four memphis match,7
8,aust addresses un security council over iraq,8
9,australia is locked into war timetable opp,9


现在预处理所有新闻标题。为此，我们使用 pandas 中的 [map](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html) 函数向 `headline_text` 列应用 `preprocess()`。

**注意**：可能需要几分钟（我的笔记本需要 6 分钟）

In [11]:
# TODO: preprocess all the headlines, saving the list of results as 'processed_docs'
documents['headline_text_processed'] = documents.apply(lambda row : preprocess(row['headline_text']), axis = 1) 

In [12]:
'''
Preview 'processed_docs'
'''
processed_docs = documents['headline_text_processed'].tolist()
processed_docs[:5]

[['decid', 'communiti', 'broadcast', 'licenc'],
 ['wit', 'awar', 'defam'],
 ['call', 'infrastructur', 'protect', 'summit'],
 ['staff', 'aust', 'strike', 'rise'],
 ['strike', 'affect', 'australian', 'travel']]

In [13]:
type(processed_docs)

list

## 第 3.1 步：数据集上的词袋

现在，根据 processed_docs 创建一个字典，后者包含单词在训练集中的出现次数。为此，将 `processed_docs` 传入 [`gensim.corpora.Dictionary()`](https://radimrehurek.com/gensim/corpora/dictionary.html) 并称之为 `dictionary`。

In [14]:
'''
Create a dictionary from 'processed_docs' containing the number of times a word appears 
in the training set using gensim.corpora.Dictionary and call it 'dictionary'
'''
dictionary = gensim.corpora.Dictionary(processed_docs)

In [15]:
print(dictionary.token2id)

{'broadcast': 0, 'communiti': 1, 'decid': 2, 'licenc': 3, 'awar': 4, 'defam': 5, 'wit': 6, 'call': 7, 'infrastructur': 8, 'protect': 9, 'summit': 10, 'aust': 11, 'rise': 12, 'staff': 13, 'strike': 14, 'affect': 15, 'australian': 16, 'travel': 17, 'ambiti': 18, 'jump': 19, 'olsson': 20, 'tripl': 21, 'win': 22, 'antic': 23, 'barca': 24, 'break': 25, 'delight': 26, 'record': 27, 'aussi': 28, 'match': 29, 'memphi': 30, 'qualifi': 31, 'stosur': 32, 'wast': 33, 'address': 34, 'council': 35, 'iraq': 36, 'secur': 37, 'australia': 38, 'lock': 39, 'timet': 40, 'contribut': 41, 'million': 42, 'birthday': 43, 'celebr': 44, 'robson': 45, 'ahead': 46, 'bathhous': 47, 'plan': 48, 'championship': 49, 'cycl': 50, 'hop': 51, 'launceston': 52, 'boost': 53, 'paroo': 54, 'suppli': 55, 'water': 56, 'bill': 57, 'blizzard': 58, 'buri': 59, 'state': 60, 'unit': 61, 'brigadi': 62, 'dismiss': 63, 'harass': 64, 'report': 65, 'troop': 66, 'arriv': 67, 'british': 68, 'combat': 69, 'daili': 70, 'kuwait': 71, 'bryant

In [16]:
'''
Checking dictionary created
'''
count = 0
for k, v in dictionary.iteritems():
    print(k, v)
    count += 1
    if count > 10:
        break

0 broadcast
1 communiti
2 decid
3 licenc
4 awar
5 defam
6 wit
7 call
8 infrastructur
9 protect
10 summit


In [17]:
#convert tokenized documents to vectors:
new_doc = "broadcast Broadcast communiti decid"
new_vec = dictionary.doc2bow(new_doc.lower().split())
print(new_vec) 

[(0, 2), (1, 1), (2, 1)]


** Gensim filter_extremes **

[`filter_extremes(no_below=5, no_above=0.5, keep_n=100000)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.filter_extremes)

滤除出现在以下情形中的 token

* 出现在 no_below 个以下的文档中（绝对数字），或
* 出现在 no_above 个以上的文档中（ 总语料库大小的一部分，不是绝对数字）。
* 在 (1) 和 (2) 之后，仅保留前 keep_n 个最常见的标记（如果为 None，则保留所有标记）。

In [18]:
'''
OPTIONAL STEP
Remove very rare and very common words:

- words appearing less than 15 times
- words appearing in more than 10% of all documents
'''
# TODO: apply dictionary.filter_extremes() with the parameters mentioned above
dictionary.filter_extremes(no_below=5, no_above=0.5, keep_n=100000)


** Gensim doc2bow **

[`doc2bow(document)`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary.doc2bow)

* 将文档（单词列表）转换为词袋格式 = 2 元组（token_id、token_count）列表。每个单词都应该是标记化和标准化的字符串（unicode 或 utf8-编码）。文档中的单词没有进一步预处理了；在调用此函数之前，请应用标记化、词干提取等方法。

In [19]:
'''
Create the Bag-of-words model for each document i.e for each document we create a dictionary reporting how many
words and how many times those words appear. Save this to 'bow_corpus'
'''
# TODO
bow_corpus = [dictionary.doc2bow(doc) for doc in processed_docs]

In [20]:
'''
Checking Bag of Words corpus for our sample document --> (token_id, token_count)
'''
bow_corpus[document_num]

[(76, 1), (113, 1), (482, 1), (4016, 1)]

In [21]:
'''
Preview BOW for our sample preprocessed document
'''
# Here document_num is document number 4310 which we have checked in Step 2
# ['rain', 'help', 'dampen', 'bushfir']
bow_doc_4310 = bow_corpus[document_num]

for i in range(len(bow_doc_4310)):
    print("Word {} (\"{}\") appears {} time.".format(bow_doc_4310[i][0], 
                                                     dictionary[bow_doc_4310[i][0]], 
                                                     bow_doc_4310[i][1]))

Word 76 ("bushfir") appears 1 time.
Word 113 ("help") appears 1 time.
Word 482 ("rain") appears 1 time.
Word 4016 ("dampen") appears 1 time.


## 第 3.2 步： 对文档集合应用 TF-IDF ##

虽然使用 gensim 模型的 LDA 实现并非必须对语料库执行 TF-IDF，但是建议这么做。TF-IDF 在初始化过程中要求词袋（整数值）训练语料库。在转换过程中，它将接受向量并返回另一个维度相同的向量。

*请注意：Gensim 的作者规定 LDA 的标准流程是使用词袋模型。*

** TF-IDF 是“词频、逆文本频率"的简称。**

* 它是根据单词在多个文档中的出现频率对单词（或“term”）重要性进行评分的方式。
* 如果单词频繁出现在文档中，则很重要，给该单词评很高的得分。但是如果单词出现在很多文档中，则不是唯一标识符，给该单词评很低的得分。
* 因此，“the”和“for”等常见单词出现在很多文档中，评分将降低。经常出现在单个文档中的单词评分将升高。

换句话说：

* TF(w) = `（term w 出现在文档中的次数）/（文档中的term总数）`。
* IDF(w) = `log_e（文档总数/包含term w 的文档数）`。

** 例如 **

* 假设有一个文档包含 `100` 个单词，其中单词“tiger”出现了 3 次。
* "tiger"的词频（即 tf）是：
    - `TF = (3 / 100) = 0.03`. 

* 现在，假设有 `10 million` 个文档，单词”tiger“出现在了其中 `1000` 个文档中。逆文档频率（即 idk）的计算方式为：
    - `IDF = log(10,000,000 / 1,000) = 4`. 

* 因此，Tf-idf 权重是这些数量的积：
    - `TF-IDF = 0.03 * 4 = 0.12`.

In [22]:
'''
Create tf-idf model object using models.TfidfModel on 'bow_corpus' and save it to 'tfidf'
'''
from gensim import corpora, models

# TODO
tfidf = models.TfidfModel(bow_corpus)

In [23]:
'''
Apply transformation to the entire corpus and call it 'corpus_tfidf'
'''
# TODO
corpus_tfidf = tfidf[bow_corpus]

In [24]:
'''
Preview TF-IDF scores for our first document --> --> (token_id, tfidf score)
'''
from pprint import pprint
for doc in corpus_tfidf:
    pprint(doc)
    break

[(0, 0.5959813347777092),
 (1, 0.39204529549491984),
 (2, 0.48531419274988147),
 (3, 0.5055461098578569)]


## 第 4.1 步：使用词袋运行 LDA ##

我们将处理文档语料库中的 10 个话题。

** 我们将使用所有 CPU 核心运行 LDA，以并行化并加快模型训练。**

我们将调整以下参数：

* **num_topics** 是请求从训练语料库中提取的潜在话题数。
* **id2word** 是从单词 ID（整数）到单词（字符串）的映射，用于判断词汇表大小，以及用于调试和输出话题。
* **workers** 是用于并行化的额外进程数。默认使用所有可用的核心。
* **alpha** 和 **eta** 是影响文档-话题 (θ) 和话题-单词 (lambda) 分布的超参数。暂时使用默认值（默认值为 `1/num_topics`）
    - Alpha 是文档-话题分布。
        * alpha 很高：每个文档都包含所有话题（文档似乎都相似）。
        * alpha 很低：每个文档包含的话题很少

- Eta 是话题-单词分布。
    * eta 很高：每个话题都包含大部分单词（话题似乎都相似）。
    * eta 很低：每个话题包含的单词很少。

* ** 通过次数** 是通过语料库的训练次数。例如，如果训练语料库有 50,000 个文档，块大小是 10,000，通过次数是 2，则在线训练需要更新 10 次：
    * `#1 documents 0-9,999 `
    * `#2 documents 10,000-19,999 `
    * `#3 documents 20,000-29,999 `
    * `#4 documents 30,000-39,999 `
    * `#5 documents 40,000-49,999 `
    * `#6 documents 0-9,999 `
    * `#7 documents 10,000-19,999 `
    * `#8 documents 20,000-29,999 `
    * `#9 documents 30,000-39,999 `
    * `#10 documents 40,000-49,999`

In [25]:
# LDA mono-core -- fallback code in case LdaMulticore throws an error on your machine
# lda_model = gensim.models.LdaModel(bow_corpus, 
#                                    num_topics = 10, 
#                                    id2word = dictionary,                                    
#                                    passes = 50)

# LDA multicore 
'''
Train your lda model using gensim.models.LdaMulticore and save it to 'lda_model'
'''
# TODO
lda_model = gensim.models.LdaMulticore(bow_corpus, 
                                       num_topics=10, 
                                       id2word = dictionary, 
                                       passes = 2, 
                                       workers=2)

In [26]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model.print_topics(-1):
    print("Topic: {} \nWords: {}".format(topic, idx ))
    print("\n")

Topic: 0.022*"opposit" + 0.021*"test" + 0.020*"work" + 0.017*"talk" + 0.017*"probe" + 0.016*"call" + 0.015*"hold" + 0.015*"polic" + 0.013*"say" + 0.013*"push" 
Words: 0


Topic: 0.033*"claim" + 0.024*"minist" + 0.019*"reject" + 0.018*"protest" + 0.015*"south" + 0.014*"worker" + 0.013*"strike" + 0.011*"vote" + 0.011*"damag" + 0.010*"worri" 
Words: 1


Topic: 0.026*"open" + 0.017*"deal" + 0.016*"hospit" + 0.015*"inquiri" + 0.013*"action" + 0.012*"guilti" + 0.012*"final" + 0.011*"find" + 0.010*"build" + 0.010*"injuri" 
Words: 2


Topic: 0.037*"govt" + 0.031*"water" + 0.023*"urg" + 0.020*"fund" + 0.019*"plan" + 0.015*"group" + 0.014*"boost" + 0.014*"council" + 0.013*"help" + 0.012*"health" 
Words: 3


Topic: 0.023*"nation" + 0.022*"labor" + 0.017*"win" + 0.016*"meet" + 0.016*"howard" + 0.012*"chang" + 0.011*"state" + 0.010*"say" + 0.010*"liber" + 0.010*"park" 
Words: 4


Topic: 0.034*"polic" + 0.017*"arrest" + 0.016*"jail" + 0.013*"offic" + 0.013*"famili" + 0.012*"timor" + 0.011*"rudd" + 0

### 主题分类 ###

根据每个话题中的单词及其对应的权重，你能够推断出哪些类别？

* 0: 
* 1: 
* 2: 
* 3: 
* 4: 
* 5: 
* 6: 
* 7:  
* 8: 
* 9: 

## 第 4.2 步：使用 TF-IDF 运行 LDA ##

In [27]:
'''
Define lda model using corpus_tfidf, again using gensim.models.LdaMulticore()
'''
# TODO
lda_model_tfidf = gensim.models.LdaMulticore(corpus_tfidf, 
                                             num_topics=10, 
                                             id2word = dictionary, 
                                             passes = 2)

In [28]:
'''
For each topic, we will explore the words occuring in that topic and its relative weight
'''
for idx, topic in lda_model_tfidf.print_topics(-1):
    print("Topic: {} Word: {}".format(idx, topic))
    print("\n")

Topic: 0 Word: 0.010*"govt" + 0.008*"council" + 0.008*"plan" + 0.006*"urg" + 0.006*"fund" + 0.006*"opposit" + 0.006*"toll" + 0.006*"road" + 0.006*"health" + 0.005*"water"


Topic: 1 Word: 0.017*"crash" + 0.016*"kill" + 0.012*"iraq" + 0.008*"bomb" + 0.007*"baghdad" + 0.006*"troop" + 0.006*"soldier" + 0.006*"die" + 0.006*"blast" + 0.006*"attack"


Topic: 2 Word: 0.007*"guilti" + 0.007*"terror" + 0.007*"iran" + 0.006*"hick" + 0.006*"plead" + 0.006*"pakistan" + 0.006*"rudd" + 0.005*"nuclear" + 0.005*"polic" + 0.004*"arrest"


Topic: 3 Word: 0.005*"wind" + 0.005*"farm" + 0.005*"climat" + 0.005*"plan" + 0.005*"share" + 0.005*"takeov" + 0.004*"news" + 0.004*"council" + 0.004*"market" + 0.004*"pipelin"


Topic: 4 Word: 0.024*"closer" + 0.015*"miss" + 0.012*"search" + 0.009*"polic" + 0.005*"accid" + 0.005*"councillor" + 0.005*"girl" + 0.005*"bodi" + 0.005*"woman" + 0.004*"continu"


Topic: 5 Word: 0.006*"palestinian" + 0.005*"elect" + 0.005*"isra" + 0.004*"gaza" + 0.004*"govt" + 0.004*"lebanon"

### 主题分类 ###

可以看出，在使用 tf-idf 时，不太常见的单词权重更高，导致名词被考虑在内。这样就更难分类，因为名词比较难分类。进一步表明我们应用的模型取决于要处理的文本语料库的类型。

根据每个话题中的单词及其对应的权重，你能够推断出哪些类别？

* 0: 
* 1:  
* 2: 
* 3: 
* 4:  
* 5: 
* 6: 
* 7: 
* 8: 
* 9: 

## 第 5.1 步：通过使用 LDA 词袋模型分类样本文档评估性能##

我们将检查可以在何处分类测试文档。

In [29]:
'''
Text of sample document 4310
'''
processed_docs[4310]

['rain', 'help', 'dampen', 'bushfir']

In [30]:
'''
Check which topic our test document belongs to using the LDA Bag of Words model.
'''
document_num = 4310
# Our test document is document number 4310

# TODO
# Our test document is document number 4310
for index, score in sorted(lda_model[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model.print_topic(index, 10)))


Score: 0.6196737289428711	 
Topic: 0.037*"govt" + 0.031*"water" + 0.023*"urg" + 0.020*"fund" + 0.019*"plan" + 0.015*"group" + 0.014*"boost" + 0.014*"council" + 0.013*"help" + 0.012*"health"

Score: 0.22024796903133392	 
Topic: 0.026*"miss" + 0.021*"forc" + 0.017*"search" + 0.013*"iraq" + 0.013*"world" + 0.012*"lead" + 0.012*"continu" + 0.012*"close" + 0.011*"lose" + 0.011*"troop"

Score: 0.020016156136989594	 
Topic: 0.026*"open" + 0.017*"deal" + 0.016*"hospit" + 0.015*"inquiri" + 0.013*"action" + 0.012*"guilti" + 0.012*"final" + 0.011*"find" + 0.010*"build" + 0.010*"injuri"

Score: 0.020011691376566887	 
Topic: 0.022*"opposit" + 0.021*"test" + 0.020*"work" + 0.017*"talk" + 0.017*"probe" + 0.016*"call" + 0.015*"hold" + 0.015*"polic" + 0.013*"say" + 0.013*"push"

Score: 0.020010966807603836	 
Topic: 0.034*"polic" + 0.017*"arrest" + 0.016*"jail" + 0.013*"offic" + 0.013*"famili" + 0.012*"timor" + 0.011*"rudd" + 0.010*"sale" + 0.009*"suspect" + 0.009*"releas"

Score: 0.02000836282968521	 

### 它成为我们所分配话题（X，分类正确）的一部分的概率最高 ###

## 第 5.2 步：通过使用 LDA TF-IDF 模型分类样本文档评估性能##

In [31]:
'''
Check which topic our test document belongs to using the LDA TF-IDF model.
'''
# Our test document is document number 4310
for index, score in sorted(lda_model_tfidf[bow_corpus[document_num]], key=lambda tup: -1*tup[1]):
    print("\nScore: {}\t \nTopic: {}".format(score, lda_model_tfidf.print_topic(index, 10)))


Score: 0.8199503421783447	 
Topic: 0.005*"clean" + 0.005*"afghanistan" + 0.005*"uranium" + 0.005*"farmer" + 0.005*"grower" + 0.004*"rain" + 0.004*"damag" + 0.004*"govt" + 0.004*"polic" + 0.004*"robberi"

Score: 0.02001890540122986	 
Topic: 0.006*"bird" + 0.006*"break" + 0.006*"govt" + 0.006*"rate" + 0.006*"bushfir" + 0.006*"rise" + 0.005*"hill" + 0.005*"inquiri" + 0.005*"polic" + 0.005*"plan"

Score: 0.0200049988925457	 
Topic: 0.007*"govt" + 0.006*"plan" + 0.005*"urg" + 0.005*"drink" + 0.005*"fund" + 0.005*"driver" + 0.005*"care" + 0.005*"polic" + 0.005*"boost" + 0.004*"drive"

Score: 0.02000466175377369	 
Topic: 0.010*"govt" + 0.008*"council" + 0.008*"plan" + 0.006*"urg" + 0.006*"fund" + 0.006*"opposit" + 0.006*"toll" + 0.006*"road" + 0.006*"health" + 0.005*"water"

Score: 0.020004266873002052	 
Topic: 0.007*"guilti" + 0.007*"terror" + 0.007*"iran" + 0.006*"hick" + 0.006*"plead" + 0.006*"pakistan" + 0.006*"rudd" + 0.005*"nuclear" + 0.005*"polic" + 0.004*"arrest"

Score: 0.0200041830

### 它成为我们所分配话题 (X) 的一部分的概率最高 (`x%`) ###

## 第 6 步：用未见过的文档测试模型 ##

In [32]:
unseen_document = "My favorite sports activities are running and swimming."

# Data preprocessing step for the unseen document
bow_vector = dictionary.doc2bow(preprocess(unseen_document))

for index, score in sorted(lda_model[bow_vector], key=lambda tup: -1*tup[1]):
    print("Score: {}\t Topic: {}".format(score, lda_model.print_topic(index, 5)))

Score: 0.42003557085990906	 Topic: 0.034*"polic" + 0.034*"kill" + 0.032*"crash" + 0.028*"death" + 0.024*"investig"
Score: 0.2248527854681015	 Topic: 0.034*"polic" + 0.017*"arrest" + 0.016*"jail" + 0.013*"offic" + 0.013*"famili"
Score: 0.21507854759693146	 Topic: 0.023*"nation" + 0.022*"labor" + 0.017*"win" + 0.016*"meet" + 0.016*"howard"
Score: 0.020005878061056137	 Topic: 0.022*"opposit" + 0.021*"test" + 0.020*"work" + 0.017*"talk" + 0.017*"probe"
Score: 0.02000458911061287	 Topic: 0.037*"charg" + 0.033*"court" + 0.031*"face" + 0.030*"council" + 0.025*"plan"
Score: 0.020004551857709885	 Topic: 0.037*"govt" + 0.031*"water" + 0.023*"urg" + 0.020*"fund" + 0.019*"plan"
Score: 0.020004533231258392	 Topic: 0.033*"claim" + 0.024*"minist" + 0.019*"reject" + 0.018*"protest" + 0.015*"south"
Score: 0.020004529505968094	 Topic: 0.026*"open" + 0.017*"deal" + 0.016*"hospit" + 0.015*"inquiri" + 0.013*"action"
Score: 0.020004529505968094	 Topic: 0.024*"closer" + 0.019*"record" + 0.017*"coast" + 0.017

模型正确地将未见过的文档分类成 X 类别，概率是 x%。