## LDA模型应用：希拉里邮件门

In [18]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

sns.set_context('notebook')
sns.set_style('white')

import warnings
warnings.filterwarnings('ignore')

In [19]:
HillaryEmail = pd.read_csv('HillaryEmails.csv')
emailBodyText = HillaryEmail[['Id', 'ExtractedBodyText']].dropna()

In [20]:
emailBodyText.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6742 entries, 1 to 7944
Data columns (total 2 columns):
Id                   6742 non-null int64
ExtractedBodyText    6742 non-null object
dtypes: int64(1), object(1)
memory usage: 158.0+ KB


### 文本预处理

In [21]:
def emailBodyTextcleaning(text):
    text = text.replace('\n'," ")
    text = re.sub(r"-", " ", text)
    text = re.sub(r"\d+/\d+/\d+", "", text)
    text = re.sub(r"[0-2]?[0-9]:[0-6][0-9]", "", text)
    text = re.sub(r"[\w]+@[\.\w]+", "", text)
    text = re.sub(r"/[a-zA-Z]*[:\//\]*[A-Za-z0-9\-_]+\.+[A-Za-z0-9\.\/%&=\?\-_]+/i", "", text)
    pure_text = ''
    for letter in text:
        if letter.isalpha() or letter==' ':
            pure_text += letter
    text = ' '.join(word for word in pure_text.split() if len(word)>1)
    return text

In [22]:
BodyText = emailBodyText['ExtractedBodyText'].apply(lambda c: emailBodyTextcleaning(c))

In [23]:
BodyText.head(1).values

array(['Thursday March PM Latest How Syria is aiding Qaddafi and more Sid hrc memo syria aiding libya docx hrc memo syria aiding libya docx March For Hillary'],
      dtype=object)

In [24]:
bodytextlist = BodyText.values

### LDA模型构建

使用gensim库构建模型，其要求语料库形式为：

[[一，条，邮件，在，这里],[第，二，条，邮件，在，这里],[今天，天气，肿么，样],...]

In [25]:
import gensim
from gensim import corpora, models, similarities

停止词，在中英文中没有意义，出现概率可能还比较高，需要去除...

In [26]:
stopwordlist = ['very', 'ourselves', 'am', 'doesn', 'through', 'me', 'against', 'up', 'just', 'her', 'ours', 
            'couldn', 'because', 'is', 'isn', 'it', 'only', 'in', 'such', 'too', 'mustn', 'under', 'their', 
            'if', 'to', 'my', 'himself', 'after', 'why', 'while', 'can', 'each', 'itself', 'his', 'all', 'once', 
            'herself', 'more', 'our', 'they', 'hasn', 'on', 'ma', 'them', 'its', 'where', 'did', 'll', 'you', 
            'didn', 'nor', 'as', 'now', 'before', 'those', 'yours', 'from', 'who', 'was', 'm', 'been', 'will', 
            'into', 'same', 'how', 'some', 'of', 'out', 'with', 's', 'being', 't', 'mightn', 'she', 'again', 'be', 
            'by', 'shan', 'have', 'yourselves', 'needn', 'and', 'are', 'o', 'these', 'further', 'most', 'yourself', 
            'having', 'aren', 'here', 'he', 'were', 'but', 'this', 'myself', 'own', 'we', 'so', 'i', 'does', 'both', 
            'when', 'between', 'd', 'had', 'the', 'y', 'has', 'down', 'off', 'than', 'haven', 'whom', 'wouldn', 
            'should', 've', 'over', 'themselves', 'few', 'then', 'hadn', 'what', 'until', 'won', 'no', 'about', 
            'any', 'that', 'for', 'shouldn', 'don', 'do', 'there', 'doing', 'an', 'or', 'ain', 'hers', 'wasn', 
            'weren', 'above', 'a', 'at', 'your', 'theirs', 'below', 'other', 'not', 're', 'him', 'during', 'which']

人工分词：

英文句式中都是包含空格的，直接空格分词就行；而对于中文稍微麻烦，可以使用CoreNLP、HaNLP，或结巴分词来处理...

分词的意义在于将句子转换为有意义的小单元，让计算机来识别...

In [27]:
texts = [[word for word in bodytext.lower().split() if word not in stopwordlist] for bodytext in bodytextlist]

In [28]:
texts[0]

['thursday',
 'march',
 'pm',
 'latest',
 'syria',
 'aiding',
 'qaddafi',
 'sid',
 'hrc',
 'memo',
 'syria',
 'aiding',
 'libya',
 'docx',
 'hrc',
 'memo',
 'syria',
 'aiding',
 'libya',
 'docx',
 'march',
 'hillary']

### 建立语料库

语料库建立会对每个词加上index，并会统计每个词的词频

In [29]:
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

In [30]:
corpus[13]

[(51, 1), (505, 1), (506, 1), (507, 1), (508, 1)]

### 开始建模

In [31]:
lda = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=20)

出现次数为top5的单词：

In [32]:
lda.print_topic(10, topn=5)

'0.025*"call" + 0.015*"tomorrow" + 0.011*"talk" + 0.010*"im" + 0.009*"also"'

打印所有主题：

In [33]:
lda.print_topics(num_topics=20, num_words=5)

[(0,
  '0.019*"pm" + 0.013*"state" + 0.011*"fw" + 0.010*"cheryl" + 0.008*"mills"'),
 (1,
  '0.009*"think" + 0.006*"would" + 0.006*"germany" + 0.005*"tomorrow" + 0.005*"one"'),
 (2,
  '0.011*"know" + 0.009*"work" + 0.007*"good" + 0.005*"women" + 0.005*"holbrooke"'),
 (3, '0.014*"call" + 0.012*"add" + 0.011*"pls" + 0.008*"mod" + 0.007*"would"'),
 (4,
  '0.007*"would" + 0.006*"government" + 0.006*"us" + 0.005*"party" + 0.005*"security"'),
 (5,
  '0.014*"qddr" + 0.005*"arizona" + 0.004*"fco" + 0.004*"got" + 0.004*"called"'),
 (6, '0.008*"would" + 0.007*"mr" + 0.005*"said" + 0.005*"one" + 0.005*"us"'),
 (7,
  '0.029*"fyi" + 0.013*"boehner" + 0.012*"doc" + 0.008*"house" + 0.007*"us"'),
 (8,
  '0.006*"us" + 0.005*"new" + 0.005*"obama" + 0.005*"policy" + 0.004*"said"'),
 (9,
  '0.007*"get" + 0.006*"didnt" + 0.005*"madame" + 0.004*"roger" + 0.004*"brother"'),
 (10,
  '0.025*"call" + 0.015*"tomorrow" + 0.011*"talk" + 0.010*"im" + 0.009*"also"'),
 (11,
  '0.012*"fyi" + 0.009*"us" + 0.005*"would" 

希拉里twitter数据：

```
To all the little girls watching...never doubt that you are valuable and powerful & deserving of every chance & opportunity in the world.

I was greeted by this heartwarming display on the corner of my street today. Thank you to all of you who did this. Happy Thanksgiving. -H

Hoping everyone has a safe & Happy Thanksgiving today, & quality time with family & friends. -H

Scripture tells us: Let us not grow weary in doing good, for in due season, we shall reap, if we do not lose heart.

Let us have faith in each other. Let us not grow weary. Let us not lose heart. For there are more seasons to come and...more work to do

We have still have not shattered that highest and hardest glass ceiling. But some day, someone will

To Barack and Michelle Obama, our country owes you an enormous debt of gratitude. We thank you for your graceful, determined leadership

Our constitutional democracy demands our participation, not just every four years, but all the time

You represent the best of America, and being your candidate has been one of the greatest honors of my life

Last night I congratulated Donald Trump and offered to work with him on behalf of our country

Already voted? That's great! Now help Hillary win by signing up to make calls now

It's Election Day! Millions of Americans have cast their votes for Hillary—join them and confirm where you vote

We don’t want to shrink the vision of this country. We want to keep expanding it

We have a chance to elect a 45th president who will build on our progress, who will finish the job

I love our country, and I believe in our people, and I will never, ever quit on you. No matter what

```

In [34]:
data_twitter = pd.read_excel('./HillaryTwitters.xlsx', names=['BodyText'])
data_twitter.dropna(inplace=True)

text = data_twitter['BodyText']
text = text.apply(lambda s: emailBodyTextcleaning(s))
textlist = text.values

texts = [[word for word in doc.lower().split() if word not in stopwordlist] for doc in textlist]

dictionary = corpora.Dictionary(texts)
tweets = [dictionary.doc2bow(text) for text in texts]

lst = []
length = 0
def get_topic(tweets):
    topic = lda.get_document_topics(tweets)
    for i in range(len(topic)):
        for j in range(len(topic[i])):
            lst.append(topic[i][j][1])
        length = len(lst) - len(topic[i])
        if i == 0:
            print ("第%i条twitter最有可能属于topic%i，属于该topic的概率为：%.3f" % (i+1, topic[i][lst.index(max(lst))][0], max(lst)))
        else:
            print("第%i条twitter最有可能属于topic%i，属于该topic的概率为：%.3f" % (i+1, topic[i][lst[length:].index(max(lst[length:]))][0], max(lst[length:])))

get_topic(tweets)

第1条twitter最有可能属于topic0，属于该topic的概率为：0.905
第2条twitter最有可能属于topic0，属于该topic的概率为：0.839
第3条twitter最有可能属于topic7，属于该topic的概率为：0.502
第4条twitter最有可能属于topic7，属于该topic的概率为：0.791
第5条twitter最有可能属于topic7，属于该topic的概率为：0.775
第6条twitter最有可能属于topic7，属于该topic的概率为：0.927
第7条twitter最有可能属于topic7，属于该topic的概率为：0.754
第8条twitter最有可能属于topic0，属于该topic的概率为：0.336
第9条twitter最有可能属于topic4，属于该topic的概率为：0.792
第10条twitter最有可能属于topic4，属于该topic的概率为：0.914
第11条twitter最有可能属于topic12，属于该topic的概率为：0.427
第12条twitter最有可能属于topic11，属于该topic的概率为：0.770
第13条twitter最有可能属于topic11，属于该topic的概率为：0.770
第14条twitter最有可能属于topic11，属于该topic的概率为：0.526


In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

In [36]:
tfidf = TfidfVectorizer(stop_words='english')
tf = tfidf.fit_transform(BodyText)
tf_feature_names = tfidf.get_feature_names()

In [37]:
lda = LatentDirichletAllocation(n_components=20)
lda.fit(tf)



LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=20, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

In [38]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        message = "Topic #%d: " % topic_idx
        message += " ".join([feature_names[i]
                             for i in topic.argsort()[:-n_top_words - 1:-1]])
        print(message)

In [39]:
print_top_words(lda, tf_feature_names, n_top_words=20)

Topic #0: roger yeah counts sheets yup congratulate discouraging courier philippes inquiring pouch cin angolans bowl ssci stephens younis pronounciation bios retried
Topic #1: ok gheit roccos iiloty donnas mccue meghann redraft trimming nasrallah breathlessness lyi serendipitous zaki hbi hossam screed arming thats referring
Topic #2: pis print lock ns shelton prothero qs laurenpls amendmentcould thinking fine drafts ill comfortable thx email fw fax september day
Topic #3: pm secretarys office arrive room meeting en depart route residence private department daily conference monica state kris floor andrews minutes
Topic #4: thx benghazi doc supposed foia redactions waiver date select produced comm cb dept maggie perfect church fox secret thankfully data
Topic #5: interesting aye haim mullens congratsiiiiii prelude betcha kinney leecia ophelia furstenberg binta kives hochberg rudin saban duke nasr clay whos
Topic #6: yellow leg ffyi ba greet elmendorf morsis dmv license somewherein blumen

In [41]:
import numpy as np
import lda
import lda.datasets
X = lda.datasets.load_reuters()
vocab = lda.datasets.load_reuters_vocab()
titles = lda.datasets.load_reuters_titles()
model = lda.LDA(n_topics=20, n_iter=1500, random_state=1)
model.fit(X)  # model.fit_transform(X) is also available
topic_word = model.topic_word_  # model.components_ also works
n_top_words = 8
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-(n_top_words+1):-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))
doc_topic = model.doc_topic_
for i in range(10):
    print("{} (top topic: {})".format(titles[i], doc_topic[i].argmax()))

INFO:lda:n_documents: 395
INFO:lda:vocab_size: 4258
INFO:lda:n_words: 84010
INFO:lda:n_topics: 20
INFO:lda:n_iter: 1500
INFO:lda:<0> log likelihood: -1051748
INFO:lda:<10> log likelihood: -719800
INFO:lda:<20> log likelihood: -699115
INFO:lda:<30> log likelihood: -689370
INFO:lda:<40> log likelihood: -684918
INFO:lda:<50> log likelihood: -681322
INFO:lda:<60> log likelihood: -678979
INFO:lda:<70> log likelihood: -676598
INFO:lda:<80> log likelihood: -675383
INFO:lda:<90> log likelihood: -673316
INFO:lda:<100> log likelihood: -672761
INFO:lda:<110> log likelihood: -671320
INFO:lda:<120> log likelihood: -669744
INFO:lda:<130> log likelihood: -669292
INFO:lda:<140> log likelihood: -667940
INFO:lda:<150> log likelihood: -668038
INFO:lda:<160> log likelihood: -667429
INFO:lda:<170> log likelihood: -666475
INFO:lda:<180> log likelihood: -665562
INFO:lda:<190> log likelihood: -664920
INFO:lda:<200> log likelihood: -664979
INFO:lda:<210> log likelihood: -664722
INFO:lda:<220> log likelihood: -

Topic 0: british churchill sale million major letters west britain
Topic 1: church government political country state people party against
Topic 2: elvis king fans presley life concert young death
Topic 3: yeltsin russian russia president kremlin moscow michael operation
Topic 4: pope vatican paul john surgery hospital pontiff rome
Topic 5: family funeral police miami versace cunanan city service
Topic 6: simpson former years court president wife south church
Topic 7: order mother successor election nuns church nirmala head
Topic 8: charles prince diana royal king queen parker bowles
Topic 9: film french france against bardot paris poster animal
Topic 10: germany german war nazi letter christian book jews
Topic 11: east peace prize award timor quebec belo leader
Topic 12: n't life show told very love television father
Topic 13: years year time last church world people say
Topic 14: mother teresa heart calcutta charity nun hospital missionaries
Topic 15: city salonika capital buddhist c