# 自然语言处理介绍及实践

# 1. 基本概念

<img src='./image/nlp.jpg' />

自然语言处理是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此，这一领域的研究将涉及自然语言，即人们日常使用的语言，所以它与语言学的研究有着密切的联系，但又有重要的区别。自然语言处理并不是一般地研究自然语言，而在于研制能有效地实现自然语言通信的计算机系统，特别是其中的软件系统。因而它是计算机科学的一部分。

自然语言处理（NLP）是计算机科学，人工智能，语言学关注计算机和人类（自然）语言之间的相互作用的领域。

作为data analyst，我们日常中的工作，很大一部分就是将信息从交易所、上市公司、基金公司公布的金融文档中提取出来。

比如基金名称，具体的林林总总的金融数据等，如果掌握自然语言处理技巧，或许能够对日常工作如虎添翼。

# 2. 主要范畴

文本朗读（Text to speech）/语音合成（Speech synthesis）

语音识别（Speech recognition）

中文自动分词（Chinese word segmentation）

词性标注（Part-of-speech tagging）

句法分析（Parsing）

自然语言生成（Natural language generation）

文本分类（Text categorization）

信息检索（Information retrieval）

信息抽取（Information extraction）

文字校对（Text-proofing）

问答系统（Question answering）

机器翻译（Machine translation）

自动摘要（Automatic summarization）

文字蕴涵（Textual entailment）

<img src='./image/nlparc.jpg' />

# 3. 常用套路

## 3.1 收集数据

对于我们analyst来说，就是从我们文档库里面，把我们关心的filing收集起来，然后最好按照句子为单位作为样本进行堆叠。

In [26]:
import spacy
nlp = spacy.load('en_core_web_lg')

我们拿如下这一段话，进行分句：

This prospectus offers variable annuity contract allowing you to accumulate values and paying you benefits on a variable and/or fixed basis. This prospectus provides information regarding the material provisions of your variable annuity contract. We may restrict the availability of this contract to certain broker-dealers. National Security Life V.I. and Annuity Company ("National Security") issues the contract. This contract is only available in New York.

In [6]:
def splitparagraph2sentence(paragraph):
    doc = nlp(paragraph)
    return [sentence.text for sentence in doc.sents]

In [8]:
sentences = splitparagraph2sentence('This prospectus offers variable annuity contract allowing you to accumulate values and paying you benefits on a variable and/or fixed basis. This prospectus provides information regarding the material provisions of your variable annuity contract. We may restrict the availability of this contract to certain broker-dealers. National Security Life V.I. and Annuity Company ("National Security") issues the contract. This contract is only available in New York.')
for sentence in sentences:
    print(sentence)

This prospectus offers variable annuity contract allowing you to accumulate values and paying you benefits on a variable and/or fixed basis.
This prospectus provides information regarding the material provisions of your variable annuity contract.
We may restrict the availability of this contract to certain broker-dealers.
National Security Life V.I. and Annuity Company ("National Security") issues the contract.
This contract is only available in New York.


注意：National Security Life V.I.中的点，没有被无脑作为分句的依据，而是真正根据语义分句。

<b>如果能够做有监督的分类，就顺手打上标签</b>，因为做无监督的聚类操作，然后根据相似度判断文本类型，耗时耗力，而且效果不是很好。

## 3.2 清洗数据

我们遵循的第一原则是：“再好的模型也拯救不了shi一样的数据”。所以，先来清洗一下数据吧！

我们做以下处理：
准则：去除变量，只留常量，或者可常量化。

1. 删除所有不相关的字符，如任何非字母数字字符

2. 通过文本分隔分成单独的单词来标记你的文章

3. 删除不相关的字词，例如“@”推特或网址

4. 将所有字符转换为小写字母，以便将诸如“hello”，“Hello”和“HELLO”等单词看做相同单词

5. 考虑整合拼写错误或多种拼写的单词，用一个单词代表（例如“cool”/“kewl”/“cooool”）相结合

6. 考虑词形还原（把“am”，“are”，“is”等词语缩小为“be”这样的常见形式）

7. 将所有专有名词转换为propn这个语义标注词，即变量转换为常量！

8. 去除停用词，比如for a an of the and to about after in among as...

具体实现方式：

### 删除所有非字母的字符
如这句话：Also assume that, when the Owner is age 76, a step up occurs and the highest quarterly Contract Value is greater than the BDB; in that case, the GAWA percentage will be re determined based on the Owner's attained age of 76, resulting in a new GAWA percentage of 6%.

In [10]:
import re
text = '''Also assume that, when the Owner is age 76, a step up occurs and the highest quarterly Contract Value is greater than the BDB; in that case, the GAWA percentage will be re determined based on the Owner's attained age of 76, resulting in a new GAWA percentage of 6%.'''
text = re.sub(r'\W', ' ', text)
text = re.sub(r'\d', ' ', text)
text = re.sub(r'( ){2,}', ' ', text).strip()
print(text)

Also assume that when the Owner is age a step up occurs and the highest quarterly Contract Value is greater than the BDB in that case the GAWA percentage will be re determined based on the Owner s attained age of resulting in a new GAWA percentage of


### 词性还原

#### 什么是词性？

词性指以词的特点作为划分词类的根据，比如：
ADV: 副词；sample：very, well, exactly, tomorrow, up, down

VERB: 动词；sample: run, eat, ate, running, eats

ADJ: 形容词；sample: big, old, green

DET: 限定词；sample: a, an, this, this, no

NOUN: 名词；sample: girl, boy, cat, tree

ADP: 介词；sample: in, to, during

PROPN: 专属名词；sample: Mary, London, HBO, Google

CCONJ: 连词；sample: and, or, but

参照：http://universaldependencies.org/u/pos/all.html

下面的例子，演示如何通过Spacy获取一句话中各个单词的词性

In [27]:
def getwordtokenattributes(text):
    doc = nlp(text)
    result = []
    wordlist = []
    for token in doc:
#         if token.text not in wordlist:
        dictinfo = {}
        dictinfo['text'] = token.text
        dictinfo['lemma_'] = token.lemma_
        dictinfo['pos_'] = token.pos_
        dictinfo['tag_'] = token.tag_
        dictinfo['dep_'] = token.dep_
        dictinfo['shape_'] = token.shape_
        dictinfo['is_alpha'] = token.is_alpha
        dictinfo['is_stop'] = token.is_stop
        wordlist.append(token.text)
        result.append(dictinfo)
    return result

In [31]:
result = getwordtokenattributes(r'I cook with new best cook. A good cook cooks a good cook.')

In [32]:
print(result)

[{'text': 'I', 'lemma_': '-PRON-', 'pos_': 'PRON', 'tag_': 'PRP', 'dep_': 'nsubj', 'shape_': 'X', 'is_alpha': True, 'is_stop': False}, {'text': 'cook', 'lemma_': 'cook', 'pos_': 'VERB', 'tag_': 'VBP', 'dep_': 'ROOT', 'shape_': 'xxxx', 'is_alpha': True, 'is_stop': False}, {'text': 'with', 'lemma_': 'with', 'pos_': 'ADP', 'tag_': 'IN', 'dep_': 'prep', 'shape_': 'xxxx', 'is_alpha': True, 'is_stop': False}, {'text': 'new', 'lemma_': 'new', 'pos_': 'ADJ', 'tag_': 'JJ', 'dep_': 'amod', 'shape_': 'xxx', 'is_alpha': True, 'is_stop': False}, {'text': 'best', 'lemma_': 'good', 'pos_': 'ADJ', 'tag_': 'JJS', 'dep_': 'amod', 'shape_': 'xxxx', 'is_alpha': True, 'is_stop': False}, {'text': 'cook', 'lemma_': 'cook', 'pos_': 'NOUN', 'tag_': 'NN', 'dep_': 'pobj', 'shape_': 'xxxx', 'is_alpha': True, 'is_stop': False}, {'text': '.', 'lemma_': '.', 'pos_': 'PUNCT', 'tag_': '.', 'dep_': 'punct', 'shape_': '.', 'is_alpha': False, 'is_stop': False}, {'text': 'A', 'lemma_': 'a', 'pos_': 'DET', 'tag_': 'DT', 'd

我们可以用Pandas的DataFrame，将结果变得容易阅读：

In [34]:
import pandas as pd
df = pd.DataFrame(result)
df

Unnamed: 0,dep_,is_alpha,is_stop,lemma_,pos_,shape_,tag_,text
0,nsubj,True,False,-PRON-,PRON,X,PRP,I
1,ROOT,True,False,cook,VERB,xxxx,VBP,cook
2,prep,True,False,with,ADP,xxxx,IN,with
3,amod,True,False,new,ADJ,xxx,JJ,new
4,amod,True,False,good,ADJ,xxxx,JJS,best
5,pobj,True,False,cook,NOUN,xxxx,NN,cook
6,punct,False,False,.,PUNCT,.,.,.
7,det,True,False,a,DET,X,DT,A
8,amod,True,False,good,ADJ,xxxx,JJ,good
9,nsubj,True,False,cook,NOUN,xxxx,NN,cook


#### 通过词性还原获得语干

In [39]:
def lemmatization(sentence, allowed_postags=''):
    """https://spacy.io/api/annotation"""
    doc = nlp(sentence)
    # allowed_postags, such as 'NOUN,ADJ,VERB,ADV',
    # 但是大多数情况，不能加allow_postags，否则很多词，比如no,  or就没有了
    if len(allowed_postags) > 0:
        resultlist = [token.lemma_
                      for token
                      in doc
                      if token.pos_
                      in [postag.upper().strip() for postag in allowed_postags.split(',')]]
    else:
        resultlist =  [token.lemma_ for token in doc]
    return resultlist

In [40]:
text = r'The product is the best than others.'

In [41]:
print(' '.join(lemmatization(text)))

the product be the good than other .


#### 通过词性表达式获得短语

In [42]:
import textacy
def extractverbphrase(text, pattern=r'(<ADV>*<NOUN|PROPN>*<VERB><DET>?<ADV>*<VERB|ADJ>+<ADP>?<DET>?<NUM>*<ADJ>*<NOUN|PROPN>*<ADV>?)|(<VERB>?<NOUN|PROPN>*<ADV>?<VERB><ADP>?<ADJ|VERB>*<ADP>?<DET>?<VERB>?<NOUN|PROPN>*)|(<DET>?<ADJ>+<NOUN|PROPN>+)|(<ADV>*<ADJ><ADP><DET>?<VERB|ADJ>*<NOUN|PROPN>*)|(<DET><NOUN><CCONJ><NOUN>)|(<NOUN|PROPN>*<CCONJ>?<NOUN|PROPN>+<ADP><NOUN|PROPN>+)|(<ADP><DET><NOUN|PROPN>+)'):
    # ADV: 副词；sample：very, well, exactly, tomorrow, up, down
    # VERB: 动词；sample: run, eat, ate, running, eats
    # ADJ: 形容词；sample: big, old, green
    # DET: 限定词；sample: a, an, this, this, no
    # NOUN: 名词；sample: girl, boy, cat, tree
    # ADP: 介词；sample: in, to, during
    # PROPN: 专属名词；sample: Mary, London, HBO, Google
    # CCONJ: 连词；sample: and, or, but
    # 参照：http://universaldependencies.org/u/pos/all.html
    doc = nlp(text)
    return list(textacy.extract.pos_regex_matches(doc, pattern))

In [46]:
text = r'Effective April 24, 2017, there are new Investment Divisions for which Accumulation Unit information is not yet available.'
phraselist = extractverbphrase(text, pattern=r'(<PROPN>+)')
for phrase in phraselist:
    print(phrase)

April
Investment Divisions
Accumulation Unit


#### 统一的文字清洗方法

将清理逻辑连接起来，构成一个统一的文字清洗方法：

In [44]:
import re

In [47]:
def removespecialchar(sentence):
    result = re.sub('\W', ' ', sentence)
    return re.sub('( ){2,}', ' ', result)

In [48]:
def clearandlemmasentence(sentence,
                          stopword='for a an the and in among'):
    stoplist = set(stopword.split())
    sentence = removespecialchar(sentence).lower().strip()
    sentence = ' '.join([word.strip() for word
                         in sentence.lower().strip().split()
                         if len(word.strip()) > 0
                         and word not in stoplist]).strip()
    sentence = re.sub(r'(propn\s+){2,}', 'propn ', sentence)
    if len(sentence) == 0:
        sentence = 'only for test'
    lemmawordlist = lemmatization(sentence)
    return lemmawordlist

In [49]:
def replacevariabletextfromtextblock(textblock):
    """
    Variable Text:
    1. PROPN words, such as: Mainstay VP Funds Trust, replace them with propn
    2. Date part, such as January 1, 2018, replace them with date
    3. Number, such as 1, 2, replace with space
    :param textblock:
    :return:
    """
    # replace date string with "date"
    datepattern = r'((January|February|March|April|May|June|July|August|September|October|November|December)[\s]*[0-9]{1,2}[\s]*,[\s]*[0-9]{4})|([0-9]{1,2}/[0-9]{1,2}/[0-9]{4})'
    textblock = re.sub(datepattern, 'date', textblock)
    datepattern = r'\d{2}\/\d{2}\/(\d{4}|\d{2})'
    textblock = re.sub(datepattern, 'date', textblock)
    # 应对*CTIVP这种情况，无法识别PROPN
    textblock = textblock.replace('*', ' ')
    textblock = re.sub(r'( ){2,}', ' ', textblock).strip()
    # 因为Money Market Fund前缀与后缀词经常是具体的基金公司，
    # 所以去除具体基金公司名称的同时，
    # 避免其被作为专属名词替换
    textblock = textblock.replace(' of ', ' ')\
        .replace(' Inc. ', ' ')\
        .replace('&', '')\
        .replace(' LLC ', ' ')
    textblock = textblock.replace('-', ' ').replace('–', ' ')
    textblock = re.sub(r'( ){2,}', ' ', textblock).strip()
    phraselist = extractverbphrase(textblock, '<PROPN>+')
    phraselist.sort(key=lambda i: len(i), reverse=True)
    if len(phraselist) > 0:
        for phrase in phraselist:
            phrasetext = phrase.text
            # avoid remove important words which are related with category
            if 'money market fund' in phrasetext.lower():
                textblock = textblock.replace(phrasetext, 'money market fund')
            noexcludewordlist = ['date',
                                 ' merge ',
                                 ' merged ',
                                 ' merging ',
                                 ' merger ',
                                 'acquir',
                                 'surviv',
                                 'liquidat',
                                 'transfer',
                                 'reorganiz',
                                 'expense table',
                                 'fee summary',
                                 'operating expenses',
                                 'annual fund',
                                 'the adviser',
                                 'benefit payment',
                                 'variable account option']
            shouldignore = False
            for word in noexcludewordlist:
                if word in phrasetext.lower():
                    shouldignore = True
                    break
            if shouldignore:
                continue
            if not any([phrasetext.lower() == 'fund',
                        len(phrasetext.split()) <= 2]):
                textblock = textblock.replace(phrasetext, 'propn')
    textblock = textblock.replace('PIMCO', ' ')
    textblock = re.sub(r'\W', ' ', textblock)
    textblock = re.sub(r'\d', ' ', textblock)
    textblock = re.sub(r'(propn\s+){2,}', 'propn ', textblock)
    textblock = re.sub(r'( ){2,}', ' ', textblock).strip()
    return textblock

In [50]:
def cleardatafordoc2vector(doc):
    temp = ' '.join(
        clearandlemmasentence(replacevariabletextfromtextblock(doc),
                              'for a an of the and or to about after in among as at be been was were is are being b c d e f g h i j k l m n o p q r s t u v w x y z'
                              )).strip()
    temp = temp.replace('-PRON-', 'pron')
    return temp

现在可以做一下效果测试：<br>
原句：122 66 32 15 14 5 13 17 *CTIVP SM – Eaton Vance Floating Rate Income Fund (Class 2) liquidated on April 27, 2018. 

In [51]:
text = r'122 66 32 15 14 5 13 17 *CTIVP SM – Eaton Vance Floating Rate Income Fund (Class 2) liquidated on April 27, 2018. '
print(cleardatafordoc2vector(text))

propn class liquidate on date


## 3.3 找到一个好的数据表示方式

### 3.3.1 词袋化

Bag-of-words模型是信息检索领域常用的文档表示方法。

在信息检索中，BOW模型假定对于一个文档，忽略它的单词顺序和语法、句法等要素，将其仅仅看作是若干个词汇的集合，文档中每个单词的出现都是独立的，不依赖于其它单词是否出现。

也就是说，文档中任意一个位置出现的任何单词，都不受该文档语意影响而独立选择的。

词袋模型的缺点：

词袋模型最重要的是构造词表，然后通过文本为词表中的词赋值，但词袋模型严重缺乏相似词之间的表达。 

比如“我喜欢北京”“我不喜欢北京”其实这两个文本是严重不相似的。但词袋模型会判为高度相似。 

“我喜欢北京”与“我爱北京”其实表达的意思是非常非常的接近的，但词袋模型不能表示“喜欢”和“爱”之间严重的相似关系。（当然词袋模型也能给这两句话很高的相似度，但是注意我想表达的含义）

在Investment名字相似度这个应用中，正是采用了词袋 + TF/IDF模型 + 余弦相似度作为核心。因为单纯的investment并不存在或者很少存在需要语义分析。

下面是代码示例：

In [52]:
from gensim import corpora, models, similarities



In [53]:
namelist = [
    'ODDO BHF US Mid Cap CI-EUR H',
    'ODDO BHF US Mid Cap CR-USD',
    'Credit Suisse Index Fund (CH) - CSIF (CH) Bond Fiscal Strength EUR Blue ZA',
    'Winton Diversified Futures Fund (Luxembourg) C GBP Acc',
    'Prescient Core Equity Fund B5',
    'Robeco QI GTAA Plus DHL $',
    'Franklin US Rising Dividends T',
    'FT MLP Closed-End Fund & Energy 52 CA',
    'FT Richard Bern Adv TS Amer Ind 16-3 CA',
    'FT Municipal FT Income Select CE 81 CA',
    'Raiffeisen-Pensionsfonds-Österreich 2007 VT',
    'Multipartner SICAV - Carthesio Asian Credit Fund B EUR',
    'HSBC Wealth Strategic Solutions Fund (1) - Conservative Portfolio Income X',
    'American Beacon Flexible Bond Fund A Class',
    'Robeco QI GTAA Plus IHL $',
    'AXA World Funds - Global Equity Income M Capitalisation EUR']
stoplist = set('for a an of the and to in - $ &'.split())

In [54]:
data_train = []
for name in namelist:
    data_train.append([word for word in name.strip().split() 
                       if word not in stoplist])
print(data_train)

[['ODDO', 'BHF', 'US', 'Mid', 'Cap', 'CI-EUR', 'H'], ['ODDO', 'BHF', 'US', 'Mid', 'Cap', 'CR-USD'], ['Credit', 'Suisse', 'Index', 'Fund', '(CH)', 'CSIF', '(CH)', 'Bond', 'Fiscal', 'Strength', 'EUR', 'Blue', 'ZA'], ['Winton', 'Diversified', 'Futures', 'Fund', '(Luxembourg)', 'C', 'GBP', 'Acc'], ['Prescient', 'Core', 'Equity', 'Fund', 'B5'], ['Robeco', 'QI', 'GTAA', 'Plus', 'DHL'], ['Franklin', 'US', 'Rising', 'Dividends', 'T'], ['FT', 'MLP', 'Closed-End', 'Fund', 'Energy', '52', 'CA'], ['FT', 'Richard', 'Bern', 'Adv', 'TS', 'Amer', 'Ind', '16-3', 'CA'], ['FT', 'Municipal', 'FT', 'Income', 'Select', 'CE', '81', 'CA'], ['Raiffeisen-Pensionsfonds-Österreich', '2007', 'VT'], ['Multipartner', 'SICAV', 'Carthesio', 'Asian', 'Credit', 'Fund', 'B', 'EUR'], ['HSBC', 'Wealth', 'Strategic', 'Solutions', 'Fund', '(1)', 'Conservative', 'Portfolio', 'Income', 'X'], ['American', 'Beacon', 'Flexible', 'Bond', 'Fund', 'A', 'Class'], ['Robeco', 'QI', 'GTAA', 'Plus', 'IHL'], ['AXA', 'World', 'Funds', 'Glo

下面的代码演示如何生成词袋字典以及词袋模型，并保存为具体的文件

In [55]:
dictionary = corpora.Dictionary(data_train)
print('输出每个单词对应的索引编号')
print(dictionary.token2id)
dictpath = './nlpmodel/corpus.dict'
dictionary.save(dictpath)
corpus = [dictionary.doc2bow(text) for text in data_train]
print('输出当前句子中各个单词的索引编号以及出现频率')
for corpu in corpus:
    print(corpu)
modelpath = './nlpmodel/corpus.mm'
corpora.MmCorpus.serialize(modelpath, corpus)

输出每个单词对应的索引编号
{'BHF': 0, 'CI-EUR': 1, 'Cap': 2, 'H': 3, 'Mid': 4, 'ODDO': 5, 'US': 6, 'CR-USD': 7, '(CH)': 8, 'Blue': 9, 'Bond': 10, 'CSIF': 11, 'Credit': 12, 'EUR': 13, 'Fiscal': 14, 'Fund': 15, 'Index': 16, 'Strength': 17, 'Suisse': 18, 'ZA': 19, '(Luxembourg)': 20, 'Acc': 21, 'C': 22, 'Diversified': 23, 'Futures': 24, 'GBP': 25, 'Winton': 26, 'B5': 27, 'Core': 28, 'Equity': 29, 'Prescient': 30, 'DHL': 31, 'GTAA': 32, 'Plus': 33, 'QI': 34, 'Robeco': 35, 'Dividends': 36, 'Franklin': 37, 'Rising': 38, 'T': 39, '52': 40, 'CA': 41, 'Closed-End': 42, 'Energy': 43, 'FT': 44, 'MLP': 45, '16-3': 46, 'Adv': 47, 'Amer': 48, 'Bern': 49, 'Ind': 50, 'Richard': 51, 'TS': 52, '81': 53, 'CE': 54, 'Income': 55, 'Municipal': 56, 'Select': 57, '2007': 58, 'Raiffeisen-Pensionsfonds-Österreich': 59, 'VT': 60, 'Asian': 61, 'B': 62, 'Carthesio': 63, 'Multipartner': 64, 'SICAV': 65, '(1)': 66, 'Conservative': 67, 'HSBC': 68, 'Portfolio': 69, 'Solutions': 70, 'Strategic': 71, 'Wealth': 72, 'X': 73, 'A': 74, 

下文将演示如何通过TF/IDF模型求语句相似度：

In [56]:
# 初始化模型
corpus = corpora.MmCorpus(modelpath)
dictionary = corpora.Dictionary.load(dictpath)
tfidf_model = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(
    tfidf_model[corpus],
    num_features=len(dictionary.keys()))

In [58]:
print('准备测试语句')
testtext = 'CR-USD ODDO Mid Cap BHF US'.split()
doc_text_vec = dictionary.doc2bow(testtext)
print(doc_text_vec)

准备测试语句
[(0, 1), (2, 1), (4, 1), (5, 1), (6, 1), (7, 1)]


In [59]:
print('直接通过TF/IDF模型获取相似度, 返回数值越大，相似度越高')
print(index.get_similarities(doc_text_vec))

直接通过TF/IDF模型获取相似度, 返回数值越大，相似度越高
[1.6776148  2.421514   0.         0.         0.         0.
 0.28899837 0.         0.         0.         0.         0.
 0.         0.         0.         0.        ]


In [60]:
test_simi = index[tfidf_model[doc_text_vec]]
test_simi = sorted(enumerate(test_simi), key=lambda item: -item[1])
outputlist = [test for test in test_simi if test[1] > 0.2]
print(outputlist)

[(1, 1.0), (0, 0.6401823)]


如果想看与哪一句最相似，直接使用索引，从语料包拿就可以

In [61]:
print('raw sentence: ', ' '.join(testtext))
for output in outputlist:
    print(namelist[output[0]],'---------similarity: ', output[1])

raw sentence:  ODDO BHF US Mid Cap CR-USD
ODDO BHF US Mid Cap CR-USD ---------similarity:  1.0
ODDO BHF US Mid Cap CI-EUR H ---------similarity:  0.6401823


### 3.3.2 Doc2Vector中的TaggedDocument

Doc2Vector其实与Word2Vector类似，都有语义分析成分，但是索引单位是句子

Doc2Vector的训练集的组成单元是TaggedDocument对象, 如下是官方说明：

Represents a document along with a tag, input document format for class: `gensim.models.doc2vec.Doc2Vec`.

A single document, made up of `words` (a list of unicode string tokens) and `tags` (a list of tokens).

Tags may be one or more unicode string tokens, but typical practice (which will also be the most memory-efficient) is for the tags list to include a unique integer id as the only tag.

In [100]:
from gensim.models.doc2vec import TaggedDocument, Doc2Vec

In [106]:
sentencelist = [
    'may prize winner teacher bomb',
    'production value use cgi digital ink paint make thing look really slick voice fine well problem thing script',
    'got heart right place also wilt awhile',
    'prof movie goodness thing good movie',
    'well go forever',
    'overproduced generally disappointing effort likely rouse rush hour crowd']
x_train = []
for index, sentence in enumerate(sentencelist):
    document = TaggedDocument(sentence.split(), tags=['{0}'.format(index)])
    print(document)
    x_train.append(document)
# model_dm = Doc2Vec(x_train, min_count=1, window=3, size=200, sample=1e-3, negative=5, workers=2)
# print(model_dm)

TaggedDocument(['may', 'prize', 'winner', 'teacher', 'bomb'], ['0'])
TaggedDocument(['production', 'value', 'use', 'cgi', 'digital', 'ink', 'paint', 'make', 'thing', 'look', 'really', 'slick', 'voice', 'fine', 'well', 'problem', 'thing', 'script'], ['1'])
TaggedDocument(['got', 'heart', 'right', 'place', 'also', 'wilt', 'awhile'], ['2'])
TaggedDocument(['prof', 'movie', 'goodness', 'thing', 'good', 'movie'], ['3'])
TaggedDocument(['well', 'go', 'forever'], ['4'])
TaggedDocument(['overproduced', 'generally', 'disappointing', 'effort', 'likely', 'rouse', 'rush', 'hour', 'crowd'], ['5'])


### 3.3.3 Keras中的Tokenizer与pad_sequences

#### text.Tokenizer类

这个类用来对文本中的词进行统计计数，生成文档词典，以支持基于词典位序生成文本的向量表示。 
init(num_words) 构造函数，传入词典的最大值

##### 成员函数

- fit_on_text(texts) 使用一系列文档来生成token词典，texts为list类，每个元素为一个文档。
- texts_to_sequences(texts) 将多个文档转换为word下标的向量形式,shape为`[len(texts)，len(text)]` -- (文档数，每条文档的长度)
- texts_to_matrix(texts) 将多个文档转换为矩阵表示,shape为`[len(texts),num_words]`

##### 成员变量

- document_count 处理的文档数量
- word_index 一个dict，保存所有word对应的编号id，从<b>1</b>开始
- word_counts 一个dict，保存每个word在所有文档中出现的次数
- word_docs 一个dict，保存每个word出现的文档的数量
- index_docs 一个dict，保存word的id出现的文档的数量

示例：

In [113]:
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import text_to_word_sequence

In [114]:
sentencelist = [
    'may prize winner teacher bomb',
    'production value use cgi digital ink paint make thing look really slick voice fine well problem thing script',
    'got heart right place also wilt awhile',
    'prof movie goodness thing good movie',
    'well go forever',
    'overproduced generally disappointing effort likely rouse rush hour crowd']

In [118]:
print('text_to_word_sequence的用法与字符串的split用法类似')
print(text_to_word_sequence(sentencelist[0]))

text_to_word_sequence的用法与字符串的split用法类似
['may', 'prize', 'winner', 'teacher', 'bomb']


In [132]:
max_fatures = 2000

In [135]:
tokenizer = Tokenizer(num_words=max_fatures, split=' ')
tokenizer.fit_on_texts(sentencelist)
print('tokenizer.word_counts')
print(tokenizer.word_counts)
print()
print('tokenizer.word_index')
print(tokenizer.word_index)
print()
print('tokenizer.word_docs')
print(tokenizer.word_docs)

print()
print('tokenizer.index_docs')
print(tokenizer.index_docs) 

tokenizer.word_counts
OrderedDict([('may', 1), ('prize', 1), ('winner', 1), ('teacher', 1), ('bomb', 1), ('production', 1), ('value', 1), ('use', 1), ('cgi', 1), ('digital', 1), ('ink', 1), ('paint', 1), ('make', 1), ('thing', 3), ('look', 1), ('really', 1), ('slick', 1), ('voice', 1), ('fine', 1), ('well', 2), ('problem', 1), ('script', 1), ('got', 1), ('heart', 1), ('right', 1), ('place', 1), ('also', 1), ('wilt', 1), ('awhile', 1), ('prof', 1), ('movie', 2), ('goodness', 1), ('good', 1), ('go', 1), ('forever', 1), ('overproduced', 1), ('generally', 1), ('disappointing', 1), ('effort', 1), ('likely', 1), ('rouse', 1), ('rush', 1), ('hour', 1), ('crowd', 1)])

tokenizer.word_index
{'thing': 1, 'well': 2, 'movie': 3, 'may': 4, 'prize': 5, 'winner': 6, 'teacher': 7, 'bomb': 8, 'production': 9, 'value': 10, 'use': 11, 'cgi': 12, 'digital': 13, 'ink': 14, 'paint': 15, 'make': 16, 'look': 17, 'really': 18, 'slick': 19, 'voice': 20, 'fine': 21, 'problem': 22, 'script': 23, 'got': 24, 'heart

In [143]:
sequences = tokenizer.texts_to_sequences(sentencelist)
print(sequences)

[[4, 5, 6, 7, 8], [9, 10, 11, 12, 13, 14, 15, 16, 1, 17, 18, 19, 20, 21, 2, 22, 1, 23], [24, 25, 26, 27, 28, 29, 30], [31, 3, 32, 1, 33, 3], [2, 34, 35], [36, 37, 38, 39, 40, 41, 42, 43, 44]]


One_Hot化

In [142]:
print(tokenizer.texts_to_matrix(sentencelist))

[[0. 0. 0. ... 0. 0. 0.]
 [0. 1. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]
 [0. 1. 0. ... 0. 0. 0.]
 [0. 0. 1. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]


pad_sequences非常重要，目的是将序列填充到maxlen长度，不足maxlenth的句子，用0填充

<b><font color='red'>这个非常重要，Keras用于做分类训练的样本，需要通过填充对齐，才能进行之后的训练</font></b>

In [148]:
X = pad_sequences(sequences, maxlen=20)
print(X)

[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  4  5  6  7  8]
 [ 0  0  9 10 11 12 13 14 15 16  1 17 18 19 20 21  2 22  1 23]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0 24 25 26 27 28 29 30]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0 31  3 32  1 33  3]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  2 34 35]
 [ 0  0  0  0  0  0  0  0  0  0  0 36 37 38 39 40 41 42 43 44]]


## 3.4 训练模型

文本清洗，词袋化或“向量化”（这里向量化打引号，表示与真正的词向量概率不同，这里仅仅是将词或句建立向量索引）之后，就是建模了。

上述部分已经提及了如何创建TF/IDF这种简单模型，那么如何创建词向量模型(word2vec)，句向量(doc2vec)以及通过Keras创建LSTM, biLSTM, GRU乃至biGRU模型呢？

我们将在`3. 常用模型`这一章详细了解。

## 3.5 使用模型

我们将在`3. 常用模型`这一章详细了解。

# 2. 常用自然语言处理包

工欲善其事，必先利其器。目前为止已经有很多很多用于NLP专项应用的python包。

下面将逐一介绍它们。

## 2.1 Gensim

## 2.2 Spacy

## 2.3 NLTK

## 2.4 JIEBA

# 3. 常用模型

## 3.1 TF/ IDF

## 3.2 词向量

## 3.3 文档向量

## 3.4 深度学习模型

# 4. 样本准备