## 1. Gensim Word2Vec
+ Python nlp 开源库，可以用来训练自己的word2vec model
+ 官网: https://radimrehurek.com/gensim/

### 1.1 Word2Vec Example

In [1]:
from gensim.models import Word2Vec

In [2]:
sentences = [['cat', 'say', 'meow'], ['dog', 'say', 'woof']]
model = Word2Vec(sentences, min_count=1)

**Word2Vec hyperparameters**:
    + size:词向量的维度，默认为100
    + window: 训练时单词的窗口大小，默认为5
    + min_count: 单词出现的最少次数，低于该值的单词会被忽略，默认为5
    + workers:训练时使用的工作线程，默认为3
    + sg: 使用的训练算法，CBOW(0) or skip gram(1)

In [3]:
model.wv.vocab

{'cat': <gensim.models.keyedvectors.Vocab at 0x1a1544e860>,
 'say': <gensim.models.keyedvectors.Vocab at 0x1a1544e898>,
 'meow': <gensim.models.keyedvectors.Vocab at 0x1a1544e8d0>,
 'dog': <gensim.models.keyedvectors.Vocab at 0x1a1544e908>,
 'woof': <gensim.models.keyedvectors.Vocab at 0x1a1544e940>}

In [4]:
model.most_similar('cat')

  """Entry point for launching an IPython kernel.


[('meow', 0.061344344168901443),
 ('dog', 0.03607504069805145),
 ('woof', -0.011934641748666763),
 ('say', -0.05083806812763214)]

In [5]:
model.wv['dog']

array([-0.00324721, -0.00372931, -0.00132748, -0.0012167 ,  0.00421101,
        0.0016716 , -0.00329486,  0.0021082 ,  0.00307677,  0.00447989,
        0.00136921, -0.00494903,  0.00495272, -0.0016848 ,  0.00385503,
       -0.00326887, -0.00105373,  0.00193478, -0.0004452 , -0.00385089,
        0.00210099,  0.00366337,  0.00010262, -0.00441987,  0.00183587,
       -0.00135071, -0.00198604,  0.00108519, -0.00120252, -0.00399885,
        0.00490881,  0.00253169, -0.00297924, -0.00365355,  0.00378577,
        0.00476437, -0.00082827, -0.00486974,  0.00374594, -0.00163786,
        0.00125674,  0.00491777,  0.00112151,  0.00118815,  0.00050925,
        0.00028776,  0.00301926,  0.00326783,  0.00044682,  0.00361566,
        0.00413984,  0.00042816, -0.00342651,  0.00253967,  0.00255286,
       -0.00066174,  0.00186746,  0.00250534, -0.00108073,  0.00447955,
       -0.00357158,  0.00496709, -0.00258402,  0.00299036, -0.00422876,
        0.00032904,  0.0024151 ,  0.00031464,  0.00438139, -0.00

### 1.2 使用新闻语料库训练Word2Vec

In [6]:
database = '/Users/liling/Documents/00TrelloNLP/DATA/news_data.csv'

In [7]:
import pandas as pd

In [8]:
content = pd.read_csv(database, encoding='gb18030')

In [9]:
content.head()

Unnamed: 0,id,author,source,content,feature,title,url
0,89617,,快科技@http://www.kkj.cn/,此外，自本周（6月12日）起，除小米手机6等15款机型外，其余机型已暂停更新发布（含开发版/...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""37""...",小米MIUI 9首批机型曝光：共计15款,http://www.cnbeta.com/articles/tech/623597.htm
1,89616,,快科技@http://www.kkj.cn/,骁龙835作为唯一通过Windows 10桌面平台认证的ARM处理器，高通强调，不会因为只考...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""15""...",骁龙835在Windows 10上的性能表现有望改善,http://www.cnbeta.com/articles/tech/623599.htm
2,89615,,快科技@http://www.kkj.cn/,此前的一加3T搭载的是3400mAh电池，DashCharge快充规格为5V/4A。\r\n...,"{""type"":""科技"",""site"":""cnbeta"",""commentNum"":""18""...",一加手机5细节曝光：3300mAh、充半小时用1天,http://www.cnbeta.com/articles/tech/623601.htm
3,89614,,新华社,这是6月18日在葡萄牙中部大佩德罗冈地区拍摄的被森林大火烧毁的汽车。新华社记者张立云摄\r\n,"{""type"":""国际新闻"",""site"":""环球"",""commentNum"":""0"",""j...",葡森林火灾造成至少62人死亡 政府宣布进入紧急状态（组图）,http://world.huanqiu.com/hot/2017-06/10866126....
4,89613,胡淑丽_MN7479,深圳大件事,（原标题：44岁女子跑深圳约会网友被拒，暴雨中裸身奔走……）\r\n@深圳交警微博称：昨日清...,"{""type"":""新闻"",""site"":""网易热门"",""commentNum"":""978"",...",44岁女子约网友被拒暴雨中裸奔 交警为其披衣相随,http://news.163.com/17/0618/00/CN617P3Q0001875...


In [10]:
# 使用新闻内容字段作为训练语料
samples = content['content'][:100].tolist()

In [11]:
import jieba

In [12]:
def cut(string):
    return ' '.join(jieba.cut(string))

In [13]:
cut('这是一个测试')

Building prefix dict from /Users/liling/anaconda3/envs/UdaCourse/lib/python3.6/site-packages/jieba/dict.txt ...
Dumping model to file cache /var/folders/jh/b14_bh2n1753x9hvqr8zhtg40000gn/T/jieba.cache
Loading model cost 1.6531429290771484 seconds.
Prefix dict has been built succesfully.


'这是 一个 测试'

In [14]:
# 将训练语料分词后写入文件
with open('mini_samples.txt', 'w') as f:
    for s in samples:
        f.write(cut(s) + '\n')

In [16]:
from gensim.models.word2vec import LineSentence

In [18]:
sentences = LineSentence('mini_samples.txt')

In [19]:
model = Word2Vec(sentences, min_count=1)

In [20]:
model.wv['小米']

array([ 0.02697092, -0.01238676,  0.00570923,  0.01853787,  0.02806742,
       -0.01462012,  0.0350339 , -0.00179451,  0.01759951, -0.02614262,
       -0.02788118,  0.03073505,  0.0157554 ,  0.04029954,  0.02168221,
        0.00470081, -0.00819971,  0.01049386, -0.00412498,  0.02106893,
        0.00220568, -0.01992517,  0.01322071, -0.00595903, -0.00906719,
        0.00678828,  0.0349687 ,  0.01869414, -0.01228004,  0.05206617,
       -0.02875679,  0.02785175,  0.01187365, -0.01059754, -0.01204202,
       -0.02703042,  0.00813641, -0.01283405, -0.00111703,  0.0076571 ,
       -0.02203112,  0.01463147, -0.00241408, -0.00466241, -0.00071005,
       -0.02957297, -0.02144731,  0.01862562, -0.02289989, -0.00208864,
        0.0214048 ,  0.00385532,  0.02306326,  0.01654031,  0.01121977,
       -0.03119729,  0.00314351, -0.0039577 ,  0.00575926, -0.00233956,
       -0.00728019,  0.02484645,  0.00684136,  0.02297481, -0.01400474,
        0.02863117,  0.00999623, -0.01596094,  0.02313304,  0.02

In [21]:
model.most_similar('小米')

  """Entry point for launching an IPython kernel.


[('变化', 0.9910062551498413),
 ('国内', 0.9909408092498779),
 ('实施', 0.9909210205078125),
 ('人们', 0.9909133911132812),
 ('人民币', 0.9909095764160156),
 ('请', 0.9908913969993591),
 ('投资者', 0.9908566474914551),
 ('完全', 0.9908537268638611),
 ('车辆', 0.9908179640769958),
 ('学园', 0.9908175468444824)]

In [22]:
model.wv.vocab

{'此外': <gensim.models.keyedvectors.Vocab at 0x1a236971d0>,
 '，': <gensim.models.keyedvectors.Vocab at 0x1a23697080>,
 '自': <gensim.models.keyedvectors.Vocab at 0x1a236970b8>,
 '本周': <gensim.models.keyedvectors.Vocab at 0x1a23697320>,
 '（': <gensim.models.keyedvectors.Vocab at 0x1a23697358>,
 '6': <gensim.models.keyedvectors.Vocab at 0x1a23697390>,
 '月': <gensim.models.keyedvectors.Vocab at 0x1a236973c8>,
 '12': <gensim.models.keyedvectors.Vocab at 0x1a23697400>,
 '日': <gensim.models.keyedvectors.Vocab at 0x1a23697438>,
 '）': <gensim.models.keyedvectors.Vocab at 0x1a23697470>,
 '起': <gensim.models.keyedvectors.Vocab at 0x1a236974a8>,
 '除': <gensim.models.keyedvectors.Vocab at 0x1a236974e0>,
 '小米': <gensim.models.keyedvectors.Vocab at 0x1a23697518>,
 '手机': <gensim.models.keyedvectors.Vocab at 0x1a23697550>,
 '等': <gensim.models.keyedvectors.Vocab at 0x1a23697588>,
 '15': <gensim.models.keyedvectors.Vocab at 0x1a236975c0>,
 '款': <gensim.models.keyedvectors.Vocab at 0x1a236975f8>,
 '机型': <

## 2. NER  And Dependency Parsing
+ 区分出人名（person），组织机构名（organization）和地点（location）
+ 哈工大 LTP：https://stanfordnlp.github.io/CoreNLP/
+ Stanford CoreNLP：􏱋􏲓􏱔􏲒􏲤􏲘􏲦􏰓 􏲘􏲦􏰭􏱕􏱙􏱗􏰀https://stanfordnlp.github.io/CoreNLP/