### What NLP concerns
* words
* entity
+ syntax - 语法
* semantic - 语义
* dialogue - 基于业务的对话场景
+ reading comprehension
+ generation
+ representation + policy

### Do not represent words as numbers
* one hot
    * Can't distinguish 'similar' or 'not similar' words.
    * Space consuming.
    * Need recalculte when adding new words.
* PCA -- 主要成分分析
* SVD -- 奇异值分解
    * Need recalculate all the words when adding new words.
    * Computing consuming.
    * Hard to implement.
    * Cannot solving polysemy.

### What features do our vectors need
* space economical - 节省空间
* adaptively update - 可扩展的
* semantic similarity - 适当解决一词多义

### Embedding
* graph embedding, node embedding, etc
* importance of representation.

### softmax: 从向量到概率分布

In [12]:
import numpy as np

In [30]:
vector = np.array([1,2,3,4,-1,-4,0])

In [31]:
np.sum(np.exp(vector))

86.17721996378178

In [32]:
def softmax(vector):
    # 防止数据溢出
    vector -= np.max(vector)
    exp = np.exp(vector)
    exp_sum = np.sum(exp)
    return exp / exp_sum

In [33]:
softmax(vector)

array([3.15429278e-02, 8.57425675e-02, 2.33072463e-01, 6.33556641e-01,
       4.26887107e-03, 2.12534576e-04, 1.16039947e-02])

In [34]:
np.sum(softmax(vector))

1.0000000000000002

### word2vec
> Two algorithms
* skip-gram -- 周围调整中间：周围有什么词时会出现目标词
* CBOW/连续词袋模型/continuous bag of words -- 中间调整周围：目标词周围会出现什么词

> Two training methods
* Hierarchical softmax
* Negative sampling

In [7]:
import smart_open
from gensim.models.word2vec import Word2Vec, LineSentence


In [31]:
data_file = 'D:\\Github\\NLP\\Artificial_Intelligence_for_NLP\\Week_04_0727\\Assignment\\wiki_00_cut'
wiki_sentences = LineSentence(smart_open.open(data_file, 'r', encoding='utf-8'))

In [32]:
model_wiki = Word2Vec(wiki_sentences, size=200, window=10, min_count=20, workers=4)

In [43]:
model_wiki.save('model2vec.model')

  'See the migration notes for details: %s' % _MIGRATION_NOTES_URL


In [36]:
smart_open.open('model2vec.model')

<_io.TextIOWrapper name='model2vec.model' mode='r' encoding='cp936'>

In [37]:
model_wiki.wv.most_similar('数学')

[('研究', 0.9977470636367798),
 ('可以', 0.9974096417427063),
 ('操作系统', 0.9969146251678467),
 ('系统', 0.9966925382614136),
 ('一个', 0.9958970546722412),
 ('或', 0.9955787658691406),
 ('领域', 0.9952758550643921),
 ('这些', 0.9941273331642151),
 ('是', 0.9937964081764221),
 ('例如', 0.9932696223258972)]

In [38]:
help(model_wiki.save)

Help on method save in module gensim.models.word2vec:

save(*args, **kwargs) method of gensim.models.word2vec.Word2Vec instance
    Save the model.
    This saved model can be loaded again using :func:`~gensim.models.word2vec.Word2Vec.load`, which supports
    online training and getting vectors for vocabulary words.
    
    Parameters
    ----------
    fname : str
        Path to the file.



In [39]:
LineSentence?

[1;31mInit signature:[0m [0mLineSentence[0m[1;33m([0m[0msource[0m[1;33m,[0m [0mmax_sentence_length[0m[1;33m=[0m[1;36m10000[0m[1;33m,[0m [0mlimit[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Iterate over a file that contains sentences: one line = one sentence.
Words must be already preprocessed and separated by whitespace.
[1;31mInit docstring:[0m
Parameters
----------
source : string or a file-like object
    Path to the file on disk, or an already-open file object (must support `seek(0)`).
limit : int or None
    Clip the file to the first `limit` lines. Do no clipping if `limit is None` (the default).

Examples
--------
.. sourcecode:: pycon

    >>> from gensim.test.utils import datapath
    >>> sentences = LineSentence(datapath('lee_background.cor'))
    >>> for sentence in sentences:
    ...     pass
[1;31mFile:[0m           c:\users\administrator\anaconda3\lib\site-packages\gensim\models\word2vec.py
[1;31m