## Testing Notebook
In this notebook, I try to use gensim to do word embedding

### Load From Pre-trained Txt File
First I have to load from pretrained data from txt file.
Load all data may take more than 30mins on my laptop, so it would be better to set a limit.

Also, I save model as a bin file for later use, which will be faster.

In [9]:
from gensim.models import KeyedVectors

file = 'data/Tencent_AILab_ChineseEmbedding/Tencent_AILab_ChineseEmbedding.txt'
wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False, limit=100000)
wv_from_text.init_sims(replace=True) # save memory to run faster
wv_from_text.save('./test.bin')

In [18]:
wv_from_text.most_similar('阿森纳')

[('曼联', 0.9391189813613892),
 ('切尔西', 0.9346709847450256),
 ('利物浦', 0.9224727153778076),
 ('热刺', 0.913425087928772),
 ('温格', 0.9049314856529236),
 ('曼城', 0.9032983779907227),
 ('西布朗', 0.8680431246757507),
 ('英超', 0.863417387008667),
 ('斯旺西', 0.8625228404998779),
 ('利物浦的', 0.8616222143173218)]

### Load From bin File
In practical use, I load model from bin file, which is fast and can be finish in few seconds.

In [20]:
model = KeyedVectors.load('./test.bin')

## Basic Function Usage

### most_similar
This function accepts a query word, and return all relative words based on their similarities. Fro similarity here, the model use Euclidean distance.

In [21]:
model.most_similar('中国')

[('在中国', 0.7863314747810364),
 ('中国国内', 0.7796300053596497),
 ('说中国', 0.7657051086425781),
 ('其他国家', 0.765122652053833),
 ('国家', 0.7572988271713257),
 ('中国本土', 0.7543554306030273),
 ('美国', 0.7531554698944092),
 ('中国目前', 0.7521167993545532),
 ('中国发展', 0.750878095626831),
 ('目前中国', 0.7493101954460144)]

### get embedding vector
I can get a word's embedding vector to compute different scores based on different algorithm

In [27]:
v = model['中国']
print(v.shape)
print(v)

(200,)
[ 0.0577162  -0.06945625  0.05540956  0.11758969  0.09546565  0.03729196
 -0.02974207  0.06095821 -0.01023129  0.01493978  0.04045184 -0.00883508
 -0.11121243  0.0104354   0.02105665 -0.00915447 -0.06020505 -0.19626671
 -0.01872811 -0.03399367 -0.05467222 -0.06345588 -0.03076352 -0.00747567
  0.02761459  0.00362954 -0.05004403  0.07759292  0.0802442   0.08721583
 -0.02959119  0.01097502 -0.0118164   0.01257534  0.01229093 -0.0234512
  0.07917012 -0.00348566 -0.18709278  0.02513031 -0.09646855  0.05539405
  0.07661617  0.02312359 -0.04140972 -0.01033562 -0.03993502 -0.21838924
 -0.0030519  -0.06632831 -0.07338238  0.04787275 -0.05623603  0.04381279
  0.09508055 -0.02240663 -0.02314732  0.04810788  0.0465389   0.08131615
  0.05668471  0.06790247  0.02324101  0.144167    0.01661706  0.08001515
 -0.00941637  0.01126704  0.07383653 -0.0613494  -0.06450471 -0.09825138
  0.08055142 -0.01678923 -0.11295786  0.01174735  0.01541066  0.16284311
  0.15695193 -0.0534214   0.04785389  0.11198

### check whether a word is in model
This dataset may not cover all the words we query. And due to the simplificatio of the dataset, I probably can not find the query word. I should check the existence first in order to avoid errors.

In [28]:
if 'lyc' in model.vocab:
    print('In Vocabulary')
else:
    print('Not IN Vocabulary')

Not IN Vocabulary
