## Testing Notebook
In this notebook, I try to use gensim to do word embedding

### Load From Pre-trained Txt File
First I have to load from pretrained data from txt file.
Load all data may take more than 30mins on my laptop, so it would be better to set a limit.

Also, I save model as a bin file for later use, which will be faster.

In [5]:
from gensim.models import KeyedVectors

In [46]:
file = 'data/Tencent_AILab_ChineseEmbedding/Tencent_AILab_ChineseEmbedding.txt'
wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False, limit=4000000)
wv_from_text.init_sims(replace=True) # save memory to run faster
wv_from_text.save('./test_40.bin')

In [31]:
wv_from_text.most_similar('安徽工业大学')

[('安徽理工大学', 0.8747442960739136),
 ('安徽财经大学', 0.8564743995666504),
 ('安徽工程大学', 0.8556878566741943),
 ('合肥工业大学', 0.8495832085609436),
 ('铜陵学院', 0.8441342115402222),
 ('皖西学院', 0.831087589263916),
 ('安师大', 0.828484058380127),
 ('安徽师范大学', 0.82568359375),
 ('巢湖学院', 0.8209227323532104),
 ('安庆师范学院', 0.8206217288970947)]

### Load From bin File
In practical use, I load model from bin file, which is fast and can be finish in few seconds.

In [6]:
model = KeyedVectors.load('./test_10.bin')

## Basic Function Usage

### most_similar
This function accepts a query word, and return all relative words based on their similarities. Fro similarity here, the model use Euclidean distance.

In [21]:
model.most_similar('tcl手机')

[('中兴手机', 0.8854156732559204),
 ('联想手机', 0.8789077997207642),
 ('酷派手机', 0.8411391377449036),
 ('lg手机', 0.8332662582397461),
 ('moto手机', 0.7984577417373657),
 ('夏普手机', 0.7713048458099365),
 ('波导手机', 0.7678113579750061),
 ('联想移动', 0.7676339149475098),
 ('摩托罗拉手机', 0.7662297487258911),
 ('金立手机', 0.7615381479263306)]

In [28]:
model.similarity('tcl集团','tcl')

0.68678486

### get embedding vector
I can get a word's embedding vector to compute different scores based on different algorithm

In [32]:
v = model['安徽工业大学']
print(v.shape)
print(type(v[0]))
print(v)

(200,)
<class 'numpy.float32'>
[ 0.03383196 -0.17869404  0.05238409 -0.05279097 -0.00203355 -0.02550736
  0.00590293  0.02501949 -0.00388571 -0.05073139 -0.01879838  0.03750166
 -0.0429432  -0.01066442 -0.05733578  0.00487093 -0.16680478 -0.01408257
 -0.0126278  -0.00468757 -0.10181407 -0.04922124  0.0236799   0.08008453
 -0.11077128 -0.07158781 -0.02592197  0.06746666  0.09003778  0.12129757
  0.13584374 -0.1022609  -0.06597835  0.17919205 -0.06002511 -0.04983841
  0.01059006  0.0650035  -0.07256994 -0.07949052 -0.0896534  -0.04846947
  0.00800002  0.11383043 -0.00127295 -0.05526561 -0.04370026 -0.0228006
 -0.04405353 -0.0121503   0.07846978  0.08052275  0.0815885  -0.13652511
  0.06184749  0.03619649 -0.06231837  0.14912842 -0.03663978  0.02024984
 -0.05889713  0.02001837  0.12010405 -0.03547539 -0.09309956 -0.02752634
 -0.04178631  0.06795077  0.01024363  0.0994449  -0.10717772  0.05576936
  0.05840507  0.13200833 -0.05574288  0.03868127  0.04881038  0.13337329
 -0.08431953  0.02521

In [37]:
v1 = model['职业技术学院']
print(v1.shape)
print(v1)

v2 = (v + v1) / 2
print(v2)

(200,)
[-0.00741409 -0.09131071  0.05083036 -0.04319877 -0.02532532  0.00266612
  0.00615757  0.04831969  0.0793904  -0.02619914 -0.02589273  0.08999024
 -0.07087526 -0.03282047 -0.06789108 -0.00793395 -0.07140415  0.00347837
  0.00112126 -0.01099633 -0.05703546 -0.06683233  0.00785789  0.04541919
  0.00940179 -0.0959845  -0.01589643  0.0202605   0.02683976  0.20117304
  0.01861057 -0.07905024 -0.05859815  0.2854236  -0.05189861 -0.07470401
 -0.01923194  0.10207884 -0.05151923 -0.067327   -0.05337382 -0.05632971
 -0.00216718  0.08146963  0.02686448 -0.05758338  0.01312144 -0.00799029
  0.00268229 -0.05050042  0.13428317  0.02066507  0.02833731  0.0036809
  0.12844509 -0.1307342  -0.06585654  0.12269804 -0.08318231  0.00231075
 -0.03047573 -0.01288016  0.08632147 -0.05644405 -0.05607536 -0.02531487
  0.07175215 -0.04911696 -0.01090006  0.12970683 -0.09437403  0.07975813
  0.05520441  0.06935226  0.03716362 -0.05611055  0.07835138  0.05934883
 -0.10233817  0.09836157  0.08353174 -0.01351

In [40]:
import numpy as np
def calculate_cosine_similarity(a, b):
    vector_a = np.mat(a)
    vector_b = np.mat(b)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    cos = num / denom
    sim = 0.5 + 0.5 * cos
    return sim

In [45]:
sim_v_v1 = calculate_cosine_similarity(v, v1)
sim_model = model.similarity('安徽工业大学', '职业技术学院')
print(sim_v_v1)
print(sim_model)

sim_v_v2 = calculate_cosine_similarity(v, v2)
sim_v1_v2 = calculate_cosine_similarity(v1, v2)
print(sim_v_v2)
print(sim_v1_v2)

0.8541409969329834
0.708282
0.9620987381219261
0.9620987703686315


### check whether a word is in model
This dataset may not cover all the words we query. And due to the simplificatio of the dataset, I probably can not find the query word. I should check the existence first in order to avoid errors.

In [28]:
if 'lyc' in model.vocab:
    print('In Vocabulary')
else:
    print('Not IN Vocabulary')

Not IN Vocabulary
