## Testing Notebook
In this notebook, I try to use gensim to do word embedding

### Load From Pre-trained Txt File
First I have to load from pretrained data from txt file.
Load all data may take more than 30mins on my laptop, so it would be better to set a limit.

Also, I save model as a bin file for later use, which will be faster.

In [3]:
from gensim.models import KeyedVectors

In [2]:
file = 'data/Tencent_AILab_ChineseEmbedding/Tencent_AILab_ChineseEmbedding.txt'
wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False, limit=5000000)
wv_from_text.init_sims(replace=True) # save memory to run faster
wv_from_text.save('./test_50.bin')

In [31]:
wv_from_text.most_similar('安徽工业大学')

[('安徽理工大学', 0.8747442960739136),
 ('安徽财经大学', 0.8564743995666504),
 ('安徽工程大学', 0.8556878566741943),
 ('合肥工业大学', 0.8495832085609436),
 ('铜陵学院', 0.8441342115402222),
 ('皖西学院', 0.831087589263916),
 ('安师大', 0.828484058380127),
 ('安徽师范大学', 0.82568359375),
 ('巢湖学院', 0.8209227323532104),
 ('安庆师范学院', 0.8206217288970947)]

### Load From bin File
In practical use, I load model from bin file, which is fast and can be finish in few seconds.

In [4]:
model = KeyedVectors.load('./test_50.bin')

## Basic Function Usage

### most_similar
This function accepts a query word, and return all relative words based on their similarities. Fro similarity here, the model use Euclidean distance.

In [16]:
model.most_similar('sysu')

[('sjtu', 0.70429927110672),
 ('cuhk', 0.6616533994674683),
 ('scut', 0.6580467224121094),
 ('hust', 0.6544046401977539),
 ('scau', 0.647921621799469),
 ('ecnu', 0.643638014793396),
 ('scnu', 0.6425882577896118),
 ('bnu', 0.6310718059539795),
 ('shnu', 0.6247861385345459),
 ('ruc', 0.6199719905853271)]

In [5]:
model.most_similar('爱芽')

[('云赞', 0.552207350730896),
 ('宋琼莹', 0.5513955354690552),
 ('原质资本', 0.5392557382583618),
 ('showki', 0.5390989780426025),
 ('蜜曰科技', 0.5359789133071899),
 ('舒朵', 0.5354244112968445),
 ('小肤', 0.5312100052833557),
 ('辣妈tv', 0.525022029876709),
 ('小芽', 0.5227168202400208),
 ('心芽', 0.5217597484588623)]

In [17]:
model.similarity('sysu','中山大学')

0.5393443

### get embedding vector
I can get a word's embedding vector to compute different scores based on different algorithm

In [6]:
v = model['西安交通大学']
print(v.shape)
print(type(v[0]))
print(v)

(200,)
<class 'numpy.float32'>
[-3.90783101e-02  1.26140621e-02 -3.64476778e-02  4.90907487e-03
  4.44173440e-02  1.44178374e-02  4.92498465e-02  9.20460448e-02
  1.05726205e-01  9.89282224e-03  4.50496785e-02  1.03399731e-01
 -1.28922269e-01  6.56003505e-02 -1.42683744e-01 -5.64271323e-02
 -1.58208802e-01 -6.23102747e-02 -3.11024617e-02 -2.56488062e-02
 -1.07827716e-01 -8.16079378e-02 -2.98235286e-03  3.92048247e-02
 -6.87738955e-02 -7.45566785e-02  8.86625126e-02  1.02725305e-01
  5.28476462e-02  1.12537034e-01  7.41369501e-02 -1.16632804e-01
 -1.19706951e-01  8.08524117e-02 -7.20953718e-02 -1.10964410e-01
  3.48094143e-02  8.97642821e-02 -2.32640654e-02 -2.56293043e-02
 -1.49546415e-01  7.80194029e-02  6.73387274e-02  5.45945838e-02
  3.71408872e-02 -7.41552562e-02  4.54047229e-03 -1.23741604e-01
 -4.84370179e-02  1.60180498e-02  8.02557543e-02  2.39772517e-02
  1.32991867e-02 -7.61461910e-04  4.51279171e-02 -2.50083897e-02
 -8.92194584e-02  1.01541966e-01  5.16217425e-02 -4.9849361

In [37]:
v1 = model['职业技术学院']
print(v1.shape)
print(v1)

v2 = (v + v1) / 2
print(v2)

(200,)
[-0.00741409 -0.09131071  0.05083036 -0.04319877 -0.02532532  0.00266612
  0.00615757  0.04831969  0.0793904  -0.02619914 -0.02589273  0.08999024
 -0.07087526 -0.03282047 -0.06789108 -0.00793395 -0.07140415  0.00347837
  0.00112126 -0.01099633 -0.05703546 -0.06683233  0.00785789  0.04541919
  0.00940179 -0.0959845  -0.01589643  0.0202605   0.02683976  0.20117304
  0.01861057 -0.07905024 -0.05859815  0.2854236  -0.05189861 -0.07470401
 -0.01923194  0.10207884 -0.05151923 -0.067327   -0.05337382 -0.05632971
 -0.00216718  0.08146963  0.02686448 -0.05758338  0.01312144 -0.00799029
  0.00268229 -0.05050042  0.13428317  0.02066507  0.02833731  0.0036809
  0.12844509 -0.1307342  -0.06585654  0.12269804 -0.08318231  0.00231075
 -0.03047573 -0.01288016  0.08632147 -0.05644405 -0.05607536 -0.02531487
  0.07175215 -0.04911696 -0.01090006  0.12970683 -0.09437403  0.07975813
  0.05520441  0.06935226  0.03716362 -0.05611055  0.07835138  0.05934883
 -0.10233817  0.09836157  0.08353174 -0.01351

In [6]:
import numpy as np
def calculate_cosine_similarity(a, b):
    vector_a = np.mat(a)
    vector_b = np.mat(b)
    num = float(vector_a * vector_b.T)
    denom = np.linalg.norm(vector_a) * np.linalg.norm(vector_b)
    cos = num / denom
    sim = 0.5 + 0.5 * cos
    return sim

In [45]:
sim_v_v1 = calculate_cosine_similarity(v, v1)
sim_model = model.similarity('安徽工业大学', '职业技术学院')
print(sim_v_v1)
print(sim_model)

sim_v_v2 = calculate_cosine_similarity(v, v2)
sim_v1_v2 = calculate_cosine_similarity(v1, v2)
print(sim_v_v2)
print(sim_v1_v2)

0.8541409969329834
0.708282
0.9620987381219261
0.9620987703686315


In [18]:
vec1 = model['sysu']
vec2 = model['中山大学']
sim_model = model.similarity('sysu','中山大学')
sim_vec1_vec2 = calculate_cosine_similarity(vec1, vec2)

print(sim_model)
print(sim_vec1_vec2)

0.5393443
0.7696721714539629


### check whether a word is in model
This dataset may not cover all the words we query. And due to the simplificatio of the dataset, I probably can not find the query word. I should check the existence first in order to avoid errors.

In [28]:
if 'lyc' in model.vocab:
    print('In Vocabulary')
else:
    print('Not IN Vocabulary')

Not IN Vocabulary


In [7]:
vec3 = model['硕士研究生']
vec4 = model['硕士']
sim_model = model.similarity('硕士研究生','硕士')
sim_vec3_vec4 = calculate_cosine_similarity(vec3, vec4)

print(sim_model)
print(sim_vec3_vec4)

0.8521307
0.9260654151439667
