## Testing Notebook
In this notebook, I try to use gensim to do word embedding

### Load From Pre-trained Txt File
First I have to load from pretrained data from txt file.
Load all data may take more than 30mins on my laptop, so it would be better to set a limit.

Also, I save model as a bin file for later use, which will be faster.

In [7]:
from gensim.models import KeyedVectors

In [11]:
file = 'data/Tencent_AILab_ChineseEmbedding/Tencent_AILab_ChineseEmbedding.txt'
wv_from_text = KeyedVectors.load_word2vec_format(file, binary=False, limit=200000)
wv_from_text.init_sims(replace=True) # save memory to run faster
wv_from_text.save('./test.bin')

In [18]:
wv_from_text.most_similar('阿森纳')

[('曼联', 0.9391189813613892),
 ('切尔西', 0.9346709847450256),
 ('利物浦', 0.9224727153778076),
 ('热刺', 0.913425087928772),
 ('温格', 0.9049314856529236),
 ('曼城', 0.9032983779907227),
 ('西布朗', 0.8680431246757507),
 ('英超', 0.863417387008667),
 ('斯旺西', 0.8625228404998779),
 ('利物浦的', 0.8616222143173218)]

### Load From bin File
In practical use, I load model from bin file, which is fast and can be finish in few seconds.

In [8]:
model = KeyedVectors.load('./test.bin')

## Basic Function Usage

### most_similar
This function accepts a query word, and return all relative words based on their similarities. Fro similarity here, the model use Euclidean distance.

In [10]:
model.most_similar('中国')

KeyError: "word '医科' not in vocabulary"

### get embedding vector
I can get a word's embedding vector to compute different scores based on different algorithm

In [9]:
v = model['细胞']
print(v.shape)
print(type(v[0]))
print(v)

KeyError: "word '医科' not in vocabulary"

### check whether a word is in model
This dataset may not cover all the words we query. And due to the simplificatio of the dataset, I probably can not find the query word. I should check the existence first in order to avoid errors.

In [28]:
if 'lyc' in model.vocab:
    print('In Vocabulary')
else:
    print('Not IN Vocabulary')

Not IN Vocabulary
