# Word2vec in Gensim

Implement a simple word2vec estimator using [Gensim](https://radimrehurek.com/gensim/). Use the small Wikipedia corpus from '../data/corpora/enlang1.txt'.

In [9]:
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = []
with open('../data/corpora/enlang1.txt') as f:
    for line in f.readlines():
        sentences.append(line.strip().split())

model = gensim.models.Word2Vec(sentences, size = 50, min_count=3)

2019-02-07 14:53:36,963 : INFO : collecting all words and their counts
2019-02-07 14:53:36,964 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2019-02-07 14:53:37,893 : INFO : collected 175177 word types from a corpus of 4873383 raw words and 5000 sentences
2019-02-07 14:53:37,894 : INFO : Loading a fresh vocabulary
2019-02-07 14:53:38,365 : INFO : min_count=3 retains 62837 unique words (35% of original 175177, drops 112340)
2019-02-07 14:53:38,365 : INFO : min_count=3 leaves 4736664 word corpus (97% of original 4873383, drops 136719)
2019-02-07 14:53:38,588 : INFO : deleting the raw counts dictionary of 175177 items
2019-02-07 14:53:38,595 : INFO : sample=0.001 downsamples 26 most-common words
2019-02-07 14:53:38,596 : INFO : downsampling leaves estimated 3788259 word corpus (80.0% of prior 4736664)
2019-02-07 14:53:38,849 : INFO : estimated required memory for 62837 words and 50 dimensions: 56553300 bytes
2019-02-07 14:53:38,850 : INFO : resetting layer wei

In [10]:
print(model.wv['car'])


[-1.7218808  -0.4272013  -0.79122293 -4.145425    0.4663584   1.2451011
 -1.3500787   0.96638715 -1.7594365  -3.2623775  -1.5156037   4.202594
  0.96996725 -1.3616059  -0.50863105 -0.11124986 -0.9238543  -0.7314218
 -1.1395476  -0.35831714  0.7078443   0.9052772  -1.2973964   0.14551657
  1.2575114  -1.3295317  -1.308183    0.92065614  1.5180007  -1.0573603
  1.1286033  -0.6752914   0.5298296  -2.511676    2.705531   -0.78752106
 -1.1190983   2.0153103   1.4105419  -0.02672557 -0.05394428 -2.2373958
  0.26088235  0.02553769  1.0599087   0.11638223 -0.58372146 -2.910903
  1.4348471   0.20868112]


In [12]:
model.wv.most_similar(positive=['queens', 'king'], negative=['queen'])

[('expressway', 0.7484003901481628),
 ('mezzanine', 0.7359848618507385),
 ('southbound', 0.7286300659179688),
 ('motorway', 0.7252560257911682),
 ('keilor', 0.7196506261825562),
 ('roads', 0.7130234241485596),
 ('northumberland', 0.7093843221664429),
 ('turnpike', 0.7042462825775146),
 ('bypass', 0.6979405879974365),
 ('junction', 0.696420431137085)]

# Import better models

Import word vectors trained on [Common Crawl](https://fasttext.cc/docs/en/english-vectors.html) corpus (600 B tokens) and play with it.

In [4]:
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('../data/crawl-300.vec', binary=False) 

2019-02-07 14:09:02,756 : INFO : loading projection weights from ../data/crawl-300.vec
2019-02-07 14:11:44,898 : INFO : loaded (500000, 300) matrix from ../data/crawl-300.vec


In [13]:
word_vectors.most_similar(positive=['kings', 'queen'], negative=['king'])

[('queens', 0.838758111000061),
 ('queen.', 0.6004167795181274),
 ('monarchs', 0.5899762511253357),
 ('Queen', 0.5859925150871277),
 ('empresses', 0.5775150656700134),
 ('princes', 0.5499585866928101),
 ('QUEEN', 0.5448766350746155),
 ('royals', 0.5442696213722229),
 ('princesses', 0.5383291840553284),
 ('royal', 0.5232111215591431)]

In [14]:
word_vectors.most_similar(positive=['woman', 'husband'], negative=['man'])

[('wife', 0.7529045343399048),
 ('daughter', 0.6500850915908813),
 ('mother-in-law', 0.6470040678977966),
 ('spouse', 0.6457177996635437),
 ('husbands', 0.633111298084259),
 ('mother', 0.6005339622497559),
 ('ex-husband', 0.5952433347702026),
 ('daughter-in-law', 0.5948172807693481),
 ('ex-wife', 0.5728636384010315),
 ('daughters', 0.5600826144218445)]

In [15]:
word_vectors.most_similar(positive=['Paris', 'Spain'], negative=['France'])

[('Madrid', 0.8625081181526184),
 ('Barcelona', 0.7637038826942444),
 ('Sevilla', 0.6874054670333862),
 ('Seville', 0.6747833490371704),
 ('Malaga', 0.6494930386543274),
 ('Zaragoza', 0.645937442779541),
 ('Valencia', 0.6383104920387268),
 ('Alicante', 0.6115807890892029),
 ('Salamanca', 0.6041631102561951),
 ('Murcia', 0.6019024848937988)]

In [8]:
word_vectors.most_similar(positive=['Donald', 'Putin'], negative=['Trump'])

[('Vladimir', 0.644631028175354),
 ('Medvedev', 0.6112760901451111),
 ('Sergei', 0.5950400233268738),
 ('Dmitry', 0.5793238878250122),
 ('Oleg', 0.5696351528167725),
 ('Denis', 0.5639139413833618),
 ('Mikhail', 0.5574285387992859),
 ('Anatoly', 0.5540498495101929),
 ('Igor', 0.5533066987991333),
 ('Ivan', 0.5529454350471497)]