# Word2vec in Gensim

Implement a simple word2vec estimator using [Gensim](https://radimrehurek.com/gensim/). Use the small Wikipedia corpus from '../data/corpora/enlang1.txt'.

In [1]:
import gensim, logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = []
with open('../data/corpora/enlang1.txt') as f:
    for line in f.readlines():
        sentences.append(line.strip().split())

model = gensim.models.Word2Vec(sentences, size = 50, min_count=3)

2018-02-01 15:56:46,389 : INFO : collecting all words and their counts
2018-02-01 15:56:46,390 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2018-02-01 15:56:47,257 : INFO : collected 175177 word types from a corpus of 4873383 raw words and 5000 sentences
2018-02-01 15:56:47,258 : INFO : Loading a fresh vocabulary
2018-02-01 15:56:47,601 : INFO : min_count=3 retains 62837 unique words (35% of original 175177, drops 112340)
2018-02-01 15:56:47,602 : INFO : min_count=3 leaves 4736664 word corpus (97% of original 4873383, drops 136719)
2018-02-01 15:56:47,770 : INFO : deleting the raw counts dictionary of 175177 items
2018-02-01 15:56:47,780 : INFO : sample=0.001 downsamples 26 most-common words
2018-02-01 15:56:47,782 : INFO : downsampling leaves estimated 3788259 word corpus (80.0% of prior 4736664)
2018-02-01 15:56:47,783 : INFO : estimated required memory for 62837 words and 50 dimensions: 56553300 bytes
2018-02-01 15:56:48,008 : INFO : resetting layer wei

In [2]:
print(model.wv['car'])


[ 0.20474598  1.451848   -1.0239041   0.42626384  2.9364133   2.5634243
  0.94328135 -0.11614431 -1.1550511   0.6351805  -0.27550706 -0.91280895
  0.8834353   0.16459714 -2.0092673   0.6252315  -0.6797438  -1.5800899
  0.44569865  1.788979   -0.23695281  1.0348423  -0.5377606   1.8198183
 -1.0216933  -0.23971567  1.2494572   0.33862925  1.5519047   1.6086415
 -1.559984    0.8733524   0.20594183 -3.7557235  -2.2541203   0.0679892
  0.5491095   1.5905684  -0.5127586   2.8371532  -2.9389822  -0.01068709
  0.25589764 -1.4922242   0.85257924  2.8131728  -0.21944268  3.9228218
 -0.4912506  -2.195461  ]


In [3]:
model.wv.most_similar(positive=['cars', 'bus'], negative=['car'])

2018-02-01 15:57:03,542 : INFO : precomputing L2-norms of word weight vectors


[('buses', 0.8922812938690186),
 ('routes', 0.8808276057243347),
 ('bts', 0.8472105264663696),
 ('platforms', 0.8464639186859131),
 ('roads', 0.8395819664001465),
 ('trains', 0.8203502893447876),
 ('rail', 0.8144564628601074),
 ('freight', 0.8041046261787415),
 ('retail', 0.8016695976257324),
 ('suburban', 0.799876868724823)]

# Import better models

Import word vectors trained on [Common Crawl](https://fasttext.cc/docs/en/english-vectors.html) corpus (600 B tokens) and play with it.

In [4]:
from gensim.models.keyedvectors import KeyedVectors
word_vectors = KeyedVectors.load_word2vec_format('../data/crawl-300.vec', binary=False) 

2018-02-01 15:57:03,653 : INFO : loading projection weights from ../data/crawl-300.vec
2018-02-01 15:58:34,941 : INFO : loaded (500000, 300) matrix from ../data/crawl-300.vec


In [5]:
word_vectors.most_similar(positive=['kings', 'queen'], negative=['king'])

2018-02-01 15:58:34,947 : INFO : precomputing L2-norms of word weight vectors


[('queens', 0.838758111000061),
 ('queen.', 0.6004167795181274),
 ('monarchs', 0.5899762511253357),
 ('Queen', 0.5859925150871277),
 ('empresses', 0.5775150656700134),
 ('princes', 0.5499585866928101),
 ('QUEEN', 0.5448766350746155),
 ('royals', 0.5442696213722229),
 ('princesses', 0.5383291840553284),
 ('royal', 0.5232111215591431)]

In [6]:
word_vectors.most_similar(positive=['woman', 'husband'], negative=['man'])

[('wife', 0.7529045343399048),
 ('daughter', 0.6500850915908813),
 ('mother-in-law', 0.6470040678977966),
 ('spouse', 0.6457177996635437),
 ('husbands', 0.633111298084259),
 ('mother', 0.6005339622497559),
 ('ex-husband', 0.5952433347702026),
 ('daughter-in-law', 0.5948172807693481),
 ('ex-wife', 0.5728636384010315),
 ('daughters', 0.5600826144218445)]

In [7]:
word_vectors.most_similar(positive=['Paris', 'Spain'], negative=['France'])

[('Madrid', 0.8625081181526184),
 ('Barcelona', 0.7637038826942444),
 ('Sevilla', 0.6874054670333862),
 ('Seville', 0.6747833490371704),
 ('Malaga', 0.6494930386543274),
 ('Zaragoza', 0.645937442779541),
 ('Valencia', 0.6383104920387268),
 ('Alicante', 0.6115807890892029),
 ('Salamanca', 0.6041631102561951),
 ('Murcia', 0.6019024848937988)]

In [8]:
word_vectors.most_similar(positive=['Donald', 'Putin'], negative=['Trump'])

[('Vladimir', 0.644631028175354),
 ('Medvedev', 0.6112760901451111),
 ('Sergei', 0.5950400233268738),
 ('Dmitry', 0.5793238878250122),
 ('Oleg', 0.5696351528167725),
 ('Denis', 0.5639139413833618),
 ('Mikhail', 0.5574285387992859),
 ('Anatoly', 0.5540498495101929),
 ('Igor', 0.5533066987991333),
 ('Ivan', 0.5529454350471497)]