# Summary

This notebook explores word embeddings through the functionality of Gensim; we train new embeddings from a dataset of our own and compare with pre-trained Glove embeddings.

Before getting started, install the gensim library:

```sh
conda install gensim=3.4.0
```


In [3]:
# !pip install gensim

Collecting gensim
  Downloading gensim-3.8.3-cp37-cp37m-win_amd64.whl (24.2 MB)
Collecting Cython==0.29.14
  Downloading Cython-0.29.14-cp37-cp37m-win_amd64.whl (1.7 MB)
Collecting smart-open>=1.8.1
  Downloading smart_open-2.1.0.tar.gz (116 kB)
Collecting requests
  Downloading requests-2.24.0-py2.py3-none-any.whl (61 kB)
Collecting boto
  Downloading boto-2.49.0-py2.py3-none-any.whl (1.4 MB)
Collecting boto3
  Downloading boto3-1.14.21-py2.py3-none-any.whl (128 kB)
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.9-py2.py3-none-any.whl (126 kB)
Collecting idna<3,>=2.5
  Downloading idna-2.10-py2.py3-none-any.whl (58 kB)
Collecting chardet<4,>=3.0.2
  Using cached chardet-3.0.4-py2.py3-none-any.whl (133 kB)
Collecting s3transfer<0.4.0,>=0.3.0
  Using cached s3transfer-0.3.3-py2.py3-none-any.whl (69 kB)
Collecting jmespath<1.0.0,>=0.7.1
  Downloading jmespath-0.10.0-py2.py3-none-any.whl (24 kB)
Collecting botocore<1.18.0,>=1.17.21
  Downloading botocore-1.1

In [1]:
!pip install fasttext



In [2]:
import re
from gensim.models import Word2Vec, KeyedVectors
from gensim.scripts.glove2word2vec import glove2word2vec
from gensim.test.utils import datapath

First, let's train a new word2vec model on our data. As the wiki data has some bug with unicode decoding, I switch to brown corpus.

In [14]:
# import nltk
# nltk.download('brown')

In [15]:
from nltk.corpus import brown as brown_corp

sentences = brown_corp.sents()
print('# sentences in Brown corpus:', len(sentences))

# sentences in Brown corpus: 57340


In [2]:
# sentences=[]
# filename="../data/wiki.10K.txt"
# with open(filename) as file:
#     for line in file:
#         words=line.rstrip().lower()
#         # replace any sequence of whitespace (space, tab, newline, etc.) with single space
#         words=re.sub("\s+", " ", words)
#         sentences.append(words.split(" "))

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 2963: character maps to <undefined>

## Word2Vec trained via CBOW

In [16]:
cbow_model = Word2Vec(
        sentences,
        size=100,
        window=5,
        min_count=2,
        workers=10,
        sg=0
)

cbow_trained_vectors = cbow_model.wv
# save vectors to file if you want to use them later
cbow_trained_vectors.save_word2vec_format('cbow_embeddings.txt', binary=False)

## Word2Vec trained via Skip-Gram

In [17]:
sg_model = Word2Vec(sentences, 
                    size=100,
        window=5,
        min_count=2,
        workers=10,
        sg=1
)
sg_trained_vectors = sg_model.wv

In [18]:
# save embeddings to file for re-use
sg_trained_vectors.save_word2vec_format('sg_embeddings.txt', binary=False)

In [19]:
sg_trained_vectors.most_similar('actor', topn=10)

[('priest', 0.9478057622909546),
 ('critic', 0.9460633993148804),
 ('painter', 0.9407342672348022),
 ('apprentice', 0.9377787709236145),
 ('fame', 0.9376294612884521),
 ('dancer', 0.9374251365661621),
 ('masterpiece', 0.9358892440795898),
 ('suggestion', 0.9349619150161743),
 ('servant', 0.9344593286514282),
 ('Being', 0.9334083199501038)]

In [20]:
cbow_trained_vectors.most_similar("actor", topn=10)

[('missionary', 0.9562637805938721),
 ('prohibition', 0.9549413919448853),
 ('commander', 0.9545937180519104),
 ('proposal', 0.9542382955551147),
 ('contract', 0.9517210721969604),
 ('Commission', 0.9508786201477051),
 ('fame', 0.9494434595108032),
 ('prize', 0.9459772706031799),
 ('incomplete', 0.9456023573875427),
 ('acceptance', 0.944739818572998)]

__Obs:__

+ `fame` appear in both top10 lists
+ skip-gram mode seems to find out rare words like `opera`, `servant`.

In [21]:
sg_trained_vectors.most_similar('engineer', topn=10)

[('scientist', 0.9173941612243652),
 ('manufacturer', 0.9163181185722351),
 ('actor', 0.9149243831634521),
 ('magazine', 0.9141924977302551),
 ('commission', 0.9136365652084351),
 ('award', 0.9118192195892334),
 ('prize', 0.9062156081199646),
 ('Englishman', 0.9060776233673096),
 ('Orthodox', 0.9049038887023926),
 ('critic', 0.9046592712402344)]

In [22]:
cbow_trained_vectors.most_similar("engineer", topn=10)

[('soldier', 0.9836461544036865),
 ('attorney', 0.961202085018158),
 ('prize', 0.9600339531898499),
 ('English', 0.9591624736785889),
 ('description', 0.9585482478141785),
 ('Foundation', 0.9573974013328552),
 ('tie', 0.9568837285041809),
 ('strain', 0.956856906414032),
 ('former', 0.9559668898582458),
 ('impersonal', 0.9555135369300842)]

## Train on bigger corpus

In [4]:
# # download data
# import os.path
# if not os.path.isfile('text8'):
#     !wget -c http://mattmahoney.net/dc/text8.zip
#     !unzip text8.zip

'unzip' is not recognized as an internal or external command,
operable program or batch file.


In [23]:
from gensim.models.word2vec import Text8Corpus

lr = 0.05
dim = 100
ws = 5
epoch = 5
minCount = 5
neg = 5
loss = 'ns'
t = 1e-4

# Same values as used for fastText training above
params = {
    'alpha': lr,
    'size': dim,
    'window': ws,
    'iter': epoch,
    'min_count': minCount,
    'sample': t,
    'sg': 1,
    'hs': 1, # use hierarchical softmax for model training
#     'negative': neg
}

In [24]:
corpus_file = '../data/text8'

%time gs_model = Word2Vec(Text8Corpus(corpus_file), **params)

Wall time: 7min 53s


In [25]:
gs_vectors = gs_model.wv
gs_vectors.most_similar('actor', topn=10)

[('actress', 0.9190559387207031),
 ('comedian', 0.8873367309570312),
 ('singer', 0.858544647693634),
 ('musician', 0.830625593662262),
 ('american', 0.8068164587020874),
 ('screenwriter', 0.8065409064292908),
 ('footballer', 0.799182653427124),
 ('monkhouse', 0.7879550457000732),
 ('playwright', 0.7821558713912964),
 ('tobey', 0.7811105251312256)]

In [26]:
gs_vectors.save_word2vec_format('gensim_embeddings.txt', binary=False)

# Comparison with pre-trained Glove

Let's load in vectors that have already been trained on a much bigger dataset. [Glove vectors](https://nlp.stanford.edu/projects/glove/) are trained using a different method than word2vec, but results in vectors that can be read in by Gensim.  The top 50K words in the "Common Crawl (42B)"  vectors (300-dimensional) can be found here: [glove.42B.300d.50K.txt](https://drive.google.com/file/d/1n1jt0UIdI3CD26cY1EIeks39XH5S8O8M/view?usp=sharing); download it and place  in your `data` directory.

In [None]:
# First we have to convert the Glove format into w2v format; this creates a new file
glove_file="../data/glove.42B.300d.50K.txt"
glove_in_w2v_format="../data/glove.42B.300d.50K.w2v.txt"
_ = glove2word2vec(glove_file, glove_in_w2v_format)

In [None]:
glove = KeyedVectors.load_word2vec_format(glove_in_w2v_format, binary=False)

In [None]:
glove.most_similar("actor", topn=10)

`most_similar` allows for vector arithmetic (as the average value of the input positive/negative vectors, where negative vectors are first multiplied by -1).  Play around with this function to discover other analogies that have been learned in this representation.

In [None]:
# one + two = three + ?
one="man"
two="king"
three="woman"

one="paris"
two="france"
three="berlin"

glove.most_similar(positive=[two, three], negative=[one], topn=5)

We can also evaluate the quality of the learned vectors through an intrinsic evaluation comparing to human judgments in the wordsim 353 dataset.

In [None]:
glove.evaluate_word_pairs(datapath('wordsim353.tsv'))

In [None]:
my_trained_vectors.evaluate_word_pairs(datapath('wordsim353.tsv'))