# Word2Vec 

In this notebook we create word embeddings based on the nowac dataset.

This is just to investigate how it works.

Some notes on how the data is processed:
- Punctuation is not removed, but interpreted as words.
- All words are lower case.
- Words that are used less than 5 time in the corpus, are removed.
- CBOW is used.

In [1]:
from gensim.models import Word2Vec
import re
import codecs

In [2]:
! du -h nowac_cleaned.txt

2.3G	nowac_cleaned.txt


In [3]:
! wc nowac_cleaned.txt

  22419428  413486154 2371748257 nowac_cleaned.txt


## Read from file with generator

We read from the file with a gererator so we don't have to fit the whole file in memory. 

It would be faster to read chunks (possibly 25% less time), but then we whould have to generate the vocabulary in a, and possibly read multiple times. 

In [2]:
class Sentences(object):
    startNr = re.compile(r'\A\d+\s+') # Remove number in start of line
    def __init__(self, filename):
        self.filename = filename
    def __iter__(self):
        with codecs.open(self.filename, encoding='utf-8') as f:
            for line in f:
                yield startNr.sub('', line).lower().split()

## Test case 

In [111]:
# ! head -n 1000000 nowac_cleaned.txt > test-nowac.txt

In [112]:
# sentences = Sentences('test-nowac.txt')

## One pass through the data

Estimated to take around 20 min

In [119]:
sentences = Sentences('nowac_cleaned.txt')

In [120]:
model = Word2Vec(size=100, # dim of word embeddings 
                 window=5, # max dist between current and predicted word
                 min_count=5, # ignore words used less than this
                 workers=2, # nr of cores
                 sg=0, # 0 for CBOW and 1 for skip-gram
                 iter=1, # number of iterations over the corpus
                 )

In [121]:
%%time
model.build_vocab(sentences)

CPU times: user 6min 51s, sys: 6.56 s, total: 6min 58s
Wall time: 7min 5s


In [122]:
%%time
model.train(sentences)

CPU times: user 24min 12s, sys: 36.7 s, total: 24min 48s
Wall time: 12min 7s


284351565

In [124]:
model.most_similar(positive=['hun', 'gutt'], negative=['han'])

[(u'jente', 0.9299687743186951),
 (u'dame', 0.8450915217399597),
 (u'pike', 0.7537344098091125),
 (u'kvinne', 0.7472227811813354),
 (u'ten\xe5ring', 0.7296069860458374),
 (u'datter', 0.7209495902061462),
 (u'venninne', 0.7089371681213379),
 (u'stores\xf8ster', 0.7047269940376282),
 (u'lilles\xf8ster', 0.7041319012641907),
 (u'mann', 0.7030349373817444)]

## A second pass

In [125]:
%%time
model.train(sentences)

CPU times: user 24min 22s, sys: 33 s, total: 24min 55s
Wall time: 11min 59s


284342197

In [126]:
model.most_similar(positive=['hun', 'gutt'], negative=['han'])

[(u'jente', 0.9386008977890015),
 (u'dame', 0.8350882530212402),
 (u'pike', 0.7984364032745361),
 (u'kvinne', 0.7504758238792419),
 (u'ten\xe5ring', 0.7339444160461426),
 (u'datter', 0.7331860065460205),
 (u'venninne', 0.7170625329017639),
 (u'kattunge', 0.7054935097694397),
 (u'mann', 0.7037138938903809),
 (u'lilles\xf8ster', 0.6950181126594543)]

## Third

In [133]:
%%time
model.train(sentences)

CPU times: user 23min 48s, sys: 30.9 s, total: 24min 19s
Wall time: 11min 35s


284356813

In [134]:
model.most_similar(positive=['hun', 'gutt'], negative=['han'])

[(u'jente', 0.9425826668739319),
 (u'dame', 0.8421979546546936),
 (u'pike', 0.8175129890441895),
 (u'kvinne', 0.7444599866867065),
 (u'ten\xe5ring', 0.7209830284118652),
 (u'datter', 0.7146647572517395),
 (u'mann', 0.7126047611236572),
 (u'lilles\xf8ster', 0.7066370248794556),
 (u'jentunge', 0.7033668756484985),
 (u'venninne', 0.6961722373962402)]

# Save the model

In [135]:
model.save('models/first_model')

In [136]:
! du -h models/first_model*

57M	models/first_model
267M	models/first_model.syn0.npy
267M	models/first_model.syn1neg.npy


## Load the model

In [6]:
model = Word2Vec.load('models/first_model')

In [7]:
model.most_similar(positive=['hun', 'gutt'], negative=['han'])

[(u'jente', 0.9425826668739319),
 (u'dame', 0.8421979546546936),
 (u'pike', 0.8175129890441895),
 (u'kvinne', 0.7444599866867065),
 (u'ten\xe5ring', 0.7209830284118652),
 (u'datter', 0.7146647572517395),
 (u'mann', 0.7126047611236572),
 (u'lilles\xf8ster', 0.7066370248794556),
 (u'jentunge', 0.7033668756484985),
 (u'venninne', 0.6961722373962402)]

# Investigate model

In [10]:
model.estimate_memory()

{'syn0': 279679600,
 'syn1neg': 279679600,
 'total': 908958700,
 'vocab': 349599500}

As far as I can understand we have this many words:

In [12]:
len(model.vocab)

699199

and this many instances of hei:

In [20]:
model.vocab['hei'].count

51190

and this is the vector for hei:

In [21]:
model['hei']

array([ 1.54843998, -0.32993838, -1.70601118, -3.39740086, -3.05587769,
        1.68532646,  0.65438181, -2.35846734,  0.28345835, -0.75644213,
        2.5716002 ,  1.51044118, -5.34372663, -0.00618741,  0.77062452,
       -2.53049088, -2.91748667,  2.61833525,  0.35818765,  1.6469332 ,
        1.36840808, -3.23291183,  2.53219628, -1.84026778,  1.1187433 ,
       -1.14597976, -0.34033257, -1.15849817, -0.54650331,  0.68562329,
       -0.90300816,  1.64296544,  2.73593807,  4.24588299,  0.40872321,
       -0.21481653,  2.02794957, -1.96033096, -3.63693905,  0.83259249,
        2.25177884,  0.33194867,  1.64359975, -0.44653463, -0.3839815 ,
       -0.25759134, -1.73013878,  1.3357091 ,  3.07605457,  0.74214005,
       -1.05492687, -0.40796721, -2.86962771, -4.00833416, -0.38941222,
       -1.14391899,  0.48896596,  3.42566657, -2.04793859, -1.46021271,
        4.4045105 , -0.14737129, -2.62188148,  0.81073081, -1.07147849,
        0.89766145, -2.16014409,  1.23088992,  1.84157062, -2.33