# **Word2Vec - Skipgram**

## enwiki cleaned and tokenized with Gensim

**API:** https://radimrehurek.com/gensim/apiref.html

**Dataset:** https://dumps.wikimedia.org/enwiki/

**Info:**

https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/


## **1. Install packages**

In [None]:
#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)
#%cd 'drive/My Drive/TFG/Code/Pre_Textual'
import gensim
import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec

## **2. Construct corpus**

In [None]:
# To get the state of the functions
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)

# Construct a corpus
wiki = WikiCorpus('enwiki-20210520-pages-articles.xml.bz2')

# Saves corpus in-memory state
wiki.save('enwiki.corpus')

2021-05-07 22:19:31,473: INFO: adding document #0 to Dictionary(0 unique tokens: [])
2021-05-07 22:20:22,670: INFO: adding document #10000 to Dictionary(462543 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
2021-05-07 22:21:14,914: INFO: adding document #20000 to Dictionary(664089 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
2021-05-07 22:21:54,176: INFO: adding document #30000 to Dictionary(810263 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
2021-05-07 22:22:29,129: INFO: adding document #40000 to Dictionary(938507 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
2021-05-07 22:22:56,226: INFO: adding document #50000 to Dictionary(1022118 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
2021-05-07 22:23:10,909: INFO: adding document #60000 to Dictionary(1040970 unique tokens: ['ability', 'able', 'abolish', 'abolition', 'about']...)
2021-05-07 22:23:23,853: INFO: 

## **3. Yield list of tokens**

In [None]:
import numpy as np
enwiki = np.load('enwiki.corpus', allow_pickle=True)
# Iterate over the dump, yielding a list of tokens
class MySentences(object):
    def __iter__(self):
        for text in enwiki.get_texts():
            yield [word for word in text]
            
sentences = MySentences()

## **4. Train skip-gram model**

In [None]:
# Word2Vec skip-gram model
w2v = Word2Vec(sentences, 
                vector_size=300, 
                window=5, 
                min_count=5, 
                negative=5,
                workers=multiprocessing.cpu_count(), 
                sg=1, 
                sample=1e-5,
                epochs=10)

w2v.save('w2v.model')

# Save words vectors
np.save('w2v_vectors.npy', w2v.wv.vectors)

# Save words keys
# words = np.array(list(w2v.wv.vocab.keys()))
words = np.array(list(w2v.wv.index_to_key))
np.save('w2v_keys.npy', words)

## **5. Test Model**

In [None]:
import gensim 
model = gensim.models.Word2Vec.load('w2v.model')

# Test the model
model.wv.most_similar('computer')

[('computers', 0.7918115258216858),
 ('computing', 0.7566682696342468),
 ('software', 0.7058992981910706),
 ('sigcas', 0.6703202128410339),
 ('cgrg', 0.6638060212135315),
 ('technology', 0.6627057790756226),
 ('probeware', 0.6473554968833923),
 ('vlsi', 0.6452200412750244),
 ('icics', 0.6422359347343445),
 ('teleinformatics', 0.6419001817703247)]

In [None]:
# Main params of the model
print(model)
# Testing word pairs
print(model.wv.similarity('apple', 'banana'))
print(model.wv.similarity('car', 'bus'))
print(model.wv.similarity('car', 'ship'))

print(model.wv.similarity('apple', 'bus'))
print(model.wv.similarity('apple', 'car'))


print(model.wv.most_similar(positive=['car', 'minivan'], topn=5))

Word2Vec(vocab=2514312, vector_size=300, alpha=0.025)
0.43023682
0.4644236
0.26581895
0.24994202
0.17902678
[('suv', 0.8029348254203796), ('cars', 0.7596056461334229), ('suvs', 0.721185028553009), ('vehicle', 0.7203112244606018), ('truck', 0.7158130407333374)]


## **6. Train Model with VisualGenome sentences**

In [None]:
import numpy as np

# Load VisualGenome sentences
VG_ARRAY = np.load('VG_sentences.npy', allow_pickle=True)
VG_SENTENCES = VG_ARRAY.tolist() # Convert to list
COUNT_SENTENCES = len(VG_SENTENCES) # Count of sentences

In [None]:
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)

import gensim 
model = gensim.models.Word2Vec.load('w2v.model')

2021-05-26 19:50:36,090: INFO: loading Word2Vec object from /home/mariella/Downloads/Jose_Lainer/w2v.model
2021-05-26 19:50:36,798: INFO: loading wv recursively from /home/mariella/Downloads/Jose_Lainer/w2v.model.wv.* with mmap=None
2021-05-26 19:50:36,798: INFO: loading vectors from /home/mariella/Downloads/Jose_Lainer/w2v.model.wv.vectors.npy with mmap=None
2021-05-26 19:50:37,489: INFO: loading syn1neg from /home/mariella/Downloads/Jose_Lainer/w2v.model.syn1neg.npy with mmap=None
2021-05-26 19:50:39,572: INFO: setting ignored attribute cum_table to None
2021-05-26 19:50:54,519: INFO: Word2Vec lifecycle event {'fname': '/home/mariella/Downloads/Jose_Lainer/w2v.model', 'datetime': '2021-05-26T19:50:54.519421', 'gensim': '4.0.1', 'python': '3.8.5 (default, Jan 27 2021, 15:41:15) \n[GCC 9.3.0]', 'platform': 'Linux-5.8.0-53-generic-x86_64-with-glibc2.29', 'event': 'loaded'}


In [None]:
# Update weights with VisualGenome sentences
model.train(corpus_iterable=VG_SENTENCES, total_examples=COUNT_SENTENCES, epochs=10)

model.save('VG_w2v.model')

# Save words vectors for VisualGenome
np.save('VG_w2v_vectors.npy', model.wv.vectors)