# **Word2Vec - Skipgram**

## enwiki cleaned and tokenized with Gensim

**API:** https://radimrehurek.com/gensim/apiref.html

**Dataset:** https://dumps.wikimedia.org/enwiki/

**Info:**

https://towardsdatascience.com/word-embedding-with-word2vec-and-fasttext-a209c1d3e12c

https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/


## **1. Install packages**

In [None]:
#from google.colab import drive
#drive.mount('/content/drive', force_remount=True)
#%cd 'drive/My Drive/TFG/Code/Pre_Textual'
import gensim
import multiprocessing
from gensim.corpora.wikicorpus import WikiCorpus
from gensim.models.word2vec import Word2Vec

## **2. Construct corpus**

In [None]:
# To get the state of the functions
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)

# Construct a corpus
wiki = WikiCorpus('enwiki-20210520-pages-articles.xml.bz2')

# Saves corpus in-memory state
wiki.save('enwiki.corpus')

## **3. Yield list of tokens**

In [None]:
import numpy as np
enwiki = np.load('enwiki.corpus', allow_pickle=True)
# Iterate over the dump, yielding a list of tokens
class MySentences(object):
    def __iter__(self):
        for text in enwiki.get_texts():
            yield [word for word in text]
            
sentences = MySentences()

## **4. Train skip-gram model**

In [None]:
# Word2Vec skip-gram model
w2v = Word2Vec(sentences, 
                vector_size=300, 
                window=5, 
                min_count=5, 
                negative=5,
                workers=multiprocessing.cpu_count(), 
                sg=1, 
                sample=1e-5,
                epochs=10)

w2v.save('w2v.model')

# Save words vectors
np.save('w2v_vectors.npy', w2v.wv.vectors)

# Save words keys
# words = np.array(list(w2v.wv.vocab.keys()))
words = np.array(list(w2v.wv.index_to_key))
np.save('w2v_keys.npy', words)

## **5. Test Model**

In [None]:
import gensim 
model = gensim.models.Word2Vec.load('w2v.model')

# Test the model
model.wv.most_similar('computer')

In [None]:
# Main params of the model
print(model)
# Testing word pairs
print(model.wv.similarity('apple', 'banana'))
print(model.wv.similarity('car', 'bus'))
print(model.wv.similarity('car', 'ship'))

print(model.wv.similarity('apple', 'bus'))
print(model.wv.similarity('apple', 'car'))


print(model.wv.most_similar(positive=['car', 'minivan'], topn=5))

## **6. Train Model with VisualGenome sentences**

In [None]:
import numpy as np

# Load VisualGenome sentences
VG_ARRAY = np.load('VG_sentences.npy', allow_pickle=True)
VG_SENTENCES = VG_ARRAY.tolist() # Convert to list
COUNT_SENTENCES = len(VG_SENTENCES) # Count of sentences

In [None]:
import logging
logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
logging.root.setLevel(level=logging.INFO)

import gensim 
model = gensim.models.Word2Vec.load('w2v.model')

In [None]:
# Update weights with VisualGenome sentences
model.train(corpus_iterable=VG_SENTENCES, total_examples=COUNT_SENTENCES, epochs=10)

model.save('VG_w2v.model')

# Save words vectors for VisualGenome
np.save('VG_w2v_vectors.npy', model.wv.vectors)