# Word Embeddings Training Tutorial
This tutorial shows how to train word embeddings using word2vec. The python package used is Gensim

https://radimrehurek.com/gensim/

We are going to use the movie review data sets to train our own word embbedings and we are going to compare it with the pretreained word embeddings from google.

- Movie Review data set
http://ai.stanford.edu/~amaas/data/sentiment/

- Google pre-trained word embeddings using word2vec
https://code.google.com/archive/p/word2vec/

## 1. Data Loading
The first step as always is loading the data and processing in a suitable way for the algorithm

### Option 1: Not enough RAM
If you do not have enough RAM a SentenceIterator class was created to avoid running out of RAM while training the model. Run the code below only if you do not have enough RAM

In [11]:
#from sentence_iterator import SentenceIterator
#file_path = "/home/jose/aclImdb_data/all_data"
#sentences = SentenceIterator(file_path)

### Option 2: Enough RAM
If you have enough RAM is better to read the files, split them into sentences and save a pickle file for future use. You only need to run this code once

In [20]:
# Read the files
#import pickle
#import os
#from nltk import sent_tokenize
#from nltk import word_tokenize

pickle_file = "imdb_extracted_sentences.pickle"
#file_path = "/home/jose/aclImdb_data/all_data"
#sentences = []
#for fname in os.listdir(file_path):
#    for sent in sent_tokenize(open(os.path.join(file_path, fname),'r').read()):
#        sentences.append(word_tokenize(sent))
#
#with open(pickle_file, 'wb') as f: 
#    pickle.dump(sentences, f, protocol=pickle.HIGHEST_PROTOCOL)
#f.close()

If you already save the pickle file you do not need to run the code above, only load the pickle file

In [21]:
with open(pickle_file,"rb") as f:
    sentences = pickle.load(f)
f.close()

## 2. Train the model
The algorithm require us to specify some hyper parameters such as
- min_count: is the minimum number of times a word needs to occur in the corpus to be used by the algorithm during training
- embeddings_size: this is the size of the real valued vector that is going to represent each unique word in the corpus
- percent: this is the percentage of sentences to process. Thi is useful when you want to obtain a quick result with a subset of the data
- workers: number of workers for parallel processing

In [17]:
# Import the library
import gensim
import math

min_count = 10
embeddings_size = 100
percent = 0.15
workers = 20
model = gensim.models.Word2Vec(sentences[0:math.floor(percent*len(sentences))], min_count=min_count, workers=workers)

In [18]:
# saving the model
model.save("word2vec_model")

## 3. Inspecting the model
It is useful to see the semantic similarity of some words as an intrisic evaluation of the model

In [39]:
print("Size of the vocabulary: " + str(len(model.wv.vocab)))
model.most_similar(positive=['woman'])

Size of the vocabulary: 16705


[('girl', 0.8700416088104248),
 ('man', 0.8666628003120422),
 ('lady', 0.8453189134597778),
 ('boy', 0.8107814788818359),
 ('soldier', 0.7831786870956421),
 ('doctor', 0.7675391435623169),
 ('child', 0.732384443283081),
 ('witch', 0.7167837619781494),
 ('scientist', 0.7161659002304077),
 ('cop', 0.6994512677192688)]

In [26]:
model.similarity('dog', 'fox')

0.65367832859740349

In [29]:
model['animal']

array([-0.18082204, -0.05777469,  0.28638139, -0.43027464, -0.3152011 ,
        0.06283583,  0.39537835, -0.96961647, -0.11849733, -0.01840558,
       -0.32806531,  0.03565818, -0.10399359,  0.35991848,  0.222792  ,
        0.07531917, -0.1085555 , -0.15583414, -0.06360928,  0.02611457,
       -0.10706156, -0.54339021,  0.05757902, -0.12644817, -0.22230825,
        0.7339049 ,  0.16918948,  0.28775319, -0.13272724,  0.19950433,
       -0.12631072, -0.2535795 , -0.0032704 ,  0.11509611, -0.02309781,
       -0.05108172, -0.05151635, -0.40891787,  0.21989913,  0.11440427,
       -0.0152181 , -0.27502182, -0.63500124, -0.19826609, -0.06698717,
        0.21478115, -0.03544712,  0.44662356, -0.08495852, -0.15820642,
       -0.06566726, -0.13116595, -0.22253287,  0.18820772, -0.29603323,
        0.13104057,  0.0482242 ,  0.0267632 , -0.09839885,  0.20925035,
       -0.39864907,  0.0827572 ,  0.26668873,  0.25393853,  0.20875359,
        0.3907806 , -0.26723692, -0.44289762, -0.47170168, -0.05

## 4. Load a pre-trained Model
In this case we are going to load google pre-trained model.

The description of the model can be found in
https://code.google.com/archive/p/word2vec/

In [31]:
google_model = gensim.models.KeyedVectors.load_word2vec_format('/data2/data/GoogleNews-vectors-negative300.bin.gz', binary=True)

### 4.1 Inspect the model

In [42]:
print("Size of the Vocabulary: " + str(len(google_model.vocab)))
google_model.most_similar(positive=['woman'])

Size of the Vocabulary: 3000000


[('man', 0.7664012312889099),
 ('girl', 0.7494640946388245),
 ('teenage_girl', 0.7336829900741577),
 ('teenager', 0.631708562374115),
 ('lady', 0.6288785934448242),
 ('teenaged_girl', 0.6141784191131592),
 ('mother', 0.607630729675293),
 ('policewoman', 0.6069462299346924),
 ('boy', 0.5975908041000366),
 ('Woman', 0.5770983099937439)]

In [43]:
google_model.similarity('dog', 'fox')

0.52753879318272923