# Word2Vec

The Word2Vec technique is a method for obtaining and word embeding suitable for natural language processing, such as finding synonyms, making analogies and suggesting missing words for a sentence. It uses a neural network to learn from a large text (or corpus) am build a n-dimensional space containing one vector for each word on the training vocabulary.

This technique is relatively recent, being first published by a team of Google researches in 2013. The two main algorithms applied are skipgram and cbow. In both cases, we use a set of hyperparams, such as number of features in the word embeding (or the dimensionality of each word vector), the number of epochs and the size of our context window, to define the way our language model will be trained.

The objective of this notebook, is to explore how the variation of these hyperparameters will affect the accuracy of our model. The following implementation was made using gensim. It covers the training and validation of various models, however, **it does not cover any kind of sophisticated text lemmatization**, since we are using an already noralized corpus. Needless to say, that is not the real word scenario in most cases, therefore, any pratical implementation of Word2Vec should take this into account.

In [1]:
import math
import sys
import pandas as pd
import itertools
from time import time, sleep
import multiprocessing

# File management
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
from os import path, remove

# NLP
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
from gensim.test.utils import datapath
import spacy
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /Users/eem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Downloading the corpus

The first thing we got to do is to download the corpus we will be using to train our model. Here we will use a text of lowercase and unponctuated englih words supported by gensim as standard. The file is 100MB long, so it wont be included on the repository, nevertheless, the code bellow should perform the propper configuration.

In [2]:
corpus_uri = 'http://mattmahoney.net/dc/text8.zip'
target_name = './corpus.txt'
corpus_language = 'english'

if not path.exists(target_name):
    try:
        resp = urlopen(corpus_uri)
        file = ZipFile(BytesIO(resp.read()))

        target_file = open(target_name, 'w')
        for line in file.open(file.namelist()[0]).readlines():
            target_file.write(line.decode('utf-8'))
        target_file.close()
    except:
        if path.exists(target_name):
            remove(target_name)
            
corpus = open(target_name, 'r')
content = corpus.read()
print(content[:1024])

 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic instituti

## Pre processing

Our text is mostly normalized, so all we are going to do is break the text into sentences (using gensim standard support) and remove stopwords using the natural language toolkit library group of english stopwords.

In [3]:
# Split text into sentences
stopwords = nltk.corpus.stopwords.words('english')
sentences = list(itertools.islice(Text8Corpus(target_name),None))
sentences = [[word for word in sentence if word not in stopwords] for sentence in sentences]

## Creating and training the language model

The following functions are used for building the vocabulary and training our models. We will have a set of 81 combinations of parametres for each algorithm, therefore we will have to train 162 models. Each model takes a few minutes to train so it would be a very time consuming process to run each time. Gladly, gensim allows us to save our models into files, so we only have to train the models once.

The code will then verify if for a given set of hyperparameters a model was already created, if so, it will simply load the model from its file.

In [4]:
def build_vocabulary(model, sentences):
    t = time()
    model.build_vocab(sentences, progress_per=10000)
    print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

def train(model, sentences, epochs, corpus_size):
    t = time()
    model.train(sentences, total_examples=corpus_size, epochs=epochs, report_delay=1)
    print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

def build_model(sentences, min_count, window, vector_size, alpha, epochs, sg, corpus_size, model_name):
    cores = multiprocessing.cpu_count()
    model = Word2Vec(min_count=min_count, window=window, vector_size=vector_size, alpha=0.001, workers=cores-1, sg=sg)
    corpus_size = model.corpus_count if corpus_size == 0 else corpus_size
    
    build_vocabulary(model, sentences)
    train(model, sentences, epochs, corpus_size)
    
    model.save(model_name)
    
def get_model_name(sg, window, vector_size, epochs, corpus_size):
    return "./models/{sg}-{window}-{vector_size}-{epochs}-{corpus_size}.model".format(
        sg = 'skipgram' if sg == 1 else 'cbow',
        window = window,
        vector_size = vector_size,
        epochs = epochs,
        corpus_size = corpus_size
    )
    
def build_if_not_exists(sentences, sg=1, window=2, vector_size=100, epochs=30, corpus_size=0):
    model_name = get_model_name(sg, window, vector_size, epochs, corpus_size)
    
    if not path.isfile(model_name):
        return build_model(
            sentences = sentences,
            min_count = 10,
            window = window,
            vector_size = vector_size,
            alpha = 0.001,
            epochs = epochs,
            sg = sg,
            corpus_size = corpus_size,
            model_name = model_name
        )

### Parameters

For the hyperparameters to be tunned we will have:

1. vector_sizes, wich indicates the number of features in each word embeding of the model
2. windows, which reffers to the size of the context use to evaluate the sentences during training
3. corpus_sizes, which is the number of sentences from our corpus that will be considered for training
4. the number of epochs performed during training

each one of the parameters will have three values and each value will be evenly permutated generating 81 combinations.

In [5]:
vector_sizes = [50, 100, 300]
windows = [2, 5, 10]
corpus_sizes = [math.floor(len(sentences)*.33), math.floor(len(sentences)*.66), len(sentences)]
epochs = [10, 20, 30]

params = [list(i) for i in itertools.product(windows, vector_sizes, epochs, corpus_sizes)]

### Skipgram

In [6]:
for param in params:
    build_if_not_exists(sentences, 1, param[0], param[1], param[2], param[3])

Time to build vocab: 0.04 mins
Time to train the model: 1.19 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.23 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.21 mins
Time to build vocab: 0.04 mins
Time to train the model: 2.42 mins
Time to build vocab: 0.04 mins
Time to train the model: 2.52 mins
Time to build vocab: 0.04 mins
Time to train the model: 2.43 mins
Time to build vocab: 0.05 mins
Time to train the model: 3.6 mins
Time to build vocab: 0.04 mins
Time to train the model: 3.58 mins
Time to build vocab: 0.04 mins
Time to train the model: 3.59 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.24 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.26 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.24 mins
Time to build vocab: 0.04 mins
Time to train the model: 2.48 mins
Time to build vocab: 0.04 mins
Time to train the model: 2.49 mins
Time to build vocab: 0.04 mins
Time to train the model: 2.48 mins
Time to bui

### Cbow

In [7]:
for param in params:
    build_if_not_exists(sentences, 0, param[0], param[1], param[2], param[3])

Time to build vocab: 0.04 mins
Time to train the model: 0.56 mins
Time to build vocab: 0.04 mins
Time to train the model: 0.56 mins
Time to build vocab: 0.04 mins
Time to train the model: 0.57 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.12 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.12 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.13 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.68 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.68 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.68 mins
Time to build vocab: 0.04 mins
Time to train the model: 0.6 mins
Time to build vocab: 0.04 mins
Time to train the model: 0.61 mins
Time to build vocab: 0.04 mins
Time to train the model: 0.6 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.22 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.21 mins
Time to build vocab: 0.04 mins
Time to train the model: 1.22 mins
Time to buil

### Retrieving a model

In [8]:
def get_model(corpus_size, sg=1, window=2, vector_size=100, epochs=30):
    model_name = get_model_name(sg, window, vector_size, epochs, corpus_size)
    
    if path.isfile(model_name):
        return Word2Vec.load(model_name).wv
    else:
        print('Model not trained')

## Analogies

With our model trained, we can query for analog words. Take for example the pair germany and berlin, there is a clear relation between them, so if we were to provide a third word, it is to be expected that the model will return a fourth word bearing the same relation with the third word as the second did to the first, in this case, supose we input the word france, a reasonable answer would be paris, since paris is to france as berlin is to germany.

This is shown bellow.

In [9]:
def analogy(model, word, is_to, as_word):
    result = model.most_similar(negative=[word], positive=[is_to, as_word])
    return result[0][0]

In [67]:
model = get_model(corpus_size=len(sentences), sg=1, window=5, vector_size=100, epochs=20)
print('germany is to berlin as france is to: {}'.format(analogy(model, 'germany', 'berlin', 'france')))
print('star is to sun as planet is to: {}'.format(analogy(model, 'star', 'sun', 'planet')))
print('man is to king as woman is to: {}'.format(analogy(model, 'man', 'king', 'woman')))
print('teacher is to school as nurse is to: {}'.format(analogy(model, 'teacher', 'school', 'nurse')))
print('frederick is to king as elizabeth is to: {}'.format(analogy(model, 'frederick', 'king', 'elizabeth')))
print('feline is to cat as canine is to: {}'.format(analogy(model, 'feline', 'cat', 'canine')))
print('car is to road as boat is to: {}'.format(analogy(model, 'car', 'road', 'boat')))
print('fast is to faster as easy is to: {}'.format(analogy(model, 'fast', 'faster', 'easy')))
print('small is to big as good is to: {}'.format(analogy(model, 'small', 'big', 'good')))

germany is to berlin as france is to: paris
star is to sun as planet is to: jupiter
man is to king as woman is to: empress
teacher is to school as nurse is to: clinic
frederick is to king as elizabeth is to: queen
feline is to cat as canine is to: dog
car is to road as boat is to: shore
fast is to faster as easy is to: easier
small is to big as good is to: bad


Of course, it does not always work...

In [70]:
print('hat is to head as shirt is to: {}'.format(analogy(model, 'hat', 'head', 'shoe')))

hat is to head as shirt is to: severed


We can use this caracteristic to measure the accuracy of our model. Given a collection of pre stablished analogies, we will provide the first three words and search the model's reponse for the fourth. Keep in mind that the output of the model is not a single word as the exemples above, it is an array of probabilities, and each component of the array will be accounted for during the accuracy measure, this will also be done by gensim.

In [12]:
# Download test analogies
analogies_uri = 'https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt'
analogies_file_name = 'questions-words.txt'

if not path.exists(analogies_file_name):
    try:
        resp = urlopen(analogies_uri)
        file = open(analogies_file_name, 'wb')
        file.write(resp.read())
        file.close()
    except:
        if path.exists(analogies_file_name):
            remove(analogies_uri)

In [13]:
def test_model(sg, window, vector_size, epochs, corpus_size):
    model = get_model(corpus_size, sg, window, vector_size, epochs)
    analogy_scores = model.evaluate_word_analogies(datapath(analogies_file_name))
    return analogy_scores[0]

The following script will be renspondible for performing the test for each model and saving the results to a dataframe (it even has a nice progress bar). This also takes quite a lot of time, so we will save the dataframes to a csv file so we can load it later.

In [14]:
results_file_name = 'accuracy.csv'
accuracy_df = pd.DataFrame(columns=['algorithm', 'window', 'vector size', 'epochs', 'corpus size', 'accuracy'])

if not path.exists(results_file_name):
    total = len(params)
    
    # Skipgram
    done = 0
    print('Skipgram test progress:')
    for param in params:
        accuracy = test_model(sg=1, window=param[0], vector_size=param[1], epochs=param[2], corpus_size=param[3])
        data = {
            'algorithm': 'skipgram',
            'window': param[0],
            'vector size': param[1],
            'epochs': param[2],
            'corpus size': param[3],
            'accuracy': accuracy
        }
        accuracy_df = accuracy_df.append(data, ignore_index=True)
        done = done + 1
        
        progress = math.floor(20*(done/total))
        sys.stdout.write('\r')
        sys.stdout.write("[%-20s] %d%%" % ('='*progress, 5*progress))
        sys.stdout.flush()

    # Cbow
    done = 0
    print('\nCbow test progress:')
    for param in params:
        accuracy = test_model(sg=0, window=param[0], vector_size=param[1], epochs=param[2], corpus_size=param[3])
        data = {
            'algorithm': 'cbow',
            'window': param[0],
            'vector size': param[1],
            'epochs': param[2],
            'corpus size': param[3],
            'accuracy': accuracy
        }
        accuracy_df = accuracy_df.append(data, ignore_index=True)
        done = done + 1
        
        progress = math.floor(20*(done/total))
        sys.stdout.write('\r')
        sys.stdout.write("[%-20s] %d%%" % ('='*progress, 5*progress))
        sys.stdout.flush()
        
    # Save results
    results_file = open(results_file_name, 'w')
    results_file.write(accuracy_df.to_csv())
    results_file.close()
else:
    accuracy_df = pd.read_csv(results_file_name)

Skipgram test progress:
Cbow test progress:

You can see below the best achieved results for each algorithm. Skipgram had a maximum of 24.5% of accuracy rate while cbow only achieved about 8% accuracy.

In [15]:
print('Maximum skipgram accuracy: {}'.format(accuracy_df[accuracy_df.algorithm == 'skipgram'].accuracy.max()))
print('Maximum cbow accuracy: {}'.format(accuracy_df[accuracy_df.algorithm == 'cbow'].accuracy.max()))

Maximum skipgram accuracy: 0.24380325329202168
Maximum cbow accuracy: 0.08591531112832429
