# Word2Vec

The Word2Vec technique is a method for obtaining and word embeding suitable for natural language processing, such as finding synonyms, making analogies and suggesting missing words for a sentence. It uses a neural network to learn from a large text (or corpus) am build a n-dimensional space containing one vector for each word on the training vocabulary.

This technique is relatively recent, being first published by a team of Google researches in 2013. The two main algorithms applied are skipgram and cbow. In both cases, we use a set of hyperparams, such as number of features in the word embeding (or the dimensionality of each word vector), the number of epochs and the size of our context window, to define the way our language model will be trained.

The objective of this notebook, is to explore how the variation of these hyperparameters will affect the accuracy of our model. The following implementation was made using gensim. It covers the training and validation of various models, however, **it does not cover any kind of sophisticated text lemmatization**, since we are using an already noralized corpus. Needless to say, that is not the real word scenario in most cases, therefore, any pratical implementation of Word2Vec should take this into account.

In [1]:
import math
import sys
import pandas as pd
import itertools
from time import time, sleep
import multiprocessing

# File management
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
from os import path, remove

# NLP
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
from gensim.test.utils import datapath
import spacy
import nltk
nltk.download('stopwords')

# Data analisis
from matplotlib import pyplot as plt

[nltk_data] Downloading package stopwords to /Users/eem/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
def progress_bar(done = 0, total = 100):
    progress = 20*(done/total)
    sys.stdout.write('\r')
    sys.stdout.write("[%-20s] %d%%" % ('='*math.floor(progress), 5*progress))
    sys.stdout.flush()
    if done == total:
        print('')

## Downloading the corpus

The first thing we got to do is to download the corpus we will be using to train our model. Here we will use a text of lowercase and unponctuated englih words supported by gensim as standard. The file is 100MB long, so it wont be included on the repository, nevertheless, the code bellow should perform the propper configuration.

In [3]:
corpus_uri = 'http://mattmahoney.net/dc/text8.zip'
target_name = './corpus.txt'
corpus_language = 'english'

if not path.exists(target_name):
    try:
        resp = urlopen(corpus_uri)
        file = ZipFile(BytesIO(resp.read()))

        target_file = open(target_name, 'w')
        for line in file.open(file.namelist()[0]).readlines():
            target_file.write(line.decode('utf-8'))
        target_file.close()
    except:
        if path.exists(target_name):
            remove(target_name)
            
corpus = open(target_name, 'r')
content = corpus.read()
print(content[:1024])

 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the english revolution and the sans culottes of the french revolution whilst the term is still used in a pejorative way to describe any act that used violent means to destroy the organization of society it has also been taken up as a positive label by self defined anarchists the word anarchism is derived from the greek without archons ruler chief king anarchism as a political philosophy is the belief that rulers are unnecessary and should be abolished although there are differing interpretations of what this means anarchism also refers to related social movements that advocate the elimination of authoritarian institutions particularly the state the word anarchy as most anarchists use it does not imply chaos nihilism or anomie but rather a harmonious anti authoritarian society in place of what are regarded as authoritarian political structures and coercive economic instituti

## Pre processing

Our text is mostly normalized, so all we are going to do is break the text into sentences (using gensim standard support) and remove stopwords using the natural language toolkit library group of english stopwords.

In [4]:
# Split text into sentences
stopwords = nltk.corpus.stopwords.words('english')
sentences = list(itertools.islice(Text8Corpus(target_name),None))
sentences = [[word for word in sentence if word not in stopwords] for sentence in sentences]

## Creating and training the language model

The following functions are used for building the vocabulary and training our models. We will have a set of 81 combinations of parametres for each algorithm, therefore we will have to train 162 models. Each model takes a few minutes to train so it would be a very time consuming process to run each time. Gladly, gensim allows us to save our models into files, so we only have to train the models once.

The code will then verify if for a given set of hyperparameters a model was already created, if so, it will simply load the model from its file.

In [5]:
def build_vocabulary(model, sentences):
    t = time()
    model.build_vocab(sentences, progress_per=10000)
    return time() - t

def train(model, sentences, epochs, corpus_size):
    t = time()
    model.train(sentences, total_examples=corpus_size, epochs=epochs, report_delay=1)
    return time() - t

def build_model(sentences, min_count, window, vector_size, alpha, epochs, sg, corpus_size, model_name):
    cores = multiprocessing.cpu_count()
    model = Word2Vec(min_count=min_count, window=window, vector_size=vector_size, alpha=0.001, workers=cores-1, sg=sg)
    corpus_size = model.corpus_count if corpus_size == 0 else corpus_size
    
    t = 0
    t = t + build_vocabulary(model, sentences)
    t = t + train(model, sentences, epochs, corpus_size)
    
    model.save(model_name)
    return t
    
def get_model_name(sg, window, vector_size, epochs, corpus_size, min_count):
    return "./models/{sg}-{window}-{vector_size}-{epochs}-{corpus_size}-{min_count}.model".format(
        sg = 'skipgram' if sg == 1 else 'cbow',
        window = window,
        vector_size = vector_size,
        epochs = epochs,
        corpus_size = corpus_size,
        min_count = min_count
    )
    
def build_if_not_exists(sentences, sg=1, window=2, vector_size=100, epochs=30, corpus_size=0, min_count=50):
    model_name = get_model_name(sg, window, vector_size, epochs, corpus_size, min_count)
    
    if not path.isfile(model_name):
        return build_model(
            sentences = sentences,
            min_count = min_count,
            window = window,
            vector_size = vector_size,
            alpha = 0.001,
            epochs = epochs,
            sg = sg,
            corpus_size = corpus_size,
            model_name = model_name
        )
    else:
        return 0
    
def get_model(corpus_size, sg=1, window=2, vector_size=100, epochs=30, min_count=50):
    model_name = get_model_name(sg, window, vector_size, epochs, corpus_size, min_count)
    
    if path.isfile(model_name):
        return Word2Vec.load(model_name).wv
    else:
        print('Model not trained')

### Parameters

For the hyperparameters to be tunned we will have:

1. vector_sizes, wich indicates the number of features in each word embeding of the model
2. windows, which reffers to the size of the context use to evaluate the sentences during training
3. corpus_sizes, which is the number of sentences from our corpus that will be considered for training
4. the number of epochs performed during training

each one of the parameters will have three values and each value will be evenly permutated generating 81 combinations.

In [6]:
vector_sizes = [50, 300]
windows = [2, 5, 10]
min_counts = [10, 30, 50]
corpus_sizes = [math.floor(len(sentences)*.5), len(sentences)]
epochs = [10, 20, 30]

params = [list(i) for i in itertools.product(windows, vector_sizes, epochs, corpus_sizes, min_counts)]

In [9]:
def train_with_params(sentences, algorithm, params):
    done = 0
    total = len(params)
    t = 0
    algorithm_name = 'skipgram' if algorithm == 1 else 'cbow'
    print('Training {} models:'.format(algorithm_name))
    progress_bar(done, total)

    for param in params:
        t = t + build_if_not_exists(sentences, algorithm, param[0], param[1], param[2], param[3], param[4])
        done = done + 1
        progress_bar(done, total)

    minutes = math.floor(t/60)
    hours = math.floor(minutes/60)
    minutes = minutes % 60
    print('Took {} hours and {} minutes to make {} {} models'.format(hours, minutes, done, algorithm_name))

### Skipgram

In [10]:
train_with_params(sentences, 1, params)

Training skipgram models:
Took 0 hours and 1 minutes to make 1 skipgram models


### Cbow

In [None]:
train_with_params(sentences, 0, params)

## Analogies

With our model trained, we can query for analog words. Take for example the pair germany and berlin, there is a clear relation between them, so if we were to provide a third word, it is to be expected that the model will return a fourth word bearing the same relation with the third word as the second did to the first, in this case, supose we input the word france, a reasonable answer would be paris, since paris is to france as berlin is to germany.

This is shown bellow.

In [None]:
def analogy(model, word, is_to, as_word):
    result = model.most_similar(negative=[word], positive=[is_to, as_word])
    return result[0][0]

In [None]:
model = get_model(corpus_size=len(sentences), sg=1, window=5, vector_size=100, epochs=20, min_count=30)
print('germany is to berlin as france is to: {}'.format(analogy(model, 'germany', 'berlin', 'france')))
print('star is to sun as planet is to: {}'.format(analogy(model, 'star', 'sun', 'planet')))
print('man is to king as woman is to: {}'.format(analogy(model, 'man', 'king', 'woman')))
print('teacher is to school as nurse is to: {}'.format(analogy(model, 'teacher', 'school', 'nurse')))
print('frederick is to king as elizabeth is to: {}'.format(analogy(model, 'frederick', 'king', 'elizabeth')))
# print('feline is to cat as canine is to: {}'.format(analogy(model, 'feline', 'cat', 'canine')))
print('car is to road as boat is to: {}'.format(analogy(model, 'car', 'road', 'boat')))
print('fast is to faster as easy is to: {}'.format(analogy(model, 'fast', 'faster', 'easy')))
print('small is to big as good is to: {}'.format(analogy(model, 'small', 'big', 'good')))

Of course, it does not always work...

In [None]:
print('hat is to head as shirt is to: {}'.format(analogy(model, 'hat', 'head', 'shirt')))

We can use this caracteristic to measure the accuracy of our model. Given a collection of pre stablished analogies, we will provide the first three words and search the model's reponse for the fourth. Keep in mind that the output of the model is not a single word as the exemples above, it is an array of probabilities, and each component of the array will be accounted for during the accuracy measure, this will also be done by gensim.

In [None]:
# Download test analogies
analogies_uri = 'https://raw.githubusercontent.com/nicholas-leonard/word2vec/master/questions-words.txt'
analogies_file_name = 'questions-words.txt'

if not path.exists(analogies_file_name):
    try:
        resp = urlopen(analogies_uri)
        file = open(analogies_file_name, 'wb')
        file.write(resp.read())
        file.close()
    except:
        if path.exists(analogies_file_name):
            remove(analogies_uri)

In [None]:
def test_model(sg, window, vector_size, epochs, corpus_size, min_count):
    model = get_model(corpus_size, sg, window, vector_size, epochs, min_count)
    analogy_scores = model.evaluate_word_analogies(datapath(analogies_file_name))
    return analogy_scores[0]

The following script will be renspondible for performing the test for each model and saving the results to a dataframe (it even has a nice progress bar). This also takes quite a lot of time, so we will save the dataframes to a csv file so we can load it later.

In [None]:
results_file_name = 'accuracy.csv'
accuracy_df = pd.DataFrame(columns=['algorithm', 'window', 'vector size', 'epochs', 'corpus size', 'min count', 'accuracy'])

if not path.exists(results_file_name):
    total = len(params)
    
    # Skipgram
    done = 0
    print('Skipgram test progress:')
    progress_bar(done, total)
    for param in params:
        accuracy = test_model(sg=1, window=param[0], vector_size=param[1], epochs=param[2], corpus_size=param[3], min_count=param[4])
        data = {
            'algorithm': 'skipgram',
            'window': param[0],
            'vector size': param[1],
            'epochs': param[2],
            'corpus size': param[3],
            'min count': param[4],
            'accuracy': accuracy
        }
        accuracy_df = accuracy_df.append(data, ignore_index=True)
        done = done + 1
        progress_bar(done, total)
    print('')

    # Cbow
    done = 0
    print('Cbow test progress:')
    progress_bar(done, total)
    for param in params:
        accuracy = test_model(sg=0, window=param[0], vector_size=param[1], epochs=param[2], corpus_size=param[3], min_count=param[4])
        data = {
            'algorithm': 'cbow',
            'window': param[0],
            'vector size': param[1],
            'epochs': param[2],
            'corpus size': param[3],
            'min count': param[4],
            'accuracy': accuracy
        }
        accuracy_df = accuracy_df.append(data, ignore_index=True)
        done = done + 1
        progress_bar(done, total)
        
    # Save results
    results_file = open(results_file_name, 'w')
    results_file.write(accuracy_df.to_csv())
    results_file.close()
else:
    accuracy_df = pd.read_csv(results_file_name)

You can see below the best achieved results for each algorithm. Skipgram had a maximum of 24.5% of accuracy rate while cbow only achieved about 8% accuracy.

In [None]:
print('Maximum skipgram accuracy: {}'.format(accuracy_df[accuracy_df.algorithm == 'skipgram'].accuracy.max()))
print('Maximum cbow accuracy: {}'.format(accuracy_df[accuracy_df.algorithm == 'cbow'].accuracy.max()))

## Hyperparameter study

With our accuracy data gathered, we can visualize how each paremeter affected the average accuracy obtained on the figures below.

In [None]:
def plot_influence(param, dataframe):
    skipgram_data = dataframe[dataframe.algorithm == 'skipgram'].groupby(param).mean()
    cbow_data = dataframe[dataframe.algorithm == 'cbow'].groupby(param).mean()
    
    plt.title('Influence of {} in model accuracy'.format(param))
    plt.xlabel(param)
    plt.ylabel('accuracy')
    plt.plot(cbow_data, label='cbow')
    plt.scatter(x=dataframe[param].unique(), y=cbow_data)
    plt.plot(skipgram_data, label='skipgram', color='red')
    plt.scatter(x=dataframe[param].unique(), y=skipgram_data, color='red')
    plt.legend()
    
def parameter_variance(params, dataframe):
    data = pd.DataFrame(columns=['param', 'variance'])
    
    plt.title('Variance of accuracy for each parameter')
    plt.ylabel('accuracy')
    for param in params:
        variance = dataframe.groupby(param).mean().var()[0]
        data = data.append({'param': param, 'variance': variance}, ignore_index=True)
        
    plt.bar(range(len(params)), data['variance'], tick_label=params)

In [None]:
plt.figure(figsize=(10, 8), dpi=100)
params = ['epochs', 'window', 'vector size', 'corpus size']

for i in range(len(params)):
    plt.subplot(2, 2, i+1)
    plot_influence(params[i], accuracy_df)

plt.subplots_adjust(hspace = .4)
plt.show()

As we can see, for both algorithms the number of the epochs and the size of the context window had a much greater impact in overall accuracy, while vector size and corpus size created a smaller variation on the results. We can comprove this tendency with the following plot of the variance for each parameter.

Moreover, as the vector size increases, the average accuracy decreases for the cbow algorithm. This could be an indicator of overfitting, as we try to extreact a large number of features from a limited corpus.

In [None]:
plt.figure(figsize=(10, 4), dpi=100)
parameter_variance(params, accuracy_df)
plt.show()

With this data, we can prioritize the variation of the most relevant params and set a constant value for the remaining ones.

For the skipgram algorithm, we will use a vector size of 100 and our entire corpus while using values between 30 and 70 for the min count, 20 and 60 for the number of epochs and 5 and 10 for window sizes. We will do the same for the cbow algorithm, except we will use 50 for vector size instead of 100, since it gave us a beter accuracy.

In [None]:
vector_sizes = [100]
windows = [5, 10]
min_counts = [30, 50, 70]
corpus_sizes = [len(sentences)]
epochs = [20, 30, 60]
skipgram_params = [list(i) for i in itertools.product(windows, vector_sizes, epochs, corpus_sizes, min_counts)]

train_with_params(sentences, 1, skipgram_params)

In [None]:
vector_sizes = [50]
windows = [5, 10]
min_counts = [30, 50, 70]
corpus_sizes = [len(sentences)]
epochs = [20, 30, 60]
cbow_params = [list(i) for i in itertools.product(windows, vector_sizes, epochs, corpus_sizes, min_counts)]

train_with_params(sentences, 0, cbow_params)