## Introducing the royal family of vectors
### A Basic Illustrative Example of Word2Vec

#### Introduction
This notebook runs through a basic example of how to vectorize words from a pre-existing corpora to illustrate the vector algebra at the core of one possible utility of Word2Vec for social scientists. To do this, I rely on GenSim, an open source topic modelling library that can process raw and unstructured plain text files using unsupervised machine learning algorithms. Word2vec is one of the algorithms included in GenSim (others include Latent Semantic Indexing and Latent Dirichlet Allocation). The virtual environment to run this code is available on Github (gensimEnv.yml).

#### Getting Started - From text to tokens

In [25]:
#Importing Word2Vec from GenSim
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api

Once Word2Vec has been imported from Gensim, we can began to vectorize words. In this example, the corpora used to vectorize words is a pre-processed corpora on the Gensim database, called "text8". Text8 constitutes the first 100,000,000 bytes of plain text from Wikipedia, of which I only use a subset to speed up computation. 

In [26]:
#Downloading existing corpora from the Gensim database - Text from Wikipedia
dataset = api.load("text8")


The next step in the vectorization process is to create a list of words from the unstructured plain text file, a process known as tokenization. In reality, the tokenization produces a list of list, where each sentence is a list a words and each sentence is an element of the list that forms the original corpus. As previously mentioned, I subset the list to speed up computation in this example given the size of the corpora ("text8" is the first 100,000,000 bytes of plain text from Wikipedia). The unstructured plain text file has already been processed (the next demonstration runs through basic text processing, standard in natural language processing) and can simply be subsetted before being used to vectorize words with the Word2Vec algorithm. 

In [27]:
#Tokenizing 
data = [d for d in dataset]
data_part1 = data[:1000]

#### Training the Word2Vec model- From tokens to vectors

Now that we have "tokenized" our text, we are ready to vectorize words using the Word2Vec algorithm in GenSim. The default Word2Vec model is the Continuous Bag of Words (CBOW) model (see presentation for details), however it is trivial to change the model to Skip-gram (for instructions and additional details on the option ```sg``` [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)). Skip-gram has been argued to be much slower in training a Word2Vec model than CBOW, but has better training capibility using a smaller training corpus (see more details [here](https://arxiv.org/pdf/1301.3781.pdf)). In the code below, ```min_count``` tells the Word2Vec model to ignore the words that have a frequency less than the specified value (2, in this case), which should improve the model by removing rare words. The basic idea is that a vector representation of word that is rarely used in a training corpus will not have an accurate semantic representation. Further options can be added, such as specifying a desired vector dimension and specifying the way in which vectors are computed, but these are beyond the scope of this basic example (more details [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)).

In [28]:
#Training the Word2Vec model with text from Wikipedia (in the gensim database)
word2vec = Word2Vec(data_part1, min_count=2)

#### Visualizing the Word2Vec model -  Words as Vectors
We have now vectorized as set of words using their semantic context in a sample of Wikipedia articles. Obviously, these articles do not contain all of the words in the English language. In fact, we have "only" vectorized 100698 words and this constitutes the extent of our vocabulary. 


In [22]:
vocab = len(word2vec.wv)
print(vocab)

100698


As you can imagine, the larger your "training" corpus or corpora (equivalent to our "text8" in this example) the more words you can vectorize and generally the more accurate your model. More text means more semantic context through which the model can vectorize words and, therefore, the more accurate your vector represents the meaning of the word within your chosen text(s). Obviously, the choice of training corpus for the model should be dictated by the type of semantic analysis one wishes to embark upon, if you are using Word2Vec for semantic analysis (a more in-depth discussion [here](https://kbpedia.org/use-cases/document-specific-word2vec-training-corpuses/)). The vector for "king" can be displayed as follows and it is a 100-dimensional vector (vector dimensions can be changed in the options for the Word2Vec model, more details [here](https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)). 

In [24]:
#Displaying the vector for the word "King"
word2vec.wv['king']

array([-0.6247059 ,  2.734628  ,  0.07881403,  0.8415265 ,  1.0919182 ,
       -1.1927114 , -0.15348631,  0.31121325,  1.2566534 , -0.67269063,
        0.5604339 ,  2.6822858 ,  2.854661  ,  1.7861245 ,  3.8821514 ,
       -1.1519864 ,  1.8963438 , -0.42539087,  2.2171638 , -0.6451796 ,
        1.3156921 , -0.77337134,  2.1716301 ,  3.0620973 ,  2.2385871 ,
        1.9413279 ,  0.76725775, -1.1715403 , -1.7636768 ,  0.4159524 ,
       -0.37398598, -2.7588546 ,  0.58317375, -1.5069119 ,  1.7781116 ,
        1.7118299 , -0.42223325, -1.4407419 ,  1.7727789 ,  1.5959058 ,
       -0.94584805, -0.01704919, -0.263002  ,  2.6214325 ,  0.9325347 ,
       -1.944263  , -3.248921  , -1.9091796 ,  1.0752041 ,  0.41822544,
        3.2948973 ,  0.09252111, -3.8474638 , -2.503489  , -0.3280734 ,
        0.8311117 ,  3.3224537 ,  0.52778506,  0.30486453,  1.3455604 ,
        0.24585572,  0.9263525 ,  1.793085  , -1.0099862 ,  0.3908879 ,
       -1.2896887 , -1.3128164 , -0.11316916, -0.41064888, -1.25

#### Semantic Similarity - From Vectors to Vector Algebra

Given that we now have a set of words represented in vector format, we can assess the cosine similarity between these vectors as a way to assess semantic similarity. Following the premise in the presentation, the similarity between "king", "queen", "male" and "female" can be computed as follows. As expected, "king" and "female" are less semantically similar than "king" and "male".   

In [38]:
#Cosine Similarity between King and Queen
word2vec.wv.similarity('king', 'queen')

0.71888465

In [39]:
#Cosine Similarity between King and Female
word2vec.wv.similarity('king', 'female')

0.04135818

In [40]:
#Cosine Similarlity between King and Male 
word2vec.wv.similarity('king', 'male')

0.070132636

In [41]:
#Cosine Similarity between Male and Female 
word2vec.wv.similarity('male', 'female')

0.92445123

#### Conclusion 

This notebook has covered the basics of Word2vec, showcasing how to vectorize words from a pre-processed sample of Wikipedia articles. This code was adapted from the following [tutorial](https://www.machinelearningplus.com/nlp/gensim-tutorial/#14howtotrainword2vecmodelusinggensim). 