# Gensim Word2Vec Tutorial

Gensim tutorial for biomedical word2vec model training and playing around

Based on <http://mccormickml.com/2016/04/27/word2vec-resources/>



# Plan

1. [Briefing about Word2Vec](#Briefing-about-Word2Vec:)
    * [Purpose of the tutorial](#Purpose-of-the-tutorial:)
    * [Brief explanation](#Brief-explanation:)

2. [Getting Started](#Getting-Started)
    * [Setting up the environment](#Setting-up-the-environment:)
    * [The data](#The-data:)
3. [Preprocessing](#Preprocessing)
    * [Cleaning](#Cleaning)
    * [Bigrams](#Bigrams)
    * [Most frequent words](#Most-Frequent-Words)
    
4. [Training the Model](#Training-the-model)
    * [Gensim Word2Vec Implementation](#Gensim-Word2Vec-Implementation:)
    * [Why I seperate the training of the model in 3 steps](#Why-I-seperate-the-training-of-the-model-in-3-steps:)
    * [Training the model](#Training-the-model)
        * [The parameters](#The-parameters)
        * [Building the vocabulary table](#Building-the-Vocabulary-Table)
        * [Training of the model](#Training-of-the-model)
        * [Saving the model](#Saving-the-model:)
5. [Exploring the Model](#Exploring-the-model)
    * [Most similar to](#Most-similar-to:)
    * [Similarities](#Similarities:)
    * [Odd-one-out](#Odd-One-Out:)
    * [Analogy difference](#Analogy-difference:)
    * [t-SNE visualizations](#t-SNE-visualizations:)
        * [10 Most similar words vs. 8 Random words](#10-Most-similar-words-vs.-8-Random-words:)
        * [10 Most similar words vs. 10 Most dissimilar](#10-Most-similar-words-vs.-10-Most-dissimilar:)
        * [10 Most similar words vs. 11th to 20th Most similar words](#10-Most-similar-words-vs.-11th-to-20th-Most-similar-words:)
6. [Final Thoughts](#Final-Thoughts)
7. [Material for more in depths understanding](#Material-for-more-in-depths-understanding:)
8. [Acknowledgements](#Acknowledgements)
9. [References](#References:)
10. [End](#End)

# Briefing about Word2Vec:

<img src="http://mccormickml.com/assets/word2vec/skip_gram_net_arch.png" alt="drawing" width="550"/>

[[1]](#References:)


## Purpose of the tutorial:
As I said before, this tutorial focuses on the right use of the Word2Vec package from the Gensim libray; therefore, I am not going to explain the concepts and ideas behind Word2Vec here. I am simply going to give a very brief explanation, and provide you with links to good, in depth tutorials.

## Brief explanation:

Word2Vec was introduced in two [papers](#Material-for-more-in-depths-understanding:) between September and October 2013, by a team of researchers at Google. Along with the papers, the researchers published their implementation in C. The Python implementation was done soon after the 1st paper, by [Gensim](https://radimrehurek.com/gensim/index.html). 

The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: "dog", "puppy" and "pup" are often used in similar situations, with similar surrounding words like "good", "fluffy" or "cute", and according to Word2Vec they will therefore share a similar vector representation.<br>

From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.



In [None]:
# !pip install spacy

In [None]:
import logging  # Setting up the loggings to monitor gensim
import re  # For preprocessing
from collections import defaultdict  # For word frequency
from time import time  # To time our operations

import pandas as pd  # For data handling
import spacy  # For preprocessing

logging.basicConfig(
    format="%(levelname)s - %(asctime)s: %(message)s",
    datefmt="%H:%M:%S",
    level=logging.INFO,
)

# Preprocessing




In [None]:
# data dir for the text data
data_dir = "<DATA_DIR>"
## we just want text so lets grab the text column only
df = pd.read_csv(f"{data_dir}", sep="delimiter", header=None)
df.shape

In [None]:
df.head()

In [None]:
# working with dataframes is easier if we have a nicer column name - so rename 0 to TEXT
df.rename(columns={0: "TEXT"}, inplace=True)

In [None]:
df.isnull().sum()

Removing the missing values:

In [None]:
df = df.dropna().reset_index(drop=True)
df.isnull().sum()

In [None]:
df.shape

## Cleaning:
We are lemmatizing and removing the stopwords and non-alphabetic characters for each line of dialogue. 

### NOTE
Lemmatizing is probably something we would skip for medical texts usually - it can often reduce the vocabularly substantially and may not work so well for niche medical terms


In [None]:
nlp = spacy.load(
    "en_core_web_sm", disable=["ner", "parser"]
)  # disabling Named Entity Recognition for speed


def cleaning(doc):
    # Lemmatizes and removes stopwords
    # doc needs to be a spacy Doc object
    txt = [token.lemma_ for token in doc if not token.is_stop]
    # Word2Vec uses context words to learn the vector representation of a target word,
    # if a sentence is only one or two words long,
    # the benefit for the training is very small
    if len(txt) > 2:
        return " ".join(txt)

Removes non-alphabetic characters:

In [None]:
brief_cleaning = (re.sub("[^A-Za-z']+", " ", str(row)).lower() for row in df["TEXT"])

Taking advantage of spaCy .pipe() attribute to speed-up the cleaning process:

In [None]:
txt = [cleaning(doc) for doc in nlp.pipe(brief_cleaning, n_process=2)]

Put the results in a DataFrame to remove missing values and duplicates:

In [None]:
df_clean = pd.DataFrame({"clean": txt})
df_clean = df_clean.dropna().drop_duplicates()
df_clean.shape

In [None]:
df_clean.head()

## Bigrams:
We are using Gensim Phrases package to automatically detect common phrases (bigrams) from a list of sentences.
https://radimrehurek.com/gensim/models/phrases.html



In [None]:
from gensim.models.phrases import Phraser, Phrases

As `Phrases()` takes a list of list of words as input:

In [None]:
sent = [row.split() for row in df_clean["clean"]]

In [None]:
len(sent)

Creates the relevant phrases from the list of sentences:

In [None]:
phrases = Phrases(sent, min_count=30, progress_per=10000)

The goal of Phraser() is to cut down memory consumption of Phrases(), by discarding model state not strictly needed for the bigram detection task:

In [None]:
bigram = Phraser(phrases)

In [None]:
bigram

Transform the corpus based on the bigrams detected:

In [None]:
sentences = bigram[sent]

## Most Frequent Words:
Mainly a sanity check of the effectiveness of the lemmatization, removal of stopwords, and addition of bigrams.

In [None]:
word_freq = defaultdict(int)
for sent in sentences:
    for i in sent:
        word_freq[i] += 1
len(word_freq)

In [None]:
sorted(word_freq, key=word_freq.get, reverse=True)[:10]

# Training the model
## Gensim Word2Vec Implementation:
We use Gensim implementation of word2vec: https://radimrehurek.com/gensim/models/word2vec.html

In [None]:
import multiprocessing

from gensim.models import Word2Vec

## Why I seperate the training of the model in 3 steps:
I prefer to separate the training in 3 distinctive steps for clarity and monitoring.
1. `Word2Vec()`: 
>In this first step, I set up the parameters of the model one-by-one. <br>I do not supply the parameter `sentences`, and therefore leave the model uninitialized, purposefully.
2. `.build_vocab()`: 
>Here it builds the vocabulary from a sequence of sentences and thus initialized the model. <br>With the loggings, I can follow the progress and even more important, the effect of `min_count` and `sample` on the word corpus. I noticed that these two parameters, and in particular `sample`, have a great influence over the performance of a model. Displaying both allows for a more accurate and an easier management of their influence.
3. `.train()`:
>Finally, trains the model.<br>
The loggings here are mainly useful for monitoring, making sure that no threads are executed instantaneously.

In [None]:
cores = multiprocessing.cpu_count()  # Count the number of cores in a computer

cores

## The parameters:

* `min_count` <font color='purple'>=</font> <font color='green'>int</font> - Ignores all words with total absolute frequency lower than this - (2, 100)


* `window` <font color='purple'>=</font> <font color='green'>int</font> - The maximum distance between the current and predicted word within a sentence. E.g. `window` words on the left and `window` words on the left of our target - (2, 10)


* `size` <font color='purple'>=</font> <font color='green'>int</font> - Dimensionality of the feature vectors. - (50, 300)


* `sample` <font color='purple'>=</font> <font color='green'>float</font> - The threshold for configuring which higher-frequency words are randomly downsampled. Highly influencial.  - (0, 1e-5)


* `alpha` <font color='purple'>=</font> <font color='green'>float</font> - The initial learning rate - (0.01, 0.05)


* `min_alpha` <font color='purple'>=</font> <font color='green'>float</font> - Learning rate will linearly drop to `min_alpha` as training progresses. To set it: alpha - (min_alpha * epochs) ~ 0.00


* `negative` <font color='purple'>=</font> <font color='green'>int</font> - If > 0, negative sampling will be used, the int for negative specifies how many "noise words" should be drown. If set to 0, no negative sampling is used. - (5, 20)


* `workers` <font color='purple'>=</font> <font color='green'>int</font> - Use these many worker threads to train the model (=faster training with multicore machines)

In [None]:
w2v_model = Word2Vec(
    min_count=20,
    window=2,
    vector_size=300,
    sample=6e-5,
    alpha=0.03,
    min_alpha=0.0007,
    negative=20,
    workers=cores - 1,
)

## Building the Vocabulary Table:
Word2Vec requires us to build the vocabulary table (simply digesting all the words and filtering out the unique words, and doing some basic counts on them):

In [None]:
%%time

w2v_model.build_vocab(sentences, progress_per=500)

In [None]:
w2v_model.sorted_vocab

In [None]:
w2v_model.corpus_count

## Training of the model:
_Parameters of the training:_
* `total_examples` <font color='purple'>=</font> <font color='green'>int</font> - Count of sentences;
* `epochs` <font color='purple'>=</font> <font color='green'>int</font> - Number of iterations (epochs) over the corpus - [10, 20, 30]

In [None]:
%%time
w2v_model.train(
    sentences, total_examples=w2v_model.corpus_count, epochs=30, report_delay=1
)

In [None]:
# save model
save_path = "<SAVE_PATH>"
w2v_model.save(f"{save_path}/incident_word2vec.model")

# Exploring the model
## Most similar to:

Here, we will ask our model to find the word most similar to some basic medical concepts

In [None]:
w2v_model.wv.most_similar(positive=["cardiac"])

_A small precision here:_<br>
Can try some bigrams too

In [None]:
# w2v_model.wv.most_similar(positive=["heart_disease"])

In [None]:
w2v_model.wv.most_similar(positive=["lung"])

## Similarities:
Here, we will see how similar are two words to each other :

In [None]:
w2v_model.wv.similarity("leg", "head")

# Odd-One-Out

Here we can request the model provides the word that does not belong to a given list

In [None]:
w2v_model.wv.doesnt_match(["discharge", "heart", "lung"])

## Analogy difference:


In [None]:
w2v_model.wv.most_similar(positive=["heart", "lung"], negative=["leg"], topn=3)

### t-SNE visualizations:
t-SNE is a non-linear dimensionality reduction algorithm that attempts to represent high-dimensional data and the underlying relationships between vectors in a lower-dimensional space.<br>
Here is a good tutorial on it: https://medium.com/@luckylwk/visualising-high-dimensional-datasets-using-pca-and-t-sne-in-python-8ef87e7915b

In [None]:
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

import seaborn as sns

sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

Our goal in this section is to plot our 300 dimensions vectors into 2 dimensional graphs, and see if we can spot interesting patterns.<br>
For that we are going to use t-SNE implementation from scikit-learn.

To make the visualizations more relevant, we will look at the relationships between a query word (in <font color='red'>**red**</font>), its most similar words in the model (in <font color="blue">**blue**</font>), and other words from the vocabulary (in <font color='green'>**green**</font>).

In [None]:
def tsnescatterplot(model, word, list_names, features_dimensionality=300):
    """Plot in seaborn the results from the t-SNE dimensionality reduction algorithm
    of the vectors of a query word,
    its list of most similar words, and a list of words.
    """
    arrays = np.empty((0, features_dimensionality), dtype="f")
    word_labels = [word]
    color_list = ["red"]

    # adds the vector of the query word
    arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)

    # gets list of most similar words
    close_words = model.wv.most_similar([word])

    print(f"length of close words: {len(close_words)}")

    # adds the vector for each of the closest words to the array
    for wrd_score in close_words:
        wrd_vector = model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append("blue")
        arrays = np.append(arrays, wrd_vector, axis=0)

    # adds the vector for each of the words from list_names to the array
    for wrd in list_names:
        wrd_vector = model.wv.__getitem__([wrd])
        word_labels.append(wrd)
        color_list.append("green")
        arrays = np.append(arrays, wrd_vector, axis=0)

    print(f"Shape of arrays going to PCA: {arrays.shape}")
    # for PCA the n_components must equal min(n_samples, n_features)

    # Reduces the dimensionality from 300 to 50 dimensions with PCA
    reduc = PCA(
        n_components=min(arrays.shape[0], features_dimensionality)
    ).fit_transform(arrays)

    # Finds t-SNE coordinates for 2 dimensions
    np.set_printoptions(suppress=True)

    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)

    # Sets everything up to plot
    df = pd.DataFrame(
        {
            "x": [x for x in Y[:, 0]],
            "y": [y for y in Y[:, 1]],
            "words": word_labels,
            "color": color_list,
        }
    )

    fig, _ = plt.subplots()
    fig.set_size_inches(9, 9)

    # Basic plot
    p1 = sns.regplot(
        data=df,
        x="x",
        y="y",
        fit_reg=False,
        marker="o",
        scatter_kws={"s": 40, "facecolors": df["color"]},
    )

    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
        p1.text(
            df["x"][line],
            df["y"][line],
            "  " + df["words"][line].title(),
            horizontalalignment="left",
            verticalalignment="bottom",
            size="medium",
            color=df["color"][line],
            weight="normal",
        ).set_size(15)

    plt.xlim(Y[:, 0].min() - 50, Y[:, 0].max() + 50)
    plt.ylim(Y[:, 1].min() - 50, Y[:, 1].max() + 50)

    plt.title("t-SNE visualization for {}".format(word.title()))

In [None]:
w2v_model.wv.key_to_index

Code inspired by: [[2]](#References:)

## 10 Most similar words vs. 8 Random words:
Let's compare where the vector representation of Homer, his 10 most similar words from the model, as well as 8 random ones, lies in a 2D graph:

In [None]:
word = "heart"

In [None]:
tsnescatterplot(
    w2v_model, word, ["bedside", "left", "discharge", "give", "immediately"]
)

## 10 Most similar words vs. 10 Most dissimilar


In [None]:
tsnescatterplot(
    w2v_model, word, [i[0] for i in w2v_model.wv.most_similar(negative=[word])]
)

## 10 Most similar words vs. 11th to 20th Most similar words:


In [None]:
tsnescatterplot(
    w2v_model,
    word,
    [t[0] for t in w2v_model.wv.most_similar(positive=[word], topn=20)][10:],
)




# Materials for more in depths understanding:
* Word Embeddings introduction: https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/
* Word2Vec introduction: https://skymind.ai/wiki/word2vec
* Another Word2Vec introduction: http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/
* A great Gensim implentation tutorial: http://kavita-ganesan.com/gensim-word2vec-tutorial-starter-code/#.W467ScBjM2x
* Original articles from Mikolov et al.: https://arxiv.org/abs/1301.3781 and https://arxiv.org/abs/1310.4546


