# Word2Vec and Doc2Vec Embeddings Generation

- Each document can be viewed as a bag of words. Collectively all the words (tokens) across documents constitute the vocabulary. Each of these words will be converted to fixed length dense vector known using Word2Vec algorithm. The size of each vector is 100. 


- Doc2Vec algorithm produces a dense vector representation of a document. It considers the global ordering of words in the document. We transform each document (discharge summary) to D2V embedding of size 128.


- These embeddings D2V (per document) and W2V (per word in vocabulary) are used in our deep learning model as input encodings.

- We using **Gensim** provided Word2Vec and Doc2Vec models to train and create embeddings.

In [12]:
import csv
from gensim.models.word2vec import LineSentence, Word2Vec
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.parsing.preprocessing import preprocess_string, split_on_space
import random
import os
import time

In [13]:
# set seed
seed = 24
random.seed(seed)
os.environ["PYTHONHASHSEED"] = str(seed)

In [None]:
# Mount the project directory in Google drive. (Its only intended to be run in colab environment.)

from google.colab import drive
drive.mount('drive')

In [25]:
# Define the base project directory.

PROJECT_DIR = 'drive/My Drive/cs598-dl/' # For Google drive only

# PROJECT_DIR = '../' # For local directory

## Word2Vec Model

We will start will Word2Vec embedding generation for the words used across discharge summary documents.

In [28]:
# Create Streaming iterator to read discharge summary in a streaming fashion 
# as list of sentences, each sentence containing list of tokens.
# This is needed as input source for Gensim Word2Vec model.
# Such a streaming sources enables Gensim to train the model without needing to load the whole corpus in memory.

class NotesIter(object):
    def __iter__(self):
        with open(PROJECT_DIR + 'data/NOTES_2.csv') as notes:
            reader = csv.reader(notes)
            for row in reader:
                break
            for row in reader:
                report = row[2]
                sentences = report.splitlines()
                for sentence in sentences:
                    yield self._getTokens(sentence)
    def _getTokens(self, sentence):
        # Here we use Gensim default pre-processing which will tokenize the text with following transformations:
        # strip (html) tags,
        # strip punctuation,
        # strip multiple whitespaces,
        # strip numbers,
        # remove stopwords,
        # strip short words (smaller that 3 characters),
        # stem text
        return preprocess_string(sentence)

In [29]:
# Create Word2Vec model with NotesIter as input source, and ouput vector size 100.
sta = time.time()
model = Word2Vec(NotesIter(), min_count=1, vector_size = 100, workers=4, seed = seed)
end = time.time()
print('Time spent: ' + str(end - sta))

Time spent: 2212.24529337883


In [10]:
# Extract and save the map (of-sort) of vocabulary tokens and corresponding vectors.
# The 'wv' property of the model is KeyedVector object, 
# which is memory efficient representation of the trained model.

wv = model.wv
wv.save(PROJECT_DIR + 'data/wv.kv')

## Doc2Vec Model

Now, we create a Doc2Vec model and generate embeddings per discharge summary document.

In [33]:
# We create an Iterator which will read document and generate a list of tokens. 
# This iterator will be used as a streaming source for Doc2Vec model.
# Such a streaming sources enables Gensim to train the model without needing to load the whole corpus in memory.

class DocsIter(object):
    def __iter__(self):
        with open(PROJECT_DIR + 'data/NOTES_2.csv') as notes:
            reader = csv.reader(notes)
            for row in reader:
                break
            for row in reader:
                # Here we tag each document with corresponding HADM_ID (Hospitalization ID).
                yield TaggedDocument(words = preprocess_string(row[2]), tags = [row[1]])

In [34]:
# Create Doc2Vec model with DocsIter as input source, and ouput vector size 128.

sta = time.time()
document_model = Doc2Vec(DocsIter(), vector_size=128, min_count=1, workers=4, seed = seed)
end = time.time()
print('Time spent: ' + str(end - sta))

Time spent: 3078.5452456474304


In [30]:
# Extract and save the map (of-sort) of HADM_ID tokens and corresponding D2V vectors.
# The 'dv' property of the model is KeyedVector object, 
# which is memory efficient representation of the trained model.

dv = document_model.dv
dv.save(PROJECT_DIR + 'data/dv.kv')