# Microbiome Embedding with word2vec

In order to deal with the sparseness of the microbiome data we can attempt to use word2vec to create a dense embedding that represents each OTU. By representing each OTU as a dense vector it will be easier to use for predictive modeling later on. Data is being read from a local pkl file in this demonstration. Proper implementation will pull the data out of S3.

In [4]:
import pandas as pd
import gensim
from random import shuffle


# pull table from local file
samples = pd.read_pickle(r'C:\Users\bwesterber\Downloads\biom_table.pkl')

# replace nans with zeros
samples = samples.fillna(0)

Since we are framing this as an NLP problem lets think of each microbiome sample as a sentence and each OTU as a word. Here we construct the sentences that will be put into the word2vec model. 

In [5]:
sentences = []
for sample in range(len(samples) - 1):
    try:
        sentence = list(samples[sample][samples[sample] > 0].index)
        sentences.append([str(x) for x in sentence])
    except KeyError:
        pass

Here we can augment the dataset use for generating the embeddings by shuffling each sample. In CBOW mode word order shouldnt matter for the embedding, but having a larger training set might improve how word2vec learns the conditional probability of each word. The usefullness of this is still unclear and may not be required. 

In [6]:
# shuffle each sample around and append it to the training data
augmentation_constant = 2
generated_sentences = []
for sentence in sentences:
    for augmentation in range(augmentation_constant):
        shuffle(sentence)
        generated_sentences.append(sentence)

Prepared data is now fed into the gensim word2vec model for training in CBOW mode. 

In [7]:
model = gensim.models.Word2Vec(generated_sentences, size = 100, min_count = 2, window = 100, workers = 4, sg = 0)
model.train(generated_sentences, total_examples = len(sentences), epochs = 5)

(27331117, 28082160)

We can now evaluate the model to see which OTUs are most similar to each other in the embedding space. The most_similar method returns the cosine similairty between the input OTU and its nearest neighbors. 

In [9]:
model.wv.most_similar('84239', topn = 10)

[('354969', 0.9211254119873047),
 ('278561', 0.5733003616333008),
 ('298273', 0.421808123588562),
 ('74035', 0.39019107818603516),
 ('128089', 0.38910770416259766),
 ('321953', 0.37498682737350464),
 ('321024', 0.3368404805660248),
 ('179973', 0.32836204767227173),
 ('104226', 0.32391905784606934),
 ('286850', 0.3225645422935486)]

# Microbiome Embedding with Poincare Embeddings

If we think of each OTU as a vertex in an undirected graph we can embed it using a Poincare embedding. Each OTU that coocurs with another in a sample will share an edge between them. 

TODO