In this section, we will look at different factors that can affect the stability of your word embeddings, and ways to mitigate these issues in order to make your observations robust.

This notebook contains code to train various word2vec models using gensim, but we won't run them during the tutorial, in order to save time. Although you could run them yourself, I would not recommend using these settings for research or making design decisions about your experiments:

* for research, make sure you train larger models
* for design decisions, please consult the original papers, which show these effects over hundreds of runs.

This notebook is just for us to get a sense of how different things can get. Go ahead and run it for fun if you like, though!

In [None]:
import collections
import pickle
from gensim.models import Word2Vec
import random
import matplotlib.pyplot as plt
import seaborn as sns
import util

`build_w2v_model` is the function we will use to build all our models. I tried to make it somewhat realistic, but that means that I can't build them all during the tutorial. So we'll just have to use results "that I prepared earlier"!

In [None]:
def build_w2v_model(sentences, seed=0, num_workers=8):
  """Build a basic word2vec model using gensim.
  
  Args:
    sentences: corpus as list of lists of tokens
    seed: random seed (gensim.models.Word2Vec parameter)
    num_workers: number of threads to use (gensim.models.Word2Vec parameter)
    
  Returns:
    gensim.models.Word2Vec model
  """
  return Word2Vec(sentences=sentences, vector_size=10, window=5, min_count=10, workers=num_workers)
  
def embedding_corpus_from_docs(doc_list):
  """Build a corpus from training embeddings from a specific list of documents.
    Args:
      doc_list: list of documents, each in list of lists of tokens form
      
    Returns:
      A list of lists of tokens comprising sentences from all the documents in order.
  """
  return sum(doc_list, [])

We will use a corpus of documents from /r/AskScience on Reddit, in the form of a dict:
```
{
"id_1": [
	["This", "is", "a", "sentence", "."],
	["This", "is", "another", "sentence", "."],
	],
"id_2": [
	["This", "is", "a", "third", "sentence ."], 
	["This", "is", "a", "fourth", "sentence", "."],
	],
...
}
```

The various preprocessing steps, etc. are hidden in `util.py`. For now, we will just retrieve them and not pay attention to the details.

In [None]:
askscience_docs = util.preprocess_askscience()

First, we will compare two models trained with different random seeds. Setting `num_workers=1` makes the gensim training deterministic.

\* Note: you can also get 'random' results by using a higher `num_workers`, but this is due to thread scheduling and is not reproducible. The tradeoff might be worth it though, since you can train a lot faster, and by the end of this tutorial we will have ways of dealing with randomness.

In [None]:
fixed_corpus = embedding_corpus_from_docs(list(askscience_docs.values())[:1000])

print("Starting")
w2v_1 = build_w2v_model(fixed_corpus, 23)
print("Built first model")
w2v_2 = build_w2v_model(fixed_corpus, 96)
print("Built second model")

We will compare two models using a list of salient words for /r/AskScience. I got these by doing some LDA shenanigans, but I am the least qualified person in the room to talk about LDA, so I will just gloss over that part.

One way to quantify the changes is to look at the difference in cosines calculated with the two models. We plot the 380 pairwise cosines in a scatter plot below:

In [None]:
ASK_SCIENCE_WORDS = ["bacteria", "plant", "species", "brain",
"muscle", "sleep", "human", "galaxy",]
# "space", "planet", "universe", "electricity",
# "light", "magnetic", "field", "power",
# "calorie", "chemical", "temperature", "pressure"]

def plot_cosine_changes(model_1, model_2, words):
  cosine_pairs = []

  for i, word1 in enumerate(words):
      for word2 in words[i+1:]:
          cosine_pairs.append((model_1.wv.distance(word1, word2), model_2.wv.distance(word1, word2) ))
  x, y = zip(*cosine_pairs)
  sns.scatterplot(x, y)
  
plot_cosine_changes(w2v_1, w2v_2, ASK_SCIENCE_WORDS)

Clearly, raw cosines are not very reliable -- while they haven't changed drastically, these are substantial changes considering both models were trained with the **exact same distribution of contexts**. 

A more realistic way to compare  we make qualitative judgments based on nearest neighbor lists. So instead, we should look at changes in nearest neighbor lists, and then try to quantify that change.

In [None]:
def compare_nearest_neighbors(model_1, model_2, words):
  for word in words:
    print(word)
    nn_1 = [x[0] for x in model_1.wv.most_similar(positive=[word])]
    nn_2 = [x[0] for x in model_2.wv.most_similar(positive=[word])]
    print(nn_1)
    print(nn_2)
    
compare_nearest_neighbors(w2v_1, w2v_2, ASK_SCIENCE_WORDS)

To quantify the change in nearest neighbors, we can measure (1) the proportion of the set of nearest neighbors that has changed and (2) the change in rank of the words that haven't changed.

In [None]:
def jaccard(l1, l2):
  return len(set(l1).intersection(set(l2))) / len(set(l1).union(set(l2)))

def quantify_nn_change(model_1, model_2, words):
  rank_diffs = []

  for word in words:
    nn_1 = [x[0] for x in model_1.wv.most_similar(positive=[word])]
    nn_2 = [x[0] for x in model_2.wv.most_similar(positive=[word])]
    print(word, jaccard(nn_1, nn_2))
    for nn in nn_1:
      if nn not in nn_2:
        continue
      else:
        rank_diffs.append(nn_1.index(nn) - nn_2.index(nn))
  fig, ax = plt.subplots()
  sns.histplot(rank_diffs)
  ax.set_xlabel("Difference in nearest neighbor rank")
  
quantify_nn_change(w2v_1, w2v_2, ASK_SCIENCE_WORDS)

All of these changes are taking place for the exact same set of documents. However, changes to the set of documents can result in instability as well. Below, we train two models, each on 95% of the documents in the corpus.

In [None]:
keys1 = random.sample(sorted(askscience_docs.keys()), 5)#0.95 * len(askscience_docs))
keys2 = random.sample(sorted(askscience_docs.keys()), 5)#0.95 * len(askscience_docs))

w2v_1 = build_w2v_model( embedding_corpus_from_docs([askscience_docs[key] for key in keys1]))
print("Built first model")
w2v_2 = build_w2v_model( embedding_corpus_from_docs([askscience_docs[key] for key in keys2]))
print("Built second model")

# Show results here

This is really important because while a corpus of documents represents the mental model of the document creators, it represents only a *sample* of this mental model. We need to treat it in a statistically appropriate way. The fact that sampling can make such a huge difference indicates that we need to use bootstrapping (i.e. sampling with replacement).

Note that it is also possible to sample at the sentence level. While this gives much more appealing bounds, it is likely to be misleading -- a word's contexts all over a document are likely to be similar, and we aren't really excluding a particular usage of the word by just excluding some of its occurences in a document.

In [None]:
sentences1 = random.sample(fixed_corpus, int(0.95 * len(fixed_corpus)))
sentences2 = random.sample(fixed_corpus, int(0.95 * len(fixed_corpus)))

w2v_1 = build_w2v_model(sentences1)
print("Built first model")
w2v_2 = build_w2v_model(sentences2)
print("Built second model")

# Show results for sentence-level sampling here

The way to ensure robust results is to use bootstrapping, and report aggregated results over multiple runs.

In [None]:
BOOTSTRAP_NUM = 20
import tqdm

bootstrap_models_1 = []
for i in tqdm.tqdm(range(BOOTSTRAP_NUM)):
  selected_keys = random.choices(sorted(askscience_docs.keys()), k=len(askscience_docs.keys()))[:10]
  bootstrap_models_1.append(build_w2v_model(embedding_corpus_from_docs([askscience_docs[key] for key in selected_keys])))


Let's compare this with another set of 20 bootstrapped models.

In [None]:
bootstrap_models_2 = []
for i in tqdm.tqdm(range(BOOTSTRAP_NUM)):
  selected_keys = random.choices(sorted(askscience_docs.keys()), k=len(askscience_docs.keys()))[:10]
  bootstrap_models_2.append(build_w2v_model(embedding_corpus_from_docs([askscience_docs[key] for key in selected_keys])))

We can compare the nearest neighbors by aggregating their ranks over each run.

In [None]:
def aggregate_nearest_neighbors(words, model_list):
  nearest_neighbor_rank_counter = collections.defaultdict(lambda:collections.defaultdict(list))
  for model in model_list:
    for word in words:
      for i, nn in enumerate(model.wv.most_similar(positive=[word])):
        nearest_neighbor_rank_counter[word][nn].append(i+1)

  for k, v in nearest_neighbor_rank_counter.items():
    print(k)
    print(v)
    
aggregate_nearest_neighbors(ASK_SCIENCE_WORDS, bootstrap_models_1)

In [None]:
sorted_docs = [final_docs[k] for k in sorted(final_docs.keys())]

sorted_keys = sorted(final_docs.keys())


random.seed(131)
bootstrapped_keylists = []
for _ in range(20):
    bootstrapped_keylists.append(random.choices(sorted_keys, k=len(sorted_keys)))
    
models = []   
for i, key_list in enumerate(bootstrapped_keylists):
    models.append(
        build_w2v_model(
            embedding_corpus_from_docs(
                [final_docs[k] for k in key_list]), 4))
    print(f'Built model {i}')

In [None]:
# import nltk
# nltk.download('wordnet')
# from nltk.corpus import wordnet as wn
# def get_lemma(word):
#     lemma = wn.morphy(word)
#     if lemma is None:
#         return word
#     else:
#         return lemma
    
# from nltk.stem.wordnet import WordNetLemmatizer
# def get_lemma2(word):
#     return WordNetLemmatizer().lemmatize(word)
  
# nltk.download('stopwords')
# en_stop = set(nltk.corpus.stopwords.words('english'))

# def prepare_text_for_lda(text):
#     tokens = tokenize(text)
#     tokens = [token for token in tokens if len(token) > 4]
#     tokens = [token for token in tokens if token not in en_stop]
#     tokens = [get_lemma(token) for token in tokens]
#     return tokens

In [None]:
# lda_corpus = []
# for doc_id, doc in askscience_docs.items():
#   doc_tokens = sum(doc, [])
#   lda_corpus.append([get_lemma(token) for token in doc_tokens if len(token) > 4 and token not in en_stop])

# from gensim import corpora
# dictionary = corpora.Dictionary(lda_corpus)
# corpus = [dictionary.doc2bow(text) for text in lda_corpus]
# import pickle
# pickle.dump(corpus, open('corpus.pkl', 'wb'))
# dictionary.save('dictionary.gensim')

# import gensim
# NUM_TOPICS = 20
# ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics = NUM_TOPICS, id2word=dictionary, passes=15)
# ldamodel.save('model5.gensim')
# topics = ldamodel.print_topics(num_words=4)
# for topic in topics:
#     print(topic)

In [None]:
blerp = """(0, '0.091*"light" + 0.047*"speed" + 0.033*"would" + 0.022*"move"')
(1, '0.017*"article" + 0.015*"people" + 0.013*"scientific" + 0.012*"study"')
(2, '0.096*"earth" + 0.067*"would" + 0.053*"planet" + 0.043*"space"')
(3, '0.056*"cause" + 0.034*"effects" + 0.026*"effect" + 0.020*"damage"')
(4, '0.022*"chemical" + 0.019*"reaction" + 0.019*"different" + 0.015*"material"')
(5, '0.040*"animal" + 0.037*"humans" + 0.031*"human" + 0.030*"species"')
(6, '0.040*"weight" + 0.033*"muscle" + 0.032*"drink" + 0.024*"calorie"')
(7, '0.032*"image" + 0.020*"picture" + 0.019*"pattern" + 0.017*"computer"')
(8, '0.042*"plant" + 0.028*"bacteria" + 0.018*"electricity" + 0.017*"hands"')
(9, '0.043*"title" + 0.041*"ground" + 0.022*"cloud" + 0.021*"train"')
(10, '0.049*"field" + 0.026*"magnetic" + 0.017*"project" + 0.014*"fields"')
(11, '0.173*"would" + 0.056*"could" + 0.038*"possible" + 0.020*"years"')
(12, '0.023*"question" + 0.022*"would" + 0.022*"thanks" + 0.022*"understand"')
(13, '0.129*"water" + 0.047*"would" + 0.045*"temperature" + 0.033*"pressure"')
(14, '0.165*"energy" + 0.047*"power" + 0.031*"charge" + 0.030*"increase"')
(15, '0.077*"brain" + 0.056*"sound" + 0.043*"people" + 0.037*"person"')
(16, '0.039*"question" + 0.022*"answer" + 0.021*"really" + 0.021*"could"')
(17, '0.032*"galaxy" + 0.019*"shape" + 0.018*"point" + 0.018*"measure"')
(18, '0.049*"universe" + 0.035*"matter" + 0.026*"black" + 0.020*"space"')
(19, '0.033*"happen" + 0.017*"sleep" + 0.016*"notice" + 0.014*"something"')
"""

print(blerp.split())