# Word Embeddings

** Table of Contents**
1. Deriving own Word Embeddings from Corpus 
    - Word2Vec-Module-Gensim
        - Comparison with BoW vocabulary: Missing >4000 words
    - FastText-Module-Gensim
        - Generic Word2Vec (i.e.  no subword information)
        - Word2Vec with subwords/n-grams
        - Comparison with BoW vocabulary: Almost complete overlap (3/4 words missing)
2. Assessing Pretrained Word Embeddings
    - Word2Vec trained on English Wikipedia
        - Comparison with BoW vocabulary: Missing >8000 words
    - Word2Vec with subword information trained on English Wikipedia
        - Comparison with BoW vocabulary: Missing >8000 words
3. Feature Generation and Export for Learning2Rank


    
**TLDR:**

We decided to  derive our own embeddings (Word2Vec, CBOW, with/without subword information) from the given corpus using the FastText moudule of Gensim. 

This was the only approach that covered the entire  vocabulary we obtained from the BoW representation of the collection, and that we used previously for TFIDF,UnigramLM, and BM25.


## 1. Deriving own Word Embeddings from Corpus 

Other than for the probabilistic rating models and TFIDF, we look into the raw text documents provided. Hence, we have first to apply some preprocessing.

In [6]:
# Preprocessing
# Gensim requires list of lists of Unicode 8 strings as an input. Since we have a small collection, we are fine with loading everything into memory.
import re
doc_list= []
with open('../nfcorpus/raw/doc_dump.txt', 'r') as rf1:
    for line in rf1:
        l = re.sub("MED-.*\t", "",line).lower().strip('\n').split()
        doc_list.append(l) 
len(doc_list) # TODO: Report this in project report

5371

### A note on Collocations / Phrases

In order to have a unified approach when learning to rank, we finally only considered unigrams and matched the embeddings obtained in this notebook with the (reduced) vocabulary from the BoW representation of the collection.

So, we do not consider any bigrams. The corresponding code is commented-out below. This section is just for the sake of completeness.

When deriving WE, a first preprocessing step is extracting multi-word expressions.
We use gensim's phrase detection [module].(https://radimrehurek.com/gensim/models/phrases.html#id2).

And we use gensim's default approach and parameter settings to detect collocations which is outlined [here](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality). 

In [7]:
from gensim import models
# step 1: train the detector with
phrases = models.phrases.Phrases(doc_list, min_count=2) # phrases have to occur at least two times
# step 2: create a Phraser object to transform any sentence 
bigram = models.phrases.Phraser(phrases)

In [8]:
#little sanity check to see if it has worked: breast cancer should be detected as a collocation
bigram['Exhibits', 'a', 'high', 'risk ' ,'of' , 'breast', 'cancer']

['Exhibits', 'a', 'high', 'risk ', 'of', 'breast_cancer']

The Phraser object will then be used as a chained 'function' when creating the embeddings.

## Word2Vec with CBOW

We are using gensim's Word2Vec implementation and default parameter settings as described [here](https://radimrehurek.com/gensim/models/word2vec.html). 

We only modified the following parameters:
- Words have to occur only once to be included in the vocabulary. This is justified since we only have a small corpus, and otherwise we would risk that we have words in the vocabulary created from the BoW representation of the docs, that are excluded here.



In [9]:
import gensim
#word2vec = models.Word2Vec(bigram[doc_list],min_count=1, workers=4)
word2vec = models.Word2Vec(doc_list,min_count=1, workers=4)
# also tried with skipgram, produces same vocablary, also set min_count to zero, produces same vocabulary
word2vec.save('our_word2vec')

In [10]:
''' 
# free RAM as recommended in the docs and if we top training
word2vec_vectors = word2vec.wv
del (word2vec.wv)
'''

' \n# free RAM as recommended in the docs and if we top training\nword2vec_vectors = word2vec.wv\ndel (word2vec.wv)\n'

### Observation 1: No stopword and chars are being removed

This goes beyond the scope of our project work, but here we could further tune the model.

We also see that, certain n-grams that will be captured by fastText aren't part of the vocabulary created with Word2Vec ("Can"cer).

In [12]:
[i in word2vec.wv for i in [ 'of', 'by', 'the', 'and','.',',','%','$','2', '23', '234','X','Can']]

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False]

### Observation 2: No (implicit) Lemmatization or Stemming has occured

In [14]:
[i in word2vec.wv for i in ['describe', 'described', 'describes', 'describing']]

[True, True, True, True]

In [16]:
#but we will live with this as they are subjectively very similar and this goes beyond the scope of our topic
word2vec.wv.similarity('describe', 'described')

0.657413450180564

In [18]:
# as expected we created a 64585 dimensional vocabulary, each word being described by a 100 dimensional dense vector
word2vec.wv.vectors.shape # 89269 if allwing for bigrams

(64585, 100)

### Observation 3: There is a substantial vocabulary mismatch between  precomputed BoWs and Word2Vec-CBOW embeddings
... and hence we are not using Gensims native Word2Vec modules for feature generation
... but derive Word2Vec embeddings via the FastText module which achieves the desired behaviour.

In [20]:
import pandas as pd
inverted_index = pd.read_pickle('../0_Collection_and_Inverted_Index/pickle/inverted_index.pkl')

In [21]:
# BoW vocabulary 
len(inverted_index.index)

29052

In [22]:
#overlap between two sets
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in word2vec.wv.vocab: # ... in  word2vec.wv  : returns the same
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

4839

In [23]:
#this is only hard to explain why these words are in the smaller BoW representation and not the vocabulary obtained from WEs
no_overlap

{'ensued',
 'methylnitrosamino',
 'reml',
 'three-phase',
 'interpretive',
 'opes',
 'neuroendocrine/circadian',
 'wallace',
 'birds/pen',
 'hhq',
 'provokes',
 'xrd',
 'hydroxycinnamates',
 'affq',
 'sulfonic',
 'p-cresidine',
 'brix',
 '-nitro-l-arginine-methyl-ester',
 'tailed',
 'realities',
 'chocolate--guilty',
 'reflections',
 'mufas',
 'degeneration/low-back',
 'janeiro',
 'lcnac',
 'paulo',
 'ri',
 'small-to-medium-chain',
 'stolen',
 'scdns',
 'wipes',
 'equivalent/l',
 'hdyroxyvitamin',
 'alcohol-mediated',
 'coordinately',
 'fatique',
 'brackish',
 'glucolipotoxicity',
 'hplc-esi-ms/ms',
 'rhaponticum',
 'hydroxytamoxifen',
 'immuncastration',
 'petals',
 'sj/bjc',
 'segmentation',
 'unsaturates',
 'flavanols/d',
 "m's",
 'therans',
 'metabolife',
 'recursive',
 'lactoferrin',
 'cryptosporidium',
 'lethargy',
 'srm',
 '-active',
 'vegandiet',
 'lpha-estran',
 'germacrone',
 'suspect',
 'malx',
 'ohms',
 'essentiality',
 'irr',
 'arachidonic-acid-derived',
 'bbzp',
 'cucumbe

## Word2Vec with Subword Information ...

... splits words into character n-grams of arbitray lenght (has to be specified as a range). It proceeds then same way as Vanilla Word2Vec (either Skipgram or CBOW architecture).

**The advantage of Word2Vec with subword information is that it makes predictions for out-of-vocabulary or misspelled terms, if they can be constructed from the character n-grams in the vocabulary.**

If you don't use any subword information/subword n-grams than it should yield the same results Word2Vec. (Irritatingly Word2Vec in FastText yields better results as we Gensim's Word2Vec module which sort out more than 4000 vocabulary terms...)

## FastText with CBOW

In [24]:
gensim.models.fasttext.FAST_VERSION > -1 # make sure that you are using Cython backend

True

In [25]:
import gensim
#fasttext= gensim.models.FastText(bigram[doc_list], min_count= 1, min_n= 3, max_n=12)
fasttext= gensim.models.FastText(doc_list, min_count= 1, min_n= 3, max_n=12)

In [26]:
fasttext.save('our_fasttext')

In [27]:
''' 
# free RAM as recommended in the docs
fasttext_vectors = fasttext.wv
del (fasttext.wv)
type(fasttext_vectors)
'''

' \n# free RAM as recommended in the docs\nfasttext_vectors = fasttext.wv\ndel (fasttext.wv)\ntype(fasttext_vectors)\n'

In [28]:
# you don't want to recompute the FastText vectors since it takes quite long
# this loads the whole model, (not only the vectors)
fasttext= gensim.models.FastText.load('our_fasttext')

In [33]:
# put this in presentation: this is why W2V with Subwords are cool..
fasttext.wv.similarity('breawe caner', 'breast cancer') # this is the primary use case: out of vocab predictions

0.8028783918571138

In [39]:
# overlap between two sets
# fasttext produces the result we expect, word2vec however not
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttext.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

3

In [40]:
len(overlap)

29049

In [79]:
no_overlap

{'nw', 'rq', 'w'}

In [41]:
fasttextword2vec= gensim.models.FastText(doc_list, min_count= 1, word_ngrams=0)

In [42]:
fasttext.save('our_fasttextword2vec')

In [43]:
# overlap between two sets
# fasttext produces the result we expect, word2vec however not
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttextword2vec.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

4

In [44]:
no_overlap

{':{', 'nw', 'rq', 'w'}

In [45]:
fasttextword2vec.vocabulary.max_vocab_size

In [46]:
len(fasttextword2vec.wv.vocab)

64585

In [47]:
len(fasttextword2vec.wv.vocab)

64585

In [48]:
len(word2vec.wv.vocab)

64585

In [49]:
# fasttext and fasttextword2vec have same vocabulary
overlap=set()
no_overlap=set()
for word in fasttextword2vec.wv.vocab:
    if word in fasttext.wv.vocab:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

0

In [50]:
#obviously the vocabulary is the same in all three cases...
no_overlap=set()
for word in fasttextword2vec.wv.vocab:
    if word in word2vec.wv.vocab:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

0

** Open Question: Why does Word2Vec module miss 4000 words? **

In [52]:
fasttext.wv.num_ngram_vectors

805905

In [51]:
fasttextword2vec.wv.num_ngram_vectors # > 64585: makes sense, that there are more n-grams words in the vocabulary 

341164

## 2. Assessing Pretrained Embeddings

As we will see, pretrained embeddings derived from larger corpora do not suit our task since we would miss more than 8000 words in our BoW vocabulary

We are using these pre-trained embeddings: https://fasttext.cc/docs/en/english-vectors.html
Which are obtained through the approach outlined here: https://arxiv.org/pdf/1712.09405.pdf

In [36]:
# since we are not using any subword information (fasttext ngrams for out of vocabulary words), we can import the embeddings as easy as follows
from gensim.models.keyedvectors import KeyedVectors
word2vec_wiki = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")
fasttext_wiki = KeyedVectors.load_word2vec_format("wiki-news-300d-1M-subword.vec")
fasttext_commoncrawl = KeyedVectors.load_word2vec_format("crawl-300d-2M.vec")

In [41]:
# whave a 300 dimensional dense vector for all models
300==len(word2vec_wiki.get_vector('cancer'))==len(fasttext_wiki.get_vector('cancer'))==len(fasttext_commoncrawl.get_vector('cancer'))

True

In [59]:
import pandas as pd
inverted_index=pd.read_pickle('inverted_index.pkl')
#overlap with vocabulary from Word2Vec emdeddings generated from English Wikipeda
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in word2vec_wiki.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

  import sys


8219

In [61]:
#overlap with vocabulary from fasttext emdeddings generated from English Wikipeda
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttext_wiki.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

  """


8219

In [58]:
#overlap with vocabulary from fasttext emdeddings generated from commoncrawl
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttext_wiki.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

  """


8219

In [62]:
len(overlap)

20833

In [63]:
# we would throw away good domain-specific candidates when only looking at the union of both vocabularies
no_overlap

{'p/s',
 "'hort",
 'servings/day',
 'genetischen',
 'nocp',
 'inflammatory/proliferative',
 'isoflavanoid',
 'conceiveably',
 'disease-promoting',
 'praldiet',
 'ear-tagged',
 'combined-treatment',
 'attitudes/beliefs',
 'walked/d',
 'hplc',
 'octadecylsilyl',
 'hdaci',
 'occlusion/stenosis',
 'nepp',
 'stress-linked',
 'wwtps',
 'cpges',
 'hrct',
 'anti-gal',
 'dust-exposed',
 'azinobis',
 'ebsco',
 'plasma/mass',
 'awamori-fermented',
 'cdt-time',
 'nnk-specific',
 'tea-derived',
 'sub-fulminant',
 'multi-intervention',
 'antiulcer',
 "daughter's",
 'overnight-fasted',
 'l-aspartyl-l-phenylalanyl-methyl',
 'cdks',
 'cnki',
 'neurodegenerations',
 'chrono-log',
 'chemoprotection',
 'aphv',
 'kidney-transplant',
 'padecerla',
 'ific',
 'seafood-borne',
 'diabetes--what',
 'n-nitrosamines',
 'qald',
 'alpha-ethinylestradiol',
 'withdrawn/depressed',
 'sick-listing',
 'isphagula',
 'single-blinded',
 'meriones',
 'pathogen-recognition',
 'pubmed/medline',
 'peonidin',
 'hasc',
 'hmg-coa'

In [115]:
len(fasttext.wv.vectors)==len(word2vec.wv.vectors)

True

# 3. Feature Generation

Please run Embeddings.ipynb within the folder of this Jupyter notebook.