# Word Embeddings

** Table of Contents**
1. Deriving own Word Embeddings from Corpus 
    - Word2Vec-CBOW
        - Comparison with BoW vocabulary: Missing >4000 words
    - FastText-CBOW
        - Generic Fasttext (i.e. with subwords/n-grams)
        - Word2Vec in FastTest  (i.e. with subwords/n-grams)
        - Comparison with BoW vocabulary: Almost complete overlap (3/4 words missing)
2. Assessing Pretrained Word Embeddings
    - Word2Vec trained on English Wikipedia
        - Comparison with BoW vocabulary: Missing >8000 words
    - FastText trained on English Wikipedia
        - Comparison with BoW vocabulary: Missing >8000 words
3. Feature Generation and Export for Learning2Rank


    
**TLDR:**

We decided to  derive our own embeddings (Word2Vec, CBOW, with/without subword information) from the given corpus using the FastText moudule of Gensim. 

This was the only approach that covered the entire  vocabulary we obtained from the BoW representation of the collection, and that we used previously for TFIDF,UnigramLM, and BM25.


## 1. Deriving own Word Embeddings from Corpus 

Other than for the probabilistic rating models and TFIDF, we look into the raw text documents provided. Hence, we have first to apply some preprocessing.

In [100]:
# Preprocessing
# Gensim requires list of lists of Unicode 8 strings as an input. Since we have a small collection, we are fine with loading everything into memory.
import re
doc_list= []
with open('./nfcorpus/raw/doc_dump.txt', 'r') as rf1:
    for line in rf1:
        l = re.sub("MED-.*\t", "",line).lower().strip('\n').split()
        doc_list.append(l) 
len(doc_list) # TODO: Report this in project report

5371

In [101]:
# as we can see and would expect the vocabulary of the big corpus, containing raw text is more than twice as large as the Bow vocabulary - 29052
vocabulary_big=set()
for doc in doc_list:
    for word in doc:
        vocabulary_big.add(word)
len(vocabulary_big)

64585

### A note on Collocations / Phrases

In order to have a unified approach when learning to rank, we only consider unigrams and match the embeddings obtained in this notebook with the (reduced) vocabulary from the BoW representation of the collection.

So, we do not consider any bigrams. The corresponding code is commented out below. This section is just for the sake of completeness.

When deriving WE, a first preprocessing step is extracting multi-word expressions.
We use gensim's phrase detection [module].(https://radimrehurek.com/gensim/models/phrases.html#id2).

And we use gensim's default approach and parameter settings to detect collocations which is outlined [here](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality). 

In [66]:
from gensim import models
# step 1: train the detector with
phrases = models.phrases.Phrases(doc_list, min_count=2) # phrases have to occur at least two times
# step 2: create a Phraser object to transform any sentence 
bigram = models.phrases.Phraser(phrases)

In [9]:
#little sanity check to see if it has worked: breast cancer should be detected as a collocation
bigram['Exhibits', 'a', 'high', 'risk ' ,'of' , 'breast', 'cancer']

['Exhibits', 'a', 'high', 'risk ', 'of', 'breast_cancer']

The Phraser object will then be used as a chained 'function' when creating the embeddings.

## Word2Vec with CBOW

We are using gensim's Word2Vec implementation and default parameter settings as described [here](https://radimrehurek.com/gensim/models/word2vec.html). 

We only modified the following parameters:
- Words have to occur only once to be included in the vocabulary. This is justified since we only have a small corpus, and otherwise we would risk that we have words in the vocabulary created from the BoW representation of the docs, that are excluded here.



In [95]:
import gensim
#word2vec = models.Word2Vec(bigram[doc_list],min_count=1, workers=4)
word2vec = models.Word2Vec(doc_list,min_count=1, workers=4)
# also tried with skipgram, produces same vocablary, also set min_count to zero, produces same vocabulary
word2vec.save('our_word2vec')

In [102]:
''' 
# free RAM as recommended in the docs and if we top training
word2vec_vectors = word2vec.wv
del (word2vec.wv)
'''

' \n# free RAM as recommended in the docs and if we top training\nword2vec_vectors = word2vec.wv\ndel (word2vec.wv)\n'

### Observation 1: No stopword and chars are being removed

This goes beyond the scope of our project work, but here we could further tune the model.

We also see that, certain n-grams that will be captured by fastText aren't part of the vocabulary created with Word2Vec ("Can"cer).

In [103]:
[i in word2vec_vectors for i in [ 'of', 'by', 'the', 'and','.',',','%','$','2', '23', '234','X','Can']]

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 False,
 True,
 True,
 True,
 False,
 False]

### Observation 2: No (implicit) Lemmatization or Stemming has occured

In [104]:
[i in word2vec_vectors for i in ['describe', 'described', 'describes', 'describing']]

[True, True, True, True]

In [105]:
#but we will live with this as they are subjectively very similar and this goes beyond the scope of our topic
word2vec_vectors.similarity('describe', 'described')

0.7210535364528369

In [106]:
# as expected we created a 64585 dimensional vocabulary, each word being described by a 100 dimensional dense vector
word2vec_vectors.vectors.shape # 89269 if allwing for bigrams

(64585, 100)

### Observation 3: There is a substantial vocabulary mismatch between  precomputed BoWs and Word2Vec-CBOW embeddings
... and hence we are not using Gensims native Word2Vec modules for feature generation
... but derive Word2Vec embeddings via the FastText module which achieves the desired behaviour.

In [107]:
import pandas as pd
inverted_index=pd.read_pickle('inverted_index.pkl')

In [108]:
# BoW vocabulary 
len(inverted_index.index)

29052

In [143]:
#overlap between two sets
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in word2vec.wv.vocab: # ... in  word2vec.wv  : returns the same
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

4839

In [110]:
#this is only hard to explain why these words are in the smaller BoW representation and not the vocabulary obtained from WEs
no_overlap

{'loins',
 "'hort",
 'tracker',
 'dihydroequilin',
 'imidacloprid',
 'day/person',
 'elixir',
 'familiarization',
 'non-vegans',
 'b-carotene',
 'muscling',
 'elegans',
 'well-tested',
 'substudies',
 'linzess',
 'emollients',
 'attitudes/beliefs',
 'dihydrochalcones',
 'gumes',
 'methyl]dihydro',
 'newborn--case',
 'nutricia',
 'bee',
 'dai',
 'klrg',
 'err',
 'hbd',
 'lifestyle/dietary',
 'disliked',
 'midwives',
 'apmis',
 'hemianopsia',
 'unmotivated',
 'ema',
 'diethylstilboestrol--a',
 'formation--a',
 'hydroxynonenal',
 'ttv',
 'chamomilla',
 'chf',
 'hypertonic-hypovolemia',
 'min/d',
 'cbpi',
 'furfur',
 'macrobiotics',
 'landscapes',
 'nordmann/dortet/poirel',
 'obscene',
 'azinobis',
 'fra',
 'melengestrol',
 'osteosarcoma',
 'collagenase',
 'diphenyldichloroethene',
 'essence',
 'teamwork',
 'tetrahydrocannabinol',
 'wuerzburg',
 'aacr',
 'biometrics',
 'steakhouse',
 'overwrought',
 'ine',
 'ividis',
 'decins',
 'hydroxybutyric',
 'methylenetetrahydrofolate',
 'naion',
 'p

## FastText

FastText splits words into character n-grams of arbitray lenght (has to be specified as a range). It proceeds then same way as Word2Vec (either Skipgram or CBOW architecture).

**The advantage of fastText is that it makes predictions for out-of-vocabulary or misspelled terms, if they can be constructed from the character n-grams in the vocabulary.**

If you don't use any subword information/subword n-grams than it yields the same results Word2Vec. (Or irritatingly even better results as we found out since it doesn't sort out more than 4000 vocabulary terms...)

## FastText with CBOW

In [111]:
gensim.models.fasttext.FAST_VERSION > -1 # make sure that you are using Cython backend

True

In [94]:
import gensim
#fasttext= gensim.models.FastText(bigram[doc_list], min_count= 1, min_n= 3, max_n=12)
fasttext= gensim.models.FastText(doc_list, min_count= 1, min_n= 3, max_n=12)

In [98]:
fasttext.save('our_fasttext')

In [100]:
''' 
# free RAM as recommended in the docs
fasttext_vectors = fasttext.wv
del (fasttext.wv)
type(fasttext_vectors)
'''

' \n# free RAM as recommended in the docs\nfasttext_vectors = fasttext.wv\ndel (fasttext.wv)\ntype(fasttext_vectors)\n'

In [112]:
# you don't want to recompute the FastText vectors since it takes quite long
# this loads the whole model, (not only the vectors)
fasttext= gensim.models.FastText.load('our_fasttext')

In [114]:
fasttext.wv.similarity('brear_caner', 'breast_cancer') # this is the primary use case: out of vocab predictions

0.8268258623852386

In [115]:
# overlap between two sets
# fasttext produces the result we expect, word2vec however not
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttext.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

3

In [78]:
len(overlap)

29049

In [79]:
no_overlap

{'nw', 'rq', 'w'}

In [83]:
fasttextword2vec= gensim.models.FastText(doc_list, min_count= 1, word_ngrams=0)

In [117]:
fasttext.save('our_fasttextword2vec')

In [116]:
# overlap between two sets
# fasttext produces the result we expect, word2vec however not
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttextword2vec.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

4

In [118]:
no_overlap

{':{', 'nw', 'rq', 'w'}

In [128]:
fasttextword2vec.vocabulary.max_vocab_size

In [142]:
len(fasttextword2vec.wv.vocab)

64585

In [144]:
len(fasttextword2vec.wv.vocab)

64585

In [145]:
len(word2vec.wv.vocab)

64585

In [154]:
# fasttext and fasttextword2vec have same vocabulary
overlap=set()
no_overlap=set()
for word in fasttextword2vec.wv.vocab:
    if word in fasttext.wv.vocab:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

0

In [153]:
#obviously the vocabulary is the same in all three cases...
no_overlap=set()
for word in fasttextword2vec.wv.vocab:
    if word in word2vec.wv.vocab:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

0

** Open Question: Why does Word2Vec module miss 4000 words? **

In [163]:
fasttext.wv.num_ngram_vectors

805905

In [165]:
fasttextword2vec.wv.num_ngram_vectors # > 64585: why is this larger than the vocabulary?

341164

## 2. Assessing Pretrained Embeddings

As we will see, pretrained embeddings derived from larger corpora do not suit our task since we would miss more than 8000 words in our BoW vocabulary

We are using these pre-trained embeddings: https://fasttext.cc/docs/en/english-vectors.html
Which are obtained through the approach outlined here: https://arxiv.org/pdf/1712.09405.pdf

In [36]:
# since we are not using any subword information (fasttext ngrams for out of vocabulary words), we can import the embeddings as easy as follows
from gensim.models.keyedvectors import KeyedVectors
word2vec_wiki = KeyedVectors.load_word2vec_format("wiki-news-300d-1M.vec")
fasttext_wiki = KeyedVectors.load_word2vec_format("wiki-news-300d-1M-subword.vec")
fasttext_commoncrawl = KeyedVectors.load_word2vec_format("crawl-300d-2M.vec")

In [41]:
# whave a 300 dimensional dense vector for all models
300==len(word2vec_wiki.get_vector('cancer'))==len(fasttext_wiki.get_vector('cancer'))==len(fasttext_commoncrawl.get_vector('cancer'))

True

In [59]:
import pandas as pd
inverted_index=pd.read_pickle('inverted_index.pkl')
#overlap with vocabulary from Word2Vec emdeddings generated from English Wikipeda
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in word2vec_wiki.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

  import sys


8219

In [61]:
#overlap with vocabulary from fasttext emdeddings generated from English Wikipeda
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttext_wiki.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

  """


8219

In [58]:
#overlap with vocabulary from fasttext emdeddings generated from commoncrawl
overlap=set()
no_overlap=set()
for word in list(inverted_index.index):
    if word in fasttext_wiki.wv:
        overlap.add(word)
    else: 
        no_overlap.add(word)
len(no_overlap)

  """


8219

In [62]:
len(overlap)

20833

In [63]:
# we would throw away good domain-specific candidates when only looking at the union of both vocabularies
no_overlap

{'p/s',
 "'hort",
 'servings/day',
 'genetischen',
 'nocp',
 'inflammatory/proliferative',
 'isoflavanoid',
 'conceiveably',
 'disease-promoting',
 'praldiet',
 'ear-tagged',
 'combined-treatment',
 'attitudes/beliefs',
 'walked/d',
 'hplc',
 'octadecylsilyl',
 'hdaci',
 'occlusion/stenosis',
 'nepp',
 'stress-linked',
 'wwtps',
 'cpges',
 'hrct',
 'anti-gal',
 'dust-exposed',
 'azinobis',
 'ebsco',
 'plasma/mass',
 'awamori-fermented',
 'cdt-time',
 'nnk-specific',
 'tea-derived',
 'sub-fulminant',
 'multi-intervention',
 'antiulcer',
 "daughter's",
 'overnight-fasted',
 'l-aspartyl-l-phenylalanyl-methyl',
 'cdks',
 'cnki',
 'neurodegenerations',
 'chrono-log',
 'chemoprotection',
 'aphv',
 'kidney-transplant',
 'padecerla',
 'ific',
 'seafood-borne',
 'diabetes--what',
 'n-nitrosamines',
 'qald',
 'alpha-ethinylestradiol',
 'withdrawn/depressed',
 'sick-listing',
 'isphagula',
 'single-blinded',
 'meriones',
 'pathogen-recognition',
 'pubmed/medline',
 'peonidin',
 'hasc',
 'hmg-coa'

In [115]:
len(fasttext.wv.vectors)==len(word2vec.wv.vectors)

True

# 3. Feature Generation

In [2]:
#FastText Embeddings, 100-d dense vector
import pandas as pd
import numpy as np
import gensim
fasttext= gensim.models.FastText.load('our_fasttext')
word2vec= gensim.models.FastText.load('our_fasttextword2vec')
inverted_index=pd.read_pickle('inverted_index.pkl')
##
fasttext_embeddings_list=[]
words_not_covered_in_fasttext=[]
for word in inverted_index.index:
    try:
        fasttext_embeddings_list.append(fasttext.wv.get_vector(word))
    except:
        words_not_covered_in_fasttext.append(word)
        fasttext_embeddings_list.append(np.zeros(100)) # for those 3 OOV we insert an array consisting of zeros
fasttext_embeddings=pd.Series(fasttext_embeddings_list,index=inverted_index.index)
fasttext_embeddings.head()

'hort    [-0.239141, -1.4904516, 1.142754, -0.29527017,...
+        [-0.1422618, -0.26532656, 0.2698323, 0.4454199...
-        [-0.10083045, -0.20921779, 0.16661192, 0.19205...
--a      [-0.106107, -0.42222458, 0.19491445, 0.0256388...
--all    [0.23656483, -1.3561225, 0.25603667, 0.3251931...
dtype: object

In [3]:
word2vec.wv.get_vector('word')

array([ 0.2572194 , -0.6651085 ,  0.79225636, -0.01694296,  0.33287758,
       -0.54349154,  0.00667507,  0.1850751 ,  0.54274076,  0.03368193,
       -0.8320921 , -0.14556715,  0.49946925, -0.7071163 , -0.6381891 ,
        0.8695892 , -0.03466732,  0.33267492,  0.14396028, -0.15146871,
        0.03048434, -0.81679654, -0.8710392 ,  0.8360582 ,  0.5293026 ,
        0.88567525, -0.47475386,  0.11644375, -0.8636472 ,  0.01304353,
       -0.9219213 ,  0.18258792,  1.513047  ,  0.08487371, -0.07466734,
        0.95229095, -0.13856596, -0.6068335 ,  0.07104414, -0.08772398,
       -0.3221791 ,  0.54721814, -0.51129454, -0.3732295 ,  0.27261707,
       -0.22245723, -0.21325155,  0.11936326, -0.2954692 ,  0.13292292,
       -0.5082604 , -0.16721785,  0.53673196,  0.06266623, -0.3204079 ,
        0.174656  ,  0.597623  ,  0.32837942,  0.1105896 ,  0.4582284 ,
       -0.22074609,  0.6194034 ,  1.2198602 , -0.12896219,  0.33726743,
       -0.04403872,  0.57322377, -1.0528064 , -0.15330076, -1.14

In [4]:
#Word2Vec Embeddings, 100-d dense vector
word2vec_embeddings_list=[]
words_not_covered_in_word2vec=[]
for word in inverted_index.index:
    try:
        word2vec_embeddings_list.append(word2vec.wv.get_vector(word))
    except:
        words_not_covered_in_word2vec.append(word)
        word2vec_embeddings_list.append(np.zeros(100)) # for those 3 OOV we insert an array consisting of zeros
word2vec_embeddings=pd.Series(word2vec_embeddings_list,index=inverted_index.index)
word2vec_embeddings.head()

'hort    [-0.239141, -1.4904516, 1.142754, -0.29527017,...
+        [-0.1422618, -0.26532656, 0.2698323, 0.4454199...
-        [-0.10083045, -0.20921779, 0.16661192, 0.19205...
--a      [-0.106107, -0.42222458, 0.19491445, 0.0256388...
--all    [0.23656483, -1.3561225, 0.25603667, 0.3251931...
dtype: object

In [5]:
tfidf=pd.read_pickle('tfidf.pkl')
tfidf.shape

(29052, 3633)

In [6]:
weighted_word2vec_embeddings=tfidf.multiply(word2vec_embeddings,axis=0)

In [6]:
weighted_word2vec_embeddings.loc['cancer'].head()

MED-10     [-4.4339123, -0.12741917, 4.5506024, -9.852687...
MED-14     [-4.4318056, -0.12735865, 4.548441, -9.848006,...
MED-118    [-0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, ...
MED-301    [-0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, ...
MED-306    [-0.0, -0.0, 0.0, -0.0, 0.0, -0.0, 0.0, -0.0, ...
Name: cancer, dtype: object

In [10]:
weighted_word2vec_embeddings.shape

(29052, 3633)

In [None]:
store = pd.HDFStore('store.h5')
store['word2vec'] = weighted_word2vec_embeddings # save it
#store['df']  # load it