# Word Embeddings

Other than for the probabilistic rating models and TFIDF, we look into the raw text documents provided. Hence, we have first to apply some preprocessing.

In [1]:
# Preprocessing
# Gensim requires list of lists of Unicode 8 strings as an input. Since we have a small collection, we are fine with loading everything into memory.
import re
doc_list= []
with open('/Users/d071503/Desktop/IR/irteamproject/nfcorpus/raw/doc_dump.txt', 'r') as rf1:
    for line in rf1:
        l = re.sub("MED-.*\t", "",line).lower().strip('\n').split()
        doc_list.append(l) 
len(doc_list) # TODO: Report this in project report. Maybe also other summary stats.

5371

## Collocations / Phrases

For both models we will extract multi-word expressions first and analogously.
We use gensim's phrase detection [module](https://radimrehurek.com/gensim/models/phrases.html#id2).

We use gensim's default approach and parameter settings to detect collocations which is outlined [here](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality). 

The formula applied is:

(count(word_a followed by word_b) - min_count) * N / (count(word_a) * count(word_b)) > threshold , where N is the total vocabulary size.

Gensim sets the default  threshold to 10.



In [3]:
from gensim import models
# step 1: train the detector with
phrases = models.phrases.Phrases(doc_list, min_count=2) # phrases have to occur at least two times
# step 2: create a Phraser object to transform any sentence 
bigram = models.phrases.Phraser(phrases)

In [4]:
#little sanity check to see if it has worked: breast cancer should be detected as a collocation
bigram['Exhibits', 'a', 'high', 'risk ' ,'of' , 'breast', 'cancer']

['Exhibits', 'a', 'high', 'risk ', 'of', 'breast_cancer']

The Phraser object will then be used as a chained 'function' when creating the embeddings.

## Word2Vec with CBOW

We are using gensim's Word2Vec implementation and default parameter settings as described [here](https://radimrehurek.com/gensim/models/word2vec.html). 

We only modified the following parameters:
- Words have to occur at least twice to be included in the vocabulary

And we are detecting phrases as described above.

In [5]:
import gensim
word2vec = models.Word2Vec(bigram[doc_list],min_count=2, workers=4)
word2vec.save('our_word2vec')

### Observation 1: No stopword and chars seem to be removed

This goes beyond the scope of our project work. Potentially, we come back here and search the gensim documentation for further implementation details.

We also see that, certain n-grams that will be captured by fastText aren't part of the vocabulary ("Can"cer).

In [6]:
[i in word2vec.wv for i in [ 'of', 'by', 'the','.',',','%','$','2', '23', '234','X','Can']]

[True, True, True, True, True, True, False, True, True, True, False, False]

### Observation 2: No (implicit) Lemmatization or Stemming has occured

As of now, we'll live with this, but we should consider lemmatization/stemming eiter as a forther pre- or post-processing step (idea for post-processing: averaging over vectors).

In [7]:
[i in word2vec.wv for i in ['describe', 'described', 'describes', 'describing']]

[True, True, True, True]

## fastText

FastText splits words into character n-grams of arbitray lenght (has to be specified as a range). It proceeds then same as Word2Vec (either Skipgram or CBOW architecture).

##TODO: find paper that explains in more depth how skipgram works


The advantage of fastText is that it makes predictions for out-of-vocabulary or misspelled terms, if they can be constructed from the character n-grams in the vocabulary.

One major disadvantage of the existing fastText implementation is that word phrases such as “New York” are not being captured, further pre-processing would be necessary (Word2Vec does that for you).


We are using the fastText implementation of gensim. 
Parameters are set as default , this also implies that we are using an CBOW representation.



## FastText with CBOW

In [8]:
gensim.models.fasttext.FAST_VERSION > -1 # make sure that you are using Cython backend

True

In [9]:
# same procedure as above, will take substantially longer 
import gensim
fasttext= gensim.models.FastText(doc_list, min_count= 2, min_n= 3, max_n=10)

In [13]:
fasttext.save('our_fasttext')

In [None]:
#TODO: Beispiele, Similarities rausfinden, z.B: p-value/significant...

In [17]:
%%bash
pwd

/Users/d071503/Desktop/IR/irteamproject/embeddings
