# Word and Document Embeddings

In this part of the lab, we will play with word and document embeddings and see how they can be applied to (un)supervised document classification. We will make use of the scikit-learn and gensim Python libraries. Finally, we will also need the "Google News" pre-trained word vectors. If not already done, you can download them from  here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing. This binary file contains embeddings for more than 3 million unique words and phrases, learned with word2vec from a corpus of more than 100 billion words (see [Mikolov et al. NIPS'13](http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) for more details).

Throughout the entire session, we will be working with the BBC Sport dataset (available at http://mlg.ucd.ie/datasets/bbc.html). The dataset consists of sports news articles from the BBC Sport website. There are 737 articles in total categorized into the following 5 classes: (1) athletics, (2) cricket, (3) football, (4) rugby, and (5) tennis. Let's first read the data.

In [None]:
def load_data(filename):
    labels = []
    docs =[]

    with open(filename, encoding='utf8', errors='ignore') as f:
        for line in f:
            content = line.split('\t')
            labels.append(content[0])
            docs.append(content[1][:-1])
    
    return docs,labels

docs, class_labels = load_data('data/bbcsport.txt')
print("Example of an article:", docs[0])

The documents that are contained in the dataset have already undergone some preprocessing. We will apply some further preprocessing steps. More specifically, we will remove some punctuation marks, diacritics, and non letters, if any. Furthermore, we will represent each document as a list of tokens. We will also randomly shuffle the articles, and finally, since the class labels are strings, we will encode them with values between 0 and n_classes-1.

In [None]:
import re
import numpy as np
from sklearn.preprocessing import LabelEncoder

def clean_str(string):
    string = re.sub(r"[^A-Za-z0-9(),!?\'\`]", " ", string)     
    string = re.sub(r"\'s", " \'s", string) 
    string = re.sub(r"\'ve", " \'ve", string) 
    string = re.sub(r"n\'t", " n\'t", string) 
    string = re.sub(r"\'re", " \'re", string) 
    string = re.sub(r"\'d", " \'d", string) 
    string = re.sub(r"\'ll", " \'ll", string) 
    string = re.sub(r",", " , ", string) 
    string = re.sub(r"!", " ! ", string) 
    string = re.sub(r"\(", " \( ", string) 
    string = re.sub(r"\)", " \) ", string) 
    string = re.sub(r"\?", " \? ", string) 
    string = re.sub(r"\s{2,}", " ", string)
    return string.strip().split()

    
def preprocessing(docs):
    preprocessed_docs = []

    for doc in docs:
        preprocessed_docs.append(clean_str(doc))

    return preprocessed_docs

processed_docs = preprocessing(docs)

le = LabelEncoder()
labels = le.fit_transform(class_labels)

n = len(processed_docs)
idx = np.random.permutation(n)

processed_docs_rand = list()
y = np.zeros(n, dtype=np.int64)
for i in range(idx.size):
    processed_docs_rand.append(processed_docs[idx[i]])
    y[i] = labels[idx[i]]

processed_docs = processed_docs_rand

print("Preprocessed document:", processed_docs[10])

### Experimenting with word embeddings 
First, we will get familiar with some properties of word embeddings by performing some operations manually.

We will first make use of gensim’s build_vocab() method which extracts the vocabulary from the list of documents (where each document is a list of tokens). Create a new list of documents. Insert all existing documents to the list. Furthermore, add another document consisting of the following 8 tokens: 'queen', 'king', 'woman', 'man', 'aunt', 'uncle', 'son', 'daughter'. Then, load the Google News word vectors corresponding to our vocabulary words. This is a trick to avoid having to load all the vectors into memory (it would require 5-6 GB of RAM).

In [None]:
from gensim.models.word2vec import Word2Vec

# returns cosine similarity between two vectors
def cosine(vec1, vec2):
    return np.dot(vec1, vec2)/(np.linalg.norm(vec1)*np.linalg.norm(vec2))

# returns the vector of a word
def my_vector_getter(word, model):
    try:
        word_array = model.wv[word].astype(np.float64)
        return word_array
    except KeyError:
        print('word: <', word, '> not in vocabulary!')

# returns cosine similarity between two word vectors
def my_cos_similarity(word1, word2, model):
    return cosine(my_vector_getter(word1, model), my_vector_getter(word2, model))

model = Word2Vec(size=300, min_count=0)


#your code here


# load vectors corresponding to our vocabulary
path_to_embeddings = '...'
model.intersect_word2vec_format(path_to_embeddings, binary=True)

Compute the cosine similarity in the embedding space between semantically close words (e.g., "man" and "woman") and between unrelated words. You can make use of the functions defined above. What do you observe? Similarly, perform some vector operations (e.g., "king"-"man"+"woman") and interpret the results.

In [None]:
#your code here

Project the word vectors into a lower-dimensional space using PCA/t-SNE and visualize the projections of the 200 most frequent words. What can you say about the embedding space?

In [None]:
import operator
from collections import Counter
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

%matplotlib inline

# ========== visualize word embeddings of 'n_mf' most frequent words ==========

n_mf = 200
all_tokens = [token for sublist in processed_docs for token in sublist]
t_counts = dict(Counter(all_tokens))
sorted_t_counts = sorted(t_counts.items(), key=operator.itemgetter(1), reverse=True)
mft = [elt[0] for elt in sorted_t_counts]

# store the vectors of the most frequent words in numpy array
mft_vecs = np.zeros((n_mf,300))
for i,token in enumerate(mft[:n_mf]):
    mft_vecs[i,:] = model.wv[token]

my_pca = PCA(n_components=10)
my_tsne = TSNE(n_components=2)

mft_vecs_pca = my_pca.fit_transform(mft_vecs)
mft_vecs_tsne = my_tsne.fit_transform(mft_vecs_pca)

fig, ax = plt.subplots()
ax.scatter(mft_vecs_tsne[:,0], mft_vecs_tsne[:,1],s=3)
for xx, yy, token in zip(mft_vecs_tsne[:,0] , mft_vecs_tsne[:,1], mft):     
    ax.annotate(token, xy=(xx, yy), size=8)
fig.suptitle('t-SNE visualization of word embeddings',fontsize=20)
fig.set_size_inches(11,7)

Similarly, observe some regularities in the space made of the first two PCs (e.g., gender regularities).

In [None]:
# ========== visualize regularities among word vectors ==========

my_pca = PCA(n_components=2)
# numpy array containg vectors of all words
all_vecs = model.wv.vectors
all_vecs_pca = my_pca.fit_transform(all_vecs) 

my_words = ['queen','king','woman','man','aunt','uncle','son','daughter']

# w2v.wv.index2word contains the words in the order in which they appear in w2v.wv.syn0
idxs = [model.wv.index2word.index(elt) for elt in my_words]

fig, ax = plt.subplots()
ax.scatter(all_vecs_pca[idxs,0], all_vecs_pca[idxs,1],s=3)
for xx, yy, token in zip(all_vecs_pca[idxs,0], all_vecs_pca[idxs,1], my_words):     
    ax.annotate(token, xy=(xx, yy), size=8)
fig.suptitle('PCA visualization of gender regularities',fontsize=15)
fig.set_size_inches(7,5)

### Document embeddings for supervised text categorization
We will next use an unsupervised method that generates document representations and we will apply it for classifying the articles from the BBC Sport dataset. We will first split the dataset into a training and a test set. Use 90% of the documents for training. The remaining documents will serve as our test set.

In [None]:
#your code here

Next, we will experiment with doc2vec ([Le and Mikolov ICML'14](https://cs.stanford.edu/~quocle/paragraph_vector.pdf)). Doc2vec is an extension of word2vec that can learn vectors for all the documents in a collection in a fully unsupervised manner. The embeddings can thus be used for unsupervised or supervised classification. In a supervised setting, an inference stage is required to obtain the vectors of the documents in the test set. The model is simply trained on the new documents, with all parameters fixed. Before using doc2vec features with a SVM for supervised classification, we will learn document embeddings for our training set with the PV-DBOW architecture (using gensim's [Doc2Vec](https://radimrehurek.com/gensim/models/doc2vec.html) object). Are the similarities between documents meaningful? Do similar documents share the same labels? Visualize 2D maps of the document embedding space. 

In [None]:
from gensim.models.doc2vec import Doc2Vec,TaggedDocument

d2v_training_data = []
for i,doc in enumerate(processed_docs_train):
    d2v_training_data.append(TaggedDocument(words=doc,tags=[i]))
    
# ========== learning doc embeddings with doc2vec ==========

# PV stands for 'Paragraph Vector'
# PV-DBOW (distributed bag-of-words) dm=0

d2v = Doc2Vec(d2v_training_data, vector_size=100, window=10, alpha=0.1, min_alpha=1e-4, dm=0, negative=1, epochs=10, min_count=2, workers=4)
d2v_vecs = np.zeros((len(processed_docs_train), 100))
for i in range(len(processed_docs_train)):
    d2v_vecs[i,:] = d2v.docvecs[i]

# ========== experimenting with doc2vec ==========

print(d2v.docvecs.most_similar(0))
idxs_most_similar = [elt[0] for elt in d2v.docvecs.most_similar(0)]
print([y_train[idx] for idx in idxs_most_similar])

# visualize document embeddings
n_plot = 1000

my_pca = PCA(n_components=10)
my_tsne = TSNE(n_components=2)
d2v_vecs_pca = my_pca.fit_transform(d2v_vecs[:n_plot,:]) 
d2v_vecs_tsne = my_tsne.fit_transform(d2v_vecs_pca)

labels_plt = list(y_train[:n_plot])

palette = plt.get_cmap('hsv',len(list(set(labels_plt))))
fig, ax = plt.subplots()

my_colors = {0:'r', 1:'b', 2: 'g', 3:'y', 4:'k'}

for label in list(set(labels_plt)):
    idxs = [idx for idx,elt in enumerate(labels_plt) if elt==label]
    ax.scatter(d2v_vecs_tsne[idxs,0], 
               d2v_vecs_tsne[idxs,1], 
               c = my_colors[label],
               label=str(label),
               alpha=0.7,
               s=40)

ax.legend(scatterpoints=1)
fig.suptitle('t-SNE visualization of document embeddings',fontsize=20)
fig.set_size_inches(11,7)

Compare doc2vec against the traditional bag-of-words representation with tf-idf weighting in the task of text categorization. Use the [infer_vector](https://radimrehurek.com/gensim/models/doc2vec.html) method of doc2vec to generate representations for the documents of the test set. Use the [TfidfVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) object of scikit-learn to generate the traditional bag-of-words representation and the [LinearSVC](http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) object to perform text categorization. To calculate the accuracies of the two approaches, use the [accuracy_score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) function of scikit-learn. What do you observe? 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

#your code here