
# Word Embeddings

### Dataset

The datasets are taken from the data of `thedeep` project, produced by the DEEP (https://www.thedeep.io) platform. The DEEP is an open-source platform, which aims to facilitate processing of textual data for international humanitarian response organizations. The platform enables the classification of text excerpts, extracted from news and reports into a set of domain specific classes. The provided dataset has 12 classes (labels) like agriculture, health, and protection. 

Download from [this link](https://drive.jku.at/filr/public-link/file-download/0cce88f07c9c862b017c9cfba294077a/33590/5792942781153185740/nlp2021_22_data.zip).

- `thedeep.$name$.train.txt`: Train set in csv format with three fields: sentence_id, text, and label.
- `thedeep.$name$.validation.txt`: Validation set in csv format with three fields: sentence_id, text, and label.
- `thedeep.$name$.test.txt`: Test set in csv format with three fields: sentence_id, text, and label.
- `thedeep.$name$.label.txt`: Captions of the labels.
- `README.txt`: Terms of use of the dataset.


### Similarity, Nearest Neighbors, and WE Evaluation

In [1]:
import csv
import numpy as np
from tqdm import tqdm
from scipy import stats
import gensim.downloader as api

#### Calculate word-to-word similarities


In [2]:
# download the word-embedder model
wv = api.load('word2vec-google-news-300')

In [3]:
def cosine_similarity_calc(vec_1,vec_2):
	sim = np.dot(vec_1,vec_2) / (np.linalg.norm(vec_1)*np.linalg.norm(vec_2))
	return sim

In [4]:
def source_to_words(source_word:str, compare:list):
    source_vec = wv[source_word]
    for word in compare:
        comp_vec = wv[word]
        print(f"cosine sim({source_word}, {word}):", cosine_similarity_calc(source_vec, comp_vec))

In [5]:
source_1 = "car"
compare_list1 = ["minivan","bicycle","airplane","cereal", "communism"]
source_to_words(source_1, compare_list1)

cosine sim(car, minivan): 0.69070363
cosine sim(car, bicycle): 0.5364484
cosine sim(car, airplane): 0.42435578
cosine sim(car, cereal): 0.13924746
cosine sim(car, communism): 0.05820294


In [6]:
source_2 = "fish"
compare_list2 = ["shark","sea","fishing","water", "boat"]
source_to_words(source_2, compare_list2)

cosine sim(fish, shark): 0.5278309
cosine sim(fish, sea): 0.3250058
cosine sim(fish, fishing): 0.63979906
cosine sim(fish, water): 0.42041764
cosine sim(fish, boat): 0.37478778


In [7]:
source_3 = "rice"
compare_list3 = ["asia","bowl","plant","China", "sushi"]
source_to_words(source_3, compare_list3)

cosine sim(rice, asia): 0.18458173
cosine sim(rice, bowl): 0.25551498
cosine sim(rice, plant): 0.13637118
cosine sim(rice, China): 0.17849869
cosine sim(rice, sushi): 0.36377826


#### Calculate nearest neighbors

In [8]:
def nearest_neighbors(source_vec, target_vec, k):
    """
    efficient way: dot product between vector and matrix produces
    list with scalars where each entry is divided by 
    its corresponding norm vector
    ----------
    returns [("most similar word", cosine_value), ...]
    ----------
    """
    # create list where each element is norm of row vector
    vec_norms = [np.linalg.norm(vec) for vec in target_vec]
    source_len = np.linalg.norm(source_vec)

    # our model has shope 3mill x 300, but we need 300 x 3mill,
    # so we get a list with scalars 
    target_transposed = target_vec.T
    cos_list = np.dot(source_vec, target_transposed)
    
    # we divide list of scalars with (||source_vec|| * ||target vec||   
    for idx, vec_len in enumerate(vec_norms):
        cos_list[idx] = cos_list[idx] / (source_len * vec_len)
        
    k_indices = np.argpartition(cos_list, -k)[-k:] # get indices of k largest elements
    
    to_sort = list()
    for idx in k_indices:
        to_sort.append((wv.index_to_key[idx], round(cos_list[idx], 4)))
        
    sorted_k = sorted(to_sort, key = lambda x: x[1])[::-1] # sort by second element and reverse list
    return sorted_k

In [9]:
nearest_neighbors(wv["rice"], wv.vectors, 10)

[('rice', 1.0),
 ('milled_rice', 0.6632),
 ('wheat_flour', 0.6618),
 ('paddy_rice', 0.6603),
 ('paddy', 0.6452),
 ('unhusked_rice', 0.6451),
 ('cassava', 0.6379),
 ('parboiled_rice', 0.637),
 ('maize', 0.636),
 ('wheat', 0.6308)]

In [10]:
nearest_neighbors(wv["mathematics"], wv.vectors, 10)

[('mathematics', 1.0),
 ('math', 0.8161),
 ('Mathematics', 0.7574),
 ('maths', 0.742),
 ('Math', 0.6675),
 ('mathematics_physics', 0.6631),
 ('algebra_trigonometry', 0.656),
 ('calculus_trigonometry', 0.6443),
 ('algebra', 0.6425),
 ('mathematical_sciences', 0.6379)]

In [11]:
nearest_neighbors(wv["dildo"], wv.vectors, 10)

[('dildo', 1.0),
 ('dildos', 0.6557),
 ('vibrator', 0.6179),
 ('nipple_clamps', 0.603),
 ('strap_ons', 0.5892),
 ('clit', 0.5726),
 ('vagina', 0.5708),
 ('vibrators', 0.5669),
 ('anally', 0.5658),
 ('dick', 0.5645)]

#### WE evaluation
***
1) Word Similarity:
- datasets: WordSim353 paritioned into two datasets, WordSim Similarity, WordSim Relatedness, MEN dataset, Mechanical Turk dataset, Rare Words dataset, SimLex-999 dataset
- The word vectors are evaluated by ranking the pairs according to their cosine similarities, and measuring the correlation (Spearman’s ρ) with the human ratings

2) Analogy:
- datasets: MSR analogy dataset, Google analogy dataset
- the two analogy datasets present questions of the form "a is to a as b is to b*" where b* is hidden, and must be guessed from the entire vocabulary
- analogy questions are answered using 3CosAdd and 3CosMul

#### Word Similarity Evaluation

In [12]:
wv_words = wv.index_to_key
m_turk = list()
m_turk_similarities = list()

with open("Mtruk.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=",")
    for row in csv_reader:
        # do not include words that are not in our embedding model
        if row[0] not in wv_words or row[1] not in wv_words:
            continue
        else:
            m_turk.append(row) # list of lists
            m_turk_similarities.append(row[2])

In [13]:
wv_similarities = list()

for elem in m_turk:
    wv_similarities.append(cosine_similarity_calc(wv[elem[0]], wv[elem[1]]))

In [14]:
stats.spearmanr(wv_similarities, m_turk_similarities)

SpearmanrResult(correlation=0.6843994695942136, pvalue=2.44586235254813e-39)

#### Analogy Evaluation

In [4]:
def three_cos_add(a, a_star, b, wv):
    a_index, a_star_index = wv.get_index(a), wv.get_index(a_star)
    b_index = wv.get_index(b)
    wv_normed = wv.get_normed_vectors()
    sim_list = cosine_similarity_calc(wv_normed, wv[a_star] - wv[a] + wv[b])
    # exclude a, a_star and b from embedding model
    sim_list[np.array([a_index, a_star_index, b_index])] = -np.inf
    return np.argmax(sim_list)

def three_cos_mul(a, a_star, b, wv, epsilon=1e-3):
    a_index, a_star_index = wv.get_index(a), wv.get_index(a_star)
    b_index = wv.get_index(b)
    wv_normed = wv.get_normed_vectors()
    sim_list = (cosine_similarity_calc(wv_normed, wv[a_star]) * cosine_similarity_calc(wv_normed, wv[b])) / (cosine_similarity_calc(wv_normed, wv[a]) + epsilon)
    # exclude a, a_star and b from embedding model
    sim_list[np.array([a_index, a_star_index, b_index])] = -np.inf
    return np.argmax(sim_list)

def find_analogy(a:str, a_star:str, b:str, embedding_model:list, method):
    """
    find b_star such that a is to a_star as b is to b_star
    -------------------------
    returns most analogical word as string
    -------------------------
    """
    return wv.index_to_key[method(a, a_star, b, embedding_model)]

In [5]:
find_analogy('man',"king","woman", wv, three_cos_add)

'queen'

In [6]:
# analogy dataset
with open("questions-words.txt") as fh:
    text_list = fh.read().splitlines()[1:] # elements are strings
    text_list = [elem.split() for elem in text_list] # split by whitespace

#### Evaluation of analogies with smaller subset

In [7]:
def evaluate_analogies(test_set:list, method):
    n = len(test_set)
    correct = 0
    
    for elem in tqdm(test_set, desc="evaluate analogies"):
        analogy = find_analogy(elem[0], elem[1], elem[2], wv, method)
        if analogy == elem[3]:
            correct += 1
            
    return correct / n

In [11]:
np.random.seed(123)
evaluate_analogies(np.random.choice(text_list, 1000), three_cos_add)

  
evaluate analogies: 100%|██████████████████████████████████████████████████████████| 1000/1000 [24:51<00:00,  1.49s/it]


0.753

In [12]:
np.random.seed(123)
evaluate_analogies(np.random.choice(text_list, 1000), three_cos_mul)

  
evaluate analogies: 100%|██████████████████████████████████████████████████████████| 1000/1000 [44:11<00:00,  2.65s/it]


0.562

#### cos_mul depends here on the seed, other experiment runs output a range from 50 to 75 percent, running this function took a lot of time so we omitted further experiments

### Document Classification with WE

In [22]:
import csv
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from collections import defaultdict

import sklearn

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [23]:
size = "small"

with open(f'nlpwdl2021_data/thedeep.{size}.train.txt', "r", encoding="utf8") as csvfile:
    train = list(csv.reader(csvfile)) # list of lists with 3 entries: sentence ID, text, label
    
with open(f'nlpwdl2021_data/thedeep.{size}.test.txt', "r", encoding="utf8") as csvfile:
    test = list(csv.reader(csvfile))
    
with open(f'nlpwdl2021_data/thedeep.{size}.validation.txt', "r", encoding="utf8") as csvfile:
    validation = list(csv.reader(csvfile))
    
print(len(train), len(test), len(validation))

8400 1800 1800


In [24]:
# reuse old functions
def preprocess_text(text:str=None, return_label:bool=False, label:int=None):
    """function to preprocess string
    returns a list of tokens
    """
    #1 to lowercase
    text = text.lower()
    
    #2 remove all special characters
    text = re.sub(r"\W", " ", text)
    
    #3 remove single characters with space to the left and right (possessive pronoun)
    text = re.sub(r"\s+[a-z]\s+", " ", text)
    
    #4 remove double whitespaces to single space
    text = re.sub(' +', ' ', text)
    
    #5.2 replace numbers
    text = re.sub(r'\d+', '<num>', text)
    
    #5.3 now count frequncy of <dates> and <num>
    count_num, count_dates = text.count("<num>"), text.count("<dates>")
    
    #5.4 replace it (so that in tokens it does not appear)
    text = text.replace("<num>", "").replace("<dates>", "")
    
    #6 Stop words and tokenization
    stop_words = set(stopwords.words('english'))
    tokens = word_tokenize(text)
    token_text = [i for i in tokens if not i in stop_words]
    
    #7 Lemmatization
    stemmer = WordNetLemmatizer()
    result = [stemmer.lemmatize(word) for word in token_text]
    if not return_label:
        return result
    else:
        return result, label

def create_dictionary(preprocessed_tokens:list=None):
    """Creates a word dictionary given a PREPROCESSED text. Returns a sorted dct of all counts and OOV as well as the count list"""
    threshold = 2

    count_list = list()
    filtered_dict = dict()
    out_of_vocabulary = list()
    
    word_dict = defaultdict(lambda: 0)
    for word in preprocessed_tokens:
        word_dict[word] += 1
    
    sorted_dict = {k: v for k, v in sorted(word_dict.items(), key=lambda item:item[1], reverse=True)}
    
    for key, value in sorted_dict.items():
        if value > threshold:
            filtered_dict[key] = value
        else:
            out_of_vocabulary.append(key)

        count_list.append(value)
    
    return sorted_dict, out_of_vocabulary, count_list

def merge_tokens(data:list):
    """
    function that preprocesses text and returns two different lists
    1. token_list ... used for dictionary to find out threshold value
    2. documents ... used for removing oov words
    """
    i = 0
    token_list = list() # list with all tokens of all douments ["word1", "word2", ...]
    documents = list() # elements are list of tokens [[doc1 tokens], [doc2 tokens], ...]
    
    for sample in tqdm(data):
        token_list += preprocess_text(sample[1])
        documents.append(preprocess_text(sample[1]))
        i += 1

        
    return token_list, documents

### Mapping the words to embeddings
The model used is the same as abve; its from google Word2Vec - we map the words from the document (after preprocessing and cuting the threshould words out) to the vector.

In [25]:
# for train -----------------------
# preprocess all texts
texts_train = []
for doc in train:
    texts_train.append(preprocess_text(doc[1]))

# make a dictionary
dictionarys_texts_list_train = [] # contains all dictionarys
for pre_text in texts_train:
    dictionarys_texts_list_train.append(create_dictionary(pre_text)[0])
    
    
# for test ----------------------
texts_test = []
for doc in test:
    texts_test.append(preprocess_text(doc[1]))

# make a dictionary
dictionarys_texts_list_test = [] # contains all dictionarys
for pre_text in texts_test:
    dictionarys_texts_list_train.append(create_dictionary(pre_text)[0])

In [26]:
# go over every dictionary in the list (dictionary_text_list) and map them with the pre trained model
word_map_train = {} # maps word to embedding
word_map_test = {}
not_found = 0

# for train
for doc in dictionarys_texts_list_train:
    for word in doc:
        if word not in word_map_train.keys():
            try:
                word_map_train[word] = np.array(wv[word])
            except KeyError:
                # choose random embedding (from already existing ones)
                word_map_train[word] = np.random.uniform(low=-1.0, high=1.0, size=(list(word_map_train.values())[0]).shape)
                not_found += 1
                
                
# for test 
for doc in dictionarys_texts_list_test:
    for word in doc:
        if word not in word_map_test.keys():
            try:
                word_map_test[word] = np.array(wv[word])
            except KeyError:
                # choose random embedding (from already existing ones)
                word_map_test[word] = np.random.uniform(low=-1.0, high=1.0, size=(list(word_map_test.values())[0]).shape)
                not_found += 1

In [27]:
def doc_representation(doc:list=None):
    """Input: doc, a list of vectors, same shape.
    Returns the value of above formula"""
    
    # add vectors vertically
    e_d = (1/len(doc)) * np.sum(doc, axis=0)
    
    return e_d

### Classification
Classify the documants from the test set.

In [28]:
text_and_labels_train = {} # contains all dictionarys
text_and_labels_test = {} # contains all dictionarys


# get train set
for doc in train:
    # preprocess text, make dict and cut of tokens (see thresold from create_dictionary function)
    pre_text = list(create_dictionary(preprocess_text(doc[1]))[0].keys())
    
    # assign the id the label and (preprocessed) text, which is a dictionary with the respective threshold
    text_and_labels_train[doc[0]] = {"label": doc[2], "text": pre_text}
    
    # now add a new entry - the vector representation e_v (Exercise abouve)
    text_and_labels_train[doc[0]]["vector"] = doc_representation([word_map_train[word] for word in pre_text])
    
# now with test set
for doc in test:
    # preprocess text, make dict and cut of tokens (see thresold from create_dictionary function)
    pre_text = list(create_dictionary(preprocess_text(doc[1]))[0].keys())
    
    # assign the id the label and (preprocessed) text, which is a dictionary with the respective threshold
    text_and_labels_test[doc[0]] = {"label": doc[2], "text": pre_text}
    
    # now add a new entry - the vector representation e_v (Exercise abouve)
    text_and_labels_test[doc[0]]["vector"] = doc_representation([word_map_train[word] for word in pre_text])

In [29]:
for item in text_and_labels_test.items():
    print(item)
    break

('11267', {'label': '4', 'text': ['health', 'said', 'died', 'facility', 'minister', 'jonglei', 'state', 'angok', 'gordon', 'far', 'people', 'cholera', 'duk', 'county', 'number', 'reported', 'last', 'evening', 'actually', 'case', 'monday', 'april', 'added', 'still', 'treated', 'rest', 'discharged'], 'vector': array([-2.40211193e-02, -2.69282547e-03,  4.65271574e-02,  4.16446081e-02,
       -4.11767683e-02, -2.52486798e-02,  5.06995292e-02, -7.73757420e-02,
        3.09039325e-02, -3.39679409e-02, -3.14135742e-02, -1.81439759e-01,
       -5.62723502e-02,  3.47959475e-02, -6.61549626e-02,  9.49345343e-02,
       -2.53688379e-03,  8.95255864e-02, -4.16840290e-02,  8.16017567e-05,
       -3.48079483e-02,  4.07584960e-02,  4.57891639e-02, -6.12057694e-02,
        6.63169626e-02, -3.12747544e-02, -6.72252837e-02,  1.02127287e-02,
       -3.27678302e-02, -9.11770967e-03,  3.09081945e-02, -1.34938124e-02,
       -8.17382915e-02, -7.44070315e-02, -2.20341461e-02, -6.52211166e-03,
       -3.91954

### Evaluation (1/3: Random Forest)

In [30]:
forest_clf = RandomForestClassifier(n_estimators=80, random_state=42)

# fit train vectors and train labels to classifier (dont get confused by this one liners)
forest_clf.fit([item["vector"] for item in text_and_labels_train.values()], [item["label"] for item in text_and_labels_train.values()])

# get predictions from the holy dictionary
pred = forest_clf.predict([item["vector"] for item in text_and_labels_test.values()])

In [31]:
# get the Accuraccy 
print("Accuracy for Random Forest on Test Set: ",sklearn.metrics.accuracy_score([item["label"] for item in text_and_labels_test.values()], pred))

Accuracy for Random Forest on Test Set:  0.5188888888888888


### Evaluation (2/3: kNN)

In [32]:
# repeat above but with other classifiers
neigh_clf = KNeighborsClassifier(n_neighbors=40)

neigh_clf.fit([item["vector"] for item in text_and_labels_train.values()], [item["label"] for item in text_and_labels_train.values()])
pred = neigh_clf.predict([item["vector"] for item in text_and_labels_test.values()])

In [33]:
print("Accuracy for kNN on Test Set: ", sklearn.metrics.accuracy_score([item["label"] for item in text_and_labels_test.values()], pred))

Accuracy for kNN on Test Set:  0.5455555555555556


### Evaluation (3/3: Gradient Boost)
⚠️**WARNING**⚠️ - takes long to run:)

In [34]:
boost_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=0)

boost_clf.fit([item["vector"] for item in text_and_labels_train.values()], [item["label"] for item in text_and_labels_train.values()])

pred = boost_clf.predict([item["vector"] for item in text_and_labels_test.values()])

In [35]:
print("Accuracy for Gradient Boost on Test Set: ", sklearn.metrics.accuracy_score([item["label"] for item in text_and_labels_test.values()], pred))

Accuracy for Gradient Boost on Test Set:  0.5155555555555555


<a name="section-references"></a><h2 style="color:rgb(0,120,170)">References</h2>

[1] O. Levy, Y. Goldberg, and I. Dagan. Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3:211– 225, 2015.

[2] M. Pagliardini, P. Gupta, and M. Jaggi. Unsupervised Learning of Sentence Embeddings using Compositional n-Gram Features. In Proceedings of the conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2018.