# Simple Sentence Similarity (from github) - with application of the methods to a sample dataset at the end

Word embeddings have become widespread in Natural Language Processing. They allow us to easily compute the semantic similarity between two words, or to find the words most similar to a target word. However, in many applications we're more interested in the similarity between two sentences or short texts. In this notebook, I compare some simple ways of computing sentence similarity and investigate how they perform.

In [1]:
import pandas as pd
import numpy as np
import scipy
import os
import matplotlib.pyplot as plt



## Preparation

First we need to do some preparation: some of our models require the sentences to be tokenized, some do not. For that reason we'll make a simple Sentence class where we keep both the raw sentence and the tokenized sentence. The individual methods below will then pick the input they need.

In [2]:
import nltk

STOP = set(nltk.corpus.stopwords.words("english"))

class Sentence:
    
    def __init__(self, sentence):
        self.raw = sentence
        normalized_sentence = sentence.replace("‘", "'").replace("’", "'")
        self.tokens = [t.lower() for t in nltk.word_tokenize(normalized_sentence)]
        self.tokens_without_stop = [t for t in self.tokens if t not in STOP]

Next, we're going to use the popular [Gensim](https://radimrehurek.com/gensim/) library to load aset of widely used pre-trained word embeddings: 
[word2vec](https://www.tensorflow.org/tutorials/word2vec) 

In [5]:
import gensim

from gensim.models import Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec

PATH_TO_WORD2VEC = os.path.expanduser("models/GoogleNews-vectors-negative300.bin")
word2vec = gensim.models.KeyedVectors.load_word2vec_format(PATH_TO_WORD2VEC, binary=True, limit=500000)

models/GoogleNews-vectors-negative300.bin


Finally, in order to compute weighted averages of word embeddings later, we are going to load a file with word frequencies. These word frequencies have been collected from Wikipedia and saved in a tab-separated file. 

In [6]:
import csv

PATH_TO_FREQUENCIES_FILE = "data/sentence_similarity/frequencies.tsv"
PATH_TO_DOC_FREQUENCIES_FILE = "data/sentence_similarity/doc_frequencies.tsv"

def read_tsv(f):
    frequencies = {}
    with open(f) as tsv:
        tsv_reader = csv.reader(tsv, delimiter="\t")
        for row in tsv_reader: 
            frequencies[row[0]] = int(row[1])
        
    return frequencies
        
frequencies = read_tsv(PATH_TO_FREQUENCIES_FILE)
doc_frequencies = read_tsv(PATH_TO_DOC_FREQUENCIES_FILE)
doc_frequencies["NUM_DOCS"] = 1288431


## Similarity methods

### Baseline

As our baseline, we're going for the simplest way of computing sentence embeddings: just take the embeddings of the words in the sentence (minus the stopwords), and compute their average, weighted by the sentence frequency of each word. 

We then use the cosine to calculate the similarity between two sentence embeddings.

In [7]:
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import math

def run_avg_benchmark(sentences1, sentences2, model=None, use_stoplist=False, doc_freqs=None): 

    if doc_freqs is not None:
        N = doc_freqs["NUM_DOCS"]
    
    sims = []
    for (sent1, sent2) in zip(sentences1, sentences2):
    
        tokens1 = sent1.tokens_without_stop if use_stoplist else sent1.tokens
        tokens2 = sent2.tokens_without_stop if use_stoplist else sent2.tokens

        tokens1 = [token for token in tokens1 if token in model]
        tokens2 = [token for token in tokens2 if token in model]
        
        if len(tokens1) == 0 or len(tokens2) == 0:
            sims.append(0)
            continue
        
        tokfreqs1 = Counter(tokens1)
        tokfreqs2 = Counter(tokens2)
        
        weights1 = [tokfreqs1[token] * math.log(N/(doc_freqs.get(token, 0)+1)) 
                    for token in tokfreqs1] if doc_freqs else None
        weights2 = [tokfreqs2[token] * math.log(N/(doc_freqs.get(token, 0)+1)) 
                    for token in tokfreqs2] if doc_freqs else None
        
        if weights1 and weights2:
            if sum(weights1) == 0 or sum(weights2) == 0:
                sims.append(0)
                continue
        embedding1 = np.average([model[token] for token in tokfreqs1], axis=0, weights=weights1).reshape(1, -1)
        embedding2 = np.average([model[token] for token in tokfreqs2], axis=0, weights=weights2).reshape(1, -1)

        sim = cosine_similarity(embedding1, embedding2)[0][0]
        sims.append(sim)

    return sims

### Word Mover's Distance

Word mover's distance is a popular alternative to the simple average embedding similarity. The Word Mover's Distance uses the word embeddings of the words in two texts to measure the minimum amount that the words in one text need to "travel" in semantic space to reach the words of the other text. Word mover's distance is available in the popular Gensim library.

In [8]:
def run_wmd_benchmark(sentences1, sentences2, model, use_stoplist=False):
    
    sims = []
    for (sent1, sent2) in zip(sentences1, sentences2):
    
        tokens1 = sent1.tokens_without_stop if use_stoplist else sent1.tokens
        tokens2 = sent2.tokens_without_stop if use_stoplist else sent2.tokens
        
        tokens1 = [token for token in tokens1 if token in model]
        tokens2 = [token for token in tokens2 if token in model]
        
        if len(tokens1) == 0 or len(tokens2) == 0:
            tokens1 = [token for token in sent1.tokens if token in model]
            tokens2 = [token for token in sent2.tokens if token in model]
            
        sims.append(-model.wmdistance(tokens1, tokens2))
        
    return sims

### Smooth Inverse Frequency

Taking the average of the word embeddings in a sentence, like we did above, is a very crude method of computing sentence embeddings. Most importantly, this gives far too much weight to words that are quite irrelevant, semantically speaking. Smooth Inverse Frequency tries to solve this problem. 

To compute SIF sentence embeddings, we first compute a weighted average of the token embeddings in the sentence. This procedure is very similar to the weighted average we used above, with the single difference that the word embeddings are weighted by `a/a+p(w)`, where `w` is a parameter that is set to `0.001` by default, and `p(w)` is the estimated relative frequency of a word in a reference corpus.

Next, we need to perform common component removal: we compute the principal component of the sentence embeddings we obtained above and subtract from them their projections on this first principal component. This corrects for the influence of high-frequency words that mostly have a syntactic or discourse function, such as "just", "there", "but", etc. 

In [9]:
from sklearn.decomposition import TruncatedSVD

def remove_first_principal_component(X):
    svd = TruncatedSVD(n_components=1, n_iter=7, random_state=0)
    svd.fit(X)
    pc = svd.components_
    XX = X - X.dot(pc.transpose()) * pc
    return XX


def run_sif_benchmark(sentences1, sentences2, model, freqs={}, use_stoplist=False, a=0.001): 
    total_freq = sum(freqs.values())
    embeddings = []
    
    # SIF requires us to first collect all sentence embeddings and then perform 
    # common component analysis.
    for (sent1, sent2) in zip(sentences1, sentences2): 
        
        tokens1 = sent1.tokens_without_stop if use_stoplist else sent1.tokens
        tokens2 = sent2.tokens_without_stop if use_stoplist else sent2.tokens   
        
        tokens1 = [token for token in tokens1 if token in model]
        tokens2 = [token for token in tokens2 if token in model]
        
        if tokens1 == []: tokens1 = ['empty']
        if tokens2 == []: tokens2 = ['empty'] 
        
        weights1 = [a/(a+freqs.get(token,0)/total_freq) for token in tokens1]
        weights2 = [a/(a+freqs.get(token,0)/total_freq) for token in tokens2]
        
        embedding1 = np.average([model[token] for token in tokens1], axis=0, weights=weights1)
        embedding2 = np.average([model[token] for token in tokens2], axis=0, weights=weights2)
        embeddings.append(embedding1)
        embeddings.append(embedding2)

            
        
        
    embeddings = remove_first_principal_component(np.array(embeddings))
    sims = [cosine_similarity(embeddings[idx*2].reshape(1, -1), 
                              embeddings[idx*2+1].reshape(1, -1))[0][0] 
            for idx in range(int(len(embeddings)/2))]

    return sims

The methods above share two important characteristics: 

- As simple bag-of-word methods, they do take not word order into account.
- The word embeddings they use have been learned in an unsupervised manner. 

Both these characteristics are potential downsides: 

- Since differences in word order can point to differences in meaning (compare `the dog bites the man` with `the man bites the dog`), we'd like our sentence embeddings to be sensitive to this variation.
- Supervised training can help sentence embeddings learn the meaning of a sentence more directly.

## Experiments

In [10]:
def run_experiment(df, benchmarks): 
    
    sentences1 = [Sentence(s) for s in df['sent_1']]
    sentences2 = [Sentence(s) for s in df['sent_2']]
    
    pearson_cors, spearman_cors = [], []
    for label, method in benchmarks:
        sims = method(sentences1, sentences2)
        pearson_correlation = scipy.stats.pearsonr(sims, df['sim'])[0]
        print(label, pearson_correlation)
        pearson_cors.append(pearson_correlation)
        spearman_correlation = scipy.stats.spearmanr(sims, df['sim'])[0]
        spearman_cors.append(spearman_correlation)
        
    return pearson_cors, spearman_cors

In [11]:
import functools as ft

benchmarks = [("AVG-W2V", ft.partial(run_avg_benchmark, model=word2vec, use_stoplist=False)),
              ("AVG-W2V-STOP", ft.partial(run_avg_benchmark, model=word2vec, use_stoplist=True)),
              ("AVG-W2V-TFIDF", ft.partial(run_avg_benchmark, model=word2vec, use_stoplist=False, doc_freqs=doc_frequencies)),
              ("AVG-W2V-TFIDF-STOP", ft.partial(run_avg_benchmark, model=word2vec, use_stoplist=True, doc_freqs=doc_frequencies)),
              ("WMD-W2V", ft.partial(run_wmd_benchmark, model=word2vec, use_stoplist=False)), 
              ("SIF-W2V", ft.partial(run_sif_benchmark, freqs=frequencies, model=word2vec, use_stoplist=False)),
             ]

global PYEMD_EXT

# Application to our dataset - simple version

In [12]:
# data = pd.read_csv('Example_dataset_marble - 5_column_with_correct.csv')
data = pd.read_csv('Example_dataset_marble_v2 - 2_data_no_omission.csv')


## simple version
unavailable for now

In [None]:
def run_all_simple(df, benchmarks): 
    sentences1 = [Sentence(s) for s in df['Field_en']]
    sentences2 = [Sentence(s) for s in df['Field_correct_en']]
    sims = {"Sent1":sentences1, "Sent2": sentences2}
    pearson_cors, spearman_cors = [], []
    for label, method in benchmarks:
        sims[label] = method(sentences1, sentences2)

    frame = pd.DataFrame(sims)
    frame['Field_en'] = df['Field_en']
    frame['Field_correct_en'] = df['Field_correct_en']
    return frame

In [None]:
exp_frame = data.copy()
exp_frame.tail()

In [None]:
exp_frame[exp_frame['Field_en'].str.strip() == '']

In [None]:
frame_sim = run_all_simple(exp_frame, benchmarks)

In [None]:
frame_sim = pd.concat([frame_sim, 
                       exp_frame[["Accuracy_score","Code",
                                  "Fieldname"]]], 
                        axis=1)

In [None]:
frame_sim = frame_sim[['Fieldname', "Field_en",'Field_correct_en',
                       'AVG-W2V', 'AVG-W2V-STOP', 'AVG-W2V-TFIDF', 
                       'AVG-W2V-TFIDF-STOP','WMD-W2V','SIF-W2V', 
                       'Accuracy_score','Code']]
frame_sim.tail()

## Match the sentences

In [13]:
def run_all_match(df, model, benchmarks): 
    size = len(model.index)
    text_frame = df.copy()
    sims = {"stud_sentence":[],
            "stud_field":[],}
    for label, method in benchmarks:
        sims[label+"_all_scores"] = []
        sims[label+"_similarity"] = []
        sims[label+"_aimed_sentence"] = []
        sims[label+"_aimed_field"] = []
        
    for index, row in text_frame.iterrows():
        stud_sentence = row["Field_en"]
        sims["stud_sentence"].append(stud_sentence)
        sims["stud_field"].append(row["Fieldname"])
        student_sentences = [Sentence(stud_sentence)]*size
        model_sentences = model[row['Category']].apply(lambda s: Sentence(s))
    #   pearson_cors, spearman_cors = [], []
        for label, method in benchmarks:
            similarity_scores = method(student_sentences, model_sentences)
            similarity = max(similarity_scores)
            index = np.argmax(similarity_scores)
            aimed_sentence = model_sentences.iloc[index]
            aimed_field = model_sentences.index[index]
            sims[label+"_all_scores"].append(similarity_scores)
            sims[label+"_similarity"].append(similarity)
            sims[label+"_aimed_sentence"].append(aimed_sentence.raw)
            sims[label+"_aimed_field"].append(aimed_field)
    frame = pd.DataFrame(sims)
    return frame

In [14]:
model_frame = pd.read_csv('correct_answers.csv', index_col=0)
exp_frame2 = data.copy()
# For old dataset
# exp_frame2['Category'] = [tab[0] for tab in exp_frame2['Fieldname'].str.split(";")]
model_frame = model_frame[[ 'Field1_en', 'Field2_en','Field3_en', 'Field4_en']]
model_frame = model_frame.transpose()
model_frame.head()

TextName,Beton,Botox,Geld,Metro,Muziek,Suez
Field1_en,Central heating,Can help prevent muscle tightness,People should not be distracted for too long risk,Waste need not be processed in some way,learn to read and play music,No natural connection Western Indian Ocean
Field2_en,Concrete dries out,Can help against wrinkles between the eyes and...,A person may experience only short deep happin...,Artificial reef constructed,Improving mathematics vaardigheiten,Ships make long trip around African continent
Field3_en,Buildings are smaller,People look younger,People get used to luxury,More plankton and marine fauna,Can help bring back old memories,Shorter waterway needed
Field4_en,Elevators Bliven hang,Facial expression can change,Money does not gellukig long time,"More fish (such as mackerel, grouper, sea fish...",Higher scores on IQ tests,Suez canal dug


In [None]:
frame_sim2 = run_all_match(exp_frame2, model_frame, benchmarks)
frame_sim2

In [None]:
frame_sim2 = pd.concat([frame_sim2, 
                        exp_frame2[['IDStud', 'IDClass', 
                                    "Category","Accuracy_score",
                                    "Code","Fieldname"]]], axis=1)

In [None]:
frame_sim2.to_csv("complete_result_matched.csv")
sif_matched = frame_sim2[["SIF-W2V_aimed_sentence", 
            "SIF-W2V_aimed_field", 
            'SIF-W2V_similarity',
            'stud_field',
            'stud_sentence',
            'IDStud', 'IDClass', 
            "Category","Accuracy_score",
            "Code","Fieldname"]].copy()
sif_matched.to_csv("sif_wv2_matched.csv")
avg_w2v_matched = frame_sim2[["AVG-W2V_aimed_sentence", 
            "AVG-W2V_aimed_field", 
            'AVG-W2V_similarity',
            'stud_field',
            'stud_sentence',
            'IDStud', 'IDClass', 
            "Category","Accuracy_score",
            "Code","Fieldname"]].copy()
avg_w2v_matched.to_csv("avg_wv2_matched.csv")