<a href="https://colab.research.google.com/github/harmonydata/experiments/blob/main/harmony_wmd_experiment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Adding Word Movers Distance to Harmony

See paper: Kusner, Matt, et al. "From word embeddings to document distances." International conference on machine learning. PMLR, 2015.

https://www.cs.cornell.edu/~kilian/papers/wmd_metric.pdf


Can Harmony show a distance metric between two instruments?

Each instrument is a sequence of sentence embeddings.

The distance between two sentence embeddings is given by the cosine similarity function which Harmony is already using.

But the distance between two **sequences** of embeddings?

The Word Movers Distance algorithm might help!

Code from **Eve Cheng**'s branch and pull request, https://github.com/EveWCheng/harmony/blob/main/src/harmony/matching/wmd_matcher.py

In [3]:
!pip install wmd
!pip install harmonydata

Collecting wmd
  Downloading wmd-1.3.2.tar.gz (104 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m104.6/104.6 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wmd
  Building wheel for wmd (setup.py) ... [?25l[?25hdone
  Created wheel for wmd: filename=wmd-1.3.2-cp310-cp310-linux_x86_64.whl size=1150986 sha256=230eb31372332abc8eb975655ea8956296fdb3e59be165dfd7388feac61bcccc
  Stored in directory: /root/.cache/pip/wheels/7e/09/7f/ebf39133074a0411263ce255a480293fb2e91bceaeed6a4141
Successfully built wmd
Installing collected packages: wmd
Successfully installed wmd-1.3.2
Collecting harmonydata
  Downloading harmonydata-0.5.2-py3-none-any.whl (92 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.6/92.6 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pydantic==1.10.7 (from harmonydata)
  Downloading pydantic-1.10.7-cp310-cp310-manylinux_2_

In [5]:
from wmd import WMD
import numpy as np
import math
import libwmdrelax

def euclidean_dist(point1, point2):
    if len(point1) != len(point2):
        raise ValueError("Points must have the same number of dimensions")

    squared_distance = sum((p1 - p2) ** 2 for p1, p2 in zip(point1, point2))
    distance = math.sqrt(squared_distance)
    return distance

def par_to_vecs(par,vectorisation_function):
    return [vectorisation_function(sent) for sent in par]

def dist(vecs1,vecs2):
    vec_union = list(vecs1 + vecs2)
    n1,n2 = len(vecs1),len(vecs2)
    n = len(vec_union)
    dist_ = np.zeros((n,n))
    for i in range(n):
        for j in range(i):
            dist_[i,j] = dist_[j,i] = euclidean_dist(vec_union[i],vec_union[j])

    nw1 = [1. for i in range(n1)]+[0. for i in range(n2)]
    nw2 = [0. for i in range(n1)] +[1. for i in range(n2)]
    return np.array(dist_,dtype=np.float32),np.array(nw1,dtype=np.float32),np.array(nw2,dtype=np.float32)


def pars_dist_emd_emdrelaxed(par1,par2,vectorisation_function):
    relax_cache = libwmdrelax.emd_relaxed_cache_init(int(100))
    cache = libwmdrelax.emd_cache_init(int(100))

    vecs1,vecs2 = par_to_vecs(par1,vectorisation_function),par_to_vecs(par2,vectorisation_function)
    dist_,nw1,nw2 = dist(vecs1,vecs2)
    emd = libwmdrelax.emd(nw1,nw2,dist_,cache)
    emd_relaxed = libwmdrelax.emd_relaxed(nw1,nw2,dist_,relax_cache)
    return emd,emd_relaxed


In [7]:
import harmony
import numpy as np
from harmony import match_instruments
import json
from wmd import WMD

def texts_similarity_matrix_benchmark(text_vectors):
        # Create numpy array of texts vectors
        # Get similarity with polarity
        vectors_pos,vectors_neg = harmony.matching.matcher.vectors_pos_neg(text_vectors)
        if vectors_pos.any():
            pos_pairwise_similarity = harmony.matching.matcher_utils.cosine_similarity(vectors_pos, vectors_pos)
        return pos_pairwise_similarity

def test_similarity():
    questions = ["I was bothered by things that usually don’t bother me.","I did not feel like eating; my appetite was poor.","I felt that I could not shake off the blues even with help from my family or friends.","I felt I was just as good as other people."]
    questions = ["lost my key", "found my car"]
    vectorisation_function = harmony.matching.default_matcher.convert_texts_to_vector
    text_vectors = harmony.matching.matcher.process_questions(questions)
    print(text_vectors)
    text_vectors = harmony.matching.matcher.vectorise_texts(text_vectors,vectorisation_function)
    print(texts_similarity_matrix_benchmark(text_vectors))
#   pip install harmonydata

def test_match_instruments_with_function():
    instruments = harmony.example_instruments["CES_D English"], harmony.example_instruments["GAD-7 Portuguese"]
    print(instruments[0])
    query = "Lost much sleep over worry?"
    vectorisation_function = harmony.matching.default_matcher.convert_texts_to_vector
    all_questions, similarity_with_polarity, query_similarity, new_vectors_dict=harmony.matching.matcher.match_instruments_with_function(instruments[1:10],query,vectorisation_function,[],[],np.zeros((0, 0)),{})
    print(all_questions)
    print(similarity_with_polarity)
#    print(query_similarity)
#    print(new_vectors_dict)
    np.savetxt("sim_with_polarity.txt", similarity_with_polarity, fmt='%d', delimiter='\t')

In [11]:
vectorisation_function = harmony.matching.default_matcher.convert_texts_to_vector
par1 = ["I want to go outside","oh outside is nice"]
par2 = ["I want to go outside maybe","oh outside is nice"]
par3 = ["You are a dog", "I love dogs"]
par4 = ["I am sad","are you sad"]

print ("Comparing two sequences to itself")

emd,emd_relaxed = pars_dist_emd_emdrelaxed(par1, par1,vectorisation_function)
print(emd)
print(emd_relaxed)

print ("Comparing", par4, par3)

emd,emd_relaxed = pars_dist_emd_emdrelaxed(par4,par3,vectorisation_function)
print(emd)
print(emd_relaxed)

print ("Comparing", par1, par3)

emd,emd_relaxed = pars_dist_emd_emdrelaxed(par1,par3,vectorisation_function)
print(emd)
print(emd_relaxed)

Comparing two sequences to itself
0.0
0.0
Comparing ['I am sad', 'are you sad'] ['You are a dog', 'I love dogs']
12.535743713378906
12.53574275970459
Comparing ['I want to go outside', 'oh outside is nice'] ['You are a dog', 'I love dogs']
13.033099174499512
13.033098220825195
