In [1]:
#| hide
#| default_exp simple_sentence_similarity

In [2]:
#| hide
%matplotlib inline

# Simple Sentence Similarity
(follows: https://github.com/nlptown/nlp-notebooks/blob/master/Simple%20Sentence%20Similarity.ipynb)

## Data

### STS Benchmark

The STS Benchmark gathers the English data from the SemEval sentence similarity tasks (2012-2017). The data is split in training, development, and test data (uri: http://ixa2.si.ehu.es/stswiki/index.php/STSbenchmark).

In [3]:
#| export
import pandas as pd
import numpy as np
import scipy
import math
import os
import tensorflow as tf
import matplotlib.pyplot as plt
import seaborn as sns

2022-08-24 13:59:07.820733: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


In [4]:
#| export
def load_sts_dataset(filename):
  """ Loads a subset of the STS dataset into a DataFrame. In particular both sentences and their human rated similarity score."""
  sent_pairs = []
  with open(filename, "r") as f:
    for line in f:
      ts = line.strip().split('\t')
      sent_pairs.append((ts[5], ts[6], float(ts[4])))
  return pd.DataFrame(sent_pairs, columns=["sent_1", "sent_2", "sim"])

"""
# commented out: We will use the local downloaded files
def download_and_load_sts_data():
  # We will grab the STS datasets from their website.
  sts_dataset = tf.keras.utils.get_file(
    fname="Stsbenchmark.tar.gz",
    origin="http://ixa2.si.ehu.es/stswiki/images/4/48/Stsbenchmark.tar.gz",
    extract=True
  )
"""

sts_dev = load_sts_dataset('/home/peter/Documents/data/nlp/stsbenchmark/sts-dev.csv')
sts_test = load_sts_dataset('/home/peter/Documents/data/nlp/stsbenchmark/sts-test.csv')

In [5]:
#| hide
sts_dev[:5]

Unnamed: 0,sent_1,sent_2,sim
0,A man with a hard hat is dancing.,A man wearing a hard hat is dancing.,5.0
1,A young child is riding a horse.,A child is riding a horse.,4.75
2,A man is feeding a mouse to a snake.,The man is feeding a mouse to the snake.,5.0
3,A woman is playing the guitar.,A man is playing guitar.,2.4
4,A woman is playing the flute.,A man is playing a flute.,2.75


In [6]:
#| hide
sts_test[:5]

Unnamed: 0,sent_1,sent_2,sim
0,A girl is styling her hair.,A girl is brushing her hair.,2.5
1,A group of men play soccer on the beach.,A group of boys are playing soccer on the beach.,3.6
2,One woman is measuring another woman's ankle.,A woman measures another woman's ankle.,5.0
3,A man is cutting up a cucumber.,A man is slicing a cucumber.,4.2
4,A man is playing a harp.,A man is playing a keyboard.,1.5


### SICK data

The SICK dataset contains 10_000 English sentence pairs labelled with their semantic relatedness and entailment relation.

In [7]:
#| export
import requests

def download_sick(f): 

    response = requests.get(f).text

    lines = response.split("\n")[1:]
    lines = [l.split("\t") for l in lines if len(l) > 0]
    lines = [l for l in lines if len(l) == 5]

    df = pd.DataFrame(lines, columns=["idx", "sent_1", "sent_2", "sim", "label"])
    df['sim'] = pd.to_numeric(df['sim'])
    return df
    
sick_train = download_sick("https://raw.githubusercontent.com/alvations/stasis/master/SICK-data/SICK_train.txt")
sick_dev = download_sick("https://raw.githubusercontent.com/alvations/stasis/master/SICK-data/SICK_trial.txt")
sick_test = download_sick("https://raw.githubusercontent.com/alvations/stasis/master/SICK-data/SICK_test_annotated.txt")
sick_all = pd.concat([sick_train, sick_test, sick_dev])

In [8]:
#| test
sick_all[:5]

Unnamed: 0,idx,sent_1,sent_2,sim,label
0,1,A group of kids is playing in a yard and an ol...,A group of boys in a yard is playing and a man...,4.5,NEUTRAL
1,2,A group of children is playing in the house an...,A group of kids is playing in a yard and an ol...,3.2,NEUTRAL
2,3,The young boys are playing outdoors and the ma...,The kids are playing outdoors near a man with ...,4.7,ENTAILMENT
3,5,The kids are playing outdoors near a man with ...,A group of kids is playing in a yard and an ol...,3.4,NEUTRAL
4,9,The young boys are playing outdoors and the ma...,A group of kids is playing in a yard and an ol...,3.7,NEUTRAL


## Preparation

Some of the models we will use require **tokenization** , others do not require it. We define a simple `sentence` class where we keep both the raw sentence and the tokenized sentence. The individual methods then can pick the input they need.

In [9]:
#| export
import nltk

STOP = set(nltk.corpus.stopwords.words('english'))

class Sentence:
  def __init__(self, sentence):
    self.raw = sentence
    normalized_sentence = sentence.replace("‘", "'").replace("’", "'")
    self.tokens = [t.lower() for t in nltk.word_tokenize(normalized_sentence)]
    self.tokens_without_stop = [t for t in self.tokens if t not in STOP]

Next, we will use the `gensim` library to load two sets  of pre-trained word embeddings: `word2vec` and `GloVe`:

In [13]:
#| export
import gensim
from gensim.models import Word2Vec
from gensim.scripts.glove2word2vec import glove2word2vec

PATH_TO_WORD2VEC = os.path.expanduser("~/Documents/data/nlp/GoogleNews-vectors-negative300.bin")
PATH_TO_GLOVE = os.path.expanduser("~/Documents/data/nlp/glove.840B.300d.txt")

word2vec = gensim.models.KeyedVectors.load_word2vec_format(PATH_TO_WORD2VEC, binary=True)

In order to load the GloVe file we downloaded, we need to convert it into word2vec format and then load the embeddings into a `gensim` model. This will take some time:

In [15]:
#| hide
tmp_file = '/home/peter/Documents/data/nlp/glove.840B.300d.w2v.txt'
glove2word2vec(PATH_TO_GLOVE, tmp_file)
glove = gensim.models.KeyedVectors.load_word2vec_format(tmp_file)

  glove2word2vec(PATH_TO_GLOVE, tmp_file)


EOFError: unexpected end of input; is count incorrect or file otherwise damaged?

In order to be able to compute weighted averages of word embedding in a later stage, we are going to load a file with word freqs that have been collected from Wikipedia:

In [17]:
#| export
import csv

PATH_TO_FREQUENCIES_FILE = os.path.expanduser('~/Documents/data/nlp/frequencies.tsv')
PATH_TO_DOC_FREQUENCIES_FILE = os.path.expanduser('~/Documents/data/nlp/doc_frequencies.tsv')

def read_tsv(f):
  frequencies = {}
  with open(f) as tsv:
    tsv_reader = csv.reader(tsv, delimiter='\t')
    for row in tsv_reader:
      frequencies[row[0]] = int(row[1])
  return frequencies

frequencies = read_tsv(PATH_TO_FREQUENCIES_FILE)
doc_frequencies = read_tsv(PATH_TO_DOC_FREQUENCIES_FILE)
doc_frequencies["NUM_DOCS"] = 1_288_431

In [23]:
#| hide
len(doc_frequencies)

3388134

## Similarity methods

### Baseline

The simplest way of computing sentence embeddings is taking the embeddings of words in the sentence, minus the stopwords, and compute their average, weighted by the sentence frequency of each word.

Then we can use **cosine similarity** to calculate the distance between two sentence embeddings.

In [28]:
#| export
from sklearn.metrics.pairwise import cosine_similarity
from collections import Counter
import math

def run_avg_benchmark(sentences1, sentences2, model=None, use_stoplists=False, doc_freqs=None):
  if doc_freqs is not None:
    N = doc_freqs['NUM_DOCS']

  sims = []
  for (sent1, sent2) in zip(sentences1, sentences2):
    