In [3]:
input_text = open('trial/STS.input.txt', 'r').read()

In [4]:
dataset = []
for line in input_text.splitlines():
    identifier, sentence1, sentence2 = line.split('\t')
    dataset.append({'id': identifier, 's1': sentence1, 's2': sentence2})

In [8]:
from nltk.metrics import jaccard_distance
import nltk
from nltk import pos_tag
# nltk.download('punkt')
from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()

def lemmatize(p):
    if p[1][0] in {'N','V'}:
        return wnl.lemmatize(p[0].lower(), pos=p[1][0].lower())
    return p[0]
# nltk.download('punkt')
similarities = []
for row in dataset:
    identifier = row['id']
    # We tokenize the sentences in order to avoid detecting
    # the final word of a sentence and a punctuation mark
    # as the same token, for instance.
    s1 = nltk.word_tokenize(row['s1'])
    s2 = nltk.word_tokenize(row['s2'])
    # POS tags (ie. word categories) are required for the lemmatizer
    # (for decreasing the ambiguity in order to get the right lemmas)
    s1_pos = pos_tag(s1)
    s2_pos = pos_tag(s2)
    s1_lemmas = [lemmatize(pair) for pair in s1_pos]
    s2_lemmas = [lemmatize(pair) for pair in s2_pos]
    # Once lemmatized (so, we have the root form of words), we compute distances
    similarity = jaccard_distance(set(s1_lemmas), set(s2_lemmas))
    similarities.append(similarity)
similarities

[nltk_data] Downloading package punkt to /home/jordiae/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[0.6923076923076923,
 0.6666666666666666,
 0.5333333333333333,
 0.5454545454545454,
 0.7692307692307693,
 0.8620689655172413]

In [10]:
from scipy.stats import pearsonr
text_output = open('trial/STS.gs.txt', 'r').readlines()
refs = []
for line in text_output:
    refs.append(int(line.split()[1]))
tsts = []
for sim in similarities:
    tsts.append(sim)
pearsonr(refs, tsts)[0]

0.49067081037569205

Answer justification: The Pearson correlation is greater than the one obtained in the previous session (0.49 vs 0.39) because lemmatization gives us the canonical form of words. Therefore, all the words derived from a root word will be considered the same. Since we are keen on measuring semantical distances, this is useful, because in word-level settings, morphological information (at least in English) introduces "noise". However, the obtained result is still relatively low, because word-level comparisons miss deeper semantic information introduced by sequences of words (ie. the full sentence as a whole).

Generally, we believe that using lemmas will perform better (although it will still miss synonyms, for instance). In some cases, it would perform worse. In particular, we can think of at least three possible cases:
1 - Cases in which the lemmatizer (or the POS tagger) fails, because taggers are not infallible.
2 - Cases in which the 2 texts are almost identical (as in word-by-word identical) except one word with the same lemma but a different morpheme (eg. singular vs plural) which introduces some semantically relevant. In this case, the desired similarity would be almost 100%, but not 100%. But still, it would work quite well.
3 - Cases in which morphemes introduce semantically relevant information, probably more important in other languages than English.