### In this exercise we see the definition of a false friends word detection algorithm. The general definition of a false friend is that of two almost homonymous words that share many characters in common but which differ greatly in meaning.
We used the lexical resource WordNet to access the different meanings (synsets) of the terms.

To understand whether or not two words are lexically similar, we decided to look at their **Edit Distance**, that is the minimum number of operations of insertion, removal, modification, to transform one string into another.

After that, to check if two words are False Friends we checked their **Wu & Palmer similarity** making sure it is less than a certain threshold.

This way, terms with **high lexical similarity** and **low semantic similarity** are good candidates to be False Friends.

In [1]:
from random import seed, randint
from nltk import edit_distance
from nltk.corpus import semcor, wordnet
from nltk import Tree
from itertools import combinations
from typing import List, Union, Tuple

### Data Load

In [2]:
# SemCor tagged corpus
tagged_sentences = semcor.tagged_sents(tag='sem')

### Core Functions

In [3]:
# Compute |sentence_num| random sentences from the SemCor corpus
def compute_random_sentences(sentence_num: int, custom_seed: int=None) -> List[str]:
    seed(custom_seed)
    max_index = 10000
    
    indices = set()
    while len(indices) != sentence_num:
        index = randint(0, max_index)
        indices.add(index)
    
    sentences = [tagged_sentences[index] for index in indices]
    return sentences


# Get content words from the random sentences got before
def get_content_words(tagged_sentences: List[List[Union[str, Tree]]]) -> List[str]:
    content_words = []
    for sentence in tagged_sentences:
        for word in sentence:
            if type(word) is Tree and type(word.label()) != str and word.label().synset().pos() in ["n", "v", "s", "r"] and len(word[0]) > 3:
                try:
                    content_words.append(word[0].lower())
                except:
                    content_words.extend([el.lower() for el in word[0]]) # The element is a multi-word expression

    
    return content_words


# Compute pairs that are lexically similar, it uses the edit distance to measure it (edit distance < threshold)
def compute_close_pairs(content_words: List[str], threshold: int=2) -> List[Tuple[str, str]]:
    pairs = set(combinations(content_words, 2))
    close_pairs = [pair for pair in pairs if edit_distance(pair[0], pair[1]) < threshold] # Only keeps words lexically close
    close_pairs = [pair for pair in close_pairs if edit_distance(pair[0], pair[1]) != 0] # Delete pairs of identical words
    return close_pairs


# Compute False Friends, i.e. words that are lexically similar but semantically different. It uses the Wu & Palmer similarity to measure it
def compute_false_friends(close_pairs: List[Tuple[str, str]], threshold: float) -> List[Tuple[str, str]]:
    false_friends = []
    for pair in close_pairs:
        synsets_1 = wordnet.synsets(pair[0])
        synsets_2 = wordnet.synsets(pair[1])
        if not (synsets_1 and synsets_2):
            continue

        similarity = wordnet.wup_similarity(synsets_1[0], synsets_2[0]) # wu&palmer similarity between the first wordnet synsets
        if similarity < threshold:
            false_friends.append(pair)
    
    return false_friends

### False Friends Computation

In [4]:
# Hyperparameters
sentence_num = 20
edit_threshold = 2
similarity_threshold = 0.3

sentences = compute_random_sentences(sentence_num)
content_words = get_content_words(sentences)
close_pairs = compute_close_pairs(content_words, threshold=edit_threshold)
false_friends = compute_false_friends(close_pairs, threshold=similarity_threshold)

print(false_friends)

[('lost', 'most'), ('fact', 'face')]
