# Introduction to Human Language Technology - Lab. Session 7: Word Sequences
### Authors: Rueben Álvarez and Albert Espín

### 1. Sentence pairs, word tokenization and named-entity tagging.
The following cell finds the pairs of sentences in the file "STS.input.txt" of the trial corpus. Afterwards, it stores the pairs as tuples (i.e. each sentence of the pair becomes a tuple element), with each sentence transformed into a set of tagged words and named entities obtained with word tokenization and named-entity tagging using the NLTK named-entity tagger and chunker. A function to tag and chunk using Stanford NLP named-entity tagger has also been defined, but not used for the later calculations since it keeps terms of multi-word named-entites in separate strings, while the NLTK tagger groups them as we want. Note that using current NLTK version 3.3., the API syntax is different from the statement, but the result is the same. The sentences are not turned into lower case since that would make named-entity tagger fail to identify named entities.

In [1]:
import nltk
from nltk import pos_tag
import nltk.corpus
from nltk.metrics import jaccard_distance
# nltk.download('maxent_ne_chunker')
# nltk.download('words')
from nltk import word_tokenize, pos_tag, ne_chunk
from nltk.parse import CoreNLPParser

# Stanford named-entity tagger
named_entity_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')

In [8]:
def get_stanford_named_entity_chunked_sentence(sentence):
    """Given the passed sentence string, returns an array with the chunks (words and named entities) it contains, using Stanford NLP"""
        
    # obtain an array with the sentence tokens
    tokenized_sentence = nltk.word_tokenize(sentence)
    
    # tag each word to classify it as a normal word or a named entity (e.g. a person or an organization)
    tagged_sentence = set(named_entity_tagger.tag(tokenized_sentence))
    
    chunked_sentence = []
    for tagged_token in tagged_sentence:
        
        token = tagged_token[0]
        tag = tagged_token[1]
        
        # make normal words have lower case, also discard punctuation marks
        if tag == 'O':
            if token.isalnum():
                chunked_sentence.append(token.lower())
         
        # keep named entities with the original capitalization
        else:
            chunked_sentence.append(token)
    
    return chunked_sentence

# example: note that Stanford NLP naming-entity tagger does not group the terms of named entities in a single string
print(get_stanford_named_entity_chunked_sentence("Mark Pedersen and John Smith are working at Google since 1994 for 1000$ per week."))

['Smith', 'and', '1994', 'per', 'Google', 'Mark', 'week', 'working', 'for', '$', 'at', 'since', '1000', 'Pedersen', 'John', 'are']


In [9]:
def get_named_entity_chunked_sentence(sentence):
    """Given the passed sentence string, returns an array with the chunks (words and named entities) it contains, using NLTK"""
        
    # obtain an array with the sentence tokens
    tokenized_sentence = nltk.word_tokenize(sentence)
    
    # chunk and tag with NLTK named-entity tagged
    chunk_tree = ne_chunk(pos_tag(word_tokenize(sentence)), binary=True)
    
    chunked_sentence = []
    for chunk in chunk_tree:
        
        # keep named entities with the original capitalization
        if hasattr(chunk, 'label'):
            token = ' '.join(term[0] for term in chunk)
            chunked_sentence.append(token)
            
        # make normal words have lower case, also discard puntuaction marks
        else:
            token = chunk[0]
            if token.isalnum():
                chunked_sentence.append(token.lower())
    
    return set(chunked_sentence)

# example: note that NLTK naming-entity tagger does group the terms of named entities in a single string
print(get_named_entity_chunked_sentence("Mark Pedersen and John Smith are working at Google since 1994 for 1000$ per week."))

{'and', 'are', 'working', 'Google', '1000', 'per', 'Mark Pedersen', '1994', 'since', 'at', 'week', 'for', 'John Smith'}


In [5]:
import os
import sys

# full path of the corpus file, assuming that the trial folder containing the input file is in the same directory as the "ipython" file
absolute_file_path = os.path.dirname(os.path.abspath("__file__")) + "//trial//STS.input.txt"

# find all sentence pairs in the document
sentence_pairs = []
sentence_set_pairs = []
with open(absolute_file_path) as f:
    lines = f.readlines()
    for line in lines:
        index, sentence0, sentence1 = line.split("\t")
        sentence_pairs.append((get_named_entity_chunked_sentence(sentence0), get_named_entity_chunked_sentence(sentence1)))
        print("First sentence: \t", sentence0, "\nSecond sentence: \t", sentence1, "\n")
    print()  
    
# the pairs of sentences are shown
for pair in sentence_pairs:
    print("{}\n{}\n\n".format(pair[0], pair[1]))

First sentence: 	 The bird is bathing in the sink. 
Second sentence: 	 Birdie is washing itself in the water basin.
 

First sentence: 	 In May 2010, the troops attempted to invade Kabul. 
Second sentence: 	 The US army invaded Kabul on May 7th last year, 2010.
 

First sentence: 	 John said he is considered a witness but not a suspect. 
Second sentence: 	 "He is not a suspect anymore." John said.
 

First sentence: 	 They flew out of the nest in groups. 
Second sentence: 	 They flew into the nest together.
 

First sentence: 	 The woman is playing the violin. 
Second sentence: 	 The young lady enjoys listening to the guitar.
 

First sentence: 	 John went horse back riding at dawn with a whole group of friends. 
Second sentence: 	 Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.
 


{'is', 'bathing', 'in', 'sink', 'bird', 'the'}
{'basin', 'itself', 'is', 'washing', 'in', 'Birdie', 'water', 'the'}


{'to', 'invade', 'may', 'in', 'attempted', '2010', 

### 2. Sentence similarity calculation using words and named entities, compared with the gold standard.
The pairs of sentences are checked to see how similar they are, using the Jaccard distance: the more words or named entities two sentences share, the more alike they are considered to be. The pairs are shown along with their distance and dissimilarity score (scaled to be comparable with the gold standard one).

In [6]:
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr

# gold standard file path
absolute_file_path = os.path.dirname(os.path.abspath("__file__")) + "//trial//STS.gs.txt"

# get the gold standard scores
gold_scores = []
with open(absolute_file_path) as f:
    lines = f.readlines()
    for line in lines:
        _, score = line.split("\t")
        gold_scores.append(int(score))
        
word_ne_scores = []

# compute the Jaccard distance to see how similar or different two sentences are
for i in range(len(sentence_pairs)):
    pair = sentence_pairs[i]
    word_ne_dist = jaccard_distance(set(pair[0]), set(pair[1]))
    word_ne_score = round(word_ne_dist * 5)
    word_ne_scores.append(word_ne_score)
    print("First sentence words and named entities: ", pair[0], "\nSecond sentence words and named entities: ", pair[1], "\nWord-and-named-entity-based distance:", round(word_ne_dist, 3), "\nWord-and-named-entity-based dissimilarity score:", word_ne_score, "\nGold standard dissimilarity score:", gold_scores[i], "\n") 

# Pearson correlation between the tested and the gold standard scores
word_ne_pearson = pearsonr(word_ne_scores, gold_scores)
print("Pearson correlation between word-and-named-entity-based method and gold standard:", round(word_ne_pearson[0], 3))

First sentence words and named entities:  {'is', 'bathing', 'in', 'sink', 'bird', 'the'} 
Second sentence words and named entities:  {'basin', 'itself', 'is', 'washing', 'in', 'Birdie', 'water', 'the'} 
Word-and-named-entity-based distance: 0.727 
Word-and-named-entity-based dissimilarity score: 4 
Gold standard dissimilarity score: 5 

First sentence words and named entities:  {'to', 'invade', 'may', 'in', 'attempted', '2010', 'troops', 'Kabul', 'the'} 
Second sentence words and named entities:  {'invaded', 'last', 'may', '7th', '2010', 'year', 'US', 'on', 'army', 'Kabul', 'the'} 
Word-and-named-entity-based distance: 0.75 
Word-and-named-entity-based dissimilarity score: 4 
Gold standard dissimilarity score: 4 

First sentence words and named entities:  {'not', 'is', 'a', 'he', 'considered', 'witness', 'said', 'suspect', 'John', 'but'} 
Second sentence words and named entities:  {'not', 'is', 'a', 'he', 'anymore', 'said', 'suspect', 'John'} 
Word-and-named-entity-based distance: 0.36

### 3. Explanation of the difference between using words along with named entities, and the gold standard results.
The scores obtained with our current approach (which considers words and named entities) are very different from those present in the gold standard: the Pearson correlation is roughly 0.21, with 0 meaning no correlation and 1 complete correlation. This happens because the gold standard uses more advanced techniques to determine how similar two sentences are. Particularly, the word-based method uses only word tokenization and the grouping of words that represent a same concept, such as a person's name with surname.

If we put the results into context with previous sessions, we recall that using only words gave the exact same result for the trial sentences (0.21 correlation), while using lemmas (0.36) and synsets (0.50) proved to more more accurate. We get the same result as with only words since the 5 pairs of trial sentences do not contain any multi-term named entity.

To evaluate with more rigour the goodness of using named entities along with words, however, we should check the results applied to the training data of the IHLT course project, which contains a higher number of sentences, some of which can contain multi-term named entities. In those particular cases, there will be a match between the named entities (and thus the similarity calculus will be affected) only if both sentences in a pair have the full multi-term named entity in them. If one has "John Smith" but the other has "John" a match will not be found, for example, which would not have happened with only words, in which case the "John" word would have matched while not the "Smith" one. In some cases they might be referring to the same person even if one uses only the name and the other adds the surname, and the named-entity-based approach used in this session would fail to identify this, being too strict.

It is not relevant to compute named entities if they are going to be tested just as a lexical level along words, as used here. In such case, using words will give an equal or similar result in most sentence comparison cases, and will avoid the extra computation cost of named-entity recognition.

Using named entities would be more intersting as a combination with semantic approaches such as the synset-comparison techniques used in passed sessions, that proved to be alone more accurate than the purely lexical and syntactic methodologies. Named entity tagging gives the type of conceptual element a named entity is (e.g. a person or an organization), which, along with synsets, would give intersting semantic information of a sentence, enabling for a better comparison with others. The impact of using named entities, however, would be only of truly relevant significance if there is a high frequency of named entities within the tested set of sentence pairs to compare.