# Introduction to Human Language Technology - Lab. Session 8: Parsing
### Authors: Rueben Álvarez and Albert Espín

## 1. Constituency parsing with NLTK
This section will define a grammar and use different parsing methods to determine the constituency tree of a sentence.

### 1.1. Grammar definition
The following cell defines a grammar that will allow to parse sentences with very specific terms and relations between them.

In [1]:
import nltk
from nltk import CFG, ChartParser

'''the grammar is expanded considering that:
    - sentences are composed by noun and verb prhases
    - "mice" is added as a possible plural name (NNS)
    - "with" is added as a possible preposition (CC)
    - "play" is added as the only possible verb (V)
    - the verb phrase (VP) is defined as a verb (V) or a verb plus a prepositional phrase (V PP)
    - the prepositional phrase is defined as a preposition plus a noun phrase (CC NP)
'''
grammar = CFG.fromstring('''
                        S -> NP VP
                        NP -> NNS | JJ NNS | NP CC NP 
                        NNS -> "cats" | "dogs" | "mice" | NNS CC NNS
                        JJ -> "big" | "small" | "lazy"
                        CC -> "and" | "or" | "with"
                        V -> "play"
                        VP -> V PP | V
                        PP -> CC NP
                        ''')

# sentence tokens
sentence = nltk.word_tokenize("Lazy cats play with mice".lower())

# launch the server in terminal
# java -mx4g -cp "/home/jan/Downloads/stanford-corenlp-full-2018-10-05/*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

### 1.2. Constituency parsing
The following cells perform the constituency parsing using a BottomUpChartParser, a BottomUpLeftCornerChartParser and a LeftCornerChartParser.


In [2]:
from nltk import BottomUpChartParser, ChartParser, LeftCornerChartParser

all_edges = []

print("Sentence to parse: {}\n".format(sentence))

# use different parser classes
for parser_class, parser_name in [(BottomUpChartParser, "BottomUpChartParser"), (ChartParser, "BottomUpLeftCornerChartParser"), (LeftCornerChartParser, "LeftCornerChartParser")]:

    # use the grammar to create a parser, and parse the sentence with it  
    parser = parser_class(grammar)
    parse = parser.parse(sentence)

    print("Parsing with {}:".format(parser_name))

    # show the possible constituency trees
    trees = []
    for tree in parse:
        trees.append(tree)
    print("Number of trees: ", len(trees))
    for tree in trees:
        print("Parsed tree: ", tree)
        
    # show the edges that were found and those that were filtered out
    parse = parser.chart_parse(sentence)
    print("There are {} edges:".format(parse.num_edges()))
    edges = parse.edges()
    for edge in edges:
        print(edge)
    if parser_class == BottomUpChartParser:
        all_edges = edges
    else:
        print("Edges that were filtered out:")
        for edge in all_edges:
            if edge not in edges:
                print(edge)
    print("\n")

Sentence to parse: ['lazy', 'cats', 'play', 'with', 'mice']

Parsing with BottomUpChartParser:
Number of trees:  1
Parsed tree:  (S
  (NP (JJ lazy) (NNS cats))
  (VP (V play) (PP (CC with) (NP (NNS mice)))))
There are 50 edges:
[0:1] 'lazy'
[1:2] 'cats'
[2:3] 'play'
[3:4] 'with'
[4:5] 'mice'
[0:0] JJ -> * 'lazy'
[0:1] JJ -> 'lazy' *
[0:0] NP -> * JJ NNS
[0:1] NP -> JJ * NNS
[1:1] NNS -> * 'cats'
[1:2] NNS -> 'cats' *
[1:1] NP -> * NNS
[1:1] NNS -> * NNS CC NNS
[0:2] NP -> JJ NNS *
[1:2] NP -> NNS *
[1:2] NNS -> NNS * CC NNS
[1:1] S  -> * NP VP
[1:1] NP -> * NP CC NP
[1:2] S  -> NP * VP
[1:2] NP -> NP * CC NP
[0:0] S  -> * NP VP
[0:0] NP -> * NP CC NP
[0:2] S  -> NP * VP
[0:2] NP -> NP * CC NP
[2:2] V  -> * 'play'
[2:3] V  -> 'play' *
[2:2] VP -> * V PP
[2:2] VP -> * V
[2:3] VP -> V * PP
[2:3] VP -> V *
[1:3] S  -> NP VP *
[0:3] S  -> NP VP *
[3:3] CC -> * 'with'
[3:4] CC -> 'with' *
[3:3] PP -> * CC NP
[3:4] PP -> CC * NP
[4:4] NNS -> * 'mice'
[4:5] NNS -> 'mice' *
[4:4] NP -> * NNS
[4

### 1.3. Discussion of the results

All three used parsers (BottomUpChartParser, a BottomUpLeftCornerChartParser and a LeftCornerChartParser) produce the same constituency parsing tree for the analyzed sentence ("Lazy cats play with mice"). The result is a sentence composed of a noun phrase ("lazy cats") and a verb phrase ("play with mice"). The main noun prase is made up of an adjective ("lazy") and a a plural noun ("cats"). The verb phrase is composed of two elements: a verb ("play") and a propositional phrase ("with mice"), which is composed of a preposition ("with") and a noun phrase which is a plural noun ("mice").

The different between the parsers is not found in the obtained tree, but on the way how it is obtained and how individual edges are filtered according with the principles of each parsing technique. BottomUpChartParser applies no filtering, so it constructs the tree from bottom level (individual words) to top level (the whole sentence), so it ends up producing 50 edges.

Both BottomUpLeftCornerChartParser and LeftCornerChartParser are more efficient, since they filter some edges. The exact list of filtered edges can be seen in the previous section. BottomUpLeftCornerChartParser filters out edges without any word subsumption, such as "* 'lazy'", since there is no subsumption for "lazy" in the sentence, or "* 'play'", since there is no subsumption for "play". Thanks to this, BottomUpLeftCornerChartParser considers only 31 edges. LeftCornerChartParser is the most efficient of the three parsers, obtaining 25 edges in total. It achieves this by filtering out not only edges without any word subsumption (like the already mentioned "* 'lazy'") just like BottomUpLeftCornerChartParser did, but also edges without any new word subsumption. This is the case, for example, of "NP * CC NP", since the subsumptions shown there were already represented in previous edges.

## 2. Dependency parsing with NLTK

### 1. Sentence pairs and depencency parsing.
The following cell finds the pairs of sentences in the file "STS.input.txt" of the trial corpus. Afterwards, it stores the pairs as tuples (i.e. each sentence of the pair becomes a tuple element), with each sentence transformed into a set of dependency triples generated with a depencency parser.

In [3]:
from nltk.parse.corenlp import CoreNLPDependencyParser

def get_dependency_triples(sentence):
    
    """Returns an array with the triples of depencency parsing for the passed sentence"""
    
    # generate the dependency tree using NLP Dependency Parser
    parser = CoreNLPDependencyParser("http://localhost:9000")
    parse = parser.raw_parse(sentence)
    
    # extract the triples from the depencency tree
    triples = []
    tree = next(parse)
    for triple in tree.triples():
        triples.append(triple)
    return triples
    
print(get_dependency_triples("Smith jumps over the lazy dog"))

[(('jumps', 'VBZ'), 'nsubj', ('Smith', 'NNP')), (('jumps', 'VBZ'), 'nmod', ('dog', 'NN')), (('dog', 'NN'), 'case', ('over', 'IN')), (('dog', 'NN'), 'det', ('the', 'DT')), (('dog', 'NN'), 'amod', ('lazy', 'JJ'))]


In [4]:
import os
import sys

# full path of the corpus file, assuming that the trial folder containing the input file is in the same directory as the "ipython" file
absolute_file_path = os.path.dirname(os.path.abspath("__file__")) + "/./trial//STS.input.txt" #trial is on another loc

# find all sentence pairs in the document
sentence_pairs = []
sentence_set_pairs = []
with open(absolute_file_path) as f:
    lines = f.readlines()
    for line in lines:
        index, sentence0, sentence1 = line.split("\t")
        sentence_pairs.append((get_dependency_triples(sentence0), get_dependency_triples(sentence1)))
        print("First sentence: \t", sentence0, "\nSecond sentence: \t", sentence1, "\n")
    print()  
    
# the pairs of sentences are shown
for pair in sentence_pairs:
    print("{}\n{}\n\n".format(pair[0], pair[1]))

First sentence: 	 The bird is bathing in the sink. 
Second sentence: 	 Birdie is washing itself in the water basin.
 

First sentence: 	 In May 2010, the troops attempted to invade Kabul. 
Second sentence: 	 The US army invaded Kabul on May 7th last year, 2010.
 

First sentence: 	 John said he is considered a witness but not a suspect. 
Second sentence: 	 "He is not a suspect anymore." John said.
 

First sentence: 	 They flew out of the nest in groups. 
Second sentence: 	 They flew into the nest together.
 

First sentence: 	 The woman is playing the violin. 
Second sentence: 	 The young lady enjoys listening to the guitar.
 

First sentence: 	 John went horse back riding at dawn with a whole group of friends. 
Second sentence: 	 Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.
 


[(('bathing', 'NN'), 'nsubj', ('bird', 'NN')), (('bird', 'NN'), 'det', ('The', 'DT')), (('bathing', 'NN'), 'cop', ('is', 'VBZ')), (('bathing', 'NN'), 'nmod', ('sink', 'N

### 2. Sentence similarity calculation using triples from dependency parsing
The pairs of sentences are checked to see how similar they are, using the Jaccard distance: the more dependency triples two sentences share, the more alike they are considered to be. The pairs are shown along with their distance and dissimilarity score (scaled to be comparable with the gold standard one).

In [5]:
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr

# gold standard file path
absolute_file_path = os.path.dirname(os.path.abspath("__file__")) + "//trial//STS.gs.txt"

# get the gold standard scores
gold_scores = []
with open(absolute_file_path) as f:
    lines = f.readlines()
    for line in lines:
        _, score = line.split("\t")
        gold_scores.append(int(score))
        
dependency_scores = []

# compute the Jaccard distance to see how similar or different two sentences are
for i in range(len(sentence_pairs)):
    pair = sentence_pairs[i]
    dependency_dist = jaccard_distance(set(pair[0]), set(pair[1]))
    dependency_score = round(dependency_dist * 5)
    dependency_scores.append(dependency_score)
    print("First sentence dependency triples: ", pair[0], "\nSecond sentence dependency triples: ", pair[1], "\nDepencency-triples-based distance:", round(dependency_dist, 3), "\nDepencency-triples-based dissimilarity score:", dependency_score, "\nGold standard dissimilarity score:", gold_scores[i], "\n") 

# Pearson correlation between the tested and the gold standard scores
word_ne_pearson = pearsonr(dependency_scores, gold_scores)
print("Pearson correlation between dependency-triples-based method and gold standard:", round(word_ne_pearson[0], 3))

First sentence dependency triples:  [(('bathing', 'NN'), 'nsubj', ('bird', 'NN')), (('bird', 'NN'), 'det', ('The', 'DT')), (('bathing', 'NN'), 'cop', ('is', 'VBZ')), (('bathing', 'NN'), 'nmod', ('sink', 'NN')), (('sink', 'NN'), 'case', ('in', 'IN')), (('sink', 'NN'), 'det', ('the', 'DT')), (('bathing', 'NN'), 'punct', ('.', '.'))] 
Second sentence dependency triples:  [(('washing', 'VBG'), 'nsubj', ('Birdie', 'NNP')), (('washing', 'VBG'), 'aux', ('is', 'VBZ')), (('washing', 'VBG'), 'dobj', ('itself', 'PRP')), (('washing', 'VBG'), 'nmod', ('basin', 'NN')), (('basin', 'NN'), 'case', ('in', 'IN')), (('basin', 'NN'), 'det', ('the', 'DT')), (('basin', 'NN'), 'compound', ('water', 'NN')), (('washing', 'VBG'), 'punct', ('.', '.'))] 
Depencency-triples-based distance: 1.0 
Depencency-triples-based dissimilarity score: 5 
Gold standard dissimilarity score: 5 

First sentence dependency triples:  [(('attempted', 'VBN'), 'nmod', ('May', 'NNP')), (('May', 'NNP'), 'case', ('In', 'IN')), (('May', 'N

### 3. Explanation of the results of sentence similarity measure using dependency triples, compared to the gold standard.
The Pearson correlation between the gold standard and the method that calculates the similarity of two sentences checking the proportion of common dependency triples is close to zero (-0.131), showing that there is very little correlation. In other words, dependency triples alone are not informative enough to determine how similar two sentences are. For two triples to be considered equal and contribute to considering the sentence pairs similar, all the three elements in the triple (the two words and their syntactic relation) must be equal. This does not happen if at least one of the elements, e.g. one of the two words, is different, even if it represents the same meaning. To compute how similar two sentences are with Jaccard distance, it is better to use other approaches, such as word, lemma or synset-based comparisons, as shown in previous sessions, that allowed to obtain Pearson correlations of a higher absolute value.

Dependency parsing, however, would be usable in syntactic similarity comparison between sentences, by analyzing the relations that appear in the dependency tree, independently of the particular words involved in the sentences. The proportion of common relations can be analyzed easily by using the Jaccard distance on the full set of dependency tree relations (i.e. the relation element of each one of the dependency triples, instead of the whole triple). This naive calculation, however, would not take into account the ordering (from general to specific in the sentence, for example) of the relations. A tree exploration algorithm, such a modified version of breadth-first search, may be used to check how many relations appear in the same level of both trees, or after the same parent relations. This would give a much more detailed view on the syntactic similarity of the sentences, but not taking into account their semantic similarity.

