# Mandatory Exercise - Session 8

### Students: Nafis Banirazi & Jan Carbonell

### Lab Objective:
The Objective of this lab is to:
-Consider the following sentence: "Lazy cats play with mice."
-Expand the grammar of the example related to non-probabilistic chart parsers in order to subsume this new sentence.
-Perform the constituency parsing using a BottomUpChartParser, a BottomUpLeftCornerChartParser and a LeftCornerChartParser.
-For each one of them, provide the resulting tree, the number of edges and the list of explored edges.
-Which parser is the most efficient for parsing the sentence?
-Which edges are filtered out by each parser and why?

## 1. Constituency parsing with NLTK
This section will define a grammar and use different parsing methods to determine the constituency tree of a sentence.

### 1.1. Grammar definition
The following cell defines a grammar that will allow to parse sentences with very specific terms and relations between them.

In [1]:
import nltk
from nltk import CFG, ChartParser

'''the grammar is expanded considering that:
    - sentences are composed by noun and verb prhases
    - "mice" is added as a possible plural name (NNS)
    - "with" is added as a possible preposition (CC)
    - "play" is added as the only possible verb (V)
    - the verb phrase (VP) is defined as a verb (V) or a verb plus a prepositional phrase (V PP)
    - the prepositional phrase is defined as a preposition plus a noun phrase (CC NP)
'''
grammar = CFG.fromstring('''
                        S -> NP VP
                        NP -> NNS | JJ NNS | NP CC NP 
                        NNS -> "cats" | "dogs" | "mice" | NNS CC NNS
                        JJ -> "big" | "small" | "lazy"
                        CC -> "and" | "or" | "with"
                        V -> "play"
                        VP -> V PP | V
                        PP -> CC NP
                        ''')

# sentence tokens
sentence = nltk.word_tokenize("Lazy cats play with mice".lower())

In [2]:
print(grammar)
print(sentence)

Grammar with 18 productions (start state = S)
    S -> NP VP
    NP -> NNS
    NP -> JJ NNS
    NP -> NP CC NP
    NNS -> 'cats'
    NNS -> 'dogs'
    NNS -> 'mice'
    NNS -> NNS CC NNS
    JJ -> 'big'
    JJ -> 'small'
    JJ -> 'lazy'
    CC -> 'and'
    CC -> 'or'
    CC -> 'with'
    V -> 'play'
    VP -> V PP
    VP -> V
    PP -> CC NP
['lazy', 'cats', 'play', 'with', 'mice']


### 1.2. Constituency parsing
The following cells perform the constituency parsing using a BottomUpChartParser, a BottomUpLeftCornerChartParser and a LeftCornerChartParser.m

In [36]:
from nltk import BottomUpChartParser, ChartParser, LeftCornerChartParser

total_edges = []

print("Sentence to parse: {}\n".format(sentence))

# using different parser classes
parser_type = [(BottomUpChartParser, "BottomUpChartParser"), (ChartParser, "BottomUpLeftCornerChartParser"), (LeftCornerChartParser, "LeftCornerChartParser")]

# using the grammar to create a parser, then parse the sentence with it 
for parser_class, parser_name in  parser_type:    
    parser = parser_class(grammar)
    parsed_sentence = parser.parse(sentence)

    print("Parsing with: {}".format(parser_name))
    
    # showing the constituency trees
    possible_trees = []
    for tree in parsed_sentence:
        possible_trees.append(tree)
    print("Number of trees: ", len(possible_trees))
    for tree in possible_trees:
        print("Parsed tree: ", tree)
        
    # list of the applied edges
    parse = parser.chart_parse(sentence)
    print("Number of edges: {}".format(parse.num_edges()))
    
    edges = parse.edges()
    for edge in edges:
        print(edge)
    if parser_class == BottomUpChartParser:
        total_edges = edges
    else:
        print("\n")
        print("Edges that were filtered out:")
        for edge in total_edges:
            if edge not in edges:
                print(edge)
    print("_________________________________________\n")

Sentence to parse: ['lazy', 'cats', 'play', 'with', 'mice']

Parsing with: BottomUpChartParser
Number of trees:  1
Parsed tree:  (S
  (NP (JJ lazy) (NNS cats))
  (VP (V play) (PP (CC with) (NP (NNS mice)))))
Number of edges: 50
[0:1] 'lazy'
[1:2] 'cats'
[2:3] 'play'
[3:4] 'with'
[4:5] 'mice'
[0:0] JJ -> * 'lazy'
[0:1] JJ -> 'lazy' *
[0:0] NP -> * JJ NNS
[0:1] NP -> JJ * NNS
[1:1] NNS -> * 'cats'
[1:2] NNS -> 'cats' *
[1:1] NP -> * NNS
[1:1] NNS -> * NNS CC NNS
[0:2] NP -> JJ NNS *
[1:2] NP -> NNS *
[1:2] NNS -> NNS * CC NNS
[1:1] S  -> * NP VP
[1:1] NP -> * NP CC NP
[1:2] S  -> NP * VP
[1:2] NP -> NP * CC NP
[0:0] S  -> * NP VP
[0:0] NP -> * NP CC NP
[0:2] S  -> NP * VP
[0:2] NP -> NP * CC NP
[2:2] V  -> * 'play'
[2:3] V  -> 'play' *
[2:2] VP -> * V PP
[2:2] VP -> * V
[2:3] VP -> V * PP
[2:3] VP -> V *
[1:3] S  -> NP VP *
[0:3] S  -> NP VP *
[3:3] CC -> * 'with'
[3:4] CC -> 'with' *
[3:3] PP -> * CC NP
[3:4] PP -> CC * NP
[4:4] NNS -> * 'mice'
[4:5] NNS -> 'mice' *
[4:4] NP -> * NNS
[4

## Discussion
- (A) Which parser is the most efficient for parsing the sentence?
- (B) Which edges are filtered out by each parser and why?

## 2. Dependency parsing with NLTK

In [37]:
from nltk.parse.corenlp import CoreNLPDependencyParser

def get_dependency_triples(sentence):
    
    """Returns an array with the triples of depencency parsing for the passed sentence"""
    
    # Core Named-entity parser as stanford one is deprecated: https://github.com/nltk/nltk/issues/2010
    parser = CoreNLPDependencyParser("http://localhost:9000")
    parse = parser.raw_parse(sentence)
    
    # extract the triples from the depencency tree
    triples = []
    tree = next(parse)
    for triple in tree.triples():
        triples.append(triple)
    return triples
    
print(get_dependency_triples("Smith jumps over the lazy dog"))

[(('jumps', 'VBZ'), 'nsubj', ('Smith', 'NNP')), (('jumps', 'VBZ'), 'nmod', ('dog', 'NN')), (('dog', 'NN'), 'case', ('over', 'IN')), (('dog', 'NN'), 'det', ('the', 'DT')), (('dog', 'NN'), 'amod', ('lazy', 'JJ'))]


In [42]:
import os
import sys

#trial is on lab 08 folder:
absolute_file_path = os.path.dirname(os.path.abspath("__file__")) + "/./trial//STS.input.txt" 


#value initialization and instantiation
d = {}
tests = []
standard = []

# find all sentence pairs in the document
sentence_pairs = []
sentence_set_pairs = []
with open(absolute_file_path) as f:
    lines = f.readlines()
    for line in lines:
        index, sentence0, sentence1 = line.split("\t")
        if index in d:
            d[index] = sentence0, sentence1
        else:
            d[index] = (get_dependency_triples(sentence0), get_dependency_triples(sentence1))
            print("First sentence: \t", sentence0, "\nSecond sentence: \t", sentence1, "\n")
    print()  

First sentence: 	 The bird is bathing in the sink. 
Second sentence: 	 Birdie is washing itself in the water basin.
 

First sentence: 	 In May 2010, the troops attempted to invade Kabul. 
Second sentence: 	 The US army invaded Kabul on May 7th last year, 2010.
 

First sentence: 	 John said he is considered a witness but not a suspect. 
Second sentence: 	 "He is not a suspect anymore." John said.
 

First sentence: 	 They flew out of the nest in groups. 
Second sentence: 	 They flew into the nest together.
 

First sentence: 	 The woman is playing the violin. 
Second sentence: 	 The young lady enjoys listening to the guitar.
 

First sentence: 	 John went horse back riding at dawn with a whole group of friends. 
Second sentence: 	 Sunrise at dawn is a magnificent view to take in if you wake up early enough for it.
 




### 2. Sentence similarity calculation using triples vs the gold standard
The pairs of sentences are checked to see how similar they are, using the Jaccard distance. Previously the sentences must have been tokenized and we have picked the **Sets**, *unique values of those tokenized sentences*; The more words or named entities two sentences have in common, the more similar they are. Then we calculate the similarity as 1-JD. 

We compute the Jaccard distance. In this step, we must first tokenize the sentences. We then take the **sets**; *unique values of those tokenized sentences*, lemmatize them and compute the **jaccard similarity as 1 - jaccard distance**.

In [44]:
from nltk.metrics import jaccard_distance
from scipy.stats import pearsonr

for key in d:
        
    w1 = set(d[key][0])
    w2 = set(d[key][1])

    # jaccard similarity 1 - jaccard distance
    dist = jaccard_distance(w1, w2)
    jaccard_similarity = 1 - dist
    tests.append(round(jaccard_similarity,3))
print(tests)

[0.0, 0.0, 0.0, 0.4, 0.0, 0.033]


## 3. Compare the results with gold standard by giving the pearson correlation between them.
And now, we open the Golden Standard file and calculate the perason correlation with and without lemmatization. 

**Pearson Correlation**
It shows the linear relationship between two sets of data. That means: the strength of the association between the two variables. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation

**Coefficient Value** -- Strength of Association
0.1 < | r | < .3 -- small correlation
0.3 < | r | < .5 -- medium/moderate correlation
| r | > .5 -- large/strong correlation

In [45]:
for line in open('./trial/STS.gs.txt','r'):
    line = line.strip().split("\t")
    standard.append(int(line[1]))

a = pearsonr(standard[0:6], tests[0:6])[0]
print('Pearson correlation:', round(a,3))

Pearson correlation: -0.187
