# Syntactic (constituent|dependency) parsing:

![Parsing](./parsing.png)


# COLX 535 Lab Assignment 1: Noun Phrases

## Assignment Objectives

In this assignment you will
- Identify noun phrase chunks using POS tags
- Extract information from noun phrases in the Penn Treebank



#### Reference
Abney, S. P. (1991). Parsing by Chunks. In R. C. Berwick, S. Abney, & C. Tenny (Eds.), *Principle-Based Parsing* (pp. 257–278). Kluwer Academic Publishers. https://link.springer.com/chapter/10.1007/978-94-011-3474-3_10


$[$I begin$]$ $[$with an intuition$]$: $[$when I read$]$ $[$a sentence$]$, $[$I read it$]$ $[$a chunk$]$ $[$at a time$]$

CoNLL-2000 Shared task: NP chunking https://www.clips.uantwerpen.be/conll2000/chunking/

#### How is this related to class?

In class, we talked about how sentences are made up of phrases.  The task of finding phrases is called "chunking".  In this lab, you will be using Regexes to identify chunks in parsed sentences in the treebank, before we move on to more ML methods in later labs.

## Getting Started

This assignment requires that you have downloaded following NLTK corpora/lexicons:

In [16]:
import nltk
# nltk.download("punkt")
# nltk.download("treebank")
# nltk.download("averaged_perceptron_tagger")
# nltk.download("wordnet")

# nltk.download('omw-1.4')

Run the code below to access relevant modules (you can add to this as needed):

In [17]:
from nltk.corpus import treebank
from nltk import word_tokenize, pos_tag, RegexpParser
from nltk.tree import Tree
from nltk.chunk.util import ChunkScore
from nltk.stem import WordNetLemmatizer 

## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:

- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)
- Make sure that you are familiar with the [MDS policy](https://ubc-mds.github.io/policies/) concerning plagiarism

### Exercise 1: simple NP chunking
rubric={accuracy:3, efficiency:1}

We will start by building a basic NP chunker. A simple approach to the task of NP chunking is to assume that a sequence of words is an NP if 

* it contains only determiners, nouns, pronouns, and adjectives,
* and it contains at least one noun or pronoun. 

The first letters of relevant POS tags are provided for you in the sets `NP_POS` and `NP_HEAD_POS`. 

Write a function which takes a raw sentence (a string) and 

1. tokenizes and POS tags it using NLTK (you might also want to have an end-of-sequence token and tag)
1. finds all contiguous sequences of words that fit the above description, and returns them. 

For the input sentence _the big dog barked at the bird_ , you should return a list of two NPs `["the big dog", "the bird"]`. Note that you should only return *maximal* sequences. This means that even though `"big dog"` fits our description, you shouldn't return this sequence because it is already included in the longer sequence `"the big dog"`.  

Please use the provided sets `NP_POS` and `NP_HEAD_POS` in your solution, and you **should not** use a regex when implementing `get_chunks`. You might want to add some additional tests to show you've covered all the cases.


```
..
3.	DT	Determiner
..
7.	JJ	Adjective
8.	JJR	Adjective, comparative                  
9.	JJS	Adjective, superlative                  
..
12.	NN	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS Proper noun, plural
..
18.	PRP Personal pronoun
19.	PRP$ Possessive pronoun
```
source: https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html


In [18]:
NP_POS = {"DT", "NN", "JJ", "PR"}  # these are the first two letters of the POS that you should consider potential parts of nouns 
NP_HEAD_POS = {"NN", "PR"}  # each chunk must have at least one of these

def check_chunk_head(chunk):
    return [pos for word, pos in chunk if pos[:2] in NP_HEAD_POS] != []

def get_chunks(sentence):
    '''Extracts noun phrases from a sentence corresponding to the part-of-speech tags in optional_POS,
    requiring at least one of the POS tags in required_POS. Returns the chunks as a list of strings'''
    # your code here
    
    
    # your code here
    return chunks 

Here are a few examples which show you the input format for `get_chunks` and the intended output format. Your function should pass these assertions.  

In [19]:
assert(sorted(get_chunks("the quick brown fox jumped over the lazy dog"))) == sorted(["the quick brown fox", "the lazy dog"])
assert(get_chunks("life is good")) == ["life"]
assert(get_chunks("life is good and chickens are tasty")) == ["life","chickens"]
print("Success!")

Success!


### Exercise 2: regex chunking

Create three different NLTK regex noun chunkers using the `RegexpParser` class.

This class is built for chunking.  We'll be looking at it in lecture 2.  (Check the slides!)

#### 2.1
rubric={accuracy:2}
1. `simple_chunk` which exactly duplicates the logic from Exercise 1.

In [20]:
# your code here


#### 2.2
rubric={accuracy:2}

2. `ordered_chunk` which captures the standard English NP word order. For the purposes of this assignment, assume that English NPs are defined by the following properties:

 * The syntactic head of an NP is either a personal pronoun, common noun or proper noun. Every NP has to contain at least one of these. Note that there can be more.
 * If the head is a noun, it can be preceded by a determiner (also called an article) as in _the dog_ or a possessive pronoun as in _my dogs_. 
 * If the head is a noun, it can be preceded by one or more adjectives as in _beautiful weather_.
 * If a determiner or possessive pronoun occurs, it has to be the first token of the NP.  
 * If the syntactic head is a noun, it can be preceded by an adjective as in _the grey dog_ and _grey dogs_.

In [21]:
# your code here


#### 2.3
rubric={accuracy:2}

3. `conj_chunk` which allows for coordination of two NPs matching `ordered_chunk` using a coordinate conjunction `CC`. Note that often there is only one determiner in a coordinated NP as in "the Globe and Mail", however, "the Globe and the Mail" is also grammatical. Hint - both sides of the co-ordination should hav ethe same Regex - the rules of English are the same on the left or right side of a conjunction.

In [22]:
# your code here


In [23]:
sent = "I gave John my old Globe and Mail"
assert (str(simple_chunk.parse(pos_tag(word_tokenize(sent)))) == str(Tree.fromstring("(S (NP I/PRP) gave/VBD (NP John/NNP my/PRP$ old/JJ Globe/NNP) and/CC (NP Mail/NNP))")))
assert (str(ordered_chunk.parse(pos_tag(word_tokenize(sent)))) == str(Tree.fromstring("(S (NP I/PRP) gave/VBD (NP John/NNP) (NP my/PRP$ old/JJ Globe/NNP) and/CC (NP Mail/NNP))")))
assert (str(conj_chunk.parse(pos_tag(word_tokenize(sent)))) == str(Tree.fromstring("(S (NP I/PRP) gave/VBD (NP John/NNP) (NP my/PRP$ old/JJ Globe/NNP and/CC Mail/NNP))")))
print("Success!")

Success!


![Chunking](chunking.png)

### Exercise 3: chunking evaluation and improvement

We will now evaluate our regular expression chunkers by comparing their output to gold standard chunks extracted from the Penn Treebank.

#### 3.1
rubric={accuracy:3, quality:1}

First, we will create a new test set for our chunkers by pulling out noun phrases from the Penn Treebank. You should start by creating a function `convert_to_chunk` which converts standard syntactic trees into shallow chunk trees, where all phrases except `NP` have been flattened.    

Your `convert_to_chunk` function should take a list of syntax trees as input and return a list of chunk trees as output. Here is an example of a syntax tree and the corresponding chunk tree:

```
input syntax tree:
(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))

output chunk tree:
(S
  (NP Pierre/NNP Vinken/NNP)
  ,/,
  (NP 61/CD years/NNS)
  old/JJ
  ,/,
  will/MD
  join/VB
  (NP the/DT board/NN)
  as/IN
  (NP a/DT nonexecutive/JJ director/NN)
  Nov./NNP
  29/CD
  ./.
)
```

Most of your work will happen in the helper function `convert_to_chunk_`. It returns all the child nodes of the chunk tree as a list. For example, given the input:
```
sent = Tree.fromstring("(S (NP (DT the) (JJ dog)) (VP (VBD saw) (NP (DT the) (NN cat))))")
```
the `convert_to_chunk_` function should return a list of three elements:
```
[Tree('NP', [('the', 'DT'), ('dog', 'JJ')]), ('saw', 'VBD'), Tree('NP', [('the', 'DT'), ('cat', 'NN')])]
```
Notice that the VP has been flattened. The `convert_to_chunk` function then transforms this list into a chunk tree:
```
(S 
 (NP the/DT dog/JJ) 
 saw/VBD 
 (NP the/DT)
)
```

Loop over the Penn Treebank to build your test set. Only extract NPs which are labeled as `NP` (i.e. not `NP-TMP`, `NP-SBJ` etc.). You should only pull out shallow NPs, i.e. those that contain no other NPs, and skip any NPs which have a "\*" in one of the leaves. A boolean function which can be used for testing these conditions is partially written for you, you need to complete it by adding the case related to the "\*".

(**HINT**: Recursion will be helpful. A helper function is defined for this purpose; Also see the [pos](https://www.nltk.org/api/nltk.html#nltk.tree.Tree.pos) method for trees)

![Eval](eval.png)

![is_wanted_NP](is_wanted_np.png)

In [24]:
def is_wanted_NP(tree):
    '''returns False if the NLTK tree of a NP has either other NPs or traces ("*") within it'''
    if tree.label() != "NP":
        return False
    
    subtrees = list(tree.subtrees())[1:]
    if any([subtree.label().startswith("NP") for subtree in subtrees]):
        return False
    # your code here
    
    ...
    # your code here

    return True

def convert_to_chunk_(tree,chunks):
    '''Recursively finds any shallow NPs in the tree, converting the parse into the NLTK chunk format.
       The list of chunks is returned'''
    
    # your code here
    
    # your code here

    return chunks

tree = Tree.fromstring("(S (NP (DT the) (NN dog)) (VP (VBD saw) (NP (DT the) (NN cat))))")
assert(convert_to_chunk_(tree,[]) == [Tree('NP', [('the', 'DT'), ('dog', 'NN')]), ('saw', 'VBD'), Tree('NP', [('the', 'DT'),('cat','NN')])])

def convert_to_chunk(tree):
    return Tree("S",convert_to_chunk_(tree,[]))


treebank_test  = []

for parsed_sent in treebank.parsed_sents():
    treebank_test.append(convert_to_chunk(parsed_sent))

#### 3.2
rubric={accuracy:1}

Now, evaluate the three regex chunkers from Exercise 2 using the built-in NLTK chunk evaluation system (**HINT**: if you've done 2 and 3.1 correctly, your f-scores should be close to or above 60%).

Note, `ordered chunk` should results in better f-score than `simple chunk` but `conj chunk` might not.  

```
                        GOLD    SYS
Pierre          NNP     B-NP    B-NP
Vinken          NNP     I-NP    B-NP
,               ,       O       O
61              CD      B-NP    B-NP
years           NNS     I-NP    I-NP
...

processed 5 tokens with 2 phrases; found: 3 phrases; correct: 1.
accuracy:  80.00%; precision:  33.33%; recall:  50.00%; FB1:  40.00
               NP: precision:  33.33%; recall:  50.00%; FB1:  40.00  3
```

$\text{precision}  = \displaystyle\frac{\text{relevant chunks} \cap \text{retrieved chunks}}{\text{retrieved chunks}} = \frac{1}{3}$

$\text{recall}  = \displaystyle\frac{\text{relevant chunks} \cap \text{retrieved chunks}}{\text{relevant chunks}}  = \frac{1}{2}$


$f_1 = \displaystyle\frac{2\cdot\text{precision}\cdot\text{recall}}{\text{precision}+\text{recall}} = \frac{2 \cdot \frac{1}{3} \cdot \frac{1}{2}}{\frac{1}{3} + \frac{1}{2}}$

In [25]:
### Your code here



simple chunk


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(simple_chunk.evaluate(treebank_test))


ChunkParse score:
    IOB Accuracy:  79.0%%
    Precision:     50.3%%
    Recall:        72.4%%
    F-Measure:     59.3%%
ordered chunk


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(ordered_chunk.evaluate(treebank_test))


ChunkParse score:
    IOB Accuracy:  79.2%%
    Precision:     50.5%%
    Recall:        74.0%%
    F-Measure:     60.0%%
conj chunk


  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  print(conj_chunk.evaluate(treebank_test))


ChunkParse score:
    IOB Accuracy:  79.0%%
    Precision:     50.6%%
    Recall:        71.4%%
    F-Measure:     59.2%%


#### 3.3
rubric={accuracy:2}

Use slicing to split your test set into two subsets, one with 50 sentences and one with the rest. Look at the errors your best chunker is making on the set with 50 sentences (your development set), and identify at least one problem that can be fixed. 

In [26]:
### Your code here



SYS: (S
  (NP Pierre/NNP Vinken/NNP)
  ,/,
  61/CD
  (NP years/NNS)
  old/JJ
  ,/,
  will/MD
  join/VB
  (NP the/DT board/NN)
  as/IN
  (NP a/DT nonexecutive/JJ director/NN Nov./NNP)
  29/CD
  ./.)
GOLD: (S
  (NP Pierre/NNP Vinken/NNP)
  ,/,
  (NP 61/CD years/NNS)
  old/JJ
  ,/,
  will/MD
  join/VB
  (NP the/DT board/NN)
  as/IN
  (NP a/DT nonexecutive/JJ director/NN)
  Nov./NNP
  29/CD
  ./.)
SYS: (S
  (NP Mr./NNP Vinken/NNP)
  is/VBZ
  (NP chairman/NN)
  of/IN
  (NP Elsevier/NNP N.V./NNP)
  ,/,
  (NP the/DT Dutch/NNP)
  publishing/VBG
  (NP group/NN)
  ./.)
GOLD: (S
  Mr./NNP
  Vinken/NNP
  is/VBZ
  (NP chairman/NN)
  of/IN
  (NP Elsevier/NNP N.V./NNP)
  ,/,
  (NP the/DT Dutch/NNP publishing/VBG group/NN)
  ./.)
SYS: (S
  (NP Rudolph/NNP Agnew/NNP)
  ,/,
  55/CD
  (NP years/NNS)
  old/JJ
  and/CC
  (NP former/JJ chairman/NN)
  of/IN
  (NP Consolidated/NNP Gold/NNP Fields/NNP PLC/NNP)
  ,/,
  was/VBD
  named/VBN
  *-1/-NONE-
  (NP a/DT nonexecutive/JJ director/NN)
  of/IN
  (NP this/D

#### 3.4
rubric={reasoning:1}

Explain the problem you saw in your data.

YOUR ANSWER HERE

#### 3.5 
rubric={accuracy:1}

Make a new regex chunker which addresses that issue, and show that it is better than the other using your new test set (the one with the 50 dev sentences excluded). If you don't seen an overall improvement in f-score, try again until you do.

In [None]:
### Your code here


### Exercise 4: identifying predicates and objects

You will now build a function which extracts predicate-object pairs from syntax trees. For example, for the sentence _I bought the toys_ , your function should identify that the predicate of the sentence is _bought_ and its object is _toys_ , the function should then return the pair `("buy", "toy")`. 


#### 4.1 Optional
rubric={accuracy:2}

First, write a recursive function `get_head` which takes two arguments: `phrase` and `phrase_type` as input. The `phrase` argument is an NLTK tree representing either an NP or a VP, and `phrase_type` is either `"N"` or `"V"` for NPs and VPs, respectively. Your function should return the **lemmatized** syntactic head of `phrase`. For example, given the following NLTK syntax tree as input

```
(NP 
 (DT the) 
 (JJ grey) 
 (NN dogs) 
)
```

your function should return `dog`. 

You can assume that the head is either the right-most token with the appropriate POS `V.*` or `N.*`, or the syntactic head of the right-most child phrase having type `NP.*` or `VP.*` depending on `phrase_type`. This means that you may need to call `get_head` recursively. For example, for 

```
(NP 
 (DT the) 
 (JJ second) 
 (NN incentive) 
 (NN plan)
)
```

you should return `"plan"` which is the right-most noun. As another example, consider 

```
(NP 
 (DT the) 
 (JJ blue) 
 (NN bird)
 (CC and)
 (NP 
   (DT the)
   (JJ yellow)
   (NN butterfly)
 )
)
```

Here you should return "`butterfly`" which is the head of the right-most child NP.

If you can't identify a syntactic head, you should return `None`.

![Head](head.png)

In [27]:
# lemmatizer.lemmatize(word,pos) returns the lemma for word. 
# pos should be 'n' for nouns and 'v' for verbs.
lemmatizer = WordNetLemmatizer()

def get_head(phrase, phrase_type):
    '''returns the lemmatized lexical head assuming the provided phrase_type ("N","V",etc.)'''
    head = None
    # your code here
    
    # your code here
    return head


In [28]:
assert (get_head(Tree.fromstring("(NP (DT the) (JJ second) (NN incentive) (NN plan))"), "N") == "plan")
assert (get_head(Tree.fromstring("(NP-SUBJ (NP (DT the) (NNS policies)) (PP (IN of) (NP (NN tomorrow))))"), "N") == "policy")
assert (get_head(Tree.fromstring("(VP (VBN offered) (NP (NNS advertisers)))"),"V") == "offer")
print("Success!")

Success!


#### 4.2 Optional

rubric={accuracy:2, quality:1}

Next, use the `get_head` function you just wrote in a function which pulls out "normal" verb-object relationships, e.g. "buy" and "toy" in *I bought the toys*. This will involve getting the head of the verb phrase, and the head of its **first** NP child.

Note that a single sentence can contain several nested VP's so you should use recursion when implementing `get_short_distance_verb_noun_pairs`.

![Pair](pair.png)

In [29]:
def get_short_distance_verb_noun_pairs(parsed_sent):
    '''extracts verb-object pairs from a parsed sentence, 
       and returns them as a set of (verb,noun) tuples'''
    pairs = set()
    # your code here


    # your code here

    return pairs

Now extract all predicate-object pairs from the treebank. You should get at least 2500, but no more than 3200 pairs.

In [30]:
total_pairs = set()
for parsed_sent in treebank.parsed_sents():
    total_pairs.update(get_short_distance_verb_noun_pairs(parsed_sent))
    
print("Got %u predicate-object pairs" % len(total_pairs))
assert ("deduct","expense") in total_pairs
assert 2500 <= len(total_pairs) <= 3200
print("Success!")

Got 2839 predicate-object pairs
Success!
