<h1><center>CHAPTER 8: Analyzing Sentence Structure</center></h1>

 We need a way to deal with the ambiguity that natural language is famous for. We also need to be able to cope with the fact that there are an unlimited number of possible sentences, and we can only write finite programs to analyze their structures and discover their meanings.

The goal of this chapter is to answer the following questions:

How can we use a formal grammar to describe the structure of an unlimited set of sentences?
How do we represent the structure of sentences using syntax trees?
How do parsers analyze a sentence and automatically build a syntax tree?

**To be able to run the codes below, download nltk.**

In [0]:
import nltk
import random #shuffle
from nltk.corpus import names
from nltk.corpus import brown
from nltk.corpus import movie_reviews
from nltk.corpus import conll2000


nltk.download('punkt') #for word_tokenize
nltk.download('averaged_perceptron_tagger')#for pos tagger
nltk.download('tagsets') #for pos_tag help
nltk.download('universal_tagset') #universal tags for pos

####Corpora###
nltk.download('names')
nltk.download('brown') #brown
nltk.download('nps_chat') #nps chat 
nltk.download('conll2000') #conll 
nltk.download('treebank') #penn 
nltk.download('sinica_treebank') #sinica treebank
nltk.download('indian') #indian corpus
nltk.download('mac_morpho') #mac morpho
nltk.download('rte') # recognizing text entailment
nltk.download('senseval') #senseval
nltk.download('ppattach')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Unzipping taggers/universal_tagset.zip.
[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Unzipping corpora/names.zip.
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package nps_chat to /root/nltk_data...
[nltk_data]   Unzipping corpora/nps_chat.zip.
[nltk_data] Downloading package conll2000 to /root/nltk_data...
[nltk_data]   Unzipping corpora/conll2000.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   U

True

<h2>1. Some Grammatical Dilemmas</h2>

<h3>Linguistic Data and Unlimited Possibilities</h3>


 Let's consider this data more closely, and make the thought experiment that we have a gigantic corpus consisting of everything that has been either uttered or written in English over, say, the last 50 years. Previous chapters have shown you how to process and analyse text corpora, and we have stressed the challenges for NLP in dealing with the vast amount of electronic language data that is growing daily. Would we be justified in calling this corpus "the language of modern English"? There are a number of reasons why we might answer No. 
 
 
 Accordingly, we can argue that the "modern English" is not equivalent to the very big set of word sequences in our imaginary corpus. Speakers of English can make judgements about these sequences, and will reject some of them as being ungrammatical.

Equally, it is easy to compose a new sentence and have speakers agree that it is perfectly good English. For example, sentences have an interesting property that they can be embedded inside larger sentences. Consider the following sentences:

	
a.		Usain Bolt broke the 100m record

b.		The Jamaica Observer reported that Usain Bolt broke the 100m record

c.		Andre said The Jamaica Observer reported that Usain Bolt broke the 100m record

d.		I think Andre said the Jamaica Observer reported that Usain Bolt broke the 100m record



If we replaced whole sentences with the symbol S, we would see patterns like Andre said S and I think S. These are templates for taking a sentence and constructing a bigger sentence. There are other templates we can use, like S but S, and S when S. With a bit of ingenuity we can construct some really long sentences using these templates. 

**Here's an impressive example from a Winnie the Pooh story**

[You can imagine Piglet's joy when at last the ship came in sight of him.] In after-years he liked to think that he had been in Very Great Danger during the Terrible Flood, but the only danger he had really been in was the last half-hour of his imprisonment, when Owl, who had just flown up, sat on a branch of his tree to comfort him, and told him a very long story about an aunt who had once laid a seagull's egg by mistake, and the story went on and on, rather like this sentence, until Piglet who was listening out of his window without much hope, went to sleep quietly and naturally, slipping slowly out of the window towards the water until he was only hanging on by his toes, at which moment, luckily, a sudden loud squawk from Owl, which was really part of the story, being what his aunt said, woke the Piglet up and just gave him time to jerk himself back into safety and say, "How interesting, and did she?" when — well, you can imagine his joy when at last he saw the good ship, Brain of Pooh (Captain, C. Robin; 1st Mate, P. Bear) coming over the sea to rescue him...


This long sentence actually has a simple structure that begins S but S when S. We can see from this example that language provides us with constructions which seem to allow us to extend sentences indefinitely. It is also striking that we can understand sentences of arbitrary length that we've never heard before: it's not hard to concoct an entirely novel sentence, one that has probably never been used before in the history of the language, yet all speakers of the language will understand it.

The purpose of a grammar is to give an explicit description of a language. But the way in which we think of a grammar is closely intertwined with what we consider to be a language. Is it a large but finite set of observed utterances and written texts? Is it something more abstract like the implicit knowledge that competent speakers have about grammatical sentences? Or is it some combination of the two? We won't take a stand on this issue, but instead will introduce the main approaches.

In this chapter, we will adopt the formal framework of "generative grammar", in which a "language" is considered to be nothing more than an enormous collection of all grammatical sentences, and a grammar is a formal notation that can be used for "generating" the members of this set. Grammars use recursive productions of the form S → S and S.

**Ubiquitous Ambiguity**

A well-known example of ambiguity is shown below from the Groucho Marx movie, Animal Crackers (1930):

For example, While hunting in Africa, I shot an elephant in my pajamas. How an elephant got into my pajamas I'll never know.

Let's take a closer look at the ambiguity in the phrase: I shot an elephant in my pajamas. First we need to define a simple grammar:

In [0]:
from nltk import CFG
groucho_grammar = CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")


This grammar permits the sentence to be analyzed in two ways, depending on whether the prepositional phrase in my pajamas describes the elephant or the shooting event.

In [0]:
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas']
parser = nltk.ChartParser(groucho_grammar)
trees = parser.parse(sent)
for tree in trees:
  print (tree)

(S
  (NP I)
  (VP
    (VP (V shot) (NP (Det an) (N elephant)))
    (PP (P in) (NP (Det my) (N pajamas)))))
(S
  (NP I)
  (VP
    (V shot)
    (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas))))))


The program produces two bracketed structures, which we can depict as trees:

![alt text](https://scontent.fist1-1.fna.fbcdn.net/v/t1.15752-9/66532603_2280423715619995_8833154114616557568_n.png?_nc_cat=105&_nc_oc=AQkXkKpKVPGzvLA5BhYSU9iPLwPQBJmLCU3dkPzQ1iA6kO4qBSaI6s1nrbNSuYCtCD8&_nc_ht=scontent.fist1-1.fna&oh=ee12a873c0c5a2a8f771353b400c7956&oe=5DB4572F)

Notice that there's no ambiguity concerning the meaning of any of the words; e.g. the word shot doesn't refer to the act of using a gun in the first sentence, and using a camera in the second sentence.

**Your Turn** Consider the following sentences and see if you can think of two quite different interpretations: Fighting animals could be dangerous. Visiting relatives can be tiresome. Is ambiguity of the individual words to blame? If not, what is the cause of the ambiguity?

## 2. What's the Use of Syntax?

**Beyond n-grams**

We saw an example above of how to use the frequency information in bigrams to generate text that seems perfectly acceptable for small sequences of words but rapidly degenerates into nonsense. Here's another pair of examples that we created by computing the bigrams over the text of a childrens' story, The Adventures of Buster Brown (http://www.gutenberg.org/files/22816/22816.txt):

(4)		
a.		He roared with me the pail slip down his back

b.		The worst part and clumsy looking for whoever heard light


You intuitively know that these sequences are "word-salad", but you probably find it hard to pin down what's wrong with them. One benefit of studying grammar is that it provides a conceptual framework and vocabulary for spelling out these intuitions. Let's take a closer look at the sequence the worst part and clumsy looking. This looks like a coordinate structure, where two phrases are joined by a coordinating conjunction such as and, but or or. Here's an informal (and simplified) statement of how coordination works syntactically:

Coordinate Structure:

If v1 and v2 are both phrases of grammatical category X, then v1 and v2 is also a phrase of category X.
Here are a couple of examples. In the first, two NPs (noun phrases) have been conjoined to make an NP, while in the second, two APs (adjective phrases) have been conjoined to make an AP.

(5)		
a.		The book's ending was (NP the worst part and the best part) for me.

b.		On land they are (AP slow and clumsy looking).


What we can't do is conjoin an NP and an AP, which is why the worst part and clumsy looking is ungrammatical. Before we can formalize these ideas, we need to understand the concept of constituent structure.

Constituent structure is based on the observation that words combine with other words to form units. The evidence that a sequence of words forms such a unit is given by substitutability — that is, a sequence of words in a well-formed sentence can be replaced by a shorter sequence without rendering the sentence ill-formed. To clarify this idea, consider the following sentence:

(6)		The little bear saw the fine fat trout in the brook.

The fact that we can substitute He for The little bear indicates that the latter sequence is a unit. By contrast, we cannot replace little bear saw in the same way.

(7)		
a.		He saw the fine fat trout in the brook.

b.		*The he the fine fat trout in the brook.



**Substitution of Word Sequences:** Working from the top row, we can replace particular sequences of words (e.g. the brook) with individual words (e.g. it); repeating this process we arrive at a grammatical two-word sentence.


![alt text](https://scontent.fist1-2.fna.fbcdn.net/v/t1.15752-9/66391169_326583831564252_3603630423525031936_n.png?_nc_cat=101&_nc_oc=AQnmf4uX--BzZUq_bBP86ABplN9k5R0eORTTEZhJq_5lelDnT0OVRFdF3by9Jn6R0Cg&_nc_ht=scontent.fist1-2.fna&oh=b80cb83a9e62e6a98370a19e7d1bdbd1&oe=5DBCE162)

**Substitution of Word Sequences Plus Grammatical Categories:** Here, we have added grammatical category labels to the words we saw in the earlier figure. The labels NP, VP, and PP stand for noun phrase, verb phrase and prepositional phrase respectively.

![alt text](https://scontent.fist1-2.fna.fbcdn.net/v/t1.15752-9/66668976_441666776428715_171090851164848128_n.png?_nc_cat=109&_nc_oc=AQnOLonWuqda8HcniLXBBlw0ZfFKbaJRXdxjffT7fe-DTmYC3Xrxrd3IxdFSE_pmrjc&_nc_ht=scontent.fist1-2.fna&oh=8ce8bee81c3403ebcd0a5c3bce43ba35&oe=5DA52EF5)

Now if we strip out the words apart from the topmost row, add an S node, and flip the figure over, we end up with a standard phrase structure tree, Each node in this tree (including the words) is called a constituent. The immediate constituents of S are NP and VP.



![alt text](https://scontent.fist1-1.fna.fbcdn.net/v/t1.15752-9/66314181_1012062082332697_210592670408507392_n.png?_nc_cat=102&_nc_oc=AQnW-aq2Kc_sbVUCd1-Rf-DNR0pMynGVvqd8Vth9ruiYmmGgceT_nLzwzTxzahMC9Xo&_nc_ht=scontent.fist1-1.fna&oh=4bbd78a33d78a0feb3519fea0b121dd8&oe=5DBA5255)

##3. Context Free Grammar

###A Simple Grammar

Let's start off by looking at a simple context-free grammar. By convention, the left-hand-side of the first production is the **start-symbol** of the grammar, typically `S`, and all well-formed trees must have this symbol as their root label. Now we define a grammar and show how to parse a simple sentence admitted by the grammar.

In NLTK, context-free grammars are defined in the nltk.grammar module.

In [0]:
grammar1 = nltk.CFG.fromstring("""
  S -> NP VP 
  VP -> V NP | V NP PP
  PP -> P NP
  V -> "saw" | "ate" | "walked"
  NP -> "John" | "Mary" | "Bob" | Det N | Det N PP
  Det -> "a" | "an" | "the" | "my"
  N -> "man" | "dog" | "cat" | "telescope" | "park"
  P -> "in" | "on" | "by" | "with"
  """)


 	
sent = "Mary saw Bob".split()
rd_parser = nltk.RecursiveDescentParser(grammar1)
for tree in rd_parser.parse(sent):
     print(tree)
        
for p in grammar1.productions():
    print(p)

(S (NP Mary) (VP (V saw) (NP Bob)))
S -> NP VP
VP -> V NP
VP -> V NP PP
PP -> P NP
V -> 'saw'
V -> 'ate'
V -> 'walked'
NP -> 'John'
NP -> 'Mary'
NP -> 'Bob'
NP -> Det N
NP -> Det N PP
Det -> 'a'
Det -> 'an'
Det -> 'the'
Det -> 'my'
N -> 'man'
N -> 'dog'
N -> 'cat'
N -> 'telescope'
N -> 'park'
P -> 'in'
P -> 'on'
P -> 'by'
P -> 'with'


Here,  ```S -> NP VP``` means a sentence consists of a noun phrase NP and a verb phrase VP. Similarly, ```VP -> V NP | V NP PP``` means a verb phrase `VP` consist of either a verb `V` and a noun phrase `NP`, or a verb `V`, noun phrase `NP` and a prepositional phrase `PP`.


`V -> "saw" | "ate" | "walked"` represents the vocabulary for the verbs.

**Recursive Descent Parser Demo:** This tool allows you to watch the operation of a recursive descent parser as it grows the parse tree and matches it against the input words.

![](https://www.nltk.org/images/parse_rdparsewindow.png)

If we parse the sentence The dog saw a man in the park using the grammar shown in 3.1, we end up with two trees:

a) 

![](https://www.nltk.org/book/tree_images/ch08-tree-4.png)

b) 

![alt text](https://www.nltk.org/book/tree_images/ch08-tree-5.png)

Since our grammar licenses two trees for this sentence, the sentence is said to be **structurally ambiguous**. The ambiguity in question is called a prepositional phrase attachment ambiguity. When the `PP` is attached to `VP`, the intended interpretation is that the seeing event happened in the park. However, if the `PP` is attached to `NP`, then it was the man who was in the park, and the agent of the seeing (the dog) might have been sitting on the balcony of an apartment overlooking the park.

###Writing Your Own Grammars

If you are interested in experimenting with writing CFGs, you will find it helpful to create and edit your grammar in a text file, say mygrammar.cfg. You can then load it into NLTK as follows:



```
grammar1 = nltk.data.load('file:mygrammar.cfg')


```


If the command print(tree) produces no output, this is probably because your sentence  sent is not admitted by your grammar. In this case, call the parser with tracing set to be on: 

In [0]:
rd_parser = nltk.RecursiveDescentParser(grammar1, trace=2)
for tree in rd_parser.parse(sent):
    print(tree)

You can also check what productions are currently in the grammar with the command:

In [0]:
for p in grammar1.productions(): 
    print(p)

S -> NP VP
VP -> V NP
VP -> V NP PP
PP -> P NP
V -> 'saw'
V -> 'ate'
V -> 'walked'
NP -> 'John'
NP -> 'Mary'
NP -> 'Bob'
NP -> Det N
NP -> Det N PP
Det -> 'a'
Det -> 'an'
Det -> 'the'
Det -> 'my'
N -> 'man'
N -> 'dog'
N -> 'cat'
N -> 'telescope'
N -> 'park'
P -> 'in'
P -> 'on'
P -> 'by'
P -> 'with'


###Recursion in Syntactic Structure

A grammar is said to be recursive if a category occurring on the left hand side of a production also appears on the righthand side of a production.

In [0]:
grammar2 = nltk.CFG.fromstring("""
  S  -> NP VP
  NP -> Det Nom | PropN
  Nom -> Adj Nom | N
  VP -> V Adj | V NP | V S | V NP PP
  PP -> P NP
  PropN -> 'Buster' | 'Chatterer' | 'Joe'
  Det -> 'the' | 'a'
  N -> 'bear' | 'squirrel' | 'tree' | 'fish' | 'log'
  Adj  -> 'angry' | 'frightened' |  'little' | 'tall'
  V ->  'chased'  | 'saw' | 'said' | 'thought' | 'was' | 'put'
  P -> 'on'
  """)

To see how recursion arises from this grammar, consider the following trees. (a) involves nested nominal phrases, while (b) contains nested sentences.

a)

![](https://www.nltk.org/book/tree_images/ch08-tree-6.png)

b) 

![](https://www.nltk.org/book/tree_images/ch08-tree-7.png)

Further explanation about recursive parsing is given in the next section.

##4. Parsing With Context Free Grammar

A **parser** processes input sentences according to the productions of a grammar, and builds one or more constituent structures that conform to the grammar. A grammar is a declarative specification of well-formedness — it is actually just a string, not a program. A parser is a procedural interpretation of the grammar. 

Many natural language applications involve parsing at some point; for example, we would expect the natural language questions submitted to a question-answering system to undergo parsing as an initial step.

###Recursive Descent Parsing

The simplest kind of parser interprets a grammar as a specification of how to break a high-level goal into several lower-level subgoals. The top-level goal is to find an S. The `S → NP VP` production permits the parser to replace this goal with two subgoals: find an `NP`, then find a `VP`. Each of these subgoals can be replaced in turn by sub-sub-goals, using productions that have `NP` and `VP` on their left-hand side. Eventually, this expansion process leads to subgoals such as: find the word *telescope*. Such subgoals can be directly compared against the input sequence, and succeed if the next word is matched. If there is no match the parser must back up and try a different alternative.

![](https://www.nltk.org/images/rdparser1-6.png)

Six Stages of a Recursive Descent Parser: the parser begins with a tree consisting of the node S; at each stage it consults the grammar to find a production that can be used to enlarge the tree; when a lexical production is encountered, its word is compared against the input; after a complete parse has been found, the parser backtracks to look for more parses.






**Note:**

RecursiveDescentParser() takes an optional parameter trace. If trace is greater than zero, then the parser will report the steps that it takes as it parses a text.

**Shortcomings of Recursive Descent Parsing:**



1.   Left-recursive productions like NP -> NP PP send it into an infinite loop.
2.   The parser wastes a lot of time considering words and structures that do not correspond to the input sentence.
3.   Third, the backtracking process may discard parsed constituents that will need to be rebuilt again later. For example, backtracking over VP -> V NP will discard the subtree created for the NP. If the parser then proceeds with VP -> V NP PP, then the NP subtree must be created all over again.


###Shift-Reduce Parsing

A simple kind of bottom-up parser is the **shift-reduce parser**. In common with all bottom-up parsers, a shift-reduce parser tries to find sequences of words and phrases that correspond to the *right hand* side of a grammar production, and replace them with the left-hand side, until the whole sentence is reduced to an `S`.

The shift-reduce parser repeatedly pushes the next input word onto a stack (4.1); this is the **shift** operation. If the top n items on the stack match the n items on the right hand side of some production, then they are all popped off the stack, and the item on the left-hand side of the production is pushed on the stack. This replacement of the top n items with a single item is the **reduce** operation. The parser finishes when all the input is consumed and there is only one item remaining on the stack, a parse tree with an S node as its root. Six stages of the execution of this parser are shown:

![](https://www.nltk.org/images/srparser1-6.png)

In [0]:
sr_parser = nltk.ShiftReduceParser(grammar1, trace=2)
sent = 'Mary saw a dog'.split()
for tree in sr_parser.parse(sent):
    print(tree)

Parsing 'Mary saw a dog'
    [ * Mary saw a dog]
  S [ 'Mary' * saw a dog]
  R [ NP * saw a dog]
  S [ NP 'saw' * a dog]
  R [ NP V * a dog]
  S [ NP V 'a' * dog]
  R [ NP V Det * dog]
  S [ NP V Det 'dog' * ]
  R [ NP V Det N * ]
  R [ NP V NP * ]
  R [ NP VP * ]
  R [ S * ]
(S (NP Mary) (VP (V saw) (NP (Det a) (N dog))))


A shift-reduce parser can reach a dead end and fail to find any parse, even if the input sentence is well-formed according to the grammar. When this happens, no input remains, and the stack contains items which cannot be reduced to an S. The problem arises because there are choices made earlier that cannot be undone by the parser (although users of the graphical demonstration can undo their choices). There are two kinds of choices to be made by the parser: (a) which reduction to do when more than one is possible (b) whether to shift or reduce when either action is possible.

The advantage of shift-reduce parsers over recursive descent parsers is that they only build structure that corresponds to the words in the input. Furthermore, they only build each sub-structure once, e.g.  NP(Det(the), N(man)) is only built and pushed onto the stack a single time, regardless of whether it will later be used by the VP -> V NP PP reduction or the NP -> NP PP reduction.

###The Left-Corner Parser

One of the problems with the recursive descent parser is that it goes into an infinite loop when it encounters a left-recursive production. This is because it applies the grammar productions blindly, without considering the actual input sentence. A left-corner parser is a hybrid between the bottom-up and top-down approaches we have seen.

The key idea of left-corner parsing is to combine top-down processing with bottom-up processing in order to avoid going wrong in the ways that we are prone to go wrong with pure top-down and pure bottom-up techniques. Before we look at how this is done, you have to know what is the left corner of a rule. The left corner of a rule is the first symbol on the right hand side. For example, NP is the left corner of the rule `S -> NP VP`, and the is the left corner of the rule `Det -> 'the'`.

A left-corner parser starts with a top-down prediction fixing the category that is to be recognized, like for example `S`. Next, it takes a bottom-up step and then alternates bottom-up and top-down steps until it has reached an `S`.

Grammar `grammar1` allows us to produce the following parse of *John saw Mary*:

![](https://www.nltk.org/book/tree_images/ch08-tree-8.png)

Recall that the grammar (defined in 3) has the following productions for expanding NP:

a.		NP -> Det N

b.		NP -> Det N PP

c.		NP -> "John" | "Mary" | "Bob"

If you were asked to select one of the rules for `NP` for this particular sentence, you would select (c). Once left corner parser finds out that 'John' is in the top category `NP`, it starts searching from bottom categories like (c), not from (a) or (b) because they contain also sub-categories. When it finds "John" as a left corner of `NP` it starts searching another rule that 'NP' is the left corner and so on.

When left corner parser reaches the top rule possible, like `S -> NP VP`, by finishing `NP`sub-tree, it starts from the other element of the rule, `VP` and does similar things to construct the tree.

A further and more visualized explanation can be seen in the following website: http://cs.union.edu/~striegnk/courses/nlp-with-prolog/html/node53.html

###Well-Formed Substring Tables

The simple parsers discussed above suffer from limitations in both completeness and efficiency. In order to remedy these, we will apply the algorithm design technique of dynamic programming to the parsing problem. This approach to parsing is known as chart parsing. We introduce the main idea in this section; see the online materials available for this chapter for more implementation details.

Dynamic programming allows us to build the `PP` *in my pajamas* just once. The first time we build it we save it in a table, then we look it up when we need to use it as a subconstituent of either the object NP or the higher VP. This table is known as a **well-formed substring table**, or WFST for short. (The term "substring" refers to a contiguous sequence of words within a sentence.)



***

**Here is a good, visulized and easy to understand explanation of filling the WFST chart with CYK algorithm, we highly recommend you to check this material before continuing:**

http://www.sfs.uni-tuebingen.de/~dm/04/winter/684.01/slides/09-single.pdf

***

Let's take the sentence *I shot an elephant in my pajamas.* as an example. The indexing of this sentence will be:

![alt text](https://www.nltk.org/images/chart_positions1.png)


For our WFST, we create an *(n-1) × (n-1)* matrix as a list of lists in Python, and initialize it with the lexical categories of each token, in the `init_wfst()` function. We also define a utility function `display()` to pretty-print the WFST for us.

In [0]:
groucho_grammar = nltk.CFG.fromstring("""
S -> NP VP
PP -> P NP
NP -> Det N | Det N PP | 'I'
VP -> V NP | VP PP
Det -> 'an' | 'my'
N -> 'elephant' | 'pajamas'
V -> 'shot'
P -> 'in'
""")

def init_wfst(tokens, grammar):
    numtokens = len(tokens)
    wfst = [[None for i in range(numtokens+1)] for j in range(numtokens+1)]
    for i in range(numtokens):
        productions = grammar.productions(rhs=tokens[i])
        wfst[i][i+1] = productions[0].lhs()
    return wfst

def complete_wfst(wfst, tokens, grammar, trace=False):
    index = dict((p.rhs(), p.lhs()) for p in grammar.productions())
    numtokens = len(tokens)
    for span in range(2, numtokens+1):
        for start in range(numtokens+1-span):
            end = start + span
            for mid in range(start+1, end):
                nt1, nt2 = wfst[start][mid], wfst[mid][end]
                if nt1 and nt2 and (nt1,nt2) in index:
                    wfst[start][end] = index[(nt1,nt2)]
                    if trace:
                        print("[%s] %3s [%s] %3s [%s] ==> [%s] %3s [%s]" % \
                        (start, nt1, mid, nt2, end, start, index[(nt1,nt2)], end))
    return wfst

def display(wfst, tokens):
    print('\nWFST ' + ' '.join(("%-4d" % i) for i in range(1, len(wfst))))
    for i in range(len(wfst)-1):
        print("%d   " % i, end=" ")
        for j in range(1, len(wfst)):
            print("%-4s" % (wfst[i][j] or '.'), end=" ")
        print()
        
print("Initialization:")
tokens = "I shot an elephant in my pajamas".split()
wfst0 = init_wfst(tokens, groucho_grammar)
display(wfst0, tokens)
print("\nCompleted chart:")
wfst1 = complete_wfst(wfst0, tokens, groucho_grammar)
display(wfst1, tokens)

Initialization:

WFST 1    2    3    4    5    6    7   
0    NP   .    .    .    .    .    .    
1    .    V    .    .    .    .    .    
2    .    .    Det  .    .    .    .    
3    .    .    .    N    .    .    .    
4    .    .    .    .    P    .    .    
5    .    .    .    .    .    Det  .    
6    .    .    .    .    .    .    N    

Completed chart:

WFST 1    2    3    4    5    6    7   
0    NP   .    .    S    .    .    S    
1    .    V    .    VP   .    .    VP   
2    .    .    Det  NP   .    .    .    
3    .    .    .    N    .    .    .    
4    .    .    .    .    P    .    PP   
5    .    .    .    .    .    Det  NP   
6    .    .    .    .    .    .    N    


After completing chart, the relations between the words in the sentence according to the grammar given will be:

![alt text](https://www.nltk.org/images/chart_positions2.png)

Notice that we have not used any built-in parsing functions here. We've implemented a complete, primitive chart parser from the ground up!





<h2>5. Dependencies and Dependency Grammar</h2>

Phrase structure grammar is concerned with how words and sequences of words combine to form constituents. A distinct and complementary approach, dependency grammar, focusses instead on how words relate to other words. Dependency is a binary asymmetric relation that holds between a **head** and its **dependents**. The head of a sentence is usually taken to be the tensed verb, and every other word is either dependent on the sentence head, or connects to it through a path of dependencies.

A dependency representation is a labeled directed graph, where the nodes are the lexical items and the labeled arcs represent dependency relations from heads to dependents.

Example for dependency graph, where arrows point from heads to their dependents:
![alt text](https://i.hizliresim.com/VQvdkq.png)

The arcs in the above example are labeled with the grammatical function that holds between a dependent and its head. For example, *I* is the `SBJ` (subject) of *shot* (which is the head of the whole sentence), and in is an `NMOD` (noun modifier of *elephant*). In contrast to phrase structure grammar, therefore, dependency grammars can be used to directly express grammatical functions as a type of dependency.

Here's one way of encoding a dependency grammar in NLTK — note that it only captures bare dependency information without specifying the type of dependency:

In [0]:
groucho_dep_grammar = nltk.DependencyGrammar.fromstring("""
  'shot' -> 'I' | 'elephant' | 'in'
  'elephant' -> 'an' | 'in'
  'in' -> 'pajamas'
  'pajamas' -> 'my'
  """)
print(groucho_dep_grammar)

Dependency grammar with 7 productions
  'shot' -> 'I'
  'shot' -> 'elephant'
  'shot' -> 'in'
  'elephant' -> 'an'
  'elephant' -> 'in'
  'in' -> 'pajamas'
  'pajamas' -> 'my'


A dependency graph is **projective** if, when all the words are written in linear order, the edges can be drawn above the words without crossing. This is equivalent to saying that a word and all its descendents (dependents and dependents of its dependents, etc.) form a contiguous sequence of words within the sentence.

The next example shows how `groucho_dep_grammar` provides an alternative approach to capturing the attachment ambiguity that we examined earlier with phrase structure grammar:

In [0]:
pdp = nltk.ProjectiveDependencyParser(groucho_dep_grammar)
sent = 'I shot an elephant in my pajamas'.split()
trees = pdp.parse(sent)
for tree in trees:
  print(tree)

(shot I (elephant an (in (pajamas my))))
(shot I (elephant an) (in (pajamas my)))


These bracketed dependency structures can also be displayed as trees, where dependents are shown as children of their heads.
> ![alt text](https://i.hizliresim.com/kMnNM7.png)


In languages with more flexible word order than English, non-projective dependencies are more frequent.

Various criteria have been proposed for deciding what is the head *H* and what is the dependent *D* in a construction *C*. Some of the most important are the following:
> 1. *H* determines the distribution class of *C*; or alternatively, the external syntactic properties of *C* are due to *H*.
> 2. *H* determines the semantic type of *C*.
> 3. *H* is obligatory while *D* may be optional.
> 4. *H* selects *D* and determines whether it is obligatory or optional.
> 5. The morphological form of *D* is determined by *H* (e.g. agreement or case government).

When we say in a phrase structure grammar that the immediate constituents of a `PP` are `P` and `NP`, we are implicitly appealing to the head / dependent distinction. A prepositional phrase is a phrase whose head is a preposition; moreover, the `NP` is a dependent of `P`. The same distinction carries over to the other types of phrase that we have discussed. The key point to note here is that although phrase structure grammars seem very different from dependency grammars, they implicitly embody a recognition of dependency relations. While CFGs are not intended to directly capture dependencies, more recent linguistic frameworks have increasingly adopted formalisms which combine aspects of both approaches.

<h3>Valency and the Lexicon</h3>

Let us take a closer look at verbs and their dependents. The grammar in the example at 3rd section correctly generates examples like 1d.

> 1a.		The squirrel was frightened.

> 1b.		Chatterer saw the bear.

> 1c.		Chatterer thought Buster was angry.

> 1d.		Joe put the fish on the log.

These possibilities correspond to the following productions:

> ![alt text](https://i.hizliresim.com/dLkyEV.png)

That is, was can occur with a following `**Adj**`, *saw* can occur with a following `NP`, *thought* can occur with a following `S` and *put* can occur with a following `NP` and `PP`. The dependents `Adj`, `NP`, `PP` and `S` are often called **complements** of the respective verbs and there are strong constraints on what verbs can occur with what complements.

By contrast with 1d, the word sequences in 2d are ill-formed:
> 2a.		The squirrel was Buster was angry.

> 2b.		Chatterer saw frightened.

> 2c.		Chatterer thought the bear.

> 2d.		Joe put on the log.


In the tradition of dependency grammar, the verbs in first example are said to have different **valencies**. Valency restrictions are not just applicable to verbs, but also to the other classes of heads.

Within frameworks based on phrase structure grammar, various techniques have been proposed for excluding the ungrammatical examples in the second example. In a CFG, we need some way of constraining grammar productions which expand `VP` so that verbs *only* co-occur with their correct complements. We can do this by dividing the class of verbs into "subcategories", each of which is associated with a different set of complements. For example, **transitive verbs** such as *chased* and *saw* require a following `NP` object complement; that is, they are **subcategorized** for `NP` direct objects. If we introduce a new category label for transitive verbs, namely `TV` (for Transitive Verb), then we can use it in the following productions:



```
VP -> TV NP
TV -> 'chased' | 'saw'
```

> ![alt text](https://i.hizliresim.com/9Y2JBN.png)

Complements are often contrasted with modifiers (or adjuncts), although both are kinds of dependent. Prepositional phrases, adjectives and adverbs typically function as modifiers. Unlike complements, modifiers are optional, can often be iterated, and are not selected for by heads in the same way as complements. For example, the adverb *really* can be added as a modifer to all the sentence in 3d:

> 3a. The squirrel really was frightened.

> 3b. Chatterer really saw the bear.

> 3c. 	Chatterer really thought Buster was angry.

> 3d. Joe really put the fish on the log.

The structural ambiguity of `PP` attachment, which we have illustrated in both phrase structure and dependency grammars, corresponds semantically to an ambiguity in the scope of the modifier.

<h2>6. Grammar Development</h2>

<h3>Treebanks and Grammars</h3>

The `corpus` module defines the `treebank` corpus reader, which contains a 10% sample of the Penn Treebank corpus.

In [0]:
from nltk.corpus import treebank
t = treebank.parsed_sents('wsj_0001.mrg')[0]
print(t)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


We can use this data to help develop a grammar. For example, the program in below uses a simple filter to find verbs that take sentential complements. Assuming we already have a production of the form `VP` -> `Vs S`, this information enables us to identify particular verbs that would be included in the expansion of `Vs`.

In [0]:
def filter(tree):
    child_nodes = [child.label() for child in tree
                   if isinstance(child, nltk.Tree)]
    return  (tree.label() == 'VP') and ('S' in child_nodes)
  
from nltk.corpus import treebank
[subtree for tree in treebank.parsed_sents()
  for subtree in tree.subtrees(filter)]

The Prepositional Phrase Attachment Corpus, `nltk.corpus.ppattach` is another source of information about the valency of particular verbs. Here we illustrate a technique for mining this corpus. It finds pairs of prepositional phrases where the preposition and noun are fixed, but where the choice of verb determines whether the prepositional phrase is attached to the `VP` or to the `NP`.

Amongst the output lines of this program we find offer-from-group N: ['rejected'] V: ['received'], which indicates that *received* expects a separate `PP` complement attached to the `VP`, while rejected does not. As before, we can use this information to help construct the grammar.

In [0]:
from collections import defaultdict
entries = nltk.corpus.ppattach.attachments('training')
table = defaultdict(lambda: defaultdict(set))
for entry in entries:
  key = entry.noun1 + '-' + entry.prep + '-' + entry.noun2
  table[key][entry.attachment].add(entry.verb)

for key in sorted(table):
  if len(table[key]) > 1:
    print(key, 'N:', sorted(table[key]['N']), 'V:', sorted(table[key]['V']))

The NLTK corpus collection includes data from the PE08 Cross-Framework and Cross Domain Parser Evaluation Shared Task. A collection of larger grammars has been prepared for the purpose of comparing different parsers, which can be obtained by downloading the `large_grammars` package (e.g. `python -m nltk.downloader large_grammars`).

The NLTK corpus collection also includes a sample from the *Sinica Treebank Corpus*, consisting of 10,000 parsed sentences drawn from the *Academia Sinica Balanced Corpus of Modern Chinese*. If you want to display one of the trees in this corpus switch to different environment, since Colaboratory is not supporting displays. You should write the code below:


```
nltk.corpus.sinica_treebank.parsed_sents()[3450].draw()              
```

The output:


![alt text](https://i.hizliresim.com/Z5kEgk.png)




<h3>Pernicious Ambiguity</h3>

Unfortunately, as the coverage of the grammar increases and the length of the input sentences grows, the number of parse trees grows rapidly. In fact, it grows at an astronomical rate.

Let's explore this issue with the help of a simple example. The word *fish* is both a noun and a verb. We can make up the sentence *fish fish fish*, meaning* fish like to fish for other fish*. (Try this with *police* if you prefer something more sensible.) Here is a toy grammar for the "fish" sentences.

In [0]:
grammar = nltk.CFG.fromstring("""
    S -> NP V NP
    NP -> NP Sbar
    Sbar -> NP V
    NP -> 'fish'
    V -> 'fish'
    """)

Now we can try parsing a longer sentence, *fish fish fish fish fish*, which amongst other things, means '*fish that other fish fish are in the habit of fishing fish themselves*'. We use the NLTK chart parser, which was mentioned earlier in this chapter. This sentence has two readings.

In [0]:
tokens = ["fish"] * 5
cp = nltk.ChartParser(grammar)
for tree in cp.parse(tokens):
  print(tree)

(S (NP fish) (V fish) (NP (NP fish) (Sbar (NP fish) (V fish))))
(S (NP (NP fish) (Sbar (NP fish) (V fish))) (V fish) (NP fish))


<h3>Weighted Grammar</h3>

Chart parsers improve the efficiency of computing multiple parses of the same sentences, but they are still overwhelmed by the sheer number of possible parses. Weighted grammars and probabilistic parsing algorithms have provided an effective solution to these problems.

Before looking at these, we need to understand why the notion of grammaticality could be *gradient*. Considering the verb *give*. This verb requires both a direct object (the thing being given) and an indirect object (the recipient). These complements can be given in either order, as illustrated in 1. In the "prepositional dative" form in 1a, the direct object appears first, followed by a prepositional phrase containing the indirect object.

> 1a. Kim gave a bone to the dog

> 1b. Kim gave the dog a bone

In the "double object" form in 1b, the indirect object appears first, followed by the direct object. In the above case, either order is acceptable. However, if the indirect object is a pronoun, there is a strong preference for the double object construction:

> 2a. Kim gives the heebie-jeebies to me (*prepositional dative)

> 2b. Kim gives me the heebie-jeebies (double object)

Using the Penn Treebank sample, we can examine all instances of prepositional dative and double object constructions involving give:



In [0]:
def give(t):
    return t.label() == 'VP' and len(t) > 2 and t[1].label() == 'NP'\
           and (t[2].label() == 'PP-DTV' or t[2].label() == 'NP')\
           and ('give' in t[0].leaves() or 'gave' in t[0].leaves())
  
def sent(t):
    return ' '.join(token for token in t.leaves() if token[0] not in '*-0')
  
def print_node(t, width):
        output = "%s %s: %s / %s: %s" %\
            (sent(t[0]), t[1].label(), sent(t[1]), t[2].label(), sent(t[2]))
        if len(output) > width:
            output = output[:width] + "..."
        print(output)

In [0]:
for tree in nltk.corpus.treebank.parsed_sents():
  for t in tree.subtrees(give):
    print_node(t, 72)

We can observe a strong tendency for the shortest complement to appear first. However, this does not account for a form like `give NP: federal judges / NP`: a raise, where animacy may play a role. In fact there turn out to be a large number of contributing factors, as surveyed by (Bresnan & Hay, 2006). Such preferences can be represented in a weighted grammar.

A **probabilistic context free grammar** (or PCFG) is a context free grammar that associates a probability with each of its productions. It generates the same set of parses for a text that the corresponding context free grammar does, and assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the product of the probabilities of the productions used to generate it.

The simplest way to define a PCFG is to load it from a specially formatted string consisting of a sequence of weighted productions, where weights appear in brackets:

In [0]:
grammar = nltk.PCFG.fromstring("""
    S    -> NP VP              [1.0]
    VP   -> TV NP              [0.4]
    VP   -> IV                 [0.3]
    VP   -> DatV NP NP         [0.3]
    TV   -> 'saw'              [1.0]
    IV   -> 'ate'              [1.0]
    DatV -> 'gave'             [1.0]
    NP   -> 'telescopes'       [0.8]
    NP   -> 'Jack'             [0.2]
    """)

print(grammar)

It is sometimes convenient to combine multiple productions into a single line, e.g. `VP -> TV NP [0.4] | IV [0.3] | DatV NP NP [0.3]`. In order to ensure that the trees generated by the grammar form a probability distribution, PCFG grammars impose the constraint that all productions with a given left-hand side must have probabilities that sum to one. The grammar in above example obeys this constraint: for S, there is only one production, with a probability of 1.0; for VP, 0.4+0.3+0.3=1.0; and for NP, 0.8+0.2=1.0. The parse tree returned by parse() includes probabilities:

In [0]:
viterbi_parser = nltk.ViterbiParser(grammar)
for tree in viterbi_parser.parse(['Jack', 'saw', 'telescopes']):
  print(tree)

(S (NP Jack) (VP (TV saw) (NP telescopes))) (p=0.064)
