# Constituency Grammars with NLTK

- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

*Notebook Covers Material of*:
- [NLTK](https://www.nltk.org/book/ch08.html) Chapter 8: Analyzing Sentence Structure

__Requirements__

- [NLTK](https://www.nltk.org/)

## Defining Context Free Grammars (CFG)

CFG are defined by a *start symbol* and a set of *production rules*. The *start symbol* defines the root node of parse trees (usually __S__). 

*Production rules* specify allowed parent-child relations in a parse tree. Each production specifies what node can be the parent of a particular set of children nodes. 

For example, the production `S -> NP VP` specifies that an `S` node can be the parent of an `NP` node and a `VP` node.

The left-hand side of a production rules specifies potential *non-terminal* parent nodes; while right-hand side specifies list of allowed *non-terminal* and *terminal* (text) children. 

A production like `VP -> V NP | VP PP` has a disjunction on the right-hand side, shown by the `|` and is an abbreviation for the two productions `VP -> V NP` and `VP -> V NP PP`.



### Syntactic Categories

| __Symbol__ | __Meaning__ | __Example__ |
|:-----------|:------------|:------------|
| S   | sentence             | I saw the man |
| NP  | noun phrase          | the man | 
| VP  | verb phrase          | saw the man |
| PP  | prepositional phrase | with a telescope |
| Det | determiner  | the |
| N   | noun        | man |
| V   | verb        | saw |
| P   | preposition | with |


- Non-Terminals: `S`, `NP`, `VP`, `PP`
- Pre-Terminals: `Det`, `N`, `V`, `P` (Part-of-Speech Tags)
- Terminals (Leaves): the, man, saw, ...

### Defining Grammar in NLTK

The grammar can be defined as a string or as a list of strings of production rules.

In [14]:
import nltk

rules = [
    'S -> NP VP',
    'NP -> Det N | Det N PP | PRON',
    'VP -> V NP | V NP PP',
    'PP -> P NP',
    'Det -> "the" | "a"',
    'N -> "man" | "telescope"',
    'PRON -> "I"',
    'V -> "saw"',
    'P -> "with"'   
]

toy_grammar = nltk.CFG.fromstring(rules)

print(toy_grammar)


Grammar with 14 productions (start state = S)
    S -> NP VP
    NP -> Det N
    NP -> Det N PP
    NP -> PRON
    VP -> V NP
    VP -> V NP PP
    PP -> P NP
    Det -> 'the'
    Det -> 'a'
    N -> 'man'
    N -> 'telescope'
    PRON -> 'I'
    V -> 'saw'
    P -> 'with'


Grammar object has 2 components:
- start symbol
- production rules

Those can be access as follows.

In [15]:
print(toy_grammar.start())

S


In [16]:
print(toy_grammar.productions())

[S -> NP VP, NP -> Det N, NP -> Det N PP, NP -> PRON, VP -> V NP, VP -> V NP PP, PP -> P NP, Det -> 'the', Det -> 'a', N -> 'man', N -> 'telescope', PRON -> 'I', V -> 'saw', P -> 'with']


Each production has 2 parts:
- left-hand side
- right-hand side

which can be accessed as follows:

In [17]:
rule = toy_grammar.productions()[0]
print(rule.lhs())
print(rule.rhs())

S
(NP, VP)


## Parsing with CFG

> A parser processes input sentences according to the productions of a grammar, and builds one or more constituent structures that conform to the grammar. A grammar is a declarative specification of well-formedness — it is actually just a string, not a program. A parser is a procedural interpretation of the grammar. It searches through the space of trees licensed by a grammar to find one that has the required sentence along its fringe (outer edges).


### Available CFG Parsers

- Recursive descent parsing
    - top-down algorithm
    - pro: finds all successful parses.
    - con: inefficient. will try all rules brute-force, even the ones that do not match the input. Goes into an infinite loop when handling a left-recursive rule.
    - `nltk.RecursiveDescentParser()`

- Shift-reduce parsing
    - bottom-up algorithm
    - pro: efficient. only works with the rules that match input words.
    - con: may fail to find a legitimate parse even when there is one.
    - `nltk.ShiftReduceParser()`

- The left-corner parser
    - a top-down parser with bottom-up filtering
    - `nltk.LeftCornerChartParser()`: combines left-corner parsing and chart parsing

- Chart parsing
    - utilizes dynamic programming: builds and refers to well-formed substring tables (WFST)
    - pro: efficient.
    - con: may take up a big memory space when dealing with a long sentence.
    - `nltk.ChartParser()`


In [18]:
parser = nltk.ChartParser(toy_grammar)

sent = "I saw the man with a telescope"

for tree in parser.parse(sent.split()):
    print(tree)

(S
  (NP (PRON I))
  (VP
    (V saw)
    (NP (Det the) (N man))
    (PP (P with) (NP (Det a) (N telescope)))))
(S
  (NP (PRON I))
  (VP
    (V saw)
    (NP (Det the) (N man) (PP (P with) (NP (Det a) (N telescope))))))


The sentence produces two possible parse trees. Thus, it is said to be structurally ambiguous -- prepositional phrase attachment ambiguity.

### Exercise

- Define grammar that covers the following sentences.

    - show me movies in italian
    - show me movies made in italy
    - show me movies from italy
    
- Try different parsers


In [102]:
rules = [
    'S -> NP VP | VP', # Added: VP
    'NP -> Det N | Det N PP | PRON',
    'VP -> V NP | V NP PP | V PRON NP', # Added: V PRON PP
    'PP -> P NP | V P NP', # Added: V P NP
    'Det -> "the" | ', # Added: empty
    'N -> "movies" | "italy" | "italian"',
    'PRON -> "me"',
    'V -> "show" | "made"',
    'P -> "in" | "from"'   
]

es_grammar = nltk.CFG.fromstring(rules)

parser = nltk.ChartParser(es_grammar)
sent = "show me movies in italian"
for tree in parser.parse(sent.split()):
    print(tree)

(S
  (VP
    (V show)
    (PRON me)
    (NP (Det ) (N movies) (PP (P in) (NP (Det ) (N italian))))))


## Probabilistic Context Free Grammars

PCFGs are very similar to CFGs - they just have an additional probability for each production. 

For a given left-hand-side non-terminal, the sum of the probabilities must be 1.0!

In [19]:
weighted_rules = [
    'S -> NP VP [1.0]',
    'NP -> Det N [0.6]',
    'NP -> Det N PP [0.3]',
    'NP -> PRON [0.1]',
    'VP -> V NP [0.7]',
    'VP -> V NP PP [0.3]',
    'PP -> P NP [1.0]',
    'Det -> "the" [0.5]',
    'Det -> "a" [0.5]',
    'N -> "man" [0.5]',
    'N -> "telescope" [0.5]',
    'PRON -> "I" [1.0]',
    'V -> "saw" [1.0]',
    'P -> "with" [1.0]'   
]

toy_grammar = nltk.PCFG.fromstring(weighted_rules)

print(toy_grammar)

Grammar with 14 productions (start state = S)
    S -> NP VP [1.0]
    NP -> Det N [0.6]
    NP -> Det N PP [0.3]
    NP -> PRON [0.1]
    VP -> V NP [0.7]
    VP -> V NP PP [0.3]
    PP -> P NP [1.0]
    Det -> 'the' [0.5]
    Det -> 'a' [0.5]
    N -> 'man' [0.5]
    N -> 'telescope' [0.5]
    PRON -> 'I' [1.0]
    V -> 'saw' [1.0]
    P -> 'with' [1.0]


On top of right-hand side and left-hand side, probabilistic rules have probabilities.

In [20]:
rule = toy_grammar.productions()[0]
print(rule.lhs())
print(rule.rhs())
print(rule.prob())

S
(NP, VP)
1.0


## Learning Grammars from a Treebank

The most important method consists of inducing a PCFG from trees in a treebank (`induce_pcfg()`). 

NLTK provides portion of Penn Treebank corpus, which we can utilize to induce rules.

In [21]:
nltk.download('treebank')

[nltk_data] Downloading package treebank to
[nltk_data]     /Users/mdestro/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


True

Production rules can be extracted using `productions()` method iterating over parsed sentences in the corpus.

In [22]:
from nltk.corpus import treebank

print(treebank)

productions = []
# let's keep it small
for item in treebank.fileids():
  for tree in treebank.parsed_sents(item):
    productions += tree.productions()
    
print(len(productions))

<BracketParseCorpusReader in '/Users/mdestro/nltk_data/corpora/treebank/combined'>
179360


The grammar can be induced as follows:

In [23]:
from nltk import Nonterminal
S = Nonterminal('S')
grammar = nltk.induce_pcfg(S, productions)
print(grammar)

11236]
    PP -> ADVP PP CC PP NP-2 [0.000193836]
    NP-2 -> NP , NP [0.0238095]
    NNP -> 'Wallach' [0.00010627]
    NP -> JJ NNP NNP NNP NNP [4.21514e-05]
    NNP -> 'Attorney' [0.00010627]
    NNP -> 'Meese' [0.00010627]
    ADJP -> JJ VBN [0.0014556]
    VBN -> 'fashioned' [0.000468604]
    NN -> 'bribery' [7.59532e-05]
    VP -> VBD NP-CLR PP-CLR , S-ADV [6.8918e-05]
    NP-CLR -> JJ NN [0.037037]
    NN -> 'peddling' [7.59532e-05]
    NP -> ADJP `` JJ '' NN NNS [4.21514e-05]
    RB -> 'politically' [0.000354359]
    JJ -> 'respectable' [0.000171409]
    NN -> 'confidant' [7.59532e-05]
    NNP -> 'Lyn' [0.00010627]
    NNP -> 'Nofzinger' [0.00010627]
    VP -> VBD CC VBD NP PP-CLR [6.8918e-05]
    NP-SBJ -> DT VBN [0.000130856]
    VBN -> 'bribed' [0.000468604]
    NN -> 'merit' [7.59532e-05]
    JJ -> 'corrupt' [0.000171409]
    NN -> 'scheme' [7.59532e-05]
    NP -> CONJP NP , CC NP [4.21514e-05]
    NN -> 'bag' [7.59532e-05]
    NN -> 'crook' [7.59532e-05]
    VP -> VBD NP AD

### PCFG Parsers
NLTK provides several PCFG parsers:

From `nltk.parse.viterbi`

- ViterbiParser
    - A bottom-up PCFG parser that uses dynamic programming to find the single most likely parse for a text. The ViterbiParser parser parses texts by filling in a “most likely constituent table”. This table records the most probable tree representation for any given span and node value. In particular, it has an entry for every start index, end index, and node value, recording the most likely subtree that spans from the start index to the end index, and has the given node value.

From `nltk.parse.pchart` module

- InsideChartParser
    - A bottom-up parser for PCFG grammars that tries edges in descending order of the inside probabilities of their trees.
    - use `beam_size = len(tokens)+1` argument
    
- RandomChartParser
    - A bottom-up parser for PCFG grammars that tries edges in random order. This sorting order results in a random search strategy.
    
- UnsortedChartParser
    - A bottom-up parser for PCFG grammars that tries edges in whatever order.

- LongestChartParser
    - A bottom-up parser for PCFG grammars that tries longer edges before shorter ones. This sorting order results in a type of best-first search strategy.

Read about them in the [documentation](http://www.nltk.org/api/nltk.parse.html).

Let's parse one of the sentences from above using Viterbi parser.

In [101]:
parser = nltk.ViterbiParser(grammar)

for tree in parser.parse("show me movies made in Italy".split()):
    print(tree)

(S
  (NP-SBJ-3
    (NP (NN show))
    (FRAG (NP-SBJ (PRP me)) (NP (NNS movies))))
  (VP (VBN made) (PP-CLR (IN in) (NP (NNP Italy))))) (p=1.17393e-28)


#### Exercise

- Try different parser to parse the sentences from above (about movies)
- Compare assigned probabilities
- Compare time it takes to parse sentences

## Generating Sentences

Grammars can be used to generate sentences as well. This is accomplished using `generate` method.
read [here](http://www.nltk.org/api/nltk.parse.html#module-nltk.parse.generate)

arguments it takes are the following `nltk.parse.generate.generate(grammar, start=None, depth=None, n=None)`:

- grammar – The Grammar used to generate sentences.
- start – The Nonterminal from which to start generate sentences.
- depth – The maximal depth of the generated tree.
- n – The maximum number of sentences to return.

In [25]:
from nltk.parse.generate import generate

for sent in generate(toy_grammar, n=10):
    print(sent)

['the', 'man', 'saw', 'the', 'man']
['the', 'man', 'saw', 'the', 'telescope']
['the', 'man', 'saw', 'a', 'man']
['the', 'man', 'saw', 'a', 'telescope']
['the', 'man', 'saw', 'the', 'man', 'with', 'the', 'man']
['the', 'man', 'saw', 'the', 'man', 'with', 'the', 'telescope']
['the', 'man', 'saw', 'the', 'man', 'with', 'a', 'man']
['the', 'man', 'saw', 'the', 'man', 'with', 'a', 'telescope']
['the', 'man', 'saw', 'the', 'man', 'with', 'the', 'man', 'with', 'the', 'man']
['the', 'man', 'saw', 'the', 'man', 'with', 'the', 'man', 'with', 'the', 'telescope']


#### Exercise

- Use the grammar defined by you to generate sentences.
- Experiment with `depth` and `start` paramenters

In [99]:
for sent in generate(es_grammar, n=10):
    print(sent)

['the', 'movies', 'show', 'the', 'movies']
['the', 'movies', 'show', 'the', 'italy']
['the', 'movies', 'show', 'the', 'italian']
['the', 'movies', 'show', 'movies']
['the', 'movies', 'show', 'italy']
['the', 'movies', 'show', 'italian']
['the', 'movies', 'show', 'the', 'movies', 'in', 'the', 'movies']
['the', 'movies', 'show', 'the', 'movies', 'in', 'the', 'italy']
['the', 'movies', 'show', 'the', 'movies', 'in', 'the', 'italian']
['the', 'movies', 'show', 'the', 'movies', 'in', 'movies']
