## Part 3. Beyond Toy Grammars

The grammars we have worked on so far are super tiny. To get a grasp of something more realistic, let us train a probabilistic context free grammar (PCFG) from part of the Penn Treebank corpus:

In [None]:
# Import the NLTK library with the Penn treebank data
import sys
!{sys.executable} -m pip install nltk
import nltk
nltk.download('treebank')

# Train a PCFG on data from a real treebank (this may take a while to run)
productions = []
for tree in nltk.corpus.treebank.parsed_sents():
    productions += tree.productions()
S = nltk.grammar.Nonterminal('S')
grammar = nltk.induce_pcfg(S, productions)

You might be curious to see what kind of sentences there are in the training data for our parser. Output some sentences from the treebank by changing the sentence ID `sent_id` in the cell below. Plot the trees by copy-pasting into the syntax tree generator at http://mshang.ca/syntree/.

In [None]:
sent_id = 0 # change this to some other integer numbers

print(nltk.corpus.treebank.parsed_sents()[sent_id].pformat(parens='[]'))

Then you can parse some sentences of your own choice using the parser that you have just trained. Note that you need to use words that are in the vocabulary of the parser. Otherwise an error message will be displayed.

In [None]:
sent = 'let us meet tomorrow after lunch .' # change this sentence to some other sentences

# Parse the sentence (this might take a while)
viterbi_parser = nltk.ViterbiParser(grammar)
trees = viterbi_parser.parse(sent.split())
for tree in trees:
    print(tree.pformat(parens='[]'))

Copy-paste some of the parses into the syntax tree generator and look at the pictures. How well do you think the parser works?

Next investigate the contents of the grammar. Print all productions (rules) that have a specific non-terminal in the left-hand side of the rule. Change the value of the non-terminal to a few different values. (You can see what kind of terminals exist in the grammar by looking at the parse trees that you have just produced.)

Note that each production has a probability, because this is a _probabilistic_ context free grammar.

In [None]:
non_terminal_symbol = "PRP" # change this value to some other grammatical tags

prods = [p for p in grammar.productions() if str(p.lhs()) == non_terminal_symbol]
for (i, p) in enumerate(prods):
    print("#{:d}:".format(i), p)

When you are done with this part you can continue to the home assignment.
