# Question 3: Building and Evaluating a Simple PCFG Parser

In this question, we will construct a Viterbi parser for the PCFG induced in Question 2 and perform evaluation of this statistical parser. 

## Question 3.1: Build a Parser
First, we'll use the pcfg_cnf_learn function from Q2:



In [4]:
import math
import nltk
from nltk.corpus import LazyCorpusLoader, BracketParseCorpusReader
from nltk.grammar import Production, ProbabilisticProduction, PCFG
from nltk import Tree, Nonterminal
import matplotlib.pyplot as plt



def simplify_functional_tag(tag):
    if tag == "-NONE-":
        return tag
    if '-' in tag:
        tag = tag.split('-')[0]
    return tag

treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg')


def get_tag(tree):
    if isinstance(tree, Tree):
        return Nonterminal(simplify_functional_tag(tree.label()))
    else:
        return tree

def tree_to_production(tree):
    return Production(get_tag(tree), [get_tag(child) for child in tree])

def tree_to_productions(tree):
    yield tree_to_production(tree)
    for child in tree:
        if isinstance(child, Tree):
            for prod in tree_to_productions(child):
                yield prod
                
                
def pcfg_cnf_learn(treebank, n):
    trees = treebank.parsed_sents()[:n]
    pcount = {}
    lcount = {}
    for s in trees:
        nltk.treetransforms.chomsky_normal_form(s, factor='right', horzMarkov=1, vertMarkov=1, childChar='|',
                                                parentChar='^')
        curr = tree_to_productions(s)
        for prod in curr:
            if not ("-NONE-" in str(prod.lhs()) or "-NONE-" in str(prod.rhs())):
                lcount[prod.lhs()] = lcount.get(prod.lhs(), 0) + 1
                pcount[prod] = pcount.get(prod, 0) + 1
    prods = [
        ProbabilisticProduction(p.lhs(), p.rhs(), prob=pcount[p] / lcount[p.lhs()])
        for p in pcount
    ]

    return PCFG(Nonterminal("S"), prods)



 ### 3.1.1 
 Now, we split the NLTK treebank corpus into 80% training and 20%  testing sets. 

In [5]:
import math
train_index = math.floor(len(treebank.parsed_sents())*0.8)
print("training set size: " + str(train_index))
print("test set size: " + str(len(treebank.parsed_sents()) - train_index))

training set size: 3131
test set size: 783


### 3.1.2
Now, we learn a PCFG over the Chomsky Normal Form version of this treebank.

In [6]:
g = pcfg_cnf_learn(treebank,train_index)

### 3.1.3
We will construct a ViterbiParser using g, the output PCFG grammar.

In [7]:
vp=nltk.parse.viterbi.ViterbiParser(g)


Now, let's test the parser: 

In [19]:
t = vp.parse("he will join the board".split())
for st in t:
    print(st)

(S
  (NP (PRP he))
  (VP^<S
    (MD will)
    (VP^<VP> (VB join) (NP^<VP> (DT the) (NN board))))) (p=1.02639e-12)
