In [None]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs236299-2020/lab3-3.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")

In [None]:
# Initialize Otter
import otter
grader = otter.Notebook()

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$

# Course 236299
## Lab 3-3 - Probabilistic context-free grammars

In previous labs, you have practiced constituency parsing using context-free grammars with the CKY parsing algorithm. In this lab you will extend this framework to a probabilistic one, probabilistic context-free grammars (PCFG).

New bits of Python used for the first time in the _solution set_ for this lab, and which you may therefore find useful:

* [`math.prod`](https://docs.python.org/3/library/math.html#math.prod)
* [`nltk.tree.Tree.productions`](https://www.nltk.org/api/nltk.html?highlight=production#nltk.tree.Tree.productions)

## Preparations

In [None]:
import copy
import math
import nltk
import pandas as pd

from collections import Counter, defaultdict
from pprint import pprint

## Syntactic ambiguity

Let's start with the following grammar for arithmetic word expressions:

In [None]:
arithmetic_grammar = nltk.CFG.fromstring("""
    S -> NUM | NUM OP S | S OP NUM
    NUM -> 'one' | 'two' | 'three' | 'four' | 'five' 
    NUM -> 'six' | 'seven' | 'eight' | 'nine' | 'ten' 
    OP -> ADD | SUB | MULT | DIV
    ADD -> 'plus' | 'added' 'to'
    SUB -> 'minus' 
    MULT -> 'times' | 'multiplied' 'by'
    DIV -> 'divided' 'by'
""")

>   It might have been more natural to have the rules
>   ```
>   S -> NUM | S OP S
>   ```
>   but we have purposefully introduced an ambiguity into the grammar via these more restricted rules above so that we can disambiguate later by rule weightings.

We can use the given CFG to parse the phrase "two plus three times four" and print the possible parse trees.

In [None]:
parser = nltk.parse.BottomUpChartParser(arithmetic_grammar)
phrase = "two plus three times four"
parses = list(parser.parse(phrase.split()))

for i, tree in enumerate(parses):
  print(f"Possible parse {i+1}:\n")
  tree.pretty_print()

In this example, every parse tree represents an arithmetic expression. Manually calculate the value of the resulting equation for each of the parsed trees.

<!--
BEGIN QUESTION
name: parsed_equation_result
-->

In [None]:
#TODO
result_tree1 = ...
result_tree2 = ...
result_tree3 = ...
result_tree4 = ...

In [None]:
grader.check("parsed_equation_result")

As you can see, we got four different parse trees, yielding different numerical results. It is interesting to notice that some of the parse trees (the pairs (1, 2) and (3, 4)) have different structures but the same "meaning" (same denoted value), while for other trees different structures induce different meanings (for example the pair (1, 3)).

The idea of different structural interpretations is called structural ambiguity. Since natural language is oftentimes ambiguous, this is a very realistic concern. One approach to deal with this issue is by defining a scoring system to score the possible parses and choosing the highest scoring tree. We will see how this can be done by taking a probabilistic approach to CFG.

## Probabilistic context-free grammars

To assign probabilities to strings, we will use a probabilistic context-free grammar (PCFG), a CFG in which each rule is augmented with a probability.

The PCFG definition is derived from that of the CFG, by augmenting the rules with probabilities:

* $\cal{N}$ – a set of nonterminal symbols
* $\Sigma$ – a set of terminal symbols
* $\cal{R}$ – a set of rules or productions, each of the form $A \rightarrow \beta\ [p]$,
where $A$ is a nonterminal, $\beta$ is a string of terminal or nonterminal symbols,
and $p$ is a number between 0 and 1 expressing $\Prob(\beta \given A)$
* $S$ – a designated start symbol

Note that to constitute a valid probability distribution we require that $\sum_\beta \Prob(\beta \given A) =1$, that is, the probabilities associated with all rules with the same left-hand side must sum to one.

Take a look at the following PCFG based on the arithmetic grammar above:

In [None]:
probabilistic_arithmetic_grammar = nltk.PCFG.fromstring("""
    S -> NUM [0.35] | NUM OP S [0.4] | S OP NUM [0.25] 
    OP -> ADD [0.4] | SUB [0.2] | MULT [0.3] | DIV [0.1]
    NUM -> 'one' [0.1] | 'two' [0.1] | 'three' [0.1] | 'four' [0.1] | 'five' [0.1]
    NUM -> 'six' [0.1] | 'seven' [0.1] | 'eight' [0.1] | 'nine' [0.1] | 'ten' [0.1]
    ADD -> 'plus' [0.8] | 'added' 'to' [0.2]
    SUB -> 'minus' [1.0]
    MULT -> 'times' [0.9] | 'multiplied' 'by' [0.1]
    DIV -> 'divided' 'by' [1.0]
""")

We can use the [nltk.CFG.productions()](https://www.nltk.org/api/nltk.html?highlight=production#nltk.grammar.CFG.productions) method to get a list of the PCFG's productions:

In [None]:
probabilistic_arithmetic_grammar.productions()

Each of the productions in the list is an instance of the [ProbabilisticProduction](https://www.nltk.org/api/nltk.html?highlight=production#nltk.grammar.ProbabilisticProduction) class. Each such instance is defined by three parameters: its left hand side (`lhs`), right-hand side (`rhs`), and rule probability (`prob`). These attributes can be accessed separately:

In [None]:
pprod_example = probabilistic_arithmetic_grammar.productions()[1]
print(f'For the production "{pprod_example}":\n' 
      f'left hand side of the rule is {pprod_example.lhs()}\n'
      f'right hand side of the rule is {pprod_example.rhs()}\n'
      f'probability of the rule is {pprod_example.prob()}')

For non-probabilistic grammars, the class of productions is [Production](https://www.nltk.org/api/nltk.html?highlight=production#nltk.grammar.Production), which doesn't have a probability attribute and is only defined by its lhs and rhs attributes:

In [None]:
print(f'PCFG production: {probabilistic_arithmetic_grammar.productions()[1]} \n'
      f'      vs.\n'
      f'CFG production:  {arithmetic_grammar.productions()[1]}') 

## Parse tree probabilities

To use a PCFG to select among parse trees, we need to be able to calculate the probability of a parse tree. The probability of a parse tree is simply the product of the probabilities of each constituent in the tree, the probability of the rule associated with the constituent.

You'll use the PCFG `probabilistic_arithmetic_grammar` to calculate the probability of each of the parse trees in `parses` (the list of trees which were parsed from the sentence "two plus three times four"). 

To do that, you'll need to get all the productions used in a parse tree (using the [productions](https://www.nltk.org/api/nltk.html?highlight=production#nltk.tree.Tree.productions) method), find their probabilities, and multiply them together.

First, we will create a dictionary from the PCFG, so that we can easily access the rule probabilities. Write a function which accepts a PCFG and returns a dictionary whose keys are the CFG (not PCFG) grammar rules and values are the associated probabilities. 
<!--
BEGIN QUESTION
name: pcfg_to_dict
-->

In [None]:
#TODO - returns a dictionary whose keys are `nltk.grammar.Production` objects
#       and whose values are the associated probabilities
def pcfg_to_dict(pcfg):
  ...

In [None]:
grader.check("pcfg_to_dict")

We can use the function you wrote to convert `probabilistic_arithmetic_grammar` to a dictionary and inspect it to make sure it's working.

In [None]:
pprint(pcfg_to_dict(probabilistic_arithmetic_grammar))

Now for the payoff: Write a function that takes a parse tree and a PCFG and returns the probability of the parse tree according to the PCFG. The `pcfg_to_dict` function you just wrote is likely to come in handy.

> Note that we are asking for the probability (not log probability), and we **don't work in log space** in this lab for simplicity, but for parse trees of longer sentences (which we'll see in the project) we might have to work in the log space to avoid underflows.

<!--
BEGIN QUESTION
name: parsed_trees_probs
-->

In [None]:
# TODO: returns the probability of the parse tree.
# `tree.productions() might be useful for getting the 
#  productions of a parse tree
def parse_probability(tree, pcfg):
    ...

In [None]:
grader.check("parsed_trees_probs")

We'll use it to calculate and print out the probability of each parse tree.

In [None]:
for i, tree in enumerate(parses):
    print(f'Probability of parsed tree {i+1} is '
          f'{parse_probability(tree, probabilistic_arithmetic_grammar):1.2e}')

<!-- BEGIN QUESTION -->

---
**Question:** Which of the trees is the most probable parse? Are there trees that got the same score (probability)? Do they have the same arithmetic result? Explain.

<!--
BEGIN QUESTION
name: open_response_ambiguity
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



---
## Estimating rule probabilities from a corpus

In the previous section, you received a CFG already augmented with rule probabilities. But where do rule probabilities come from? One way to generate rule probabilites is to learn them from a training corpus. 

In this section you will use a toy corpus of parsed sentences to generate maximum likelihood estimates of rule probabilities by counting the number of occurrences of a rule used in the corpus.

In [None]:
# The raw corpus, before splitting into separate phrases
corpus_raw = """
    (S (S (NUM two) (OP plus) (S (NUM six))) (OP times) (NUM one))
    (S (S (NUM eight) (OP minus) (S (NUM three))) (OP plus) (NUM seven))
    (S (NUM two) (OP plus) (S (S (NUM three)) (OP times) (NUM four))) 
    (S (S (NUM eight) (OP divided by) (S (NUM four))) (OP times) (NUM two))
    (S (S (NUM five) (OP divided by) (S (NUM two))) (OP plus) (NUM one))
    (S (NUM five) (OP minus) (S (NUM one) (OP times) (S (NUM four)))) 
    (S (S (NUM two) (OP times) (S (NUM three))) (OP plus) (NUM four))
    (S (S (S (NUM ten)) (OP minus) (NUM two)) (OP times) (NUM three))
"""

def corpus_from_string(raw):
  """Return a corpus as a list of sentences.
  
  The `raw` corpus is split at newlines, trimmed of whitespace, and blank 
  lines eliminated.
  """
  return list(filter(lambda x: x != '', 
                     map(lambda sent: sent.strip(),
                         raw.split('\n'))))

# The processed corpus we'll use
corpus = corpus_from_string(corpus_raw)

Recall that for the rule probabilities to define a valid probability distibution, the following needs to hold:

$$\sum_\beta p(\beta \given A) =1$$

Thus means that that after counting rule occurrences we will need to normalize them as

\begin{align}
p(\beta \given A) 
  &= \frac{\cnt{A \to \beta}}{\sum_{\beta'} \cnt{A \to \beta'}} \\
  &= \frac{\cnt{A \to \beta}}{\cnt{A}}
\end{align}

We will define three functions: 

1. `rule_counter` - accepts a list of sentences and returns a dictionary of rule counts (where the key is the production defined by the lhs and rhs and the value is the number of rule occurrences)
2. `lhs_counter` - accepts a list of sentences and returns a dictionary of lhs counts (where the key is the lhs nonterminal and the value is the count of that nonterminal's occurences as a lhs)
3. `rule_probs` - accepts a list of sentences and returns a dictionary of rule probabilities (where the key is the production and the value is the rule probability).

Implement these functions as specified above.

<!--
BEGIN QUESTION
name: probs_from_corpus
-->

In [None]:
#TODO 
def rule_counter(sentence_list):
  ...

#TODO
def lhs_counter(sentence_list):
  ...

#TODO
def rule_probs(sentence_list):
  ...

In [None]:
grader.check("probs_from_corpus")

Now we can use the `rules_prob` function you wrote to get the rule probabilities from our corpus:

In [None]:
probs_from_corpus = rule_probs(corpus)
probs_from_corpus

Observe that the probabilities of the two rules `S -> NUM OP S` and `S -> S OP NUM` are equivalent. **Modify** the `corpus` defined above such that the rule "S -> NUM OP S" will get a higher probability than "S -> S OP NUM". Call your new version `corpus_new`. Remember that `corpus_new` should be a list of strings, not a single string (like `corpus_raw` above).

<!--
BEGIN QUESTION
name: change_corpus
-->

In [None]:
corpus_new = ...

In [None]:
grader.check("change_corpus")

<!-- BEGIN QUESTION -->

---
**Question:** The example that we provided of an ambiguity introduced by multiple productions and disambiguated by their probabilities – the multiple rules for arithmetic expressions – is admittedly quite artificial. Can you think of other (more natural) examples, in natural language or elsewhere, where this phenomenon might occur?

<!--
BEGIN QUESTION
name: open_response_other_examples
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

---

## Lab debrief – for consensus submission only

**Question:** We're interested in any thoughts your group has about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# End of Lab 3-3

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()