# Probabilistic Context-Free Grammars

- 📺 **Video:** [https://youtu.be/q3dLP9YQLPA](https://youtu.be/q3dLP9YQLPA)

## Overview
- Extend CFGs with probabilities to prefer more plausible parses.
- Normalize rule probabilities per left-hand side to define a generative model.

## Key ideas
- **Rule probabilities:** each set of productions for a nonterminal sums to 1.
- **Parse probability:** product of rule probabilities along the parse tree.
- **Inside algorithm:** sums probabilities for all parses over a span.
- **Maximum likelihood:** estimate rule probabilities from treebanks.

## Demo
Compute probabilities for two competing parses of an ambiguous sentence using a PCFG, echoing the lecture (https://youtu.be/ieW08zDCcmE).

In [1]:
rules = {
    ('S', ('NP', 'VP')): 1.0,
    ('VP', ('V', 'NP')): 0.4,
    ('VP', ('VP', 'PP')): 0.3,
    ('VP', ('V', 'PP')): 0.3,
    ('NP', ('Det', 'N')): 0.6,
    ('NP', ('NP', 'PP')): 0.4,
    ('PP', ('P', 'NP')): 1.0
}
lexicon = {
    ('Det', 'the'): 1.0,
    ('N', 'cat'): 0.5,
    ('N', 'mat'): 0.5,
    ('V', 'sat'): 1.0,
    ('P', 'on'): 1.0
}

parse1 = [
    ('S', ('NP', ('Det', 'the'), ('N', 'cat')), ('VP', ('V', 'sat'), ('PP', ('P', 'on'), ('NP', ('Det', 'the'), ('N', 'mat')))))
]

parse2 = [
    ('S', ('NP', ('Det', 'the'), ('N', 'cat')), ('VP', ('VP', ('V', 'sat'), ('NP', ('Det', 'the'), ('N', 'mat'))), ('PP', ('P', 'on'), ('NP', ('Det', 'the'), ('N', 'mat')))))
]

def parse_probability(tree):
    if isinstance(tree, tuple) and len(tree) == 2 and isinstance(tree[1], str):
        return lexicon[(tree[0], tree[1])]
    lhs = tree[0]
    children = tree[1:]
    rhs_symbols = tuple(child[0] if isinstance(child, tuple) else child for child in children)
    prob = rules[(lhs, rhs_symbols)]
    for child in children:
        prob *= parse_probability(child)
    return prob

p1 = parse_probability(parse1[0])
p2 = parse_probability(parse2[0])
print('Parse 1 probability:', p1)
print('Parse 2 probability:', p2)
print('Preferred parse:', 'parse1' if p1 > p2 else 'parse2')


Parse 1 probability: 0.027
Parse 2 probability: 0.0032399999999999994
Preferred parse: parse1


## Try it
- Modify the demo
- Add a tiny dataset or counter-example


## References
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805.pdf)
- [To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks](https://www.aclweb.org/anthology/W19-4302/)
- [GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding](https://arxiv.org/pdf/1804.07461.pdf)
- [What Does BERT Look At? An Analysis of BERT's Attention](https://arxiv.org/abs/1906.04341)
- [RoBERTa: A Robustly Optimized BERT Pretraining Approach](https://arxiv.org/pdf/1907.11692.pdf)
- [BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension](https://arxiv.org/abs/1910.13461)
- [Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer](https://arxiv.org/pdf/1910.10683.pdf)
- [UnifiedQA: Crossing Format Boundaries With a Single QA System](https://arxiv.org/abs/2005.00700)
- [Neural Machine Translation of Rare Words with Subword Units](https://arxiv.org/pdf/1508.07909.pdf)
- [Byte Pair Encoding is Suboptimal for Language Model Pretraining](https://arxiv.org/pdf/2004.03720.pdf)
- [Eisenstein 8.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.4.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 7.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [TnT - A Statistical Part-of-Speech Tagger](https://arxiv.org/abs/cs/0003055)
- [Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger](https://www.aclweb.org/anthology/W00-1308/)
- [Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?](https://link.springer.com/chapter/10.1007/978-3-642-19400-9_14)
- [Natural Language Processing with Small Feed-Forward Networks](https://www.aclweb.org/anthology/D17-1309.pdf)
- [Eisenstein 10.1-10.2](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3-10.4](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 10.3.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Accurate Unlexicalized Parsing](https://www.aclweb.org/anthology/P03-1054/)
- [Eisenstein 10.5](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Eisenstein 11.1](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)
- [Finding Optimal 1-Endpoint-Crossing Trees](https://www.aclweb.org/anthology/Q13-1002/)
- [Eisenstein 11.3](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf)


*Links only; we do not redistribute slides or papers.*