# Taaltheorie en Taalverwerking · 2019 · Week 15

In this set of exercises you are asked to implement several grammars using the Python NLTK library. Provide your answers in this notebook. In order to avoid countless hours spent on searching the wrong classes, we shall provide some of the imports by which you could complete the assignments. The `CFG` uses the `RecursiveDescentParser`, and the `FeatureGrammar` uses the `FeatureEarleyChartParser`:

In [None]:
# FILL THIS IN FOR YOUR GROUP, also name your file as: tttv-w15-<group>-<name1>-<name2>.ipynb

# Group        : D
# Name - UvaID : Joshua de Roos
# Name - UvaID : Lodewijk van Keizerswaard
# Date         : 16-04-19

In [16]:
import nltk
from nltk import CFG
from nltk.grammar import FeatureGrammar
from nltk.parse import RecursiveDescentParser, FeatureEarleyChartParser

# Function that works for multiple types of parsers (You are free to use something else if you want.)
def check_sentence(parser, sentence):
    print("--------------------------------------------------")
    print("Checking if provided sentence matches the grammar:")
    print(sentence)
    if isinstance(sentence, str):
        sentence = sentence.split()
    tree_found = False
    results = parser.parse(sentence)
    for tree in results:
        tree_found = True
        print(tree)
    if not tree_found:
        print(sentence, "Does not match the provided grammar.")
    print("--------------------------------------------------")
    return tree_found

## Problem 1 (6 pts total)

Consider the following simple grammar:
<table>
<tr>
    <td>Phrase structure rules</td>
    <td>Lexicon</td>
</tr>
<tr>
    <td>S $\rightarrow$ NP VP</td>
    <td>Det $\rightarrow$ <i>the</i></td>
</tr>
<tr>
    <td>NP $\rightarrow$ Det N</td>
    <td>N $\rightarrow$ <i>journalist</i> | <i>detective</i></td>
</tr>
<tr>
    <td>VP $\rightarrow$ V NP</td>
    <td>V $\rightarrow$ <i>interviews</i></td>
</tr>
</table>

Implement this grammar as a NLTK `CFG` object (let's call this initial version **cfg\_1**).

In [17]:
# Finish the declaration of cfg_1
cfg_1 = CFG.fromstring("""
  S -> NP VP
  NP -> Det N
  VP -> V NP
  Det -> 'the'
  N -> 'journalist'
  N -> 'detective'
  V -> 'interviews'
""")

# Use the correct parser for cfg_1 (incase you missed it... scroll up)
cfg_1_parser = RecursiveDescentParser(cfg_1)
# The following inputs should produce the corresponding results
check_sentence(cfg_1_parser, 'the journalist interviews the detective') # True
check_sentence(cfg_1_parser, 'the the journalist interviews the detective') # False

--------------------------------------------------
Checking if provided sentence matches the grammar:
the journalist interviews the detective
(S
  (NP (Det the) (N journalist))
  (VP (V interviews) (NP (Det the) (N detective))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
the the journalist interviews the detective
['the', 'the', 'journalist', 'interviews', 'the', 'detective'] Does not match the provided grammar.
--------------------------------------------------


False

### Question 1.1 (3 pts):
Now write a second grammar, **cfg\_2**, that extends your **cfg\_1** with the plural forms *journalists*, *detectives* and *interview* and that is able to handle number agreement between the subject NP and the verb. That is, your **cfg\_2** should accept and reject sentences such as the following: 

| Accepted                                         | Rejected                                        |
|----------------------------------------------|-----------------------------------------------|
| the detectives interview the journalist | the detective interview the journalist   |
| the journalist interviews the detective | the journalists interviews the detective |


__Hint:__ You should deal with number agreement by adding an extra argument for number to those rules that require it, not by unnecessarily duplicating rules. In particular, your **cfg\_2** must have only one sentential rule with the start symbol $S$. The normal `CFG` class of NLTK will not be able to use features, hence look at the `FeatureGrammar` class of NLTK.

Possibly useful link: http://www.nltk.org/book/ch09.html.


In [19]:
# Finish the declaration of cfg_2
cfg_2 = FeatureGrammar.fromstring("""
  S -> NP[NUM=?n] VP[NUM=?n]
  NP[NUM=?n] -> Det N[NUM=?n]
  VP[NUM=?n] -> V[NUM=?n] NP
  Det -> 'the'
  N[NUM=sg] -> 'journalist'
  N[NUM=sg] -> 'detective'
  N[NUM=pl] -> 'journalists'
  N[NUM=pl] -> 'detectives'
  V[NUM=sg] -> 'interviews'
  V[NUM=pl] -> 'interview'
""")

# Use the correct parser for cfg_2 (incase you missed it... scroll up)
cfg_2_parser = FeatureEarleyChartParser(cfg_2)
# The following inputs should produce the corresponding results
check_sentence(cfg_2_parser, 'the detectives interview the journalist')  # True
check_sentence(cfg_2_parser, 'the journalist interviews the detectives') # True
check_sentence(cfg_2_parser, 'the detective interview the journalist')   # False
check_sentence(cfg_2_parser, 'the journalists interviews the detective') # False

--------------------------------------------------
Checking if provided sentence matches the grammar:
the detectives interview the journalist
(S[]
  (NP[NUM='pl'] (Det[] the) (N[NUM='pl'] detectives))
  (VP[NUM='pl']
    (V[NUM='pl'] interview)
    (NP[NUM='sg'] (Det[] the) (N[NUM='sg'] journalist))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
the journalist interviews the detectives
(S[]
  (NP[NUM='sg'] (Det[] the) (N[NUM='sg'] journalist))
  (VP[NUM='sg']
    (V[NUM='sg'] interviews)
    (NP[NUM='pl'] (Det[] the) (N[NUM='pl'] detectives))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
the detective interview the journalist
['the', 'detective', 'interview', 'the', 'journalist'] Does not match the provided grammar.
--------------------------------------------------
------

False

### Question 1.2 (3 pts)

Now write a third version of your grammar, **cfg\_3**, that extends **cfg\_2** with the pronouns *she*, *he*, *they*, *her*, *him*, and *them*. These pronouns differ in their number (*she*, *he*, *her*, *him* are singular, while *they* and *them* are plural) and in their case (*she*, *he*, *they* and "nominative", while *her*, *him*, and *them* are "accusative".  

Nominative pronouns can only appear in subject position (as in *she interviews the investigators*), while accusative pronouns can only appear in object position (as in *the journalist interviews them*). Your **cfg\_3** must be able to handle both number agreement between the subject NP and the verb and case restrictions for the subject and the object NPs. That is, it should accept and reject sentences such as the following:

| Accepted                                         | Rejected                                        |
|----------------------------------------------|-----------------------------------------------|
| he interviews the journalists | her interviews they   |
| the detectives interview him | them interviews the detective |


__Hint:__ You need a new rule for NPs that consist of a pronoun (*NP $\rightarrow$ Pro*), to which you will need to add extra arguments. Again, do not unnecessarily duplicate rules, exploit the power of argument unification. Your grammar must again have  only one sentential rule with the start symbol $S$. In case you get stuck, ask yourself: Can symbols have multiple features in NLTK?

In [43]:
# Finish the declaration of cfg_3
cfg_3 = FeatureGrammar.fromstring("""
    S -> NP[NUM=?n, CASE=nom] VP[NUM=?n]
    NP[NUM=?n] -> Det N[NUM=?n]
    NP[NUM=?n, CASE=?c] -> Pro[NUM=?n, CASE=?c]
    VP[NUM=?n] -> V[NUM=?n] NP[CASE=acc]
    Det -> 'the'
    N[NUM=sg] -> 'journalist' | 'detective'
    N[NUM=pl] -> 'journalists' | 'detectives'
    Pro[NUM=sg, CASE=nom] -> 'she' | 'he'
    Pro[NUM=pl, CASE=nom] -> 'they'
    Pro[NUM=sg, CASE=acc] -> 'her' | 'him'
    Pro[NUM=pl, CASE=acc] -> 'them'
    V[NUM=sg] -> 'interviews'
    V[NUM=pl] -> 'interview'
""")
# Use the correct parser for cfg_3 (incase you missed it... scroll up)
cfg_3_parser = FeatureEarleyChartParser(cfg_3)
# The following inputs should produce the corresponding results
check_sentence(cfg_3_parser, 'she interviews the journalist') # True
check_sentence(cfg_3_parser, 'the detectives interview him') # True
check_sentence(cfg_3_parser, 'her interviews they') # False
check_sentence(cfg_3_parser, 'them interviews the detective')

--------------------------------------------------
Checking if provided sentence matches the grammar:
she interviews the journalist
(S[]
  (NP[CASE='nom', NUM='sg'] (Pro[CASE='nom', NUM='sg'] she))
  (VP[NUM='sg']
    (V[NUM='sg'] interviews)
    (NP[NUM='sg'] (Det[] the) (N[NUM='sg'] journalist))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
the detectives interview him
(S[]
  (NP[NUM='pl'] (Det[] the) (N[NUM='pl'] detectives))
  (VP[NUM='pl']
    (V[NUM='pl'] interview)
    (NP[CASE='acc', NUM='sg'] (Pro[CASE='acc', NUM='sg'] him))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
her interviews they
['her', 'interviews', 'they'] Does not match the provided grammar.
--------------------------------------------------
--------------------------------------------------
Checkin

False

## Problem 2 (4 pts total)
Last week, in Homework 1, Python Exercise 3, you were asked to implement a specification of the following FSA: 
![Second FSM should be here, else add the image in the root of your jupyter notebook](fsm2.png)

In class we have seen that for every FSA, there is not only an equivalent regular expression, but also an equivalent right-linear grammar (a grammar of type 3 in the Chomsky hierarchy). 

### Question 2.1

Write a grammar that corresponds to a right-linear grammar and is equivalent to this FSA, call it **fsa\_1**. 

If your grammar is equivalent to the FSA above, you should be able to verify that strings such as "x y x x"  or "x y x y x" are part of the language while a string such as "x x y x x y" is not:


In [46]:
# Define fsa_1
fsa_1 = CFG.fromstring("""
  q_0 -> 'x' q_1
  q_0 -> 'x'
  q_1 -> 'x'
  q_1 -> 'x' q_1
  q_1 -> 'y' q_2
  q_2 -> 'x'
  q_2 -> 'x' q_1
 
""")
# Use the correct parser for fsa_1
fsa_1_parser = RecursiveDescentParser(fsa_1)
# The following inputs should produce the corresponding results
check_sentence(fsa_1_parser, 'x') # True
check_sentence(fsa_1_parser, 'x y') # False
check_sentence(fsa_1_parser, 'x y x x') # True
check_sentence(fsa_1_parser, 'x y x y x') # True
check_sentence(fsa_1_parser, 'x x y x x y') # False

--------------------------------------------------
Checking if provided sentence matches the grammar:
x
(q_0 x)
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x y
['x', 'y'] Does not match the provided grammar.
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x y x x
(q_0 x (q_1 y (q_2 x (q_1 x))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x y x y x
(q_0 x (q_1 y (q_2 x (q_1 y (q_2 x)))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x x y x x y
['x', 'x', 'y', 'x', 'x', 'y'] Does not match the provided grammar.
--------------------------------------------------


False

### Question 2.2

It is possible to specify a grammar that is equivalent to this FSA and contains only two non-terminal symbols. If the grammar you have provided includes more than two non-terminals, now give a version that contains only two, call it **fsa\_2**.


In [49]:
# Define fsa_2
fsa_2 = CFG.fromstring("""
  q_0 -> 'x' q_1
  q_0 -> 'x'
  q_1 -> 'x'
  q_1 -> 'x' q_1
  q_1 -> 'y' q_0
""")
# Use the correct parser for fsa_2
fsa_2_parser = RecursiveDescentParser(fsa_1)
# The following inputs should produce the corresponding results
check_sentence(fsa_2_parser, 'x') # True
check_sentence(fsa_2_parser, 'x y') # False
check_sentence(fsa_2_parser, 'x y x x') # True
check_sentence(fsa_2_parser, 'x y x y x') # True
check_sentence(fsa_2_parser, 'x x y x x y') # False

--------------------------------------------------
Checking if provided sentence matches the grammar:
x
(q_0 x)
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x y
['x', 'y'] Does not match the provided grammar.
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x y x x
(q_0 x (q_1 y (q_2 x (q_1 x))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x y x y x
(q_0 x (q_1 y (q_2 x (q_1 y (q_2 x)))))
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
x x y x x y
['x', 'x', 'y', 'x', 'x', 'y'] Does not match the provided grammar.
--------------------------------------------------


False

## Problem 3 (4 pts total)

Let $L = \{a^nb^{2n}\}$ be the formal language which contains all strings of the following form: an unbroken block of $a$'s of length $n$ followed by an unbroken block of $b$'s of length $2n$, and nothing else. For example, $abb$ , $aabbbb$, and $aaabbbbbb$ belong to $L$, and does the empty string. 

### Question 3.1 
Write a grammar that generates this language, call it **formal\_1**

In [61]:
# Define formal_1
formal_1 = CFG.fromstring("""
  q_0 -> 'a' q_0 'b' 'b'
  q_0 -> 'a' 'b' 'b'
  q_0 -> 
""")
# Use the correct parser for fsa_2
formal_1_parser = RecursiveDescentParser(formal_1)
# The following inputs should produce the corresponding results
check_sentence(formal_1_parser, '') # True
check_sentence(formal_1_parser, 'a b b') # True
check_sentence(formal_1_parser, 'a a b b b b') # True
check_sentence(formal_1_parser, 'a b b b') # False

--------------------------------------------------
Checking if provided sentence matches the grammar:

(q_0 )
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
a b b
(q_0 a (q_0 ) b b)
(q_0 a b b)
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
a a b b b b
(q_0 a (q_0 a (q_0 ) b b) b b)
(q_0 a (q_0 a b b) b b)
--------------------------------------------------
--------------------------------------------------
Checking if provided sentence matches the grammar:
a b b b
['a', 'b', 'b', 'b'] Does not match the provided grammar.
--------------------------------------------------


False

### Question 3.2

Does your grammar correspond to a right-linear grammar? Why?

#### Answers:

No because the first rule does not match the rule skeleton `A -> x B` or `A -> x`, which is required for a right-linear grammar.