Skip to content

Commit

Permalink
Mapped x:gc: to X -- manual intervention for a variety of case …
Browse files Browse the repository at this point in the history
…issues,

e.g. `xp`:gc: goes to ``XP``, while `Det`:gc: goes to ``Det``.

Finished annotating ch09.  In addition to annotations I changed the following:

* converted some ..ex:: displays of feature structures into program output, which reduces the total number of examples and ensures that they correspond to the input (some didn't)

* fixed case in attribute names, e.g. there was a place where we were told that (24) is equivalent to (23), when one used uppercase attributes and the other used lowercase attributes

* reworked a few passages to use simpler language, and make slightly less use of "formalism", "framework"; in one place changed "adopt" to "illustrate" since we're not taking a stance in a theory paper; sometimes there were phrases that put the material in context, but perhaps not a context that the reader needs to know about, e.g.:

    There are a number of notations for representing reentrancy in
    matrix-style representations of feature structures.

* In the introduction to X-bar syntax, I moved the literature references and the notational remark about horizontal bars to the further reading section (though it still needs to be worked in there).

* Some other heavy-going stuff is flagged.





svn/trunk@7814
  • Loading branch information
stevenbird committed Feb 28, 2009
1 parent c73d6ab commit 95db04f
Show file tree
Hide file tree
Showing 10 changed files with 508 additions and 491 deletions.
4 changes: 3 additions & 1 deletion book/CheckList.txt
Expand Up @@ -48,4 +48,6 @@ ch08 typography should no longer use NP
ch08 section 8.6 on grammar development is incomplete (incl PE08 discussion)
ch08 assumes knowledge of "head" (did some content disappear?)
ch09 lacks our standard opening
ch09 uses :lex: role, not processed by docbook
ch09 uses :lex: role, not processed by docbook
ch09 could mention use of trees as source of features for ML
ch09 includes contents of grammar files that have changed in data distribution
42 changes: 21 additions & 21 deletions book/ch07.rst
Expand Up @@ -357,7 +357,7 @@ where parsing constructs nested structures that are arbitrarily deep,
chunking creates structures of fixed depth (typically depth 2). These
chunks often correspond to the lowest level of grouping identified in
the full parse tree. This is illustrated in ex-parsing-chunking_ below,
which shows an `np`:gc: chunk structure and a completely parsed
which shows an ``NP`` chunk structure and a completely parsed
counterpart:

.. _ex-parsing-chunking:
Expand Down Expand Up @@ -663,9 +663,9 @@ chunk, then excise the chink:

Two other rules for forming chunks are splitting and merging.
A permissive chunking rule might put
`the cat the dog chased`:lx: into a single `np`:gc: chunk
`the cat the dog chased`:lx: into a single ``NP`` chunk
because it does not detect that determiners introduce new chunks.
For this we would need a rule to split an `np`:gc: chunk
For this we would need a rule to split an ``NP`` chunk
prior to any determiner, using a pattern like: ``"NP: <.*>}{<DT>"``.
Conversely, we can craft rules to merge adjacent chunks under
particular circumstances, e.g. ``"NP: <NN>{}<NN>"``.
Expand All @@ -676,10 +676,10 @@ and merging patterns in any order.
Multiple Chunk Types
--------------------

So far we have only developed `np`:gc: chunkers. However, as we saw earlier
in the chapter, the CoNLL chunking data is also annotated for `pp`:gc: and
`vp`:gc: chunks. Here is an example of a chunked sentence that contains
`np`:gc:, `vp`:gc:, and `pp`:gc: chunk types.
So far we have only developed ``NP`` chunkers. However, as we saw earlier
in the chapter, the CoNLL chunking data is also annotated for ``PP`` and
``VP`` chunks. Here is an example of a chunked sentence that contains
``NP``, ``VP``, and ``PP`` chunk types.

>>> from nltk.corpus import conll2000
>>> print conll2000.chunked_sents('train.txt')[99]
Expand Down Expand Up @@ -863,7 +863,7 @@ Reading IOB Format and the CoNLL 2000 Corpus

Using the ``corpora`` module we can load Wall Street Journal
text that has been tagged then chunked using the IOB notation. The
chunk categories provided in this corpus are `np`:gc:, `vp`:gc: and `pp`:gc:. As we
chunk categories provided in this corpus are ``NP``, ``VP`` and ``PP``. As we
have seen, each sentence is represented using multiple lines, as shown
below::

Expand All @@ -876,7 +876,7 @@ below::
|nopar| A conversion function ``chunk.conllstr2tree()`` builds a tree
representation from one of these multi-line strings. Moreover, it
permits us to choose any subset of the three chunk types to use. The
example below produces only `np`:gc: chunks:
example below produces only ``NP`` chunks:

.. doctest-ignore::
>>> text = '''
Expand Down Expand Up @@ -933,7 +933,7 @@ example that reads the 100th sentence of the "train" portion of the corpus:


|nopar|
This showed three chunk types, for `np`:gc:, `vp`:gc: and `pp`:gc:.
This showed three chunk types, for ``NP``, ``VP`` and ``PP``.
We can also select which chunk types to read:

>>> print nltk.corpus.conll2000.chunked_sents('train.txt', chunk_types=('NP',))[99]
Expand Down Expand Up @@ -961,7 +961,7 @@ We start off by establishing a baseline for the trivial chunk parser
0.440845995079

This indicates that more than a third of the words are tagged with
``O`` (i.e., not in an `np`:gc: chunk). Now let's try a naive regular
``O`` (i.e., not in an ``NP`` chunk). Now let's try a naive regular
expression chunker that looks for tags (e.g., ``CD``, ``DT``, ``JJ``,
etc.) beginning with letters that are typical of noun phrase tags:

Expand All @@ -975,7 +975,7 @@ order to develop a more data-driven approach, let's define a function
``chunked_tags()`` that takes some chunked data
and sets up a conditional frequency distribution.
For each tag, it counts up the number of times the tag
occurs inside an `np`:gc: chunk (the ``True`` case, where ``chtag`` is
occurs inside an ``NP`` chunk (the ``True`` case, where ``chtag`` is
``B-NP`` or ``I-NP``), or outside a chunk (the ``False`` case, where
``chtag`` is ``O``). It returns a list of those tags that occur
inside chunks more often than outside chunks.
Expand Down Expand Up @@ -1245,7 +1245,7 @@ structures having a depth of at most four.
(NP the/DT cat/NN)
(VP sit/VB (PP on/IN (NP the/DT mat/NN)))))

Unfortunately this result misses the `vp`:gc: headed by `saw`:lx:. It has
Unfortunately this result misses the ``VP`` headed by `saw`:lx:. It has
other shortcomings too. Let's see what happens when we apply this
chunker to a sentence having deeper nesting.

Expand Down Expand Up @@ -1298,10 +1298,10 @@ example of a tree (note that they are standardly drawn upside-down):
.. tree:: (S (NP Alice) (VP (V chased) (NP the rabbit)))

We use a 'family' metaphor to talk about the
relationships of nodes in a tree: for example, `s`:gc: is the
`parent`:dt: of `vp`:gc:; conversely `vp`:gc: is a `daughter`:dt: (or
`child`:dt:) of `s`:gc:. Also, since `np`:gc: and `vp`:gc: are both
daughters of `s`:gc:, they are also `sisters`:dt:.
relationships of nodes in a tree: for example, ``S`` is the
`parent`:dt: of ``VP``; conversely ``VP`` is a `daughter`:dt: (or
`child`:dt:) of ``S``. Also, since ``NP`` and ``VP`` are both
daughters of ``S``, they are also `sisters`:dt:.
For convenience, there is also a text format for specifying
trees:

Expand Down Expand Up @@ -1437,7 +1437,7 @@ answer because the part-of-speech tags are too impoverished and do not
give us sufficient information about the lexical item. A second
approach is to write utility programs to analyze the training data,
such as counting the number of times a given part-of-speech tag occurs
inside and outside an `np`:gc: chunk. A third approach is to evaluate the
inside and outside an ``NP`` chunk. A third approach is to evaluate the
system against some gold standard data to obtain an overall
performance score. We can even use this to parameterize the system,
specifying which chunk rules are used on a given run, and tabulating
Expand All @@ -1463,11 +1463,11 @@ The word `chink`:dt: initially meant a sequence of stopwords,
according to a 1975 paper by Ross and Tukey [Abney1996PST]_.

The IOB format (or sometimes `BIO Format`:dt:) was developed for
`np`:gc: chunking by [Ramshaw1995TCU]_, and was used for the shared `np`:gc:
``NP`` chunking by [Ramshaw1995TCU]_, and was used for the shared ``NP``
bracketing task run by the *Conference on Natural Language Learning*
(|CoNLL|) in 1999. The same format was
adopted by |CoNLL| 2000 for annotating a section of Wall Street
Journal text as part of a shared task on `np`:gc: chunking.
Journal text as part of a shared task on ``NP`` chunking.

Section 13.5 of [JurafskyMartin2008]_ contains a discussion of chunking.
Chapter 22 covers information extraction, including named entity recognition.
Expand Down Expand Up @@ -1585,7 +1585,7 @@ Exercises
re-evaluate it, to see if you have discovered an improved baseline.

#. |hard|
Develop an `np`:gc: chunker that converts POS-tagged text into a list of
Develop an ``NP`` chunker that converts POS-tagged text into a list of
tuples, where each tuple consists of a verb followed by a sequence of
noun phrases and prepositions,
e.g. ``the little cat sat on the mat`` becomes ``('sat', 'on', 'NP')``...
Expand Down
76 changes: 38 additions & 38 deletions book/ch08-extras.rst
Expand Up @@ -32,7 +32,7 @@ follows:
.. XXX "So we might adopt the heuristic that" -> "Suppose that"
So we might adopt the heuristic that the subject of a sentence is the
`np`:gc: chunk that immediately precedes the tensed verb: this would
``NP`` chunk that immediately precedes the tensed verb: this would
correctly yield ``(NP the/DT little/JJ bear/NN)`` as
subject. Unfortunately, this simple rule very quickly fails, as shown
by a more complex example.
Expand All @@ -58,27 +58,27 @@ by a more complex example.
What's doing the "preventing" in this example is not the firm monetary
policy, but rather the restated commitment to such a policy. We can
also see from this example that a different simple rule, namely
treating the initial `np`:gc: chunk as the subject, also fails, since this
treating the initial ``NP`` chunk as the subject, also fails, since this
would give us the ``(NP the/DT Exchequer/NNP)``. By contrast, a
complete phrase structure analysis of
the sentence would group together all the pre-verbal `np`:gc: chunks
into a single `np`:gc: constituent:
the sentence would group together all the pre-verbal ``NP`` chunks
into a single ``NP`` constituent:

.. ex::
.. tree:: (NP(NP (NP (Nom (N Chancellor) (PP (P of)(NP (Det the) (N Exchequer))))(NP Nigel Lawson)) (POSS 's))(Nom (Adj restated)(Nom (N commitment)(PP (P to)(NP (Det a)(Nom (Adj firm) (Nom (Adj monetary)(Nom (N policy)))))))))
:scale: 80:80:50

We still have a little work to determine which part of this complex
`np`:gc: corresponds to the "who", but nevertheless, this is much
``NP`` corresponds to the "who", but nevertheless, this is much
more tractable than answering the same question from a flat sequence
of chunks.

"Subject" and "direct object" are examples of `grammatical
functions`:dt:. Although they are not captured directly in a phrase
structure grammar, they can be defined in terms of tree
configurations. In ex-gfs_, the subject of `s`:gc: is the `np`:gc:
immediately dominated by `s`:gc: while the direct object of `v`:gc:
is the `np`:gc: directly dominated by `vp`:gc:.
configurations. In ex-gfs_, the subject of ``S`` is the ``NP``
immediately dominated by ``S`` while the direct object of ``V``
is the ``NP`` directly dominated by ``VP``.

.. _ex-gfs:
.. ex::
Expand Down Expand Up @@ -128,14 +128,14 @@ top-down parser processes *VP* |rarr| *V* *NP* *PP*,
it may find *V* and *NP* but not the *PP*. This work
can be reused when processing *VP* |rarr| *V* *NP*.
Thus, we will record the
hypothesis that "the `v`:gc: constituent `likes`:lx: is the beginning of a `vp`:gc:."
hypothesis that "the ``V`` constituent `likes`:lx: is the beginning of a ``VP``."

We can do this by adding a `dot`:dt: to the edge's right hand side.
Material to the left of the dot records what has been found so far;
material to the right of the dot specifies what still needs to be found in order
to complete the constituent. For example, the edge in
ex-dottededge_ records the hypothesis that "a `vp`:gc: starts with the `v`:gc:
`likes`:lx:, but still needs an `np`:gc: to become complete":
ex-dottededge_ records the hypothesis that "a ``VP`` starts with the ``V``
`likes`:lx:, but still needs an ``NP`` to become complete":

.. _ex-dottededge:
.. ex::
Expand All @@ -149,18 +149,18 @@ Types of Edge
-------------

Let's take stock.
An edge [`VP`:gc: |rarr| |dot| `V`:gc: `NP`:gc: `PP`:gc:, (*i*, *i*)]
records the hypothesis that a `VP`:gc: begins at location *i*, and that we anticipate
finding a sequence `V NP PP`:gc: starting here. This is known as a
An edge [``VP`` |rarr| |dot| ``V`` ``NP`` ``PP``, (*i*, *i*)]
records the hypothesis that a ``VP`` begins at location *i*, and that we anticipate
finding a sequence ``V NP PP`` starting here. This is known as a
`self-loop edge`:dt:; see ex-chart-intro-selfloop_.
An edge [`VP`:gc: |rarr| `V`:gc: |dot| `NP`:gc: `PP`:gc:, (*i*, *j*)]
records the fact that we have discovered a `V`:gc: spanning (*i*, *j*),
and hypothesize a following `NP PP`:gc: sequence to complete a `VP`:gc:
An edge [``VP`` |rarr| ``V`` |dot| ``NP`` ``PP``, (*i*, *j*)]
records the fact that we have discovered a ``V`` spanning (*i*, *j*),
and hypothesize a following ``NP PP`` sequence to complete a ``VP``
beginning at *i*. This is known as an `incomplete edge`:dt:;
see ex-chart-intro-incomplete_.
An edge [`VP`:gc: |rarr| `V`:gc: `NP`:gc: `PP`:gc: |dot| , (*i*, *k*)]
records the discovery that a `VP`:gc: consisting of the sequence
`V NP PP`:gc: has been discovered for the span (*i*, *j*). This is known
An edge [``VP`` |rarr| ``V`` ``NP`` ``PP`` |dot| , (*i*, *k*)]
records the discovery that a ``VP`` consisting of the sequence
``V NP PP`` has been discovered for the span (*i*, *j*). This is known
as a `complete edge`:dt:; see ex-chart-intro-parseedge_.
If a complete edge spans the entire sentence, and has the grammar's
start symbol as its left-hand side, then the edge is called a `parse
Expand Down Expand Up @@ -244,7 +244,7 @@ bottom-up parsing starts from the input string,
and tries to find sequences of words and phrases that
correspond to the *right hand* side of a grammar production. The
parser then replaces these with the left-hand side of the production,
until the whole sentence is reduced to an `S`:gc:. Bottom-up chart
until the whole sentence is reduced to an ``S``. Bottom-up chart
parsing is an extension of this approach in which hypotheses about
structure are recorded as edges on a chart. In terms of our earlier
terminology, bottom-up chart parsing can be seen as a parsing
Expand Down Expand Up @@ -296,11 +296,11 @@ for each grammar production whose right hand side begins with category
:scale: 30

The next step is to use the Fundamental Rule to add edges
like [`np`:gc: |rarr| Lee |dot| , (0, 1)],
like [``NP`` |rarr| Lee |dot| , (0, 1)],
where we have "moved the dot" one position to the right.
After this, we will now be able to add new self-loop edges such as
[`s`:gc: |rarr| |dot| `np`:gc: `vp`:gc:, (0, 0)] and
[`vp`:gc: |rarr| |dot| `vp`:gc: `np`:gc:, (1, 1)], and use these to
[``S`` |rarr| |dot| ``NP`` ``VP``, (0, 0)] and
[``VP`` |rarr| |dot| ``VP`` ``NP``, (1, 1)], and use these to
build more complete edges.

Using these three rules, we can parse a sentence as shown in
Expand Down Expand Up @@ -329,29 +329,29 @@ Top-Down Parsing
----------------

Top-down chart parsing works in a similar way to the recursive descent
parser, in that it starts off with the top-level goal of finding an `s`:gc:.
This goal is broken down into the subgoals of trying to find constituents such as `np`:gc: and
`vp`:gc: predicted by the grammar.
parser, in that it starts off with the top-level goal of finding an ``S``.
This goal is broken down into the subgoals of trying to find constituents such as ``NP`` and
``VP`` predicted by the grammar.
To create a top-down chart parser, we use the Fundamental Rule as before plus
three other rules: the `Top-Down Initialization Rule`:dt:, the `Top-Down
Expand Rule`:dt:, and the `Top-Down Match Rule`:dt:.
The Top-Down Initialization Rule in ex-td-init-rule_
captures the fact that the root of any
parse must be the start symbol `s`:gc:\.
parse must be the start symbol ``S``\.

.. _ex-td-init-rule:
.. ex:: `Top-Down Initialization Rule`:dt: For each production `s`:gc: |rarr| |alpha|
add the self-loop edge [`s`:gc: |rarr| |dot|\ |alpha|\ , (0, 0)]
.. ex:: `Top-Down Initialization Rule`:dt: For each production ``S`` |rarr| |alpha|
add the self-loop edge [``S`` |rarr| |dot|\ |alpha|\ , (0, 0)]

|chart_td_ex1|

.. |chart_td_ex1| image:: ../images/chart_td_ex1.png
:scale: 30

In our running example, we are predicting that we will be able to find an `np`:gc: and a
`vp`:gc: starting at 0, but have not yet satisfied these subgoals.
In order to find an `np`:gc: we need to
invoke a production that has `np`:gc: on its left hand side. This work
In our running example, we are predicting that we will be able to find an ``NP`` and a
``VP`` starting at 0, but have not yet satisfied these subgoals.
In order to find an ``NP`` we need to
invoke a production that has ``NP`` on its left hand side. This work
is done by the Top-Down Expand Rule ex-td-expand-rule_.
This tells us that if our chart contains an incomplete
edge whose dot is followed by a nonterminal *B*, then the parser
Expand Down Expand Up @@ -387,7 +387,7 @@ add an edge if the terminal corresponds to the current input symbol.

Here we see our example chart after applying the Top-Down Match rule.
After this, we can apply the fundamental rule to
add the edge [`np`:gc: |rarr| Lee |dot| , (0, 1)].
add the edge [``NP`` |rarr| Lee |dot| , (0, 1)].

Using these four rules, we can parse a sentence top-down as shown in
ex-top-down-strategy_.
Expand Down Expand Up @@ -452,8 +452,8 @@ which *P* dominates *w*. More precisely:

To illustrate, suppose the input is of the form
`I saw ...`:lx:, and the chart already contains the edge
[`vp`:gc: |rarr| |dot| `v`:gc: ..., (1, 1)]. Then the Scanner Rule will add to
the chart the edges [`v`:gc: |rarr| 'saw', (1, 2)]
[``VP`` |rarr| |dot| ``V`` ..., (1, 1)]. Then the Scanner Rule will add to
the chart the edges [``V`` |rarr| 'saw', (1, 2)]
and ['saw'|rarr| |dot|\ , (1, 2)]. So in effect the Scanner Rule packages up a
sequence of three rule applications: the Bottom-Up Initialization Rule for
[*w* |rarr| |dot|\ , (*j*, *j*\ +1)],
Expand Down Expand Up @@ -706,7 +706,7 @@ is the product of the probability of the production that
generated it and the probabilities of its children. For example, the
probability of the edge ``[Edge: S`` |rarr| ``NP``\ |dot|\ ``VP, 0:2]``
is the probability of the PCFG production ``S`` |rarr| ``NP VP``
multiplied by the probability of its `np`:gc: child.
multiplied by the probability of its ``NP`` child.
(Note that an edge's tree only includes children for elements to the left
of the edge's dot.)

Expand Down

0 comments on commit 95db04f

Please sign in to comment.