## Part 2. Chunking

Now that you know how to tokenize and POS tag a text, you will learn how to build a _chunker_ that utilizes the POS tagging. The chunker is a useful tool for data mining and information extraction. For instance, you can search for specific, linguistically interesting patterns in a corpus. First, import NLTK and the necessary resources:

In [None]:
import sys
!{sys.executable} -m pip install nltk
import nltk
nltk.download(['punkt', 'averaged_perceptron_tagger', 'brown'])

### 2.1 How to write a chunker

Below, you can see how a sentence is first tokenized and POS tagged, and then we run a very simple chunker that finds specific types of noun phrases (NPs) in the text:

In [None]:
sentence = "The little yellow dog barked at the cat."
tokenized = nltk.word_tokenize(sentence)
pos_tagged = nltk.pos_tag(tokenized)

chunk_grammar = "NP: { <DT>? <JJ>* <NN> }"
chunk_parser = nltk.RegexpParser(chunk_grammar)
result = chunk_parser.parse(pos_tagged)
print(result.pformat(parens='[]'))

You can copy-paste the bracketed structure into the syntax tree generator, if you want to visualize it: http://mshang.ca/syntree/

You can see from the tree why this is called _shallow_ parsing.

The chunk grammar is written using **regular expressions**. First you name the chunk (e.g., `NP`). Then inside curly brackets `{ ... }` you write a sequence of POS tags. Each POS tag must be placed inside angle brackets `<...>`. You can use regular expression syntax both inside the angle brackets (if you want to match many different types of parts-of-speech) and outside the angle brackets (if you want to make some of the tags optional, for instance).

You can have multiple conditions for the same chunk and you can identify many different types of chunks in the same grammar. The example below will clarify some further how to write chunk grammars in NLTK:

In [None]:
sentence = "Rapunzel let down her long golden hair."
tokenized = nltk.word_tokenize(sentence)
pos_tagged = nltk.pos_tag(tokenized)

# This chunk grammar is written on multiple lines, with comments for each rule
chunk_grammar = r"""
  NP: { <DT|PRP\$>? <JJ>* <NN> }   # NP chunk: determiner/possessive, adjectives and noun
  NP: { <NNP>+ }                   # NP chunk: sequences of proper nouns
  VB: { <VB.*> <RP>? }             # VB chunk: some verb form optionally followed by a particle
"""
chunk_parser = nltk.RegexpParser(chunk_grammar)
print(chunk_parser.parse(pos_tagged).pformat(parens='[]'))

### 2.2 Using a chunker for information extraction

If you use a larger data set, you can use your chunker as a data mining tool. Let us look for a particular type of expressions in the Brown corpus:

In [None]:
# Get the POS tagged Brown corpus
pos_tagged = nltk.corpus.brown.tagged_words()

# Define chunker and parse data with it
chunk_grammar = r"""
  CHUNK: { <VBN> <TO> <VB.*> <RP>? }   # Chunk: past participle verb + "to" + other verb + optional particle
"""
chunk_parser = nltk.RegexpParser(chunk_grammar)
tree = chunk_parser.parse(pos_tagged)

# Print all the matches in the data
for subtree in tree.subtrees():
    if subtree.label() == "CHUNK":
        print(subtree)

Your task is now to modify the chunk grammar in order to extract other types of chunks from the Brown corpus, for instance:
* Extract more complicated noun phrases than in the toy examples above.
* Extract chunks of some types of named entities (NER = named entity extraction).
* Explore some types of verb - argument structures.

When you are done here, continue to Part 3.