<a href="https://colab.research.google.com/github/goel4ever/machine-learning-notebooks/blob/main/nlp_chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP: Chunking

A `phrase` is a word or group of words that works as a single unit to perform a grammatical function. While `tokenizing` allows you to identify words and sentences, `chunking` allows you to identify phrases.

Chunking makes use of `POS tags` to group words and apply chunk tags to those groups. Chunks don't overlap, so one instance of a word can be in only one chunk at a time. This notebook focuses on chunking sentences using Natural Language Processing.

We'll use NLTK package for implementation. A group of texts is called a corpus. NLTK provides several corpora covering everything from novels hosted by Project Gutenberg to inaugural speeches by presidents of the United States.

In order to analyze texts in NLTK, you first need to import them. We need a one-off run of nltk.download() to get all the resources in one go. Note: It will take some time.

In [2]:
import nltk
# Download resource punkt for tokenization
nltk.download('punkt')

# Required imports
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [3]:
# Before you can chunk, you need to make sure that the parts of speech in your text are tagged, so create a string for POS tagging.
quote = "It's a dangerous business, Frodo, going out your door."

In [4]:
# Tokenize the string by word
words_in_quote = word_tokenize(quote)
words_in_quote

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door',
 '.']

In [5]:
# Tag those words by part of speech
nltk.download("averaged_perceptron_tagger")
pos_tags = nltk.pos_tag(words_in_quote)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

In [6]:
# In order to chunk, you first need to define a chunk grammar.
# A chunk grammar is a combination of rules on how sentences should be chunked.
# It often uses regular expressions, or regexes.
grammar = "NP: {<DT>?<JJ>*<NN>}"

# 1. NP stands for noun phrase.
# 2. Start with an optional (?) determiner ('DT')
# 3. Can have any number (*) of adjectives (JJ)
# 4. End with a noun (<NN>)

# Read about Noun Phrase chunking here
# https://www.nltk.org/book/ch07.html#noun-phrase-chunking

In [7]:
# Create a chunk parser with this grammar
chunk_parser = nltk.RegexpParser(grammar)

In [11]:
# Try the parser with the quote, and draw a tree
tree = chunk_parser.parse(pos_tags)

# This will cause an error in notebooks because there's no display to draw the tree on
# tree.draw()
print(tree.pretty_print())

# You got two noun phrases:
# 1. 'a dangerous business' has a determiner, an adjective, and a noun.
# 2. 'door' has just a noun.

                                            S                                                       
   _________________________________________|___________________________________________________     
  |      |     |      |      |      |       |        |      |            NP                     NP  
  |      |     |      |      |      |       |        |      |    ________|____________          |    
It/PRP 's/VBZ ,/, Frodo/NNP ,/, going/VBG out/RP your/PRP$ ./. a/DT dangerous/JJ business/NN door/NN

None
