<a href="https://colab.research.google.com/github/goel4ever/machine-learning-notebooks/blob/main/nlp_chinking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP: Chinking

`Chinking` is used together with `chunking`, but while chunking is used to include a pattern, chinking is used to `exclude a pattern`.

In [1]:
import nltk
# Download resource punkt for tokenization
nltk.download('punkt')

# Required imports
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
# Before you can chunk, you need to make sure that the parts of speech in your text are tagged, so create a string for POS tagging.
# We'll reuse the quote used in chunking
quote = "It's a dangerous business, Frodo, going out your door."

In [3]:
# Tokenize the string by word
words_in_quote = word_tokenize(quote)
words_in_quote

['It',
 "'s",
 'a',
 'dangerous',
 'business',
 ',',
 'Frodo',
 ',',
 'going',
 'out',
 'your',
 'door',
 '.']

In [4]:
# Tag those words by part of speech
nltk.download("averaged_perceptron_tagger")
pos_tags = nltk.pos_tag(words_in_quote)
pos_tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('It', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('dangerous', 'JJ'),
 ('business', 'NN'),
 (',', ','),
 ('Frodo', 'NNP'),
 (',', ','),
 ('going', 'VBG'),
 ('out', 'RP'),
 ('your', 'PRP$'),
 ('door', 'NN'),
 ('.', '.')]

In [5]:
# Create a grammar to determine what we want to include and exclude in our chunks
grammar = """
Chunk: {<.*>+}
        }<JJ>{
"""

# Because we’re using more than one line for the grammar, we’ll be using triple quotes (""")
# The first rule is with curly braces because it’s used to determine what patterns you want to include in you chunks.
# The second rule is with curly braces reversed because it’s used to determine what patterns you want to exclude in your chunks.

In [6]:
# Create a chunk parser with the grammer
chunk_parser = nltk.RegexpParser(grammar)

In [9]:
# Chunk the sentence with the chink
tree = chunk_parser.parse(pos_tags)

# This will cause an error in notebooks because there's no display to draw the tree on
# tree.draw()
print(tree.pretty_print())

# In this case, ('dangerous', 'JJ') was excluded from the chunks because it’s an adjective (JJ).
# The first chunk has all the text that appeared before the adjective that was excluded.
# The second chunk contains everything after the adjective that was excluded.

                                                    S                                               
      ______________________________________________|_____________                                   
     |              Chunk                                       Chunk                               
     |          ______|_____          ____________________________|_______________________________   
dangerous/JJ It/PRP 's/VBZ a/DT business/NN ,/, Frodo/NNP ,/, going/VBG out/RP your/PRP$ door/NN ./.

None
