---------------
#### Chunking and POS Tagging: 
- Breaking down text into chunks (like noun phrases) using Part-of-Speech (POS) tags and then filtering relevant chunks to serve as key phrases.
---------------

`POS Tagging`: Before chunking, we tag the words with their parts of speech (POS). This is done to understand the grammatical structure of the sentence and the role of each word.

`Chunking`: Using the POS tags, we can define patterns to identify chunks in the text. For keyphrase extraction, noun phrase chunking is commonly used, as noun phrases often serve as the subjects of the text and are likely to be key entities.

In [1]:
import nltk
from nltk.chunk import RegexpParser
from nltk.corpus import brown

In [2]:
nltk.download('brown')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\bhupe\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bhupe\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\bhupe\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [3]:
# Define the chunk grammar. 
# Here, a basic noun phrase grammar is used
grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN.*>+}   # chunk determiner/possessive, adjectives and noun
"""

In [4]:
# Create the chunk parser
chunk_parser = RegexpParser(grammar)

In [5]:
# Sample text
text = "The quick brown fox jumps over the lazy dog."

In [6]:
# Tokenize and POS tag the text
tokens = nltk.word_tokenize(text)

In [7]:
tagged_tokens = nltk.pos_tag(tokens)

In [8]:
# Parse the tagged tokens using the chunk parser
tree = chunk_parser.parse(tagged_tokens)

In [9]:
# Extract the noun phrases as keyphrases
keyphrases = [' '.join(leaf[0] for leaf in subtree.leaves())
              for subtree in tree.subtrees(lambda t: t.label() == 'NP')]

print(keyphrases)

['The quick brown fox', 'the lazy dog']


- The RegexpParser is used to create a chunk parser using the provided grammar.

- The grammar specifies that a noun phrase (NP) can be an optional determiner (DT) followed by any number of adjectives (JJ) and then one or more nouns (NN.*).

- After parsing the POS-tagged tokens, we extract the chunks labeled as NP (noun phrases) and consider them as keyphrases.

- The keyphrases extracted from the given sample text using the above code will be: ['The quick brown fox', 'the lazy dog'].

#### Let's use a section of the Brown corpus, 
- specifically from the news category, to demonstrate key phrase extraction through chunking and POS tagging.

In [15]:
# Define chunk grammar for noun phrases
grammar = r"""
     NP: {<DT>?<JJ>*<NN>}        # Chunk sequences of DT, JJ, and NN tags as NP (noun phrases)
     VP: {<VB.*><DT>?<JJ>*<NN>}  # Chunk sequences of VB (verbs) followed by DT, JJ, and NN tags as VP (verb phrases) 
"""

In [16]:
# Create the chunk parser
chunk_parser = RegexpParser(grammar)

In [17]:
# Extract a sample from the Brown corpus
text = " ".join(brown.words(categories='news')[:1500])


In [18]:
# Tokenize and POS tag the text
tokens = nltk.word_tokenize(text)
tagged_tokens = nltk.pos_tag(tokens)

In [19]:
# Extract noun phrases as keyphrases
keyphrases = [' '.join(leaf[0] for leaf in subtree.leaves())
              for subtree in tree.subtrees(lambda t: t.label() == 'NP')]

print(keyphrases)

['The quick brown fox', 'the lazy dog']
