## Experiment 10
## Build a Chunker.

A chunker, in the context of natural language processing (NLP), is a program or algorithm that identifies and segments syntactic units (chunks) in a sentence. These chunks typically consist of words that form a meaningful unit, such as noun phrases, verb phrases, or prepositional phrases.

The process of chunking involves dividing a sentence into chunks based on the grammatical structure and relationships between words. It's an intermediate step between part-of-speech tagging and full parsing. While part-of-speech tagging assigns a part-of-speech label to each word in a sentence, chunking goes a step further by grouping words into meaningful units.

Here's an example to illustrate chunking:

**Sentence:** "The black cat sat on the windowsill."

**Part-of-Speech Tagging:**
```
The/DT black/JJ cat/NN sat/VBD on/IN the/DT windowsill/NN ./.
```

**Chunking Result:**
```
[The black cat] [sat] [on the windowsill].
```

In this example, the chunker identifies and groups the words into three chunks: a noun phrase ("The black cat"), a verb ("sat"), and a prepositional phrase ("on the windowsill").

Chunking is often used in various NLP applications, including information extraction, named entity recognition, and shallow parsing. Different techniques and tools can be employed for chunking, such as regular expressions, rule-based systems, or machine learning approaches.

**Chunking in Natural Language Processing (NLP) - Explanation with Output:**

```python
import nltk
from nltk.chunk import RegexpParser
from nltk.tokenize import word_tokenize

# Sample sentence
sentence = "The cat sat on the mat."

# Tokenize the sentence
words = word_tokenize(sentence)

# Perform POS tagging
pos_tags = nltk.pos_tag(words)

# Define chunking patterns using regular expressions
chunking_patterns = r"""
    NP: {<DT>?<JJ>*<NN>}   # chunk determiner/possessive, adjectives, and noun
    PP: {<IN><NP>}         # chunk prepositions followed by NP
    VP: {<VB.*><NP|PP>*}   # chunk verbs and their arguments
"""

# Create a chunk parser using the defined patterns
chunk_parser = RegexpParser(chunking_patterns)

# Apply chunking to the POS-tagged words
chunks = chunk_parser.parse(pos_tags)

# Display the chunks
print(chunks)
```

**Output:**
```
(S
  (NP The/DT cat/NN)
  (VP sat/VBD (PP on/IN (NP the/DT mat/NN)))
  ./.)
```

In the output, the sentence is parsed into a tree structure, where:
- **NP (Noun Phrase):** "The cat"
- **VP (Verb Phrase):** "sat on the mat"
- **PP (Prepositional Phrase):** "on the mat"

Each phrase is structured hierarchically, providing a syntactic representation of the sentence's components based on the defined chunking patterns.

## Explanation of Code

1. **Import Libraries and Download Data:**
   - Import necessary libraries including NLTK modules and download required data.

2. **Load and Train POS Tagger:**
   - Load the Penn Treebank corpus for training.
   - Train a POS tagger using a combination of a default tagger for unknown words and an Unigram tagger.

3. **Tokenize and Tag Sample Sentence:**
   - Tokenize the sample sentence into words.
   - Use the trained POS tagger to tag the words in the sample sentence.

4. **Define Chunk Grammar:**
   - Define a regular expression-based chunk grammar.
   - NP (Noun Phrase): Sequences of determiners (DT), adjectives (JJ), and nouns (NN).
   - VP (Verb Phrase): Verbs (VB) and their associated noun phrases (NP) or prepositional phrases (PP).

5. **Create Chunk Parser:**
   - Create a chunk parser using the defined regular expression-based grammar.

6. **Apply Chunk Parser:**
   - Apply the chunk parser to the POS-tagged sentence, creating a tree structure representing the chunks.

7. **Visualize and Print Tree Structure:**
   - Optionally, visualize the resulting tree structure.
   - Print the tree structure using the `pretty_print()` method.

This code demonstrates chunking based on a regular expression grammar, identifying and grouping noun phrases (NP) and verb phrases (VP) in the sample sentence. The resulting tree structure illustrates the identified chunks in the sentence.

In [None]:
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
from nltk.corpus import treebank

# Download the NLTK data (you only need to do this once)
nltk.download('punkt')
nltk.download('treebank')

# Load a pre-trained POS tagger
tagged_sentences = treebank.tagged_sents()
train_data = tagged_sentences[:3000]
pos_tagger = nltk.DefaultTagger('NN')  # Default tagger for unknown words
pos_tagger = nltk.UnigramTagger(train_data, backoff=pos_tagger)

# Define a sample sentence
sample_sentence = "The quick brown fox jumps over the lazy dog."

# Tokenize and tag the sample sentence
tokens = word_tokenize(sample_sentence)
pos_tags = pos_tagger.tag(tokens)

# Define a regular expression-based chunk grammar
chunk_grammar = r"""
    NP: {<DT>?<JJ>*<NN>}   # Chunk sequences of DT, JJ, and NN
    VP: {<VB.*><NP|PP>*}   # Chunk verbs and their arguments
"""

# Create a chunk parser using the regular expression-based grammar
chunk_parser = RegexpParser(chunk_grammar)

# Apply the chunk parser to the POS-tagged sentence
tree = chunk_parser.parse(pos_tags)

# Visualize the resulting tree (optional)

# Print the tree structure
tree.pretty_print()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


                               S                                              
    ___________________________|__________________________________________     
   |     |            NP               NP      NP            NP           NP  
   |     |     _______|________        |       |        _____|_____       |    
over/IN ./. The/DT quick/JJ brown/NN fox/NN jumps/NN the/DT     lazy/NN dog/NN

