### Part-Of-Speech Tagging And Chunking

#### Aim: To understand how machines analyze sentence structure using:
- Tokenization
- POS Tagging
- Chunking (Shallow Parsing)
- Constituency Grammer
- Dependency Parsing 



#### Theory:

Human language has structure. Words have roles:
- Nouns name things 
- Verbs show actions 
- Adjectives describe nouns 

Machines must identify these roles to understand meaning 

#### Step 1: Input, Preprocessing and Tokenization

In [2]:
import nltk
from nltk.tokenize import word_tokenize

text = "The rapid advancement of artificial intelligence is fundamentally transforming how humans interact with complex data and language processing systems."
print("Raw Text:", text)

text = text.lower()
print("\nClean Text:", text)

tokens = word_tokenize(text)
print("\nTokens", tokens)

Raw Text: The rapid advancement of artificial intelligence is fundamentally transforming how humans interact with complex data and language processing systems.

Clean Text: the rapid advancement of artificial intelligence is fundamentally transforming how humans interact with complex data and language processing systems.

Tokens ['the', 'rapid', 'advancement', 'of', 'artificial', 'intelligence', 'is', 'fundamentally', 'transforming', 'how', 'humans', 'interact', 'with', 'complex', 'data', 'and', 'language', 'processing', 'systems', '.']


#### Step 2: Part-Of-Speech Tagging 
What:
1. POS Tagging: assigning grammatical role to each word.

Why: Meaning depends on role:
1. "book" as noun: "read a book"
2. "book" as verb: "book a ticket"

How: Tagger uses:
1. Dictionary knowledge
2. Context of neighboring words 
3. Probablistic models HMM, CRF, Neural Nets

In [4]:
# nltk.download('averaged_perceptron_tagger_eng')
pos_tags = nltk.pos_tag(tokens)

print("\nPos Tagged Words:")
for word,tag in pos_tags:
    print(word, "->", tag)


Pos Tagged Words:
the -> DT
rapid -> JJ
advancement -> NN
of -> IN
artificial -> JJ
intelligence -> NN
is -> VBZ
fundamentally -> RB
transforming -> VBG
how -> WRB
humans -> NNS
interact -> VBP
with -> IN
complex -> JJ
data -> NNS
and -> CC
language -> NN
processing -> NN
systems -> NNS
. -> .


##### Explanation:
1. Nouns: 
    - NN (Noun, singular or mass)
    - NNS (Noun, plural)
2. Verbs:
    - VBZ (Verb, 3rd person singular present): is
    - VBG (Verb, gerund or present participle): transforming. usually ends with "-ing" and describe ongoing actions
    - VBP (Verb, non-3rd person singular present)
3. Modifiers:
    - JJ (Adjective): rapid,complex, artificial.
    - RB (Adverb): fundamentally
4. Functional Tags:
    - DT (Determiner): the
    - IN (Preposition or Subordinating Conjuction): of, with
    - CC (Coordinating Conjunction): and 
    - WRB (Wh-adverb): how
    - "." (Punctuation): The period at the end.

##### Theory of POS Tag Types
Rule-Based:
- Uses grammer rules like: if word ends with "tion" -> Noun

Transformation-Based:
- Starts with basic tags, fixes them using rules

Statistical:
- Uses probabilities from large tagged corpora

Modern:
- Neural networks (spaCy, BERT-based taggers)



#### Step 3: Chunking (Shallow Parsing)


What:

Chunking = grouping words into phrases like:
- Noun Phrase (NP)
- Verb Phrase (VP)

Why:
- Full parsing is slow 
- Chunking gives quick phrase structure 

Chunk Grammer:
- NP: Determiner + any adjectives + noun

In [5]:
grammar = r"""
  NP: {<DT><JJ>*<NN>}   # Noun Phrase rule
"""

chunk_parser = nltk.RegexpParser(grammar)
chunk_tree = chunk_parser.parse(pos_tags)

print("\nCHUNK TREE:")
print(chunk_tree)


CHUNK TREE:
(S
  (NP the/DT rapid/JJ advancement/NN)
  of/IN
  artificial/JJ
  intelligence/NN
  is/VBZ
  fundamentally/RB
  transforming/VBG
  how/WRB
  humans/NNS
  interact/VBP
  with/IN
  complex/JJ
  data/NNS
  and/CC
  language/NN
  processing/NN
  systems/NNS
  ./.)


#### Step 4: Consistency Grammer 
What:
- Represents a sentence as hierarchy of phrases 

Sentence:
- The cat sat on the mat 

Structure:
- [S [NP The cat] [VP sat [PP on [NP the mat]]]]
- Chunking gives partial constituency 
- Full phrases build complete trees.

NLTK chunk tree already shows phrase hierarchy 


#### Step 5: Dependency Parsing 
What:

Shows relationship between words:
- who is subject?
- who is object?

Why:
- Meaning is about relationships, not just order 

Example:
- "She enjoys music"
- enjoys -> ROOT 
- She -> subject of enjoys 
- music -> object of enjoys 

For dependency parsing, we use spaCy

In [5]:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

print("\nDependency Parsing:")
for token in doc:
    print(
        token.text,
        "-> head:", token.head.text,
        "| relation:", token.dep_
    )


Dependency Parsing:
the -> head: advancement | relation: det
rapid -> head: advancement | relation: amod
advancement -> head: transforming | relation: nsubj
of -> head: advancement | relation: prep
artificial -> head: intelligence | relation: amod
intelligence -> head: of | relation: pobj
is -> head: transforming | relation: aux
fundamentally -> head: transforming | relation: advmod
transforming -> head: transforming | relation: ROOT
how -> head: interact | relation: advmod
humans -> head: interact | relation: nsubj
interact -> head: transforming | relation: ccomp
with -> head: interact | relation: prep
complex -> head: data | relation: amod
data -> head: systems | relation: nmod
and -> head: data | relation: cc
language -> head: systems | relation: compound
processing -> head: systems | relation: compound
systems -> head: with | relation: pobj
. -> head: transforming | relation: punct


Explanation:
- token.text = current word
- token.head = word it depends on
- token.dep_ = grammatical relation 

Example output:
- fox -> head: jumps I relation: nsubj
- jumps -> head:jumps | relation: ROOT 
- dog -> head:over | relation: pobj

#### Step 6: Theory of Dependency 
Relations:
- nsubj = subject
- dobj = object
- amod = adjective modifier 
- ROOT = main verb 

This allows machine to know:
- Who did what to whom 

#### Step 7: Final Pipeline Summary
Raw sentence -> Cleaning -> Tokenization -> POS Tagging -> Chunking -> Consistuency Structure -> Dependency Parsing 

The pipeline converts:
- Plain Text -> Structured grammer -> meaning 