<a href="https://colab.research.google.com/github/ihabiba/NLP-Labs/blob/main/Phrases_Clauses_Syntax.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Phrase Chunking with NLTK

In [1]:
import nltk
from nltk import word_tokenize, pos_tag
from nltk.chunk import RegexpParser

# Download required NLTK resources (run once – if they’re already downloaded, this will just skip)
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


True

In [2]:
# Example sentences to test phrase chunking
sentences = [
    "The quick brown fox jumps over the lazy dog.",
    "A beautiful butterfly landed on the colorful flower.",
    "The experienced teacher explained the complex concept clearly."
]

# Define grammar for chunking
# NP: Noun Phrase
# PP: Prepositional Phrase
# VP: Verb Phrase
grammar = r"""
    NP: {<DT|PP\$>?<JJ>*<NN.*>+}
    PP: {<IN><NP>}
    VP: {<VB.*><NP|PP|CLAUSE>+}
"""

# Create a chunk parser based on the grammar above
cp = RegexpParser(grammar)

In [3]:
def demo_phrase_chunking():
    """
    Demonstrates basic noun phrase chunking using regular expressions.
    """

    for sentence in sentences:
        print("\nSentence:", sentence)

        # 1. Tokenize the sentence into words
        tokens = word_tokenize(sentence)

        # 2. Tag each token with its Part-of-Speech (POS) label
        pos_tags = pos_tag(tokens)
        print("POS Tags:", pos_tags)

        # 3. Parse the POS-tagged sentence into a chunk tree
        tree = cp.parse(pos_tags)
        print("Parse Tree:")
        print(tree)

        # 4. Extract noun phrases (NP) from the tree
        noun_phrases = []
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                np_text = ' '.join(word for word, tag in subtree.leaves())
                noun_phrases.append(np_text)

        print("Extracted Noun Phrases:", noun_phrases)
# Run demo
demo_phrase_chunking()


Sentence: The quick brown fox jumps over the lazy dog.
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
Parse Tree:
(S
  (NP The/DT quick/JJ brown/NN fox/NN)
  (VP jumps/VBZ (PP over/IN (NP the/DT lazy/JJ dog/NN)))
  ./.)
Extracted Noun Phrases: ['The quick brown fox', 'the lazy dog']

Sentence: A beautiful butterfly landed on the colorful flower.
POS Tags: [('A', 'DT'), ('beautiful', 'JJ'), ('butterfly', 'NN'), ('landed', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('colorful', 'JJ'), ('flower', 'NN'), ('.', '.')]
Parse Tree:
(S
  (NP A/DT beautiful/JJ butterfly/NN)
  (VP landed/VBD (PP on/IN (NP the/DT colorful/JJ flower/NN)))
  ./.)
Extracted Noun Phrases: ['A beautiful butterfly', 'the colorful flower']

Sentence: The experienced teacher explained the complex concept clearly.
POS Tags: [('The', 'DT'), ('experienced', 'JJ'), ('teacher', 'NN'), ('explained', 'VBD'), ('the',

###  **Justify why chunking is important in NLP pipeline**
#### Chunking groups words into meaningful phrases (like NP and VP), which gives structure to the sentence. This helps downstream tasks by focusing on useful units instead of individual words.

###  **Compare chunking and tokenization process**
#### Tokenization splits text into words. Chunking groups those words (after POS tagging) into larger units like noun phrases. Tokenization = breaking; chunking = grouping.


### **Below is the modified version of the code that extracts and displays Noun Phrases (NP), Verb Phrases (VP), and Pronoun Phrases (PRONP)**

In [5]:
def demo_phrase_chunking():
    """
    Demonstrates chunking of Noun Phrases, Verb Phrases, and Pronoun Phrases.
    """
    for sentence in sentences:
        print("\nSentence:", sentence)

        # Tokenize and POS tag
        tokens = word_tokenize(sentence)
        pos_tags = pos_tag(tokens)
        print("POS Tags:", pos_tags)

        # Parse into a chunk tree
        tree = cp.parse(pos_tags)
        print("Parse Tree:")
        print(tree)

        # Lists for each type of phrase
        noun_phrases = []
        verb_phrases = []
        pronoun_phrases = []

        # Extract phrases by label
        for subtree in tree.subtrees():
            label = subtree.label()
            phrase_text = ' '.join(word for word, tag in subtree.leaves())

            if label == 'NP':
                noun_phrases.append(phrase_text)
            elif label == 'VP':
                verb_phrases.append(phrase_text)
            elif label == 'PRONP':
                pronoun_phrases.append(phrase_text)

        print("Extracted Noun Phrases (NP):", noun_phrases)
        print("Extracted Verb Phrases (VP):", verb_phrases)
        print("Extracted Pronoun Phrases (PRONP):", pronoun_phrases)

# Run the demo
demo_phrase_chunking()



Sentence: The quick brown fox jumps over the lazy dog.
POS Tags: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'VBZ'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN'), ('.', '.')]
Parse Tree:
(S
  (NP The/DT quick/JJ brown/NN fox/NN)
  (VP jumps/VBZ (PP over/IN (NP the/DT lazy/JJ dog/NN)))
  ./.)
Extracted Noun Phrases (NP): ['The quick brown fox', 'the lazy dog']
Extracted Verb Phrases (VP): ['jumps over the lazy dog']
Extracted Pronoun Phrases (PRONP): []

Sentence: A beautiful butterfly landed on the colorful flower.
POS Tags: [('A', 'DT'), ('beautiful', 'JJ'), ('butterfly', 'NN'), ('landed', 'VBD'), ('on', 'IN'), ('the', 'DT'), ('colorful', 'JJ'), ('flower', 'NN'), ('.', '.')]
Parse Tree:
(S
  (NP A/DT beautiful/JJ butterfly/NN)
  (VP landed/VBD (PP on/IN (NP the/DT colorful/JJ flower/NN)))
  ./.)
Extracted Noun Phrases (NP): ['A beautiful butterfly', 'the colorful flower']
Extracted Verb Phrases (VP): ['landed on the colorful flower']
Ext

# Dependent and Independent Clauses with spaCy

In [9]:
import spacy

# Load English model
nlp = spacy.load("en_core_web_sm")


In [10]:
def identify_clauses(text):
    """
    Identify independent and dependent clauses in a sentence.
    """
    doc = nlp(text)

    print(f"\nSentence: {text}")
    print("="*60)

    # Identify the root verb (forms the main/independent clause)
    root = [token for token in doc if token.dep_ == "ROOT"][0]

    print("\nINDEPENDENT CLAUSE:")
    print("  Main verb:", root.text)

    independent_words = [root.text]

    # Get subject and object of the main clause
    for child in root.children:
        if child.dep_ in ["nsubj", "nsubjpass"]:
            print("  Subject:", child.text)
            independent_words.insert(0, child.text)
        elif child.dep_ in ["dobj", "attr"]:
            print("  Object:", child.text)
            independent_words.append(child.text)

    print("  Clause:", " ".join(independent_words))

    # Dependent clauses
    print("\nDEPENDENT CLAUSES:")
    dependent_labels = {
        'advcl': 'Adverbial Clause',
        'ccomp': 'Complement Clause',
        'xcomp': 'Complement Clause',
        'relcl': 'Relative Clause',
        'acl': 'Clausal Modifier'
    }

    found = False

    for token in doc:
        if token.dep_ in dependent_labels:
            found = True
            clause_words = [t.text for t in token.subtree]

            marker = ""
            for t in token.subtree:
                if t.dep_ == "mark":
                    marker = t.text

            print(f"\n  Type: {dependent_labels[token.dep_]}")
            if marker:
                print("  Marker:", marker)
            print("  Verb:", token.text)
            print("  Clause:", " ".join(clause_words))

    if not found:
        print("  None found (simple sentence)")


In [11]:
sentences = [
    "The dog barks.",
    "I stayed home because it was raining.",
    "I believe that she is right.",
    "The student who studied hard passed the exam.",
    "She said that she would come when she finished her work.",
    "Although it was difficult, we completed the project."
]

for s in sentences:
    identify_clauses(s)
    print("-" * 60)



Sentence: The dog barks.

INDEPENDENT CLAUSE:
  Main verb: barks
  Subject: dog
  Clause: dog barks

DEPENDENT CLAUSES:
  None found (simple sentence)
------------------------------------------------------------

Sentence: I stayed home because it was raining.

INDEPENDENT CLAUSE:
  Main verb: stayed
  Subject: I
  Clause: I stayed

DEPENDENT CLAUSES:

  Type: Adverbial Clause
  Marker: because
  Verb: raining
  Clause: because it was raining
------------------------------------------------------------

Sentence: I believe that she is right.

INDEPENDENT CLAUSE:
  Main verb: believe
  Subject: I
  Clause: I believe

DEPENDENT CLAUSES:

  Type: Complement Clause
  Marker: that
  Verb: is
  Clause: that she is right
------------------------------------------------------------

Sentence: The student who studied hard passed the exam.

INDEPENDENT CLAUSE:
  Main verb: passed
  Subject: student
  Object: exam
  Clause: student passed exam

DEPENDENT CLAUSES:

  Type: Relative Clause
  Verb:

### **Justify why breaking sentences into clauses is important in NLP pipeline**
#### Breaking sentences into clauses helps the system understand the internal structure of complex sentences. Each clause carries a separate idea, so identifying them improves tasks like parsing, relation extraction, summarization, and meaning interpretation.

### **What is the difference between a phrase and a clause?**
#### A phrase has no subject–verb combination and cannot stand alone. A clause has a subject and a verb and can be independent or dependent.

### **Identify the clauses that exists in the following sentence:**
### “The man who lives next door said that he would help us if we needed anything.”
#### Independent clause: **The man said**  
#### Dependent clause (relative): **who lives next door**  
#### Dependent clause (noun/complement): **that he would help us**  
#### Dependent clause (conditional): **if we needed anything**


# Hierarchical Syntax Tree with NLTK

In [6]:
import nltk
from nltk import pos_tag, word_tokenize, RegexpParser

# Download NLTK resources (run once)
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [7]:
# Example text
sample_text = "The quick brown fox jumps over the lazy dog"

# Tokenize and POS tag
tagged = pos_tag(word_tokenize(sample_text))

# Define chunk patterns
chunker = RegexpParser("""
    NP: {<DT>?<JJ>*<NN.*>+}   # Noun Phrases
    P: {<IN>}                 # Prepositions
    V: {<VB.*>}               # Verbs
    PP: {<P><NP>}             # Prepositional Phrases
    VP: {<V><NP|PP>*}         # Verb Phrases
""")


In [8]:
# Parse and extract phrases
output = chunker.parse(tagged)

print("POS Tags:")
for word, tag in tagged:
    print(f"{word:10} -> {tag}")

print("\nParsed Output:")
print(output)

print("\nTree Structure:")
output.pretty_print()

print("\nExtracted Phrases:")
for subtree in output.subtrees():
    if subtree.label() != 'S':
        phrase_text = ' '.join(word for word, tag in subtree.leaves())
        print(f"{subtree.label()}: {phrase_text}")


POS Tags:
The        -> DT
quick      -> JJ
brown      -> NN
fox        -> NN
jumps      -> VBZ
over       -> IN
the        -> DT
lazy       -> JJ
dog        -> NN

Parsed Output:
(S
  (NP The/DT quick/JJ brown/NN fox/NN)
  (VP (V jumps/VBZ) (PP (P over/IN) (NP the/DT lazy/JJ dog/NN))))

Tree Structure:
                                    S                                      
           _________________________|_______________                        
          |                                         VP                     
          |                          _______________|_____                  
          |                         |                     PP               
          |                         |         ____________|_____            
          NP                        V        P                  NP         
   _______|________________         |        |       ___________|______     
The/DT quick/JJ brown/NN fox/NN jumps/VBZ over/IN the/DT     lazy/JJ dog/NN


Extra

### **Discuss the difference between a constituency and dependency trees.**
#### A constituency tree shows how words group into hierarchical phrases (NP, VP, PP). A dependency tree shows how words depend on each other directly (head–dependent links). Constituency focuses on phrase structure; dependency focuses on grammatical relationships.

### **What information can we derive from analysing a sentence using a hierarchical syntax tree?**
#### It shows the sentence’s full structure: subjects, verbs, objects, modifiers, phrase boundaries, and how each part connects. This helps reveal roles, relationships, and the internal organization of the sentence.

### **Explain how a hierarchical syntax tree supports NLP tasks like parsing or machine translation.**
#### It provides a clear structural map of the sentence, letting NLP systems understand who does what to whom. This reduces ambiguity, improves alignment between languages, and helps parsers and MT models generate grammatically correct and semantically accurate output.
