# Day 3: POS Tagging & Named Entity Recognition (NER)
**The AI Engineer Course 2026 - Section 22**

**Student:** Natruja

**Date:** Saturday, February 14, 2026

---

## Learning Objectives
1. Understand Part-of-Speech (POS) tagging
2. Learn about Named Entity Recognition (NER)
3. Use spaCy for advanced NLP tasks
4. Extract meaningful information from text

## Setup: Install and Import Required Libraries

In [1]:
import subprocess
import sys

# Install spaCy
subprocess.check_call([sys.executable, "-m", "pip", "install", "spacy", "-q"])

# Download English model
subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm", "-q"])

print("‚úì spaCy installed and English model downloaded successfully!")

[38;5;2m‚úî Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
‚úì spaCy installed and English model downloaded successfully!


In [2]:
# Import necessary libraries
import spacy

# Load the English model
nlp = spacy.load('en_core_web_sm')

print("‚úì All imports successful!")

‚úì All imports successful!


## Part-of-Speech (POS) Tagging: Understanding Word Roles

**POS Tagging** identifies the grammatical role of each word in a sentence.

### Common POS Tags:
- **NOUN** (NN): person, place, thing (e.g., "dog", "city", "book")
- **VERB** (VB): action (e.g., "run", "jump", "eat")
- **ADJECTIVE** (JJ): describes nouns (e.g., "beautiful", "quick")
- **ADVERB** (RB): describes verbs (e.g., "quickly", "slowly")
- **PRONOUN** (PRP): replaces nouns (e.g., "he", "she", "it")
- **PREPOSITION** (IN): shows relationships (e.g., "in", "on", "at")
- **CONJUNCTION** (CC): connects words (e.g., "and", "but")
- **DETERMINER** (DT): specifies nouns (e.g., "the", "a")

### Why POS Tagging Matters:
- Understand sentence structure
- Extract specific information types
- Improve other NLP tasks
- Help machines understand meaning

## EXAMPLE: POS Tagging with spaCy

In [12]:
# Sample sentence
text = "The quick brown fox jumps over the lazy dog."

# Process with spaCy
doc = nlp(text)

print(f"Text: {text}")
print("\n" + "="*60)
print(f"{'Word':<15} | {'POS Tag':<15} | {'Explanation':<25}")
print("-"*60)

# Display POS tags
pos_explanations = {
    'DET': 'Determiner',
    'ADJ': 'Adjective',
    'NOUN': 'Noun',
    'VERB': 'Verb',
    'ADP': 'Preposition',
    'PUNCT': 'Punctuation'
}

for token in doc:
    explanation = pos_explanations.get(token.pos_, token.pos_)
    print(f"{token.text:<15} | {token.pos_:<15} | {explanation:<25}")

Text: The quick brown fox jumps over the lazy dog.

Word            | POS Tag         | Explanation              
------------------------------------------------------------
The             | DET             | Determiner               
quick           | ADJ             | Adjective                
brown           | ADJ             | Adjective                
fox             | NOUN            | Noun                     
jumps           | VERB            | Verb                     
over            | ADP             | Preposition              
the             | DET             | Determiner               
lazy            | ADJ             | Adjective                
dog             | NOUN            | Noun                     
.               | PUNCT           | Punctuation              


## Named Entity Recognition (NER): Identifying What's What

**NER** identifies and classifies named entities (real-world objects like people, places, organizations).

### Common Entity Types:
- **PERSON**: Names of people (e.g., "John Smith", "Elon Musk")
- **ORG**: Organizations (e.g., "Google", "NASA", "Apple")
- **GPE**: Geographic/Political entities (e.g., "France", "New York", "USA")
- **PRODUCT**: Products/Objects (e.g., "iPhone", "Tesla Model 3")
- **MONEY**: Monetary values (e.g., "$100", "‚Ç¨50")
- **DATE**: Dates (e.g., "Monday", "February 14")
- **TIME**: Times (e.g., "2:30 PM", "3 hours")
- **EVENT**: Events (e.g., "World Cup", "Olympics")

### Applications:
- Information extraction
- Resume parsing
- Chatbot understanding
- Knowledge graph building
- Recommendation systems

## EXAMPLE: Named Entity Recognition with spaCy

In [13]:
# Sample text with multiple entities
text = "Elon Musk founded Tesla in California. On February 14, 2024, he announced a new product worth $500 million."

# Process with spaCy
doc = nlp(text)

print(f"Text: {text}")
print("\n" + "="*60)
print("\nNamed Entities Found:")
print("-"*60)

# Display entities
for ent in doc.ents:
    print(f"Text: {ent.text:<20} | Type: {ent.label_:<10} | Start: {ent.start_char} | End: {ent.end_char}")

print(f"\nTotal entities found: {len(doc.ents)}")

Text: Elon Musk founded Tesla in California. On February 14, 2024, he announced a new product worth $500 million.


Named Entities Found:
------------------------------------------------------------
Text: Elon Musk            | Type: PERSON     | Start: 0 | End: 9
Text: Tesla                | Type: ORG        | Start: 18 | End: 23
Text: California           | Type: GPE        | Start: 27 | End: 37
Text: February 14, 2024    | Type: DATE       | Start: 42 | End: 59
Text: $500 million         | Type: MONEY      | Start: 94 | End: 106

Total entities found: 5


## EXAMPLE: Filtering Entities by Type

In [14]:
# Sample text
text = "Apple CEO Tim Cook announced a new iPhone in California. The price is $999 on September 12, 2024."

# Process with spaCy
doc = nlp(text)

print(f"Text: {text}")
print("\n" + "="*60)

# Filter entities by type
people = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
orgs = [ent.text for ent in doc.ents if ent.label_ == 'ORG']
locations = [ent.text for ent in doc.ents if ent.label_ == 'GPE']
dates = [ent.text for ent in doc.ents if ent.label_ == 'DATE']
money = [ent.text for ent in doc.ents if ent.label_ == 'MONEY']

print("\nExtracted Entities by Type:")
print(f"  People: {people}")
print(f"  Organizations: {orgs}")
print(f"  Locations: {locations}")
print(f"  Dates: {dates}")
print(f"  Money: {money}")

Text: Apple CEO Tim Cook announced a new iPhone in California. The price is $999 on September 12, 2024.


Extracted Entities by Type:
  People: ['Tim Cook']
  Organizations: ['Apple', 'iPhone']
  Locations: ['California']
  Dates: ['September 12, 2024']
  Money: ['999']


---
# EXERCISES: 15 Tasks Organized by Difficulty

## ‚≠ê EASY: Exercise 1 - Process Text with spaCy and Print Tokens

In [4]:
# Exercise 1: Process text and display all tokens
text = "Python is a powerful programming language."

# TODO: Process text with spaCy
doc = nlp(text)

# TODO: Print total tokens
print(f"Text: {text}")
print(f"Total tokens: {doc}")

print("\nTokens:")
for token in doc:
    print(f"  - {token.text}")

Text: Python is a powerful programming language.
Total tokens: Python is a powerful programming language.

Tokens:
  - Python
  - is
  - a
  - powerful
  - programming
  - language
  - .


## ‚≠ê EASY: Exercise 2 - Get POS Tag for Each Word

In [5]:
# Exercise 2: Get POS tag for each word in a sentence
text = "Dogs love playing fetch."

# TODO: Process text with spaCy
doc = nlp(text)

print(f"Text: {text}")
print("\nWord -> POS Tag:")
for token in doc:
    print(f"  {token.text:<10} -> {token.pos_}")

Text: Dogs love playing fetch.

Word -> POS Tag:
  Dogs       -> PROPN
  love       -> AUX
  playing    -> VERB
  fetch      -> NOUN
  .          -> PUNCT


## ‚≠ê EASY: Exercise 3 - Find All Entities in a Sentence

In [6]:
# Exercise 3: Extract all entities from text
text = "Barack Obama was the 44th President of the United States."

# TODO: Process text with spaCy
doc = nlp(text)

print(f"Text: {text}")
print(f"\nEntities found: {doc}")
for ent in doc.ents:
    print(f"  - {ent.text}: {ent.label_}")

Text: Barack Obama was the 44th President of the United States.

Entities found: Barack Obama was the 44th President of the United States.
  - Barack Obama: PERSON
  - 44th: ORDINAL
  - the United States: GPE


## ‚≠ê EASY: Exercise 4 - Count Tokens in a Document

In [7]:
# Exercise 4: Count total tokens in text
text = "Machine learning is transforming industries worldwide."
doc = nlp(text)

# TODO: Count the total number of tokens
token_count = len(doc)

print(f"Text: {text}")
print(f"Total tokens: {token_count}")

Text: Machine learning is transforming industries worldwide.
Total tokens: 7


## ‚≠ê EASY: Exercise 5 - Print Entity Text and Label

In [8]:
# Exercise 5: Display entity details in formatted table
text = "Google CEO Sundar Pichai lives in California."
doc = nlp(text)

print(f"Text: {text}")
print("\nEntity Details:")
print(f"{'Entity Text':<20} | {'Entity Type':<10}")
print("-" * 32)
for ent in doc.ents:
    print(f"{ent.text:<20} | {ent.label_:<10}")

Text: Google CEO Sundar Pichai lives in California.

Entity Details:
Entity Text          | Entity Type
--------------------------------
Google               | ORG       
Sundar Pichai        | PERSON    
California           | GPE       


## ‚≠ê‚≠ê MEDIUM: Exercise 6 - Extract Only Nouns from Text

In [11]:
# Exercise 6: Extract all nouns from text
text = "The cat and dog played in the park."
doc = nlp(text)

# TODO: Extract only NOUN tokens
nouns = [token.text for token in doc if token.pos_ == 'NOUN']

print(f"Text: {text}")
print(f"Nouns extracted: {nouns}")
print(f"Total nouns: {len(nouns)}")

Text: The cat and dog played in the park.
Nouns extracted: ['cat', 'dog', 'park']
Total nouns: 3


## ‚≠ê‚≠ê MEDIUM: Exercise 7 - Extract Only PERSON Entities

In [12]:
# Exercise 7: Extract all PERSON entities
text = "Steve Jobs founded Apple. Bill Gates created Microsoft. Both revolutionized technology."
doc = nlp(text)

# TODO: Extract all entities with label_ == 'PERSON'
people = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']

print(f"Text: {text}")
print(f"People mentioned: {people}")
print(f"Total people: {len(people)}")

Text: Steve Jobs founded Apple. Bill Gates created Microsoft. Both revolutionized technology.
People mentioned: ['Steve Jobs', 'Bill Gates']
Total people: 2


## ‚≠ê‚≠ê MEDIUM: Exercise 8 - Get All Verbs and Their Lemmas

In [13]:
# Exercise 8: Extract verbs and their base forms (lemmas)
text = "The students study hard. They studied chemistry yesterday."
doc = nlp(text)

# TODO: Extract VERB tokens with their lemmas
verbs = []
for token in doc:
    if token.pos_ == 'VERB':
        verbs.append((token.text, token.lemma_))

print(f"Text: {text}")
print("\nVerbs and their lemmas:")
for word, lemma in verbs:
    print(f"  {word} -> {lemma}")

Text: The students study hard. They studied chemistry yesterday.

Verbs and their lemmas:
  study -> study
  studied -> study


## ‚≠ê‚≠ê MEDIUM: Exercise 9 - Compare Entities in Two Sentences

In [39]:
# Exercise 9: Extract and compare entities from two different texts
text1 = "John works at Microsoft in Seattle."
text2 = "Sarah works at Google in California."

doc1 = nlp(text1)
doc2 = nlp(text2)

# TODO: Extract PERSON entities from both docs
people_text1 = [ent.text for ent in doc1.ents if ___]
people_text2 = [ent.text for ent in doc2.ents if ___]

print("Text 1:", text1)
print("People:", people_text1)
print()
print("Text 2:", text2)
print("People:", people_text2)
print()
print("People in both texts:", set(people_text1) & set(people_text2))

Text 1: John works at Microsoft in Seattle.
People: ['John']

Text 2: Sarah works at Google in California.
People: ['Sarah']

People in both texts: set()


## ‚≠ê‚≠ê MEDIUM: Exercise 10 - Count POS Tag Frequencies

In [41]:
# Exercise 10: Count how many tokens of each POS type exist
text = "The quick brown fox jumps over the lazy dog and runs away."
doc = nlp(text)

# TODO: Create a dictionary to count each POS tag
pos_counts = {}
for token in doc:
    pos = ___
    if pos in pos_counts:
        pos_counts[___] += 1
    else:
        pos_counts[___] = 1

print(f"Text: {text}")
print("\nPOS Tag Frequencies:")
for pos, count in sorted(pos_counts.items()):
    print(f"  {pos}: {count}")

Text: The quick brown fox jumps over the lazy dog and runs away.

POS Tag Frequencies:
  ADJ: 3
  ADP: 1
  ADV: 1
  CCONJ: 1
  DET: 2
  NOUN: 2
  PUNCT: 1
  VERB: 2


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 11 - Build Entity Extraction Function

In [42]:
# Exercise 11: Create a function that returns entities organized by type
def extract_entities_by_type(text):
    """Extract entities and organize them by type in a dictionary."""
    doc = nlp(text)
    
    # TODO: Initialize a dictionary with entity types as keys
    entities = {
        'PERSON': [],
        'ORG': [],
        'GPE': []
    }
    
    # TODO: Loop through entities and add them to the appropriate list
    for ent in doc.ents:
        if ent.label_ in entities:
            entities[___].append(___)
    
    return entities

# Test the function
text = "Elon Musk works at Tesla in California. Jeff Bezos founded Amazon in Seattle."
result = extract_entities_by_type(text)

print(f"Text: {text}")
print("\nExtracted Entities by Type:")
for entity_type, entities_list in result.items():
    print(f"  {entity_type}: {entities_list}")

Text: Elon Musk works at Tesla in California. Jeff Bezos founded Amazon in Seattle.

Extracted Entities by Type:
  PERSON: ['Elon Musk', 'Jeff Bezos']
  ORG: ['Tesla', 'Amazon']
  GPE: ['California', 'Seattle']


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 12 - Analyze Paragraph: Entities, POS, Key Nouns

In [51]:
# Exercise 12: Comprehensive analysis of a paragraph
paragraph = """Artificial intelligence is transforming the world.
Companies like Google, Microsoft, and OpenAI are investing heavily.
Researchers in California and New York are making breakthrough discoveries every day."""

doc = nlp(paragraph)

# TODO: Extract unique entities
unique_entities = ___

# TODO: Count NOUN tokens
noun_count = ___

# TODO: Extract nouns (not just count)
key_nouns = ___

print("PARAGRAPH ANALYSIS")
print("=" * 60)
print(f"Text: {paragraph[:100]}...\n")
print(f"Total entities: {len(doc.ents)}")
print(f"Unique entities: {unique_entities}")
print(f"\nNoun count: {noun_count}")
print(f"Key nouns: {key_nouns}")

PARAGRAPH ANALYSIS
Text: Artificial intelligence is transforming the world.
Companies like Google, Microsoft, and OpenAI are ...

Total entities: 6
Unique entities: {'California', 'Microsoft', 'Google', 'OpenAI', 'New York', 'every day'}

Noun count: 6
Key nouns: ['intelligence', 'discoveries', 'world', 'day', 'Researchers', 'Companies']


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 13 - Process Multiple Documents

In [63]:
# Exercise 13: Extract and find common entities across multiple documents
documents = [
    "Apple released the iPhone in California.",
    "Google headquarters is also in California.",
    "Microsoft is based in Washington."
]

# TODO: Process each document and collect all ORG entities
all_organizations = []
for doc_text in documents:
    doc = nlp(doc_text)
    orgs = ___
    all_organizations.extend(orgs)

# TODO: Find organizations that appear in multiple documents
common_orgs = ___

print("MULTI-DOCUMENT ANALYSIS")
print("=" * 60)
for i, doc_text in enumerate(documents, 1):
    print(f"{i}. {doc_text}")

print(f"\nAll organizations found: {set(all_organizations)}")
print(f"Organizations appearing multiple times: {set(common_orgs)}")

MULTI-DOCUMENT ANALYSIS
1. Apple released the iPhone in California.
2. Google headquarters is also in California.
3. Microsoft is based in Washington.

All organizations found: {'Apple', 'Microsoft', 'iPhone', 'Google'}
Organizations appearing multiple times: set()


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 14 - Build Text Highlighting Function

In [66]:
# Exercise 14: Create a function that highlights entities in text
def highlight_entities(text, entity_type='PERSON'):
    """Highlight entities of a specific type in text."""
    doc = nlp(text)
    
    # TODO: Build a list of entity positions
    entity_spans = []
    for ent in doc.ents:
        if ___:
            entity_spans.append(___)
    
    # TODO: Sort by start position and create highlighted version
    highlighted = ""
    last_end = 0
    
    for start, end, entity_text in sorted(entity_spans):
        highlighted += text[last_end:start]
        highlighted += f"[{entity_text.upper()}]"
        last_end = end
    
    highlighted += text[last_end:]
    return highlighted

# Test the function
text = "John Smith and Mary Johnson work at Apple."
result = highlight_entities(text, entity_type='PERSON')

print(f"Original: {text}")
print(f"Highlighted: {result}")

Original: John Smith and Mary Johnson work at Apple.
Highlighted: [JOHN SMITH] and [MARY JOHNSON] work at Apple.


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 15 - Create NER Summary Report

In [67]:
# Exercise 15: Generate a comprehensive NER summary report
news_article = """Tech giant Apple announced record earnings on January 15, 2024.
CEO Tim Cook, based in Cupertino, California, presented the results to investors.
The company stock price increased by 5 percent to $185 per share.
Analysts predict further growth in the coming quarter.
Microsoft and Google also reported strong performance this month."""

def generate_ner_report(text):
    """Generate a comprehensive NER summary report."""
    doc = nlp(text)
    
    # TODO: Initialize report dictionary
    report = {
        'PERSON': [],
        'ORG': [],
        'GPE': [],
        'MONEY': [],
        'DATE': [],
        'total_tokens': ___,
        'total_entities': ___
    }
    
    # TODO: Populate report with entities
    for ent in doc.ents:
        if ___ in report and isinstance(report[___], list):
            report[___].append(___)
    
    return report

report = generate_ner_report(news_article)

print("NER SUMMARY REPORT")
print("=" * 60)
print(f"Total tokens: {report['total_tokens']}")
print(f"Total entities: {report['total_entities']}")

print("\nEntity Breakdown:")
for entity_type in ['PERSON', 'ORG', 'GPE', 'MONEY', 'DATE']:
    if report[entity_type]:
        print(f"  {entity_type:10} ({len(report[entity_type])}): {report[entity_type]}")

NER SUMMARY REPORT
Total tokens: 66
Total entities: 11

Entity Breakdown:
  PERSON     (1): ['Tim Cook']
  ORG        (3): ['Apple', 'Microsoft', 'Google']
  GPE        (2): ['Cupertino', 'California']
  MONEY      (1): ['185']
  DATE       (3): ['January 15, 2024', 'the coming quarter', 'this month']


---
## ‚≠ê‚≠ê‚≠ê BONUS HARD EXERCISES (16-20)
**Extra challenges for deeper practice!**

## ‚≠ê‚≠ê‚≠ê HARD: Exercise 16 - Dependency Parsing & Sentence Structure

In [20]:
# Exercise 16: Analyze dependency relationships between words
text = "The talented engineer designed an innovative AI system for the company."
doc = nlp(text)

# TODO: Print each token with its dependency label and head word
print("DEPENDENCY PARSING")
print("=" * 60)
print(f"{'Token':<15} | {'DEP':<12} | {'Head Word':<15} | {'POS':<8}")
print("-" * 60)
for token in doc:
    print(f"{token.text:<15} | {token.dep_:<12} | {token.head.text:<15} | {token.pos_:<8}")

# TODO: Find the ROOT verb of the sentence
root_verb = [token for token in doc if token.dep_ == 'ROOT']
print(f"\nRoot verb: {root_verb}")

# TODO: Find all direct objects (dobj)
direct_objects = [token.text for token in doc if token.dep_ == 'dobj']
print(f"Direct objects: {direct_objects}")

# TODO: Find all subjects (nsubj)
subjects = [token.text for token in doc if token.dep_ == 'nsubj']
print(f"Subjects: {subjects}")

DEPENDENCY PARSING
Token           | DEP          | Head Word       | POS     
------------------------------------------------------------
The             | det          | engineer        | DET     
talented        | amod         | engineer        | ADJ     
engineer        | nsubj        | designed        | NOUN    
designed        | ROOT         | designed        | VERB    
an              | det          | system          | DET     
innovative      | amod         | system          | ADJ     
AI              | compound     | system          | PROPN   
system          | dobj         | designed        | NOUN    
for             | prep         | system          | ADP     
the             | det          | company         | DET     
company         | pobj         | for             | NOUN    
.               | punct        | designed        | PUNCT   

Root verb: [designed]
Direct objects: ['system']
Subjects: ['engineer']


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 17 - Entity Frequency & Ranking Across Multiple Texts

In [26]:
# Exercise 17: Build an entity frequency ranker from multiple news headlines
headlines = [
    "Apple launches new iPhone in California and New York.",
    "Google and Microsoft compete in AI race.",
    "Tim Cook visits Google headquarters in California.",
    "Microsoft acquires startup in New York for $2 billion.",
    "Apple and Google announce partnership in California.",
    "Elon Musk praises Microsoft AI tools.",
    "Jeff Bezos invests in California tech startup."
]

# TODO: Build a frequency dictionary for ALL entity types
entity_freq = {}  # key: (entity_text, entity_label), value: count

for headline in headlines:
    doc = nlp(headline)
    for ent in doc.ents:
        key = (ent.text, ent.label_)
        entity_freq[key] = entity_freq.get(key, 0) + 1

# TODO: Sort by frequency (highest first)
ranked = sorted(entity_freq.items(), key=lambda x: x[1], reverse=True)

print("ENTITY FREQUENCY RANKING")
print("=" * 60)
print(f"{'Rank':<6} | {'Entity':<20} | {'Type':<10} | {'Count':<6}")
print("-" * 60)
for rank, ((text, label), count) in enumerate(ranked, 1):
    print(f"{rank:<6} | {text:<20} | {label:<10} | {count:<6}")

# TODO: Find the most mentioned ORG
top_org = [text for (text, label), count in ranked if label == 'ORG']
print(f"\nMost mentioned organization: {top_org[0] if top_org else 'None'}")

ENTITY FREQUENCY RANKING
Rank   | Entity               | Type       | Count 
------------------------------------------------------------
1      | California           | GPE        | 4     
2      | Google               | ORG        | 3     
3      | Apple                | ORG        | 2     
4      | New York             | GPE        | 2     
5      | Microsoft            | ORG        | 2     
6      | iPhone               | ORG        | 1     
7      | AI                   | GPE        | 1     
8      | Tim Cook             | PERSON     | 1     
9      | $2 billion           | MONEY      | 1     
10     | Microsoft AI         | ORG        | 1     
11     | Jeff Bezos           | PERSON     | 1     

Most mentioned organization: California


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 18 - Build a POS Pattern Matcher

In [29]:
# Exercise 18: Find specific POS patterns (e.g., ADJ + NOUN combinations)
texts = [
    "The brilliant scientist made an amazing discovery.",
    "A dangerous storm hit the coastal city yesterday.",
    "The young developer built a powerful application."
]

def find_pos_patterns(text, pattern):
    """Find consecutive POS tag patterns in text.
    pattern: list of POS tags, e.g., ['ADJ', 'NOUN']
    Returns list of matched phrases.
    """
    doc = nlp(text)
    matches = []
    tokens = list(doc)
    
    # TODO: Slide a window of pattern length across tokens
    for i in range(len(tokens) - len(pattern) + 1):
        window = tokens[i:i + len(pattern)]
        
        # TODO: Check if POS tags match the pattern
        pos_tags = [token.pos_ for token in window]
        if pos_tags == pattern:
            phrase = " ".join([t.text for t in window])
            matches.append(phrase)
    
    return matches

# Find ADJ + NOUN patterns
print("ADJ + NOUN Patterns Found:")
print("=" * 40)
for text in texts:
    results = find_pos_patterns(text, ['ADJ', 'NOUN'])
    print(f"  '{text[:40]}...'")
    print(f"    Matches: {results}")

# TODO: Now find NOUN + VERB patterns
print("\nNOUN + VERB Patterns Found:")
print("=" * 40)
for text in texts:
    results = find_pos_patterns(text, ['NOUN', 'VERB'])
    print(f"  '{text[:40]}...'")
    print(f"    Matches: {results}")

ADJ + NOUN Patterns Found:
  'The brilliant scientist made an amazing ...'
    Matches: ['brilliant scientist', 'amazing discovery']
  'A dangerous storm hit the coastal city y...'
    Matches: ['dangerous storm', 'coastal city']
  'The young developer built a powerful app...'
    Matches: ['young developer', 'powerful application']

NOUN + VERB Patterns Found:
  'The brilliant scientist made an amazing ...'
    Matches: ['scientist made']
  'A dangerous storm hit the coastal city y...'
    Matches: ['storm hit']
  'The young developer built a powerful app...'
    Matches: ['developer built']


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 19 - Entity Relationship Mapper

In [33]:
# Exercise 19: Map relationships between PERSON and ORG entities in same sentence
texts = [
    "Tim Cook is the CEO of Apple.",
    "Sundar Pichai leads Google. Satya Nadella manages Microsoft.",
    "Jensen Huang founded NVIDIA in California.",
    "Mark Zuckerberg built Meta from his Harvard dorm room."
]

def extract_person_org_relations(text):
    """Find PERSON-ORG pairs that appear in the same sentence."""
    doc = nlp(text)
    relations = []
    
    # TODO: Loop through each sentence in the document
    for sent in doc.sents:
        # TODO: Extract PERSON and ORG entities from this sentence
        persons = [ent.text for ent in sent.ents if ent.label_ == 'PERSON']
        orgs = [ent.text for ent in sent.ents if ent.label_ == 'ORG']
        
        # TODO: Create pairs of (person, org) for entities in same sentence
        for person in persons:
            for org in orgs:
                relations.append((person, org))
    
    return relations

print("PERSON ‚Üî ORG RELATIONSHIPS")
print("=" * 50)
all_relations = []
for text in texts:
    rels = extract_person_org_relations(text)
    all_relations.extend(rels)
    for person, org in rels:
        print(f"  {person} ‚Üí {org}")

print(f"\nTotal relationships found: {len(all_relations)}")

PERSON ‚Üî ORG RELATIONSHIPS
  Tim Cook ‚Üí Apple
  Satya Nadella manages ‚Üí Microsoft
  Mark Zuckerberg ‚Üí Meta
  Mark Zuckerberg ‚Üí Harvard

Total relationships found: 4


## ‚≠ê‚≠ê‚≠ê HARD: Exercise 20 - Complete NLP Analysis Pipeline

In [35]:
# Exercise 20: Build a complete text analysis pipeline combining POS + NER + Stats
article = """Artificial intelligence company OpenAI, led by Sam Altman,
released ChatGPT in November 2022. The product gained 100 million users
within two months. Google responded by launching Bard, while Microsoft
invested $10 billion in OpenAI. Experts in San Francisco and London
predict AI will transform every industry by 2030."""

def full_nlp_pipeline(text):
    """Run complete NLP analysis: tokenization, POS, NER, and statistics."""
    doc = nlp(text)
    
    # --- SECTION 1: Basic Stats ---
    # TODO: Count sentences, tokens, and unique tokens
    num_sentences = len(list(doc.sents))
    num_tokens = len(doc)
    unique_tokens = len(set([token.lower_ for token in doc if not token.is_punct]))
    
    # --- SECTION 2: POS Distribution ---
    # TODO: Get top 3 most common POS tags
    pos_counts = {}
    for token in doc:
        if not token.is_punct and not token.is_space:
            pos = token.pos_
            pos_counts[pos] = pos_counts.get(pos, 0) + 1
    
    top_pos = sorted(pos_counts.items(), key=lambda x: x[1], reverse=True)[:3]
    
    # --- SECTION 3: Entity Analysis ---
    # TODO: Group entities by type
    entity_groups = {}
    for ent in doc.ents:
        label = ent.label_
        if label not in entity_groups:
            entity_groups[label] = []
        entity_groups[label].append(ent.text)
    
    # --- SECTION 4: Key Phrases (ADJ+NOUN) ---
    # TODO: Extract adjective-noun phrases
    key_phrases = []
    tokens_list = list(doc)
    for i in range(len(tokens_list) - 1):
        if tokens_list[i].pos_ == 'ADJ' and tokens_list[i+1].pos_ == 'NOUN':
            key_phrases.append(f"{tokens_list[i].text} {tokens_list[i+1].text}")
    
    return {
        'sentences': num_sentences,
        'tokens': num_tokens,
        'unique_tokens': unique_tokens,
        'top_pos': top_pos,
        'entities': entity_groups,
        'key_phrases': key_phrases
    }

# Run the pipeline
results = full_nlp_pipeline(article)

# Display results
print("COMPLETE NLP ANALYSIS REPORT")
print("=" * 60)
print(f"\nüìä BASIC STATS:")
print(f"   Sentences: {results['sentences']}")
print(f"   Tokens: {results['tokens']}")
print(f"   Unique tokens: {results['unique_tokens']}")

print(f"\nüè∑Ô∏è TOP POS TAGS:")
for pos, count in results['top_pos']:
    print(f"   {pos}: {count}")

print(f"\nüîç ENTITIES FOUND:")
for label, entities in results['entities'].items():
    print(f"   {label}: {entities}")

print(f"\nüìù KEY PHRASES (ADJ+NOUN):")
for phrase in results['key_phrases']:
    print(f"   - {phrase}")

COMPLETE NLP ANALYSIS REPORT

üìä BASIC STATS:
   Sentences: 4
   Tokens: 60
   Unique tokens: 45

üè∑Ô∏è TOP POS TAGS:
   PROPN: 12
   NOUN: 8
   VERB: 8

üîç ENTITIES FOUND:
   PERSON: ['OpenAI', 'Sam Altman', 'Bard']
   DATE: ['November 2022', 'two months', '2030']
   CARDINAL: ['100 million']
   ORG: ['Microsoft']
   MONEY: ['$10 billion']
   GPE: ['OpenAI', 'San Francisco', 'London', 'AI']

üìù KEY PHRASES (ADJ+NOUN):
   - Artificial intelligence


---
## Summary

### Key Takeaways:
- **POS Tagging** identifies the grammatical role of words (NOUN, VERB, ADJ, etc.)
- **Named Entity Recognition** identifies and classifies named entities (people, places, organizations, etc.)
- **spaCy** is a powerful library for both tasks with pre-trained models
- These techniques enable information extraction and deeper text understanding

### Real-World Applications:
- Resume parsing and job matching
- News article analysis
- Customer support automation
- Knowledge base construction
- Question answering systems

### What's Next:
Tomorrow we'll learn **Sentiment Analysis** - determining whether text expresses positive, negative, or neutral emotions!

---

*Created for Natruja's NLP study plan*