# 📦 spaCy Matcher
## 🧠 What is Phrase Matching?
- Phrase matching is the process of identifying predefined phrases (multi-word expressions) in text.
- In spaCy, this is efficiently handled using the _**<code>Matcher</code>**_ class, which can match patterns based on tokens, and the _**<code>PhraseMatcher</code>**_ class, which is optimized specifically for matching sequences of tokens (phrases).


## ⚙️ 1. Setup: Install & Load spaCy

In [3]:
import spacy
from spacy.matcher import Matcher

# Load English language model
nlp = spacy.load("en_core_web_sm")

# Create a matcher object
matcher = Matcher(nlp.vocab)

## 📚 2. Basic Matcher Pattern

In [5]:
pattern = [
    {"LOWER": "data"},
    {"LOWER": "science"}
]

matcher.add("DATA_SCIENCE", [pattern])
doc = nlp("I love learning Data Science and Data Engineering.")

matches = matcher(doc)
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Matched: {span.text}")


Matched: Data Science


In [6]:
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]

# Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}]

# Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

#help(matcher.add)
matcher.add('SolarPower', [pattern1, pattern2, pattern3])

doc = nlp("The Solar Power industry continues to grow a solarpower increases. Solar-power is amazing.")

found_matches = matcher(doc)
print(found_matches)
print("-----------------------------------------------------------------------------------")
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id] # get string representation
    span = doc[start:end]    # get the matched span
    print(match_id,  string_id, start, end, span.text)

[(8656102463236116519, 1, 3), (8656102463236116519, 8, 9), (8656102463236116519, 11, 14)]
-----------------------------------------------------------------------------------
8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 8 9 solarpower
8656102463236116519 SolarPower 11 14 Solar-power


In [7]:
# help(matcher.add)

matcher.remove('SolarPower')

In [8]:
# solarpower SolarPower
pattern1 = [{'LOWER':'solarpower'}]

# solar.power or solar-*&power, etc
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True,'OP':'*'},{'LOWER':'power'}]

matcher.add('SolarPower',[pattern1, pattern2])

doc2 = nlp('Solar--power is solarpower yay!')
found_matches = matcher(doc2)
for match_id, start, end in found_matches:
    str_id = nlp.vocab.strings[match_id]
    span = doc2[start:end]
    print(match_id, str_id, start, end, span.text)

8656102463236116519 SolarPower 0 3 Solar--power
8656102463236116519 SolarPower 4 5 solarpower


## 🛠️ 3. Using Token Attributes
- Let’s go through several key token attributes used in patterns:

| Attribute  | Description      | Example                                   |
| ---------- | ---------------- | ----------------------------------------- |
| `TEXT`     | Exact text       | `"TEXT": "machine"`                       |
| `LOWER`    | Lowercased text  | `"LOWER": "machine"`                      |
| `LEMMA`    | Base form        | `"LEMMA": "be"` matches "is", "was", etc. |
| `POS`      | Part-of-speech   | `"POS": "NOUN"`                           |
| `TAG`      | Fine-grained POS | `"TAG": "VBD"` for past tense             |
| `IS_ALPHA` | Is it a word?    | `"IS_ALPHA": True`                        |
| `IS_DIGIT` | Is it a digit?   | `"IS_DIGIT": True`                        |


## 🧠 4. Advanced Example: Match Adjective followed by Noun

In [11]:
pattern = [
    {"POS": "ADJ"},  # Adjective
    {"POS": "NOUN"}  # Noun
]

matcher.add("ADJ_NOUN", [pattern])
doc = nlp("She has great ideas and amazing energy.")

matches = matcher(doc)
for match_id, start, end in matches:
    print("Match:", doc[start:end].text)


Match: great ideas
Match: amazing energy


## 🔁 5. Using Wildcards and Quantifiers
### Match "deep learning" or "deep neural network"

In [13]:
pattern = [
    {"LOWER": "deep"},
    {"IS_ALPHA": True, "OP": "+"},  # One or more words
    {"LOWER": "learning"}
]

matcher.add("DEEP_PATTERN", [pattern])
doc = nlp("Deep neural learning is different from deep learning.")

matches = matcher(doc)
for match_id, start, end in matches:
    print("Match:", doc[start:end].text)


Match: Deep neural learning
Match: neural learning
Match: deep learning
Match: Deep neural learning is different from deep learning


## Operators OP for Repetition & Optionality

| OP Symbol | Meaning           |
| --------- | ----------------- |
| `"?"`     | optional (0 or 1) |
| `"*"`     | 0 or more times   |
| `"+"`     | 1 or more times   |


## 🔄 6. Match Numbers Followed by Nouns (e.g., "5 apples", "100 dollars")

In [16]:
pattern = [
    {"LIKE_NUM": True},
    {"POS": "NOUN"}
]

matcher.add("NUM_NOUN", [pattern])
doc = nlp("He bought 5 apples and 100 dollars worth of items.")

matches = matcher(doc)
for match_id, start, end in matches:
    print("Match:", doc[start:end].text)


Match: 5 apples
Match: 100 dollars


## 🎯 7. Named Entity + Custom Rule Matching (e.g., Person + Verb)

In [18]:
pattern = [
    {"ENT_TYPE": "PERSON"},
    {"POS": "VERB"}
]

matcher.add("PERSON_VERB", [pattern])
doc = nlp("Elon Musk founded SpaceX. Bill Gate leads Microsoft.")

matches = matcher(doc)
for match_id, start, end in matches:
    print("Match:", doc[start:end].text)
    print("Match:", doc[start-1:end].text)

Match: Musk founded
Match: Elon Musk founded
Match: Gate leads
Match: Bill Gate leads


## 🔄 Step 8: Multiple Patterns for a Single Rule

In [20]:
matcher = Matcher(nlp.vocab)
matcher.add("TECH_TERMS", [
    [{"LOWER": "machine"}, {"LOWER": "learning"}],
    [{"LOWER": "deep"}, {"LOWER": "learning"}],
    [{"LOWER": "natural"}, {"LOWER": "language"}, {"LOWER": "processing"}]
])

doc = nlp("Deep learning and natural language processing are branches of machine learning.")

matches = matcher(doc)
for match_id, start, end in matches:
    print("Match:", doc[start:end].text)


Match: Deep learning
Match: natural language processing
Match: machine learning


## 🧼 Step 9: Removing or Listing Patterns

In [22]:
# matcher.remove("TECH_TERMS")

# How to List Added Match Patterns Properly
from spacy.matcher import Matcher
import spacy

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

# Add a few example patterns
matcher.add("DATA_SCIENCE", [[{"LOWER": "data"}, {"LOWER": "science"}]])
matcher.add("MACHINE_LEARNING", [[{"LOWER": "machine"}, {"LOWER": "learning"}]])

# List all rule names added to the matcher
pattern_names = list(matcher._patterns.keys())
pattern_names = [nlp.vocab.strings[pid] for pid in pattern_names]

print("Added pattern names:", pattern_names)


Added pattern names: ['DATA_SCIENCE', 'MACHINE_LEARNING']


## ✅ Summary: When to Use Matcher

| Use Case                          | Use `Matcher`?               |
| --------------------------------- | ---------------------------- |
| Fixed phrases                     | ❌ Use `PhraseMatcher`        |
| POS-based patterns                | ✅ Yes                        |
| Patterns involving entity types   | ✅ Yes                        |
| Repetitions, optional tokens      | ✅ Yes                        |
| Sentence/structure-level patterns | Consider `DependencyMatcher` |


# 🧠 Phrase Matching?

## 📦 Step 1: Install spaCy and Load Language Model
```
pip install spacy
python -m spacy download en_core_web_sm


# 📘 Step 2: Load Language Model and Import PhraseMatcher

In [26]:
import spacy
from spacy.matcher import PhraseMatcher

# Load English tokenizer, tagger, parser, NER, and word vectors
nlp = spacy.load("en_core_web_sm")

# Create PhraseMatcher object
matcher = PhraseMatcher(nlp.vocab)

In [27]:
# Read a sample document
with open('./reaganomics.txt') as f:
    doc3 = nlp(f.read())
    
# Step 3: Define the List of Phrases
phrases_list = ['voodoo economics','supply-side economics,','trickle-down economics','free-market economics']

# Step 4: Create Phrase Patterns
phrase_pattern = [nlp(text) for text in phrases_list]

# Step 5: Add Patterns
matcher.add('EcoMatcher',phrase_pattern)

# Step 6: Apply the matcher to the doc
found_matches = matcher(doc3)

# Step 7: Display the results
for match_id, start, end in found_matches:
    str_id = nlp.vocab.strings[match_id]
    span = doc3[start:end]
    print(match_id, str_id, start, end, span.text)

2351661100535932681 EcoMatcher 41 46 supply-side economics,
2351661100535932681 EcoMatcher 49 53 trickle-down economics
2351661100535932681 EcoMatcher 54 56 voodoo economics
2351661100535932681 EcoMatcher 61 65 free-market economics
2351661100535932681 EcoMatcher 2987 2991 trickle-down economics


## 🔁 Bonus: Matching Case Insensitively

In [29]:
matcher = PhraseMatcher(nlp.vocab, attr="LOWER")  # Make it case-insensitive
#matcher.add("EcoMatcher", patterns)

## 🧠 Understanding How It Works
- <code>PhraseMatcher</code> matches exact sequences of tokens.
- It is faster and more efficient than <code>Matcher</code> when you have predefined phrases.
- The <code>add()</code> function takes:
    - <code>label</code>: category name (used in <code>match_id</code>)
    - <code>patterns</code>: list of Doc objects
- The output <code>matches</code> is a list of tuples: (<code>match_id, start_index, end_index</code>)

# 🔄 PhraseMatcher vs Matcher

| Feature     | PhraseMatcher                | Matcher                              |
| ----------- | ---------------------------- | ------------------------------------ |
| Purpose     | Match exact phrases          | Match patterns with token attributes |
| Speed       | Faster                       | Slightly slower                      |
| Flexibility | Less flexible                | Very flexible (POS, DEP, etc.)       |
| Use Case    | Lookup dictionaries, phrases | Complex rule-based matching          |
