<div style="  background: linear-gradient(145deg, #0f172a, #1e293b);  border: 4px solid transparent;  border-radius: 14px;  padding: 18px 22px;  margin: 12px 0;  font-size: 26px;  font-weight: 600;  color: #f8fafc;  box-shadow: 0 6px 14px rgba(0,0,0,0.25);  background-clip: padding-box;  position: relative;">  <div style="    position: absolute;    inset: 0;    padding: 4px;    border-radius: 14px;    background: linear-gradient(90deg, #06b6d4, #3b82f6, #8b5cf6);    -webkit-mask:       linear-gradient(#fff 0 0) content-box,       linear-gradient(#fff 0 0);    -webkit-mask-composite: xor;    mask-composite: exclude;    pointer-events: none;  "></div>    <b>Customizing spaCy Models</b>    <br/>  <span style="color:#9ca3af; font-size: 18px; font-weight: 400;">(Natural Language Processing with spaCy)</span></div>

## Table of Contents

1. [Why Train spaCy Models?](#section-1)
2. [Training Data Preparation](#section-2)
3. [Training with spaCy](#section-3)
4. [Course Wrap-up and Review](#section-4)
5. [Conclusion](#conclusion)

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 1. Why Train spaCy Models?</span><br>

### The Need for Customization

Standard pre-trained spaCy models are excellent for general NLP use cases. However, they may not have encountered data from specific domains during their initial training. 

**Common Specific Domains:**
*   **Twitter Data:** Contains slang, hashtags, and informal grammar.
*   **Medical Data:** Contains complex pharmaceutical names, symptoms, and diagnosis codes (e.g., "pulmonary fibrosis", "atrial fibrillation").

Training a custom model allows you to achieve better results on your specific domain and is essential for domain-specific text classification or Named Entity Recognition (NER).

### Assessing Model Performance

Before starting the training process, you should ask:
1.  Do the default spaCy models perform well enough on our data?
2.  Does our domain include many labels that are absent in spaCy models?

<div style="background: #e0f2fe; border-left: 16px solid #0284c7; padding: 14px 18px; border-radius: 8px; font-size: 18px; color: #075985;"> ðŸ’¡ <b>Tip:</b> Always test the pre-trained model first. If it fails to recognize core entities in your domain, customization is necessary. </div>

### Example: Misclassification
In the example below, the standard model might misclassify "Oxford Street". While it is a location, specific contexts might require specific labeling, or the model might mistake it for an Organization.



In [None]:
import spacy

# Load the small English model
nlp = spacy.load("en_core_web_sm")

# Test text
text = "The car was navigating to the Oxford Street."

# Process the text
doc = nlp(text)

# Print entities and their labels
print([(ent.text, ent.label_) for ent in doc.ents])



**Analysis:**
If the output is `[('the Oxford Street', 'ORG')]` or similar, and you require it to be a `GPE` (Geopolitical Entity) or `FAC` (Facility), the model requires fine-tuning.

### Output Labels in spaCy Models
If your domain includes labels absent in standard models (e.g., specific financial terms like `CAGR`, medical terms like `DISEASE` or `CHEMICAL`), you must train a custom model.

**Steps for Custom Model Training:**
1.  **Collect** domain-specific data.
2.  **Annotate** the data (label intents, entities, etc.).
3.  **Determine** whether to update an existing model or train a model from scratch.

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 2. Training Data Preparation</span><br>

### The Training Workflow
Training a model involves an iterative process:
1.  Annotate and prepare input data.
2.  Initialize the model weights.
3.  Predict a few examples with current weights.
4.  Compare predictions with correct answers.
5.  Use an optimizer to calculate weights that improve performance.
6.  Update weights slightly.
7.  Repeat from step 3.

### Annotating Data
Annotation is the process of labeling the intent or entities within your text. The raw data usually comes in a dictionary format or JSON-like structure before being converted for spaCy.

**Example 1: Medical Data Annotation**


In [None]:
annotated_data_medical = {
    "sentence": "An antiviral drugs used against influenza is neuraminidase inhibitors.",
    "entities": {
        "label": "Medicine",
        "value": "neuraminidase inhibitors",
    }
}
print(annotated_data_medical)



**Example 2: General Entity Annotation**


In [None]:
annotated_data_general = {
    "sentence": "Bill Gates visited the SFO Airport.",
    "entities": [
        {"label": "PERSON", "value": "Bill Gates"},
        {"label": "LOC", "value": "SFO Airport"}
    ]
}
print(annotated_data_general)



### spaCy Training Data Format
To feed data into spaCy, it must be converted into a specific format: a list of tuples.
*   **First element:** The raw sentence string.
*   **Second element:** A dictionary containing an `"entities"` key.
*   **Entities list:** A list of tuples `(start_char, end_char, label)`.



In [None]:
# The required format for spaCy training
training_data = [
    ("I will visit you in Austin.", {"entities": [(20, 26, "GPE")]}),
    ("I'm going to Sam's house.", {"entities": [(13, 18, "PERSON"), (19, 24, "GPE")]}),
    ("I will go.", {"entities": []})
]

print("Training Data Sample:", training_data[0])



### The `Example` Object
We cannot feed raw text directly to the training loop. spaCy v3 introduces the `Example` object, which holds the predicted document (`doc`) and the reference annotations (`gold standard`).



In [None]:
import spacy
from spacy.training import Example

nlp = spacy.load("en_core_web_sm")

# Raw text and annotations
text = "I will visit you in Austin."
annotations = {"entities": [(20, 26, "GPE")]}

# Create a Doc object
doc = nlp.make_doc(text)

# Create an Example object
example_sentence = Example.from_dict(doc, annotations)

# Inspect the example object (converted back to dict for visibility)
print(example_sentence.to_dict())



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 3. Training with spaCy</span><br>

### Training Steps Implementation
When updating an existing model, we generally follow these steps:
1.  Annotate and prepare input data.
2.  **Disable other pipeline components** (to prevent them from being affected by training).
3.  Train the model for a few **epochs**.
4.  Evaluate model performance.

### Disabling Pipeline Components
If we are only training the Named Entity Recognizer (NER), we should disable the tagger, parser, and other components to keep their weights stable and speed up training.



In [None]:
# Identify pipes to disable (everything except 'ner')
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

# Disable them using the context manager or explicitly
# Note: In a real script, this is often done inside a 'with nlp.select_pipes(...):' block
# or by using nlp.disable_pipes().
print(f"Pipes to disable: {other_pipes}")



### The Training Loop
Here is the complete procedure to update the model. We iterate through the data multiple times (epochs), shuffle the data to prevent order bias, and update the model using `nlp.update`.



In [None]:
import random
import spacy
from spacy.training import Example

# Load model
nlp = spacy.load("en_core_web_sm")

# Define training data
training_data = [
    ("I will visit you in Austin.", {"entities": [(20, 26, "GPE")]}),
    ("I'm going to Sam's house.", {"entities": [(13, 18, "PERSON"), (19, 24, "GPE")]}),
    ("I will go.", {"entities": []})
]

# 1. Create an optimizer
optimizer = nlp.create_optimizer()

# 2. Define other pipes to disable
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

# 3. Training loop
epochs = 5
with nlp.disable_pipes(*other_pipes):
    for i in range(epochs):
        random.shuffle(training_data)
        losses = {}
        
        # Batch up the examples using spaCy's minibatch is recommended, 
        # but here we iterate individually for demonstration as per the slides.
        for text, annotation in training_data:
            doc = nlp.make_doc(text)
            example = Example.from_dict(doc, annotation)
            
            # Update the model
            nlp.update([example], sgd=optimizer, losses=losses)
            
        print(f"Epoch {i+1} Losses: {losses}")



### Saving and Loading the Model
Once trained, you must save the component to disk to use it later.



In [None]:
import shutil
import os

# Define output directory
output_dir = "ner_model_output"

# Clean up previous runs for this notebook
if os.path.exists(output_dir):
    shutil.rmtree(output_dir)
os.makedirs(output_dir)

# Save the NER component specifically (or nlp.to_disk for the whole pipeline)
ner = nlp.get_pipe("ner")
ner.to_disk(output_dir)
print(f"Model saved to {output_dir}")

# Loading the saved model component back
# We create a blank NER pipe and load the data into it
new_ner = nlp.create_pipe("ner")
new_ner.from_disk(output_dir)

# In a real scenario, you would add this 'new_ner' to a fresh nlp object
# nlp_new.add_pipe(new_ner, name="ner_custom")
print("Model loaded successfully.")



### Model for Inference
After loading the model, you can use it to predict entities on new text.



In [None]:
# Inference example
text = "I am flying to Austin specifically."
doc = nlp(text)

# Extract entities
entities = [(ent.text, ent.label_) for ent in doc.ents]
print(f"Inference Results: {entities}")



---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 4. Course Wrap-up and Review</span><br>

This section summarizes the key concepts covered in the "Natural Language Processing with spaCy" course.

### Chapter 1: Introduction to NLP and spaCy
*   Focused on using spaCy's text processing pipelines.
*   Key components: Tokenizer, Tagger, Parser, NER.
*   Output: The `Doc` object.

### Chapter 2: Linguistic Annotations and Word Vectors
*   Worked with spaCy classes: `Doc`, `Token`, and `Span`.
*   Used word vectors to predict semantic similarities (e.g., King - Man + Woman = Queen).

### Chapter 3: Data Analysis with spaCy
*   Used `Matcher` and `PhraseMatcher` to extract terms based on patterns.

**Matcher Example Code:**


In [None]:
from spacy.matcher import Matcher, PhraseMatcher

nlp = spacy.load("en_core_web_sm")

# 1. Token Matcher
matcher = Matcher(nlp.vocab)
# Pattern: "good" followed by "morning" or "evening"
pattern = [{"LOWER": "good"}, {"LOWER": {"IN": ["morning", "evening"]}}]
matcher.add("morning_greeting", [pattern])

# 2. Phrase Matcher
phrase_matcher = PhraseMatcher(nlp.vocab, attr="LOWER")
terms = ["InvestmentTerms", "Stocks", "Bonds"]
patterns = [nlp.make_doc(term) for term in terms]
phrase_matcher.add("InvestmentTerms", patterns)

print("Matchers initialized successfully.")



### Chapter 4: Customizing spaCy Models
*   Annotating data for training.
*   Training models (updating weights).
*   Using custom models for inference.

### Recommended Resources
To further your learning, the following resources are recommended:
*   Introduction to Deep Learning in Python
*   Introduction to Deep Learning with PyTorch
*   Introduction to ChatGPT

---

<br><span style="  display: inline-block;  color: #fff;  background: linear-gradient(135deg, #a31616ff, #02b7ffff);  padding: 12px 20px;  border-radius: 12px;  font-size: 28px;  font-weight: 700;  box-shadow: 0 4px 12px rgba(0,0,0,0.2);  transition: transform 0.2s ease, box-shadow 0.2s ease;">  ðŸ§¾ 5. Conclusion</span><br>

### Summary
In this notebook, we explored the advanced capabilities of spaCy, specifically focusing on **custom model training**. We learned that while pre-trained models are powerful, specific domains like medicine or social media often require custom training data to achieve high accuracy.

**Key Takeaways:**
1.  **Data Preparation is Key:** The quality of your model depends heavily on the quality and format of your annotated data (List of tuples with character offsets).
2.  **The `Example` Object:** spaCy v3 requires converting raw text and annotations into `Example` objects for training.
3.  **Training Loop:** We must disable unrelated pipeline components, shuffle data, and iterate through epochs to update model weights using `nlp.update`.
4.  **Persistence:** Trained models can be saved to disk and reloaded for inference in production environments.

### Next Steps
*   **Practice:** Create a custom dataset for a domain you are interested in (e.g., sports, finance).
*   **Experiment:** Try training a model from scratch (blank language model) versus updating an existing one to see the difference in performance.
*   **Explore:** Look into spaCy's config system (`config.cfg`) for more advanced hyperparameter tuning and training configurations.
