# Fine-Tuning spaCy NER for Nutrition Product Entities

In this notebook, we'll learn how to **fine-tune a spaCy Named Entity Recognition (NER) model** to recognize domain-specific entities in nutrition supplement product descriptions.

## Use Case
We want to extract key information from nutrition product descriptions:
- **PRODUCT**: Product names (e.g., "Whey Protein Isolate 90", "Creatine Monohydrate")
- **INGREDIENT**: Active ingredients (e.g., "BCAAs", "glutamic acid", "caffeine")
- **QUANTITY**: Amounts and measurements (e.g., "23g of protein", "25g serving")
- **BENEFIT**: Health/fitness benefits (e.g., "muscle growth", "recovery")

## Why Fine-Tune?
Pre-trained spaCy models are trained on general text (news, web content) and don't recognize domain-specific entities well. Fine-tuning allows us to:
1. Add new entity types specific to our domain
2. Improve recognition of existing entity types in specialized contexts
3. Create a model tailored to our specific application

## 1. Setup and Installation

In [None]:
# Install required packages
!pip install -U spacy
!python -m spacy download en_core_web_sm

In [None]:
# Standard imports
import spacy
from spacy.tokens import DocBin
from spacy.training import Example
from spacy.util import minibatch, compounding
import pandas as pd
import random
import warnings
from pathlib import Path

warnings.filterwarnings('ignore')

print(f"spaCy version: {spacy.__version__}")

## 2. Load and Explore the Dataset

We'll use the GoNutrition product descriptions dataset, which contains detailed information about nutrition supplements.

In [None]:
# Load the dataset
df = pd.read_csv('../02-Text_Classification/gonutrition.csv')
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Let's look at one product description
print("Product:", df.iloc[0]['product_name'])
print("\nDescription (first 1000 chars):")
print(df.iloc[0]['product_description'][:1000])

## 3. Test the Pre-trained Model

Let's see how well the default spaCy model recognizes entities in our domain-specific text.

In [None]:
# Load the pre-trained model
nlp = spacy.load('en_core_web_sm')

# Helper function to display entities
def show_ents(doc):
    """Display entities found in a document."""
    if doc.ents:
        for ent in doc.ents:
            print(f"{ent.text:30} | {ent.label_:10} | {spacy.explain(ent.label_)}")
    else:
        print('No named entities found.')

In [None]:
# Test on a nutrition-related sentence
test_text = "Whey Protein Isolate 90 provides 23g of protein per 25g serving with BCAAs and glutamic acid for muscle growth."
doc = nlp(test_text)

print("Test text:")
print(test_text)
print("\nEntities found by pre-trained model:")
show_ents(doc)

In [None]:
# Visualize with displaCy
from spacy import displacy

displacy.render(doc, style='ent', jupyter=True)

As you can see, the pre-trained model doesn't recognize:
- "Whey Protein Isolate 90" as a PRODUCT
- "BCAAs" or "glutamic acid" as INGREDIENTs
- "muscle growth" as a BENEFIT

**This is why we need to fine-tune!**

## 4. Prepare Training Data

spaCy requires training data in a specific format:
```python
("text", {"entities": [(start, end, "LABEL"), ...]})
```

We'll manually annotate some examples from our dataset. In production, you would use annotation tools like:
- [Prodigy](https://prodi.gy/) (by spaCy creators)
- [Label Studio](https://labelstud.io/)
- [Doccano](https://github.com/doccano/doccano)

In [None]:
# Define our custom entity labels
CUSTOM_LABELS = ["PRODUCT", "INGREDIENT", "QUANTITY", "BENEFIT"]

# Manually annotated training data
# Format: (text, {"entities": [(start_char, end_char, label), ...]})
TRAIN_DATA = [
    (
        "Whey Protein Isolate 90 provides 23g of protein per 25g serving.",
        {"entities": [(0, 23, "PRODUCT"), (33, 47, "QUANTITY"), (52, 63, "QUANTITY")]}
    ),
    (
        "This whey protein isolate powder is 90% protein and extremely low in fat.",
        {"entities": [(5, 27, "PRODUCT"), (36, 47, "QUANTITY")]}
    ),
    (
        "It's packed with BCAAs and glutamic acid for muscle growth and recovery.",
        {"entities": [(17, 22, "INGREDIENT"), (27, 40, "INGREDIENT"), (45, 58, "BENEFIT"), (63, 71, "BENEFIT")]}
    ),
    (
        "Creatine Monohydrate helps increase power and strength during high intensity exercise.",
        {"entities": [(0, 20, "PRODUCT"), (36, 41, "BENEFIT"), (46, 54, "BENEFIT")]}
    ),
    (
        "Beta Alanine has been shown to increase carnosine production.",
        {"entities": [(0, 12, "INGREDIENT"), (40, 49, "INGREDIENT")]}
    ),
    (
        "Volt pre workout formula includes 12 advanced active ingredients with 389mg of caffeine.",
        {"entities": [(0, 4, "PRODUCT"), (70, 87, "QUANTITY"), (79, 87, "INGREDIENT")]}
    ),
    (
        "Take 5g of Creatine Monohydrate to improve lean muscle gains.",
        {"entities": [(5, 7, "QUANTITY"), (11, 31, "PRODUCT"), (44, 61, "BENEFIT")]}
    ),
    (
        "GN Whey Protein 80 contains 20g of premium grade protein per 25g serving.",
        {"entities": [(0, 18, "PRODUCT"), (28, 51, "QUANTITY"), (56, 73, "QUANTITY")]}
    ),
    (
        "L Taurine and Beta Alanine work to reduce muscular fatigue and soreness.",
        {"entities": [(0, 9, "INGREDIENT"), (14, 26, "INGREDIENT"), (42, 58, "BENEFIT")]}
    ),
    (
        "Our protein powder helps with faster recovery and muscle repair.",
        {"entities": [(4, 18, "PRODUCT"), (30, 46, "BENEFIT"), (51, 64, "BENEFIT")]}
    ),
    (
        "Arginine Alpha Ketoglutarate enhances nitric oxide production for better pump.",
        {"entities": [(0, 27, "INGREDIENT"), (37, 49, "INGREDIENT"), (65, 77, "BENEFIT")]}
    ),
    (
        "Citrulline Malate at 4g per serving can increase training intensity.",
        {"entities": [(0, 17, "INGREDIENT"), (21, 35, "QUANTITY"), (49, 68, "BENEFIT")]}
    ),
    (
        "Vitamin B6 and guarana extract provide energy and mental focus.",
        {"entities": [(0, 10, "INGREDIENT"), (15, 30, "INGREDIENT"), (39, 45, "BENEFIT"), (50, 62, "BENEFIT")]}
    ),
    (
        "Each scoop contains 22g of whey protein isolate with essential amino acids.",
        {"entities": [(20, 48, "QUANTITY"), (27, 48, "PRODUCT"), (54, 74, "INGREDIENT")]}
    ),
    (
        "ZMA zinc and magnesium formula supports normal testosterone production.",
        {"entities": [(0, 3, "PRODUCT"), (4, 8, "INGREDIENT"), (13, 22, "INGREDIENT"), (40, 70, "BENEFIT")]}
    ),
]

print(f"Created {len(TRAIN_DATA)} training examples")

In [None]:
# Let's verify our annotations are correct
def verify_annotations(train_data):
    """Verify that annotation spans match the text."""
    for text, annotations in train_data:
        for start, end, label in annotations["entities"]:
            span_text = text[start:end]
            print(f"{label:12} | '{span_text}'")
        print("-" * 40)

# Check first 3 examples
print("Verifying annotations (first 3 examples):")
print("=" * 40)
verify_annotations(TRAIN_DATA[:3])

## 5. Create the Training Pipeline

We have two options for fine-tuning:
1. **Update existing model**: Add new entity types to a pre-trained model
2. **Train from blank**: Create a new model from scratch

We'll use option 1 (updating an existing model) since we want to keep the model's existing capabilities while adding our custom entities.

In [None]:
def create_training_model(base_model="en_core_web_sm", custom_labels=None):
    """
    Create a model for training by loading a base model and adding custom NER labels.
    
    Args:
        base_model: Name of the base spaCy model to use
        custom_labels: List of custom entity labels to add
    
    Returns:
        spaCy nlp object ready for training
    """
    # Load the base model
    nlp = spacy.load(base_model)
    
    # Get the NER component
    if "ner" not in nlp.pipe_names:
        ner = nlp.add_pipe("ner", last=True)
    else:
        ner = nlp.get_pipe("ner")
    
    # Add custom labels
    if custom_labels:
        for label in custom_labels:
            ner.add_label(label)
            print(f"Added label: {label}")
    
    return nlp

# Create the model
nlp_train = create_training_model(custom_labels=CUSTOM_LABELS)

In [None]:
# Check all labels in the model
ner = nlp_train.get_pipe("ner")
print("All NER labels in the model:")
for label in ner.labels:
    print(f"  - {label}")

## 6. Convert Training Data to spaCy Format

We need to convert our training data to spaCy's `Example` objects.

In [None]:
def create_examples(nlp, train_data):
    """
    Convert training data to spaCy Example objects.
    
    Args:
        nlp: spaCy nlp object
        train_data: List of (text, annotations) tuples
    
    Returns:
        List of Example objects
    """
    examples = []
    for text, annotations in train_data:
        doc = nlp.make_doc(text)
        example = Example.from_dict(doc, annotations)
        examples.append(example)
    return examples

# Create examples
train_examples = create_examples(nlp_train, TRAIN_DATA)
print(f"Created {len(train_examples)} training examples")

## 7. Train the Model

Now we'll train the NER component with our custom data.

In [None]:
def train_ner(nlp, train_examples, n_iter=30, drop=0.5):
    """
    Train the NER component of a spaCy model.
    
    Args:
        nlp: spaCy nlp object
        train_examples: List of Example objects
        n_iter: Number of training iterations
        drop: Dropout rate for regularization
    
    Returns:
        Dictionary of training losses
    """
    # Get the NER component
    ner = nlp.get_pipe("ner")
    
    # Disable other pipeline components during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]
    
    losses_history = []
    
    with nlp.disable_pipes(*other_pipes):
        # Initialize the model with examples
        optimizer = nlp.initialize(lambda: train_examples)
        
        print("Training the NER model...")
        print("=" * 50)
        
        for iteration in range(n_iter):
            # Shuffle training data
            random.shuffle(train_examples)
            losses = {}
            
            # Create minibatches
            batches = minibatch(train_examples, size=compounding(4.0, 32.0, 1.001))
            
            for batch in batches:
                nlp.update(
                    batch,
                    drop=drop,
                    sgd=optimizer,
                    losses=losses
                )
            
            losses_history.append(losses.get('ner', 0))
            
            # Print progress every 5 iterations
            if (iteration + 1) % 5 == 0:
                print(f"Iteration {iteration + 1:3d}/{n_iter} | Loss: {losses.get('ner', 0):.4f}")
    
    print("=" * 50)
    print("Training complete!")
    
    return losses_history

In [None]:
# Set random seed for reproducibility
random.seed(42)

# Train the model
losses = train_ner(nlp_train, train_examples, n_iter=30, drop=0.5)

In [None]:
# Visualize the training loss
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 5))
plt.plot(losses, marker='o', markersize=3)
plt.title('NER Training Loss Over Iterations')
plt.xlabel('Iteration')
plt.ylabel('Loss')
plt.grid(True, alpha=0.3)
plt.show()

## 8. Evaluate the Fine-Tuned Model

Let's test our fine-tuned model on the same text we tested earlier.

In [None]:
# Test on the same text as before
test_text = "Whey Protein Isolate 90 provides 23g of protein per 25g serving with BCAAs and glutamic acid for muscle growth."
doc = nlp_train(test_text)

print("Test text:")
print(test_text)
print("\nEntities found by FINE-TUNED model:")
show_ents(doc)

In [None]:
# Visualize with displaCy
# Define custom colors for our new labels
colors = {
    "PRODUCT": "#7aecec",      # Light blue
    "INGREDIENT": "#bfeeb7",   # Light green  
    "QUANTITY": "#feca74",     # Orange
    "BENEFIT": "#ff9561"       # Coral
}

options = {"ents": CUSTOM_LABELS + ["MONEY", "ORG", "GPE", "DATE"], "colors": colors}

displacy.render(doc, style='ent', jupyter=True, options=options)

In [None]:
# Test on new, unseen text
test_texts = [
    "Add 10g of BCAA powder to your post-workout shake for enhanced recovery.",
    "Our Mass Gainer contains 50g of protein and 250g of carbohydrates per serving.",
    "Caffeine and taurine boost energy levels and mental alertness during training.",
    "Take Omega-3 Fish Oil daily for joint health and reduced inflammation."
]

print("Testing on unseen text:")
print("=" * 60)

for text in test_texts:
    doc = nlp_train(text)
    print(f"\nText: {text}")
    print("Entities:")
    if doc.ents:
        for ent in doc.ents:
            print(f"  - {ent.text} ({ent.label_})")
    else:
        print("  No entities found")
    print("-" * 60)

## 9. Test on Actual Product Descriptions

In [None]:
# Test on a real product description from our dataset
# Take a snippet from the first product
sample_text = df.iloc[0]['product_description'][:500]

doc = nlp_train(sample_text)

print("Sample from dataset:")
print("=" * 60)
print(sample_text)
print("\n" + "=" * 60)
print("\nEntities found:")
for ent in doc.ents:
    print(f"  {ent.label_:12} | {ent.text}")

In [None]:
# Visualize the entities in the product description
displacy.render(doc, style='ent', jupyter=True, options=options)

## 10. Save the Fine-Tuned Model

Let's save our model so we can use it later.

In [None]:
# Create output directory
output_dir = Path("./nutrition_ner_model")
output_dir.mkdir(exist_ok=True)

# Save the model
nlp_train.to_disk(output_dir)
print(f"Model saved to: {output_dir.absolute()}")

In [None]:
# Test loading the saved model
nlp_loaded = spacy.load(output_dir)

# Verify it works
test_doc = nlp_loaded("Creatine Monohydrate provides 5g per serving for strength gains.")
print("Testing loaded model:")
for ent in test_doc.ents:
    print(f"  {ent.label_:12} | {ent.text}")

## 11. Entity Extraction Pipeline

Let's create a function to extract structured information from product descriptions.

In [None]:
def extract_nutrition_entities(text, nlp):
    """
    Extract nutrition-related entities from text.
    
    Args:
        text: Input text to process
        nlp: Trained spaCy model
    
    Returns:
        Dictionary with extracted entities by type
    """
    doc = nlp(text)
    
    entities = {
        "products": [],
        "ingredients": [],
        "quantities": [],
        "benefits": []
    }
    
    for ent in doc.ents:
        if ent.label_ == "PRODUCT":
            entities["products"].append(ent.text)
        elif ent.label_ == "INGREDIENT":
            entities["ingredients"].append(ent.text)
        elif ent.label_ == "QUANTITY":
            entities["quantities"].append(ent.text)
        elif ent.label_ == "BENEFIT":
            entities["benefits"].append(ent.text)
    
    # Remove duplicates while preserving order
    for key in entities:
        entities[key] = list(dict.fromkeys(entities[key]))
    
    return entities

In [None]:
# Process all products in the dataset
print("Extracting entities from all products:")
print("=" * 60)

for idx, row in df.iterrows():
    entities = extract_nutrition_entities(row['product_description'][:1000], nlp_train)
    
    print(f"\n{row['product_name']}")
    print("-" * 40)
    print(f"  Products:    {entities['products'][:5]}")
    print(f"  Ingredients: {entities['ingredients'][:5]}")
    print(f"  Quantities:  {entities['quantities'][:5]}")
    print(f"  Benefits:    {entities['benefits'][:5]}")

## 12. Summary and Next Steps

### What We Learned

1. **Why fine-tuning matters**: Pre-trained models don't recognize domain-specific entities
2. **Data preparation**: How to format training data with entity annotations
3. **Model training**: How to update a spaCy model with new entity types
4. **Evaluation**: Testing the model on seen and unseen data
5. **Deployment**: Saving and loading trained models

### Improving the Model

To improve accuracy, you should:

1. **Add more training data**: 15 examples is minimal; aim for 100+ annotated examples
2. **Use annotation tools**: Prodigy or Label Studio for efficient annotation
3. **Include negative examples**: Text without entities to reduce false positives
4. **Increase iterations**: More training epochs (with more data)
5. **Use a larger base model**: Try `en_core_web_lg` for better embeddings

### spaCy 3.x Config System

For production use, spaCy 3.x recommends using the config-based training system:

```bash
python -m spacy init config config.cfg --lang en --pipeline ner
python -m spacy train config.cfg --output ./output --paths.train ./train.spacy
```

This provides better reproducibility and more training options.

In [None]:
# Final comparison: pre-trained vs fine-tuned
nlp_pretrained = spacy.load('en_core_web_sm')

comparison_text = "Take 5g of Creatine Monohydrate and BCAAs daily for muscle growth and faster recovery."

print("COMPARISON: Pre-trained vs Fine-tuned Model")
print("=" * 60)
print(f"\nText: {comparison_text}")

print("\n--- Pre-trained Model ---")
doc_pre = nlp_pretrained(comparison_text)
if doc_pre.ents:
    for ent in doc_pre.ents:
        print(f"  {ent.label_:12} | {ent.text}")
else:
    print("  No entities found")

print("\n--- Fine-tuned Model ---")
doc_fine = nlp_train(comparison_text)
if doc_fine.ents:
    for ent in doc_fine.ents:
        print(f"  {ent.label_:12} | {ent.text}")
else:
    print("  No entities found")

## Exercises

1. **Add more training data**: Annotate 10 more sentences from the product descriptions and retrain the model

2. **Add a new entity type**: Add a "DOSAGE" entity type to capture recommended dosages (e.g., "2-5 servings per day")

3. **Evaluate quantitatively**: Split your data into train/test sets and calculate precision, recall, and F1 scores

4. **Try different base models**: Compare results using `en_core_web_sm`, `en_core_web_md`, and `en_core_web_lg`

5. **Build an application**: Create a simple function that takes a product description URL and returns structured product information