# 🍷 Wine & Food NLP with spaCy

This project builds two lightweight NLP models using [spaCy](https://spacy.io) for handling natural language queries related to wine and food. It supports:

- **Intent Classification**: Distinguish between wine or food recommendation queries.
- **Named Entity Recognition (NER)**: Extract key details like wine name, price, and tasting descriptions.

---
<br>

### Text Classification Model (Intent Detection)
Example Inputs:
- “Recommend a red wine under 300 HKD that pairs with grilled lamb.” → recommend_wine
- “What should I cook for dinner to go with a chilled bottle of Sancerre?” → recommend_food


Output Example
- {'recommend_wine': 0.01, 'recommend_food': 0.99}
<br>

---



In [26]:
import spacy
from spacy.training.example import Example

# prepare training data with annotations
TRAIN_DATA = [
    ("What wine goes well with spicy Thai green curry?", {"cats": {"recommend_wine": 1.0, "recommend_food": 0.0}}),
    ("Suggest a red wine under 300 HKD that pairs with grilled lamb.", {"cats": {"recommend_wine": 1.0, "recommend_food": 0.0}}),
    ("I have a bottle of Amarone — what foods would pair well with it?", {"cats": {"recommend_wine": 0.0, "recommend_food": 1.0}}),
    ("What should I cook for dinner to go with a chilled bottle of Sancerre?", {"cats": {"recommend_wine": 0.0, "recommend_food": 1.0}})
]

# create blank NLP pipeline and add labels
nlp = spacy.blank("en")
textcat = nlp.add_pipe("textcat")
textcat.add_label("recommend_wine")
textcat.add_label("recommend_food")

# train
optimizer = nlp.begin_training()
for i in range(20):
    losses = {}
    for text, annotation in TRAIN_DATA:
        example = Example.from_dict(nlp.make_doc(text), annotation)
        nlp.update([example], sgd=optimizer, losses=losses)
    print(f"Epoch {i+1}, Losses: {losses}")

# save text categorization model
nlp.to_disk("textcat_model")

Epoch 1, Losses: {'textcat': 1.0340981185436249}
Epoch 2, Losses: {'textcat': 0.7742716521024704}
Epoch 3, Losses: {'textcat': 0.4202875867486}
Epoch 4, Losses: {'textcat': 0.13618924655020237}
Epoch 5, Losses: {'textcat': 0.02502437774091959}
Epoch 6, Losses: {'textcat': 0.002704028840526007}
Epoch 7, Losses: {'textcat': 0.0002550006211095024}
Epoch 8, Losses: {'textcat': 3.303227640572004e-05}
Epoch 9, Losses: {'textcat': 6.782430546081741e-06}
Epoch 10, Losses: {'textcat': 2.1016941502693953e-06}
Epoch 11, Losses: {'textcat': 8.968619056304306e-07}
Epoch 12, Losses: {'textcat': 4.847795551654599e-07}
Epoch 13, Losses: {'textcat': 3.1036319469990303e-07}
Epoch 14, Losses: {'textcat': 2.2392674026150416e-07}
Epoch 15, Losses: {'textcat': 1.7591662349047965e-07}
Epoch 16, Losses: {'textcat': 1.4668297509956574e-07}
Epoch 17, Losses: {'textcat': 1.2749947941870232e-07}
Epoch 18, Losses: {'textcat': 1.1411040290454366e-07}
Epoch 19, Losses: {'textcat': 1.0419687868079563e-07}
Epoch 20, L

<br>

To use the trained model, load it and pass a query.

In [27]:
nlp_textcat = spacy.load("textcat_model")

query = "What should I cook to go with my Champagne?" # test a query out!
doc = nlp_textcat(query)
print(doc.cats)

{'recommend_wine': 0.1159258708357811, 'recommend_food': 0.8840740919113159}


<br>

### Named Entity Recognition (NER) Model
Extracted Fields:
- wine_name: “Silver Oak 2019”
- price: “$120”
- description_phrase: “vanilla and cherries”

Output Example
- wine_name : 2020 Opus One
- price : $299
- description_phrase : chocolate, spice, and tobacco

---

**Quick Notes**: 
- A problem you will likely encounter when training a NER with multiple labels is the quality of annotations
- Annotation indices (i.e., start index and end index) should match with the start and end indices of the token/s
- For example, "Try Château Margaux for $350." "Château Margaux" has a valid token span from start=4 to end=19. An error would occur if you try to index end=18.
- Remember this when generating training data using LLMs, as they are prone to inaccurate indexing. Use Prodigy to help label your training data

In [None]:
import spacy
from spacy.training.example import Example
from spacy.tokens import DocBin

In [None]:
# function for checking misaligned indices
def check_alignment(train_data):
    for i, (text, annots) in enumerate(train_data):
        doc = nlp(text)
        entities = []
        errors = False

        print(f"Sample {i+1}: {text}")
        for start, end, label in annots["entities"]:
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            if span is None:
                print(f"Misaligned span: '{text[start:end]}' -> ({start}, {end}) for label {label}")
                errors = True
            else:
                print(f"Valid span: {span.text} ({span.start_char}, {span.end_char})")
        
        if not errors:
            print("all spans aligned")

In [16]:
# Training data for NER
NER_DATA = [
    ("What wine goes well with spicy Thai green curry with coconut milk?",
     {"entities": [
         (25, 66, "pairing_item")  # "spicy Thai green curry with coconut milk"
     ]}),

    ("Recommend a red wine under 300 HKD that pairs well with grilled lamb and comes from Spain",
     {"entities": [
         (27, 34, "price"),          # "300 HKD"
         (56, 68, "pairing_item"),   # "grilled lamb"
         (84, 89, "region")          # "Spain"
     ]}),

    ("Suggest a celebratory wine that works with oysters and has high acidity.",
     {"entities": [
         (10, 21, "occasion"),               # "celebratory"
         (43, 50, "pairing_item"),          # "oysters"
         (59, 71, "description_phrase")     # "high acidity"
     ]}),

    ("I'm cooking mushroom risotto and want something medium-bodied and earthy to go with it.",
     {"entities": [
         (12, 28, "pairing_item"),          # "mushroom risotto"
         (48, 61, "description_phrase")     # "medium-bodied"
     ]}),

    ("Pair a bold Napa Cabernet Sauvignon with sushi.",
     {"entities": [
         (7, 35, "wine_name"),              # "bold Napa Cabernet Sauvignon"
         (41, 46, "pairing_item")           # "sushi"
     ]}),

    ("Find me a dry Riesling from Mosel to serve with pork belly and sauerkraut.",
     {"entities": [
         (14, 33, "wine_name"),             # "dry Riesling from Mosel"
         (48, 73, "pairing_item")           # "pork belly and sauerkraut"
     ]}),

    ("I want a sparkling wine under $100 that goes with smoked salmon.",
     {"entities": [
         (9, 23, "wine_name"),              # "sparkling wine"
         (30, 34, "price"),                 # "$100"
         (50, 64, "pairing_item")           # "smoked salmon"
     ]}),

    ("Looking for a wine to pair with mushroom pâté — preferably something earthy and herbal.",
     {"entities": [
         (32, 45, "pairing_item"),          # "mushroom pâté"
         (69, 87, "description_phrase")     # "earthy and herbal"
     ]}),

    ("What's a good wine under 50 USD for roast duck?",
     {"entities": [
         (25, 31, "price"),                 # "50 USD"
         (36, 46, "pairing_item")           # "roast duck"
     ]}),

    ("Recommend something festive that pairs with oysters and caviar.",
     {"entities": [
         (20, 27, "occasion"),              # "festive"
         (44, 51, "pairing_item"),          # "oysters"
         (56, 62, "pairing_item")           # "caviar"
     ]}),
]



In [28]:
# check_alignment(NER_DATA)

In [18]:
labels = ["wine_name", "price", "description_phrase", "pairing_item", "region", "occasion"]

# create blank NER pipeline and add labels
nlp_ner = spacy.blank("en")
ner = nlp_ner.add_pipe("ner")

for label in labels:
    ner.add_label(label)

# train model
optimizer = nlp_ner.begin_training()
for i in range(30):
    losses = {}
    for text, annotations in NER_DATA:
        example = Example.from_dict(nlp_ner.make_doc(text), annotations)
        nlp_ner.update([example], drop=0.1, losses=losses)
    print(f"Epoch {i+1}, losses: {losses}")

nlp_ner.to_disk("ner_model")

Epoch 1, losses: {'ner': np.float32(117.89655)}
Epoch 2, losses: {'ner': np.float32(44.81151)}
Epoch 3, losses: {'ner': np.float32(39.62362)}
Epoch 4, losses: {'ner': np.float32(27.899956)}
Epoch 5, losses: {'ner': np.float32(87.491005)}
Epoch 6, losses: {'ner': np.float32(31.17111)}
Epoch 7, losses: {'ner': np.float32(29.020218)}
Epoch 8, losses: {'ner': np.float32(18.3967)}
Epoch 9, losses: {'ner': np.float32(27.289991)}
Epoch 10, losses: {'ner': np.float32(14.38584)}
Epoch 11, losses: {'ner': np.float32(11.4727745)}
Epoch 12, losses: {'ner': np.float32(21.41718)}
Epoch 13, losses: {'ner': np.float32(6.208621)}
Epoch 14, losses: {'ner': np.float32(4.08717)}
Epoch 15, losses: {'ner': np.float32(4.831746)}
Epoch 16, losses: {'ner': np.float32(0.085311145)}
Epoch 17, losses: {'ner': np.float32(0.004360127)}
Epoch 18, losses: {'ner': np.float32(0.020516783)}
Epoch 19, losses: {'ner': np.float32(0.00071766844)}
Epoch 20, losses: {'ner': np.float32(0.0018037045)}
Epoch 21, losses: {'ner': 

In [18]:
def extract_wine_query_fields(doc):
    fields = {
        "wine_name": None,
        "price": None,
        "description_phrase": None,
        "pairing_item": [],
        "region": None,
        "occasion": None
    }

    for ent in doc.ents:
        print(f"Detected: {ent.text} ({ent.label_})")
        if ent.label_ == "pairing_item":
            fields["pairing_item"].append(ent.text)
        elif ent.label_ in fields:
            fields[ent.label_] = ent.text

    if not fields["pairing_item"]:
        fields["pairing_item"] = None

    return fields


In [25]:
text = "Find a crisp Sauvignon Blanc that pairs well with goat cheese."
doc = nlp_ner(text)

print([(ent.text, ent.label_) for ent in doc.ents])


[('Blanc', 'pairing_item'), ('goat cheese.', 'pairing_item')]


### Current Problems
- issues with alignment of labelling, which leads to improper learning -> understand problem
- lack of training samples for each field -> automate training data generation using spacy project
- consider possible problem that each combination of field is registered as unique


### How to Use
- Train each model separately using Jupyter or Python script.
- Use spacy.load() to load and run predictions.
- You can optionally combine both models into a single pipeline or deploy via FastAPI.

### Next Steps
- Add more examples for robust predictions
- Combine TextCat and NER in one spaCy pipeline
- Deploy with Flask or FastAPI for real-time use
- Use Label Studio or Prodigy to label at scale